Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic compression with ZSTD_compressCCtx #1241

Closed
jblazquez opened this issue Jul 17, 2018 · 3 comments
Closed

Nondeterministic compression with ZSTD_compressCCtx #1241

jblazquez opened this issue Jul 17, 2018 · 3 comments

Comments

@jblazquez
Copy link

It appears that since commit 9d65a5c, compressing the same input data twice in a row while using the same compression context - even if the context is reset between compressions - results in different outputs when using certain compression levels. We were relying on the guarantees that @Cyan4973 described in #999 and assuming that zstd would output binary identical compressed bitstreams in this scenario.

Are we misunderstanding ZSTD_compressCCtx and thinking that it wouldn't reuse any state in between invocations when that's not guaranteed?

Steps to repro

We're only able to repro this easily on macOS (10.13), but when we discovered the problem the data had been compressed by Windows and Linux versions of zstd, so the problem doesn't appear to be platform-specific.

Save the following code to a file called test.c:

#define ZSTD_STATIC_LINKING_ONLY
#include <zstd.h>
#include <stdio.h>
#include <string.h>

unsigned char data[] = 
{
  0x74, 0x75, 0x72, 0x70, 0x69, 0x73, 0x20, 0x65, 0x67, 0x65, 0x73, 0x74,
  0x61, 0x73, 0x20, 0x70, 0x6f, 0x72, 0x74, 0x74, 0x69, 0x74, 0x6f, 0x72,
  0x20, 0x71, 0x75, 0x69, 0x73, 0x20, 0x74, 0x69, 0x6e, 0x63, 0x69, 0x64,
  0x75, 0x6e, 0x74, 0x20, 0x6c, 0x65, 0x6f, 0x2e, 0x20, 0x44, 0x6f, 0x6e,
  0x65, 0x63, 0x20, 0x6c, 0x75, 0x63, 0x74, 0x75, 0x73, 0x20, 0x65, 0x67,
  0x65, 0x74, 0x20, 0x73, 0x61, 0x70, 0x69, 0x65, 0x6e, 0x20, 0x66, 0x72,
  0x69, 0x6e, 0x67, 0x69, 0x6c, 0x6c, 0x61, 0x20, 0x73, 0x65, 0x6d, 0x70,
  0x65, 0x72, 0x2e, 0x20, 0x46, 0x75, 0x73, 0x63, 0x65, 0x20, 0x66, 0x72,
  0x69, 0x6e, 0x67, 0x69, 0x6c, 0x6c, 0x61, 0x20, 0x6c, 0x69, 0x62, 0x65,
  0x72, 0x6f, 0x20, 0x71, 0x75, 0x69, 0x73, 0x20, 0x76, 0x65, 0x6e, 0x65,
  0x6e, 0x61, 0x74, 0x69, 0x73, 0x20, 0x70, 0x6c, 0x61, 0x63, 0x65, 0x72,
  0x61, 0x74, 0x2e, 0x20, 0x53, 0x65, 0x64, 0x20, 0x65, 0x6c, 0x65, 0x69,
  0x66, 0x65, 0x6e, 0x64, 0x20, 0x75, 0x6c, 0x74, 0x72, 0x69, 0x63, 0x65,
  0x73, 0x20, 0x6c, 0x61, 0x63, 0x75, 0x73, 0x2c, 0x20, 0x71, 0x75, 0x69,
  0x73, 0x20, 0x66, 0x65, 0x72, 0x6d, 0x65, 0x6e, 0x74, 0x75, 0x6d, 0x20,
  0x74, 0x75, 0x72, 0x70, 0x69, 0x73, 0x20, 0x62, 0x6c, 0x61, 0x6e, 0x64,
  0x69, 0x74, 0x20, 0x73, 0x69, 0x74, 0x20, 0x61, 0x6d, 0x65, 0x74, 0x2e,
  0x20, 0x43, 0x75, 0x72, 0x61, 0x62, 0x69, 0x74, 0x75, 0x72, 0x20, 0x67,
  0x72, 0x61, 0x76, 0x69, 0x64, 0x61, 0x20, 0x74, 0x65, 0x6c, 0x6c, 0x75,
  0x73, 0x20, 0x76, 0x65, 0x6c, 0x69, 0x74, 0x2e, 0x20, 0x41, 0x6c, 0x69,
  0x71, 0x75, 0x61, 0x6d, 0x20, 0x65, 0x72, 0x61, 0x74, 0x20, 0x76, 0x6f,
  0x6c, 0x75, 0x74, 0x70, 0x61, 0x74, 0x2e, 0x20, 0x53, 0x75, 0x73, 0x70,
  0x65, 0x6e, 0x64, 0x69, 0x73, 0x73, 0x65, 0x20, 0x76, 0x65, 0x6c, 0x20,
  0x6d, 0x6f, 0x6c, 0x65, 0x73, 0x74, 0x69, 0x65, 0x20, 0x6d, 0x69, 0x2e,
  0x0a, 0x0a, 0x50, 0x65, 0x6c, 0x6c, 0x65, 0x6e, 0x74, 0x65, 0x73, 0x71,
  0x75, 0x65, 0x20, 0x68, 0x61, 0x62, 0x69, 0x74, 0x61, 0x6e, 0x74, 0x20,
  0x6d, 0x6f, 0x72, 0x62, 0x69, 0x20, 0x74, 0x72, 0x69, 0x73, 0x74, 0x69,
  0x71, 0x75, 0x65, 0x20, 0x73, 0x65, 0x6e, 0x65, 0x63, 0x74, 0x75, 0x73,
  0x20, 0x65, 0x74, 0x20, 0x6e, 0x65, 0x74, 0x75, 0x73, 0x20, 0x65, 0x74,
  0x20, 0x6d, 0x61, 0x6c, 0x65, 0x73, 0x75, 0x61, 0x64, 0x61, 0x20, 0x66,
  0x61, 0x6d, 0x65, 0x73, 0x20, 0x61, 0x63, 0x20, 0x74, 0x75, 0x72, 0x70,
  0x69, 0x73, 0x20, 0x65, 0x67, 0x65, 0x73, 0x74, 0x61, 0x73, 0x2e, 0x20,
  0x51, 0x75, 0x69, 0x73, 0x71, 0x75, 0x65, 0x20, 0x76, 0x69, 0x76, 0x65,
  0x72, 0x72, 0x61, 0x20, 0x76, 0x65, 0x6c, 0x20, 0x6a, 0x75, 0x73, 0x74,
  0x6f, 0x20, 0x61, 0x63, 0x20, 0x61, 0x75, 0x63, 0x74, 0x6f, 0x72, 0x2e,
  0x20, 0x49, 0x6e, 0x74, 0x65, 0x72, 0x64, 0x75, 0x6d, 0x20, 0x65, 0x74,
  0x20, 0x6d, 0x61, 0x6c, 0x65, 0x73, 0x75, 0x61, 0x64, 0x61, 0x20, 0x66,
  0x61, 0x6d, 0x65, 0x73, 0x20, 0x61, 0x63, 0x20, 0x61, 0x6e, 0x74, 0x65,
  0x20, 0x69, 0x70, 0x73, 0x75, 0x6d, 0x20, 0x70, 0x72, 0x69, 0x6d, 0x69,
  0x73, 0x20, 0x69, 0x6e, 0x20, 0x66, 0x61, 0x75, 0x63, 0x69, 0x62, 0x75,
  0x73, 0x2e, 0x20, 0x50, 0x65, 0x6c, 0x6c, 0x65, 0x6e, 0x74, 0x65, 0x73,
  0x71, 0x75, 0x65, 0x20, 0x6e, 0x6f, 0x6e, 0x20, 0x61, 0x63, 0x63, 0x75,
  0x6d, 0x73, 0x61, 0x6e, 0x20, 0x6e, 0x69, 0x73, 0x69, 0x2e, 0x20, 0x49,
  0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x20, 0x73, 0x69, 0x74, 0x20, 0x61,
  0x6d, 0x65, 0x74, 0x20, 0x6d, 0x69, 0x20, 0x65, 0x72, 0x6f, 0x73, 0x2e,
  0x20, 0x56, 0x65, 0x73, 0x74, 0x69, 0x62, 0x75, 0x6c, 0x75, 0x6d, 0x20,
};

int main(int argc, char** argv)
{
    for (int level = 1; level <= 19; ++level)
    {
        char buffer[1024];        
        ZSTD_CCtx* zstd = ZSTD_createCCtx();

        size_t size1 = ZSTD_compressCCtx(zstd, buffer, sizeof(buffer), data, sizeof(data), level);
        ZSTD_CCtx_reset(zstd);
        size_t size2 = ZSTD_compressCCtx(zstd, buffer, sizeof(buffer), data, sizeof(data), level);

        printf("Level %d: %zu bytes / %zu bytes%s\n", level, size1, size2, size1 != size2 ? " (*)" : "");
        ZSTD_freeCCtx(zstd);
    }
}

Now build against zstd v1.3.5:

$ git clone https://github.com/facebook/zstd.git
$ cd zstd
$ git checkout v1.3.5
$ make
$ cc -o test -Ilib test.c lib/libzstd.a

The following output will be seen:

Level 1: 323 bytes / 323 bytes
Level 2: 324 bytes / 324 bytes
Level 3: 325 bytes / 325 bytes
Level 4: 324 bytes / 324 bytes
Level 5: 324 bytes / 324 bytes
Level 6: 322 bytes / 322 bytes
Level 7: 322 bytes / 322 bytes
Level 8: 322 bytes / 322 bytes
Level 9: 324 bytes / 322 bytes (*)
Level 10: 324 bytes / 322 bytes (*)
Level 11: 322 bytes / 321 bytes (*)
Level 12: 322 bytes / 321 bytes (*)
Level 13: 322 bytes / 321 bytes (*)
Level 14: 322 bytes / 321 bytes (*)
Level 15: 322 bytes / 321 bytes (*)
Level 16: 322 bytes / 321 bytes (*)
Level 17: 322 bytes / 321 bytes (*)
Level 18: 322 bytes / 321 bytes (*)
Level 19: 322 bytes / 320 bytes (*)

As you can see, every level from 9 onwards results in different compressed output the second time.

This didn't happen back in v1.3.3:

Level 1: 322 bytes / 322 bytes
Level 2: 325 bytes / 325 bytes
Level 3: 325 bytes / 325 bytes
Level 4: 324 bytes / 324 bytes
Level 5: 324 bytes / 324 bytes
Level 6: 322 bytes / 322 bytes
Level 7: 322 bytes / 322 bytes
Level 8: 322 bytes / 322 bytes
Level 9: 322 bytes / 322 bytes
Level 10: 325 bytes / 325 bytes
Level 11: 322 bytes / 322 bytes
Level 12: 322 bytes / 322 bytes
Level 13: 322 bytes / 322 bytes
Level 14: 322 bytes / 322 bytes
Level 15: 322 bytes / 322 bytes
Level 16: 322 bytes / 322 bytes
Level 17: 322 bytes / 322 bytes
Level 18: 322 bytes / 322 bytes
Level 19: 322 bytes / 322 bytes

It started happening to some extent with commit 9d65a5c:

Level 1: 322 bytes / 322 bytes
Level 2: 325 bytes / 325 bytes
Level 3: 325 bytes / 325 bytes
Level 4: 324 bytes / 324 bytes
Level 5: 324 bytes / 324 bytes
Level 6: 322 bytes / 322 bytes
Level 7: 322 bytes / 322 bytes
Level 8: 322 bytes / 322 bytes
Level 9: 324 bytes / 322 bytes (*)
Level 10: 325 bytes / 325 bytes
Level 11: 322 bytes / 322 bytes
Level 12: 322 bytes / 322 bytes
Level 13: 322 bytes / 322 bytes
Level 14: 322 bytes / 322 bytes
Level 15: 322 bytes / 322 bytes
Level 16: 322 bytes / 322 bytes
Level 17: 322 bytes / 322 bytes
Level 18: 322 bytes / 322 bytes
Level 19: 322 bytes / 322 bytes

Workaround

For now, we've switched to using ZSTD_compress, which does result in deterministic outputs in this scenario.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 17, 2018

Thanks for detailed report @jblazquez , and precise reproduction instructions.
The issue seems to impact all compression strategies starting btlazy2 and stronger.
We very much prefer to produce deterministic output.
Let's investigate.

note : I confirm reproduction of the issue on a Linux Ubuntu VM.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 17, 2018

Small comment :
ZSTD_CCtx_reset() is not supposed to be useful in this case.

ZSTD_CCtx_reset() is only meant to be useful in association with ZSTD_compress_generic().
In which case, it becomes possible to reset all sticky parameters to default values.

ZSTD_compressCCtx() doesn't have any sticky parameter.
It only uses parameters as provided directly within its interface.

I removed ZSTD_CCtx_reset() from the test case, and confirmed that the issue remains identical.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 17, 2018

Here is my suspicion :
the position 0 is special, as it's the starting value of all internal table cells.
Anytime the algorithm notices that a candidate is proposed at position 0, it just considers it's not a candidate, and skips it.

Position 0 is also the first position of the first blob compressed by a new context.
Effectively, this policy makes it impossible to have a match starting at the very first byte.

In most cases, it should not matter much. If there is a match starting a position 0, one byte later, the same match is still there, just starting at position 1. So the efficiency difference is pretty small.

Thing is, later on, as a context is re-used, the starting position of following data blob to compress is no longer 0. It is whatever value the context is currently having. This is good to reduce amount of initialization when re-using contexts.

Since the first byte is no longer at position 0, it now becomes a valid position for a match.
So it results in slightly different compressed result.

I'll check in more detail if that's what happens.
This is a special corner case, which is in no way common, hence can easily evade automated tests, but it might be triggered by your sample.

Edit : just checked, and it's indeed what happens. 2 matches start from beginning of sample. They are detected starting from position 0 during second compression, but starting from position 1 during first compression, because position 0 is invalid during first compression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants