Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify zstd compressor output compatibility guarantees across versions #999

Closed
jblazquez opened this issue Jan 22, 2018 · 12 comments
Closed
Labels

Comments

@jblazquez
Copy link

Hi,

We recently upgraded zstd to 1.3.3 after reading about the performance improvements for high compression levels, and we were happy to see that the performance increase was very significant (around 40% for level 19). However, we also noticed that the output of zstd 1.3.3 is not binary-identical to zstd 1.3.2, and unfortunately that limits its usefulness for our particular use case because we rely on our compressed data not changing as we upgrade the libzstd library, which we'd like to do in order to get access to bugfixes, new features and performance improvements. We were previously using zlib which I guess hasn't had a bitstream-impacting change in many years.

Is bit-identical output across versions a goal of the zstd project, or do you expect these changes to happen for the foreseeable future?

@Cyan4973
Copy link
Contributor

Hi @jblazquez,

zstd only guarantees :

  • Format compliance : any compressed data produced by version v1+ respect the specification, and can therefore be decompressed correctly by any decoder from version v1+.
  • Reproducibility : For a given compression level, binary version, source data, and nb of threads, compressed data will always be the same.

However, zstd makes no guarantee of producing exactly the same compressed output when comparing 2 different versions. Such restriction would greatly limit its capability to improve.

Bottom line : never expect 2 different versions of zstd to produce the same output. If it does, it's purely by chance.

@jblazquez
Copy link
Author

Thanks for clarifying the compatibility guarantees @Cyan4973. I think those two guarantees - especially the first one - should be enough for us to unlock our ability to upgrade.

@anthraxx
Copy link

@Cyan4973
We are currently considering the possibility to use zstd as default compression for all our distro packages, but we would appreciate if you could clarify the reproducible guarantees that you mentioned above with varying threads in different scenarios.

Do the following restrictions and variables all always produce the same output?

  • all compression operations always use the very same zstd version
  • fixed compression level f.e. all use -18
  • varying hardware cpu cores f.e. some single core machines, some 4 and some 8 core machines
  • fixed value of -T0 which uses number of physical CPU cores (please note that above requirement varies single core, 4 and 8 core machines as there are other compression algos that break the guarantee on a single core machine)

technically the -T0 varies the nb of threads that you mentioned above, but we would like to have -T0 and still a guarantee to have reproducible output across different number of cures (single core + multi core)

Some tests show this may be the case, but we seek to have some official clarification before we assume that we can rely on it.
thanks in advance

@Cyan4973
Copy link
Contributor

Situation has evolved since this issue was opened, and is generally more friendly to reproducible builds.

With recent versions of zstd (v1.3.4+), the number of cores and the number of threads do not matter. -T0, -T1 (default), and generally -Tn all generate the same output.
The only important parameters are the zstd version number, and the compression level.

Things that can break this reproducibility pattern :

  • altering compression level by adding advanced parameters (--long=, --zstd=, etc.), though identical advanced parameters will give the same result.
  • add --single-thread command. It's not the same as -T1, and will generate a slightly different output. However, --single-thread is stable with itself.

@terrelln
Copy link
Contributor

We will fix all bugs causing non-deterministic builds as long as they follow the constraints that @Cyan4973 laid out above. However, I'd definitely recommend adding zstd determinism tests that invoke zstd the same way you do in your builds. We test zstd for non-determinism, but you may invoke it in a different way that we've missed in our test coverage.

@tasket
Copy link

tasket commented Sep 6, 2022

I would like to see guidelines about versions become a bit more clear. For example: v1.4.x output clearly does not match v1.5.x output, but will any 1.4.x output match that of any other 1.4.x version (1.4.1 matching 1.4.9)?

@Cyan4973
Copy link
Contributor

will any 1.4.x output match that of any other 1.4.x version

No, there is no such guarantee.

All release versions of zstd are allowed to produce different outputs.
Reproducibility is only guaranteed within a single release version.

@tasket
Copy link

tasket commented Sep 13, 2022

Seems like zstd should not position itself diametrically opposite to reproducibility / consistency, which is what the current policy is. I will have to deprecate or discourage zstd use in Wyng to avoid breakdown of deduplication.

OTOH, prioritizing output consistency would place very little burden on the zstd project; all that's required is a willingness to recognize when consistency is broken and to respond with an appropriate increment of the version number. This would allow users to receive zstd bug fix updates with peace of mind.

If best security practice didn't call for hashing data in its compressed form, it would be a different story and this issue wouldn't exist.

@Cyan4973
Copy link
Contributor

Just for reference :
output reproducibility (as in bit-identical) in software toolchains is a property always tied to a specific version.

Requiring all versions of a product to always generate the same binary output would prevent the product from improving,
and would also prevent it from fixing any bug that could alter the binary output.

@tasket
Copy link

tasket commented Sep 14, 2022

Requiring all versions of a product to always generate the same binary

I don't think anyone here is suggesting that; certainly not me.

@felixhandte
Copy link
Contributor

@tasket, then I'm not really sure what need you're describing that isn't being met. We provide determinism. We bump the library version every time we break determinism. What's missing?

@tasket
Copy link

tasket commented Sep 22, 2022

@felixhandte

We bump the library version every time we break determinism.

Thank you for asking. But that is not how I interpret the dialogue thus far.

It should be asked: Do all code changes have the same significance? Why would a buffer overflow fix and a tweak to the compression ratio both affect the version's patch level and not the major or minor?

Developers wanting determinism (and zstd efficiency) will face possible discontinuity each and every time their OS packaging system updates the zstd library automatically. As a result, we'll feel pressured to include our own copies of old zstd versions in our apps... or else have to explain to users, managers, etc. that zstd is the reason their storage systems repeatedly go offline because the backup archives have exploded in size.

I would like to be able to list a dependency of "zstd 1.5.x" for my app and let updates occur for it without my intervention and without breaking determinism. This implies that changes to zstd that affect its data output would have to land in a "later" version such as 1.6. In this example v1.5.x branch would have something like a "long term support" designation. FOSS operating systems accommodate this kind of compatibility-freeze fairly often by not carrying the latest development or beta branches and keeping the major or major.minor version the same while applying patches that address security and stability issues – but I'm not sure how realistic that would be for zstd under the current versioning policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants