New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does ZSTD_flushStream have adverse impact on compression ratio? #900
Comments
If I understand correctly, you are using both flush and dictionaries, is that right? In general people will use either streaming with flush, or dictionaries, but not both. If you can provide some more details I can help find the best setup.
|
Yes, There are techniques to reduce this "fixed cost". At a minimum, Zstandard defines some "static" tables which cost almost nothing to send, and are useful for really small blocks. |
I have about to 1.7Mb uncompressed test data, 70000 records (20-30 bytes per record). I need to do flush after every record. Every compressed record is stored independently. |
BTW, I looked into the compressed file and was surprised by the repetition of some string ("finish"). libzstd version 1.3.8 (Debian stable)
|
I repeated the test with libzstd 1.4.3 (from Debian testing) and |
I write the data (journal) to the persistent storage (NOR flash) and I need to have consistent commit to storage (record must be readable, even is shutdown occurs directly after write).
Usually 20-30 bytes, sometimes larger (up to several Kbs)
Compressed data is stored in blocks of up to 64Kb (100-200Kb of uncompressed data with zlib). |
This use case is not yet well-optimized. The most important one is that, by triggering a "flush", it effectively creates a block. A full block header with complete statistics is more ~100 bytes, which is obviously too large for your use case. Fortunately, the compressor detects that, so it will undo some of these steps, only producing headers if they generate a benefit, up to sending data uncompressed if need be. The real issue here is that, by default, the streaming interface doesn't know what's going to happen next. Hence each decision is "local". When creating a block of 20-30 bytes of content, it's very likely that the most sensible decision is to send it uncompressed, or partially compressed but using only some "default" statistics tables. It's extremely likely that the "literals" part will be sent uncompressed. If the algorithm knew in advance that it's going to compress a bunch of similar data, it could create a kind of "shared statistics", that would be produced once, and re-use on all blocks. But because it doesn't know it, it's obliged to make a local decision, and never reaches the point where some efficient re-usable statistics are produced. There are likely some other issues, such as a mismatch between the statistics level 19 believes it's using, and the real statistics employed once the impact of small block size is taken into consideration, making the parser's choices ill-funded. It's likely that using strategy The "best" solution would require a custom interface and dedicated development. Unfortunately, that's a lot of investment, likely too much to justify it. |
Note that, even if each record is stored as an independent block, each block still needs to be retrieved in order, so that decompression instructions make sense. This use case seems a good fit for dictionary compression, where each record would be truly independent, hence could be extracted and decompressed independently. This is a very different setup though, and is likely going to change your existing data pipeline too much. Yet, it's an interesting idea to keep in mind, should you have in the future some similar scenario to develop from scratch, where you would not be tied by some existing design. |
So use of zstd has no sense in my case.
I thought about it. There is no guarantee that a static dictionary will be relevant forever. And dictionary update looks like a headache. |
Use the ZlibWrapper,when the zlib flush flag is Z_SYNC_FLUSH,the z_deflate wrap function's implementation is ZSTD_flushStream.
In my case, I use a trained zstd dictionary, and when I set Z_SYNC_FLUSH, the z_defalte write output data to buffer immediately, however the compression ratio is much lower than set Z_NO_FLUSH & Z_FINISH, which use ZSTD_endStream in the end, not immediately.
I want a high compression ratio, and also need the flush feature, are these two contradictory?
The text was updated successfully, but these errors were encountered: