-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724
Comments
This is a very old implementation limitation. When Initially, this choice was selected for speed consideration : we want to proceed blindly, i.e. trust the table provided by the dictionary, so we must be sure it's going to be able to represent any data thrown at it. This results in major speed benefits. Nowadays, we could use some stronger analysis, deciding to employ dictionary statistics selectively depending on their fit with local statistics. Such analysis would be able to detect that a dictionary statistic is incomplete and unable to represent a symbol to compress, and replace them with locally generated statistics. This is something we already do for inter-block correlation when streaming at higher compression levels. However, for fastest levels, there is still a problem regarding time spent during analysis. Generating new statistics embedded in the frame is of course detrimental to compression ratio and speed, on both the compression and decompression sides. And since we mostly use dictionary compression for high speed applications, this is a situation we want to avoid. Consequently, our reference decoder has only met this scenario, so we have not so far needed to deal with Zstandard dictionaries containing incomplete statistics (and the need to mitigate the side effects). You are right that incomplete statistics would likely compress better, since we would not waste some probability space for events that will never happen. |
Ah, so you are forcing it to use the table on lower levels. But it only checks if 255 can be represented. Not if all others can. Is this a "missing" check, or is it safe to use? I generate the Huffman table from the statistics of the actual literal bytes, and discard low (and obviously zero) probability outputs - with the assumption that it can switch to another if the table cannot represent it. My experience from toying with deflate told me that adding zero-probability codes "just incase they were needed" was less efficient overall than just having these cases generate a fresh table. So I applied that learning here. Do the FSE tables have the same limitations? I generate these in a similar fashion, which will also lead to "short" tables and some symbols having 0 probability. |
I don't remember this part of the code base, But for information, this entry point is fuzzed 24/24. So I don't expect an adversarial dictionary to be able to crash the encoder or make it produce corrupted frames. There are additional checks in the pipeline, and cumulatively, they seem to cover the situation well enough to survive heavy fuzzer tests, so no catastrophic scenario is expected. Worst case is, the dictionary will be refused. Nonetheless, some improvements might be possible, if only for clarity. So that can still be worth another look. As mentioned in the previous message, sure, zeroing some non-probable symbols is likely going to improve compression ratio, if only by a little bit. The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off. Now, it doesn't mean that we can't support partial statistics. It's just that there was no such need so far, so the implementation need not to support it. Now let's assume the implementation is updated to support partial statistics in dictionary. A secondary issue is that checking input compatibility with partial statistics is going to cost additional cpu time, but since correctness is not affected, this is a form of silent degradation: nothing crashes, it just silently reduces speed. Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed. |
Sure. I see how you can avoid histogramming the input in these cases.
It may be something internally, since I don't understand how it is fine to have a Huffman table where symbol 254 cannot be represented (weight=0), but symbol 255 must - or at least the table must produce 256 weight entries when decoded.
I guess it is just a matter of implementation expectations. But I do see how being able to always use the literal dictionary compressor can be an advantage. In my implementation, it uses the dictionary the same as any dictionary from a previous block. 1) Generate histogram of literals 2) Check if previous can be used 3) Check if a new table would be better or even if just uncompressed. I don't actually think this is too far from the logic in zstd. Since I always (by default) check if a new table will be better, I am already generating the histogram, so the only extra cost is checking the histogram against the dictionary table. Sidenote - maybe returning a more "correct" error than I am not asking you to change anything (other than maybe the error) - but I am trying to understand what is going on. |
Agreed, the error message is not reflective of what's going on. |
TLDR: This check is overly restrictive, and we can fix it easily. I think this line is a holdover from when we were relying on dictionaries being valid. At this point, we ensure that compression never corrupts data, even when a dictionary is corrupted, invalid, or missing symbols in statistics (but the same between compression & decompression). And we have fuzzers to verify this. During dictionary loading, we validate the Huffman / FSE tables, and mark them as either For Huffman, you can see that logic here: zstd/lib/compress/zstd_compress.c Lines 4905 to 4911 in edb6e8f
We should be to change the logic to be The 3 calls to zstd/lib/compress/zstd_compress.c Line 4939 in edb6e8f
zstd/lib/compress/zstd_compress.c Line 4953 in edb6e8f
zstd/lib/compress/zstd_compress.c Lines 4963 to 4970 in edb6e8f
|
For both FSE & Huffman tables:
If you want your dictionary to play nice with lower compression levels and the reference encoder, you'd have to either:
|
Also, @klauspost, please let us know about the results of your custom dictionary builder! The baseline dictionary builder we generally use is
If your dictionary builder is competitive, or able to beat that, we'd love to try it out! |
@terrelln Thanks - I will read through that. Much appreciated. My current work is a different approach than yours. the backreference dictionary is done with a rather naïve chain of hashes that counts occurrences of 4-8 byte sequences and then from the top list tries to string them together to a set of longer strings until a hash exceeds a threshold - obviously emitting the most popular at the end. The dictionaries beats zstd on "diverse input". For example a collection of "Go source code" and a collection of "HTML" files is able to generate a better dictionary than zstd (even when compressing with zstd). For sparse, very similar content, like "github users" it creates a worse dictionary, since it splits it up too much. It currently does not have "repeat-friendly" dicts, where it has several patterns, with a few "literal" chars. The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there. FSE Tables are generated doing actual compression and normalizing the output. Huffman tables are generated using the literal output of the compression. Ran into a couple of cases where this didn't generate compressible tables, so I need a better secondary strategy. It is still fairly experimental and can fail on certain input - for example if an FSE table is a single value (RLE). |
@terrelln Just used regular
Nothing groundbreaking, but I'll take my one win and keep tweaking :D |
Yeah, we don't do anything with this currently. Maybe there is something to get here, but we haven't really found a use case for them. However, we have found a really interesting pattern in our dictionaries w.r.t. repcodes. If I have the string:
And two dictionary snippets:
Dictionary B will outperform Dictionary A, sometimes by a large margin, because the match finder will be able to use a repeat offset for This is one of the advantages of the cover algorithm: It keeps segments together, even if there is an infrequent sub-segment within it. |
Also CC @ot (the original author of the cover algorithm), you may find this discussion interesting |
@terrelln Yes. That is pretty much what I refer to when "not repeat friendly". Especially considering how relatively often lower levels checks for repeats that should also be a factor. Also zstd's ability to "splat" versions of similar data is missing. I will never emit a hash that has already been emitted as a substring of a higher priority previous item. This is fine for "diverse" input, but not optimal much for "sparse" input. So "github-users" only generates a small dictionary, whereas zstd can emit several "versions" to chose from. |
@terrelln Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant, the only optimization was to put the more frequent substrings at the end so they would be available for longer to the sliding window. |
Yeah, I think we mostly got lucky, since it wasn't a design consideration going on. When the cover dictionary builder selected parameters, it often ended up selecting larger segment sizes than we expected, like a few hundred bytes. We only fully realized that the larger segment size was important for utilizing repeat offsets in the dictionary content later, when we were removing that property and saw a regression. I found the |
If you wanted to be more repcode aware, you could potentially concatenate all the sources, or random pairs of sources or something, then use |
Describe the bug
I am building a custom dictionary builder.
When testing it on my current output:
dictionary.bin.gz
To Reproduce
Gunzip above. Place file at
github_users_sample_set\user.0E6FBohGzg.json
Execute:
I have an older debug compiled zstd:
So it seems like there is a restriction on the huffman tables - where maxSymbolValue must be 255 the provided table.
What is the reason for that. I would kind of expect it to be more efficient, if all symbols can't be represented? Could you explain a bit to me why this (undocumented?) limitation exists?
Expected behavior
I would kinda expect it to work?
The text was updated successfully, but these errors were encountered: