Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

klauspost · 2023-08-16T14:54:51Z

Describe the bug

I am building a custom dictionary builder.

When testing it on my current output:
dictionary.bin.gz

To Reproduce

Gunzip above. Place file at github_users_sample_set\user.0E6FBohGzg.json

Execute:

λ zstd -D dictionary.bin github_users_sample_set\user.0E6FBohGzg.json
zstd: error 11 : Allocation error : not enough memory

I have an older debug compiled zstd:

compress\zstd_compress.c: ZSTD_compress_insertDictionary (dictSize=4272)
compress\zstd_compress.c:4337: ERROR!: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted):
compress\zstd_compress.c:4435: ERROR!: forwarding error in eSize: Dictionary is corrupted: ZSTD_loadCEntropy failed
compress\zstd_compress.c:4835: ERROR!: forwarding error in dictID: Dictionary is corrupted: ZSTD_compress_insertDictionary failed
compress\zstd_cwksp.h: cwksp: freeing workspace
compress\zstd_compress.c:1123: ERROR!: check !dl->cdict failed, returning ERROR(memory_allocation): ZSTD_createCDict_advanced failed

So it seems like there is a restriction on the huffman tables - where maxSymbolValue must be 255 the provided table.

What is the reason for that. I would kind of expect it to be more efficient, if all symbols can't be represented? Could you explain a bit to me why this (undocumented?) limitation exists?

Expected behavior

I would kinda expect it to work?

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2023-08-16T15:29:38Z

This is a very old implementation limitation.

When maxSymbolValue < 255, that means that not every symbol (byte value) is present in the statistics.
Consequently, if a missing symbol ends up being present in the input, and then in the literals section, it will not be possible to represent it. This can happen if data to compress is different from training set.
If encoding was proceeding blindly, this would lead to corruption errors.

Initially, this choice was selected for speed consideration : we want to proceed blindly, i.e. trust the table provided by the dictionary, so we must be sure it's going to be able to represent any data thrown at it. This results in major speed benefits.

Nowadays, we could use some stronger analysis, deciding to employ dictionary statistics selectively depending on their fit with local statistics. Such analysis would be able to detect that a dictionary statistic is incomplete and unable to represent a symbol to compress, and replace them with locally generated statistics. This is something we already do for inter-block correlation when streaming at higher compression levels.

However, for fastest levels, there is still a problem regarding time spent during analysis.
Hence, to remain compatible with speed objectives, we would be more likely to just always trust, or always distrust.

Generating new statistics embedded in the frame is of course detrimental to compression ratio and speed, on both the compression and decompression sides. And since we mostly use dictionary compression for high speed applications, this is a situation we want to avoid.
Hence, the reference dictionary builder always generate "complete" statistics.
(note: the same problem happens with statistics for other symbols, such as match length, literal length and offset).

Consequently, our reference decoder has only met this scenario, so we have not so far needed to deal with Zstandard dictionaries containing incomplete statistics (and the need to mitigate the side effects).

You are right that incomplete statistics would likely compress better, since we would not waste some probability space for events that will never happen.
But the loss is also generally not that big, so this is just a (small) compression ratio optimization.
On the other hand, the processing requirement for the encoder to deal safely with incomplete statistics is unavoidable, and actually substantial for high speed applications.
The only use case I could imagine it to make sense is in combination with high compression ratio modes, which are already very slow, and therefore for which such additional processing stage would probably be insensible.

klauspost · 2023-08-16T15:51:44Z

Ah, so you are forcing it to use the table on lower levels.

But it only checks if 255 can be represented. Not if all others can. Is this a "missing" check, or is it safe to use?

I generate the Huffman table from the statistics of the actual literal bytes, and discard low (and obviously zero) probability outputs - with the assumption that it can switch to another if the table cannot represent it.

My experience from toying with deflate told me that adding zero-probability codes "just incase they were needed" was less efficient overall than just having these cases generate a fresh table. So I applied that learning here.

Do the FSE tables have the same limitations? I generate these in a similar fashion, which will also lead to "short" tables and some symbols having 0 probability.

Cyan4973 · 2023-08-16T23:24:00Z

I don't remember this part of the code base,
so it could very well be that something is incomplete or not done well enough.

But for information, this entry point is fuzzed 24/24. So I don't expect an adversarial dictionary to be able to crash the encoder or make it produce corrupted frames. There are additional checks in the pipeline, and cumulatively, they seem to cover the situation well enough to survive heavy fuzzer tests, so no catastrophic scenario is expected. Worst case is, the dictionary will be refused.

Nonetheless, some improvements might be possible, if only for clarity. So that can still be worth another look.

As mentioned in the previous message, sure, zeroing some non-probable symbols is likely going to improve compression ratio, if only by a little bit. The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off.
In our use cases, we lean towards better speed. And full statistics is a way towards better compression speed.

Now, it doesn't mean that we can't support partial statistics. It's just that there was no such need so far, so the implementation need not to support it.

Now let's assume the implementation is updated to support partial statistics in dictionary. A secondary issue is that checking input compatibility with partial statistics is going to cost additional cpu time, but since correctness is not affected, this is a form of silent degradation: nothing crashes, it just silently reduces speed. Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed.

klauspost · 2023-08-17T12:23:34Z

The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off.

Sure. I see how you can avoid histogramming the input in these cases.

Now, it doesn't mean that we can't support partial statistics.

It may be something internally, since I don't understand how it is fine to have a Huffman table where symbol 254 cannot be represented (weight=0), but symbol 255 must - or at least the table must produce 256 weight entries when decoded.

Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed.

I guess it is just a matter of implementation expectations. But I do see how being able to always use the literal dictionary compressor can be an advantage.

In my implementation, it uses the dictionary the same as any dictionary from a previous block. 1) Generate histogram of literals 2) Check if previous can be used 3) Check if a new table would be better or even if just uncompressed. I don't actually think this is too far from the logic in zstd.

Since I always (by default) check if a new table will be better, I am already generating the histogram, so the only extra cost is checking the histogram against the dictionary table.

Sidenote - maybe returning a more "correct" error than zstd: error 11 : Allocation error : not enough memory would be nice.

I am not asking you to change anything (other than maybe the error) - but I am trying to understand what is going on.

Cyan4973 · 2023-08-17T16:15:19Z

Agreed, the error message is not reflective of what's going on.
This must be improved.

terrelln · 2023-08-21T18:31:31Z

TLDR: This check is overly restrictive, and we can fix it easily.

I think this line is a holdover from when we were relying on dictionaries being valid. At this point, we ensure that compression never corrupts data, even when a dictionary is corrupted, invalid, or missing symbols in statistics (but the same between compression & decompression). And we have fuzzers to verify this.

During dictionary loading, we validate the Huffman / FSE tables, and mark them as either valid meaning they have non-zero probabilities for every possible symbol and can be reused for any source, or check meaning that they have zero probability for some symbols, and can't be re-used without histograming the source.

For Huffman, you can see that logic here:

zstd/lib/compress/zstd_compress.c

Lines 4905 to 4911 in edb6e8f

    
                   /* We only set the loaded table as valid if it contains all non-zero 
        
                    * weights. Otherwise, we set it to check */ 
        
                   if (!hasZeroWeights) 
        
                       bs->entropy.huf.repeatMode = HUF_repeat_valid; 
        
                   RETURN_ERROR_IF(HUF_isError(hufHeaderSize), dictionary_corrupted, ""); 
        
                   RETURN_ERROR_IF(maxSymbolValue < 255, dictionary_corrupted, "");

We should be to change the logic to be if (!hasZeroWeights && maxSymbolValue == 255) bs->entropy.huf.repeatMode = HUF_repeat_valid;.

The 3 calls to FSE_dictNCountRepeat() in that function is where we validate the FSE tables:

zstd/lib/compress/zstd_compress.c

Line 4939 in edb6e8f

    
           bs->entropy.fse.matchlength_repeatMode = ZSTD_dictNCountRepeat(matchlengthNCount, matchlengthMaxValue, MaxML);

zstd/lib/compress/zstd_compress.c

Line 4953 in edb6e8f

    
           bs->entropy.fse.litlength_repeatMode = ZSTD_dictNCountRepeat(litlengthNCount, litlengthMaxValue, MaxLL);

zstd/lib/compress/zstd_compress.c

Lines 4963 to 4970 in edb6e8f

    
           {   size_t const dictContentSize = (size_t)(dictEnd - dictPtr); 
        
               U32 offcodeMax = MaxOff; 
        
               if (dictContentSize <= ((U32)-1) - 128 KB) { 
        
                   U32 const maxOffset = (U32)dictContentSize + 128 KB; /* The maximum offset that must be supported */ 
        
                   offcodeMax = ZSTD_highbit32(maxOffset); /* Calculate minimum offset code required to represent maxOffset */ 
        
               } 
        
               /* All offset values <= dictContentSize + 128 KB must be representable for a valid table */ 
        
               bs->entropy.fse.offcode_repeatMode = ZSTD_dictNCountRepeat(offcodeNCount, offcodeMaxValue, MIN(offcodeMax, MaxOff));

terrelln · 2023-08-21T18:50:34Z

For both FSE & Huffman tables:

For lower compression levels, or for very small data, if the table is marked as valid we will always use the dictionaries (or previous blocks) tables.
For lower compression levels, if the dictionary is marked as check we won't re-use the tables ever.
For higher compression levels, if the dictionary is marked as check or valid, we will compute whether it is better to reuse the dictionaries tables, write new tables, or use the default tables (for FSE).

If you want your dictionary to play nice with lower compression levels and the reference encoder, you'd have to either:

Ensure you have non-zero probabilities for all symbols
Add a parameter to the compressor to allow overriding the default behavior, so lower levels can check whether it is better to re-use a dictionary or not. I can give you code pointers.

terrelln · 2023-08-21T18:57:16Z

Also, @klauspost, please let us know about the results of your custom dictionary builder!

The baseline dictionary builder we generally use is ZDICT_optimizeTrainFromBuffer_cover() with these parameters:

k=0
d=0
steps=256
compressionLevel=/* compression level you expect to use the dictionary with */

If your dictionary builder is competitive, or able to beat that, we'd love to try it out!

klauspost · 2023-08-22T08:37:52Z

@terrelln Thanks - I will read through that. Much appreciated.

My current work is a different approach than yours. the backreference dictionary is done with a rather naïve chain of hashes that counts occurrences of 4-8 byte sequences and then from the top list tries to string them together to a set of longer strings until a hash exceeds a threshold - obviously emitting the most popular at the end.

The dictionaries beats zstd on "diverse input". For example a collection of "Go source code" and a collection of "HTML" files is able to generate a better dictionary than zstd (even when compressing with zstd). For sparse, very similar content, like "github users" it creates a worse dictionary, since it splits it up too much. It currently does not have "repeat-friendly" dicts, where it has several patterns, with a few "literal" chars.

The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there.

FSE Tables are generated doing actual compression and normalizing the output. Huffman tables are generated using the literal output of the compression. Ran into a couple of cases where this didn't generate compressible tables, so I need a better secondary strategy.

It is still fairly experimental and can fail on certain input - for example if an FSE table is a single value (RLE).

klauspost · 2023-08-22T09:39:07Z

@terrelln Just used regular --fast-cover to compare. But when compared against zstd --train-fastcover="k=0,d=0,steps=256" it is at least competitive:

HTML:

Default:

zstd: 1698 files compressed : 19.97% (  95.4 MiB =>   19.0 MiB)
kp:   1698 files compressed : 19.82% (  95.4 MiB =>   18.9 MiB)

Level 19:

zstd: 1698 files compressed : 16.73% (  95.4 MiB =>   16.0 MiB)
kp: 1698 files compressed : 16.80% (  95.4 MiB =>   16.0 MiB)

GO SOURCE:

Default:

zstd: 8912 files compressed : 21.86% (  67.0 MiB =>   14.6 MiB)
kp: 8912 files compressed : 22.17% (  67.0 MiB =>   14.8 MiB)


Level 19:

zstd: 8912 files compressed : 17.95% (  67.0 MiB =>   12.0 MiB)
kp: 8912 files compressed : 18.06% (  67.0 MiB =>   12.1 MiB)

Nothing groundbreaking, but I'll take my one win and keep tweaking :D

terrelln · 2023-08-22T18:31:53Z

The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there.

Yeah, we don't do anything with this currently. Maybe there is something to get here, but we haven't really found a use case for them.

However, we have found a really interesting pattern in our dictionaries w.r.t. repcodes. If I have the string:

<common substr 1><random 4 bytes><common substr 2>

And two dictionary snippets:

Dictionary A: <common substr 1><common substr 2>
Dictionary B: <common substr 1>0000<common substr 2>

Dictionary B will outperform Dictionary A, sometimes by a large margin, because the match finder will be able to use a repeat offset for <common substr 2>.

This is one of the advantages of the cover algorithm: It keeps segments together, even if there is an infrequent sub-segment within it.

terrelln · 2023-08-22T23:04:16Z

Also CC @ot (the original author of the cover algorithm), you may find this discussion interesting

klauspost · 2023-08-23T08:53:58Z

@terrelln Yes. That is pretty much what I refer to when "not repeat friendly". Especially considering how relatively often lower levels checks for repeats that should also be a factor.

Also zstd's ability to "splat" versions of similar data is missing. I will never emit a hash that has already been emitted as a substring of a higher priority previous item. This is fine for "diverse" input, but not optimal much for "sparse" input.

So "github-users" only generates a small dictionary, whereas zstd can emit several "versions" to chose from.

ot · 2023-08-23T09:52:14Z

@terrelln Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant, the only optimization was to put the more frequent substrings at the end so they would be available for longer to the sliding window.

terrelln · 2023-08-23T17:35:16Z

Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant

Yeah, I think we mostly got lucky, since it wasn't a design consideration going on. When the cover dictionary builder selected parameters, it often ended up selecting larger segment sizes than we expected, like a few hundred bytes.

We only fully realized that the larger segment size was important for utilizing repeat offsets in the dictionary content later, when we were removing that property and saw a regression.

I found the ZSTD_generateSequences() API very useful when debugging dictionary performance. It allowed me to build a heatmap of the dictionary by # of uses, and also inspect things like which positions are referenced using repeat offsets.

terrelln · 2023-08-23T17:46:28Z

If you wanted to be more repcode aware, you could potentially concatenate all the sources, or random pairs of sources or something, then use ZSTD_generateSequences() to check for matches of the pattern <match><literals><repeat 0>. You could zero the literals, and treat that whole segment as one unit somehow.

Cyan4973 added the question label Aug 16, 2023

Cyan4973 self-assigned this Aug 16, 2023

terrelln mentioned this issue Aug 21, 2023

No longer reject dictionaries with literals maxSymbolValue < 255 #3731

Merged

terrelln closed this as completed in #3731 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

klauspost commented Aug 16, 2023

Cyan4973 commented Aug 16, 2023 •

edited

Loading

klauspost commented Aug 16, 2023

Cyan4973 commented Aug 16, 2023 •

edited

Loading

klauspost commented Aug 17, 2023

Cyan4973 commented Aug 17, 2023

terrelln commented Aug 21, 2023

terrelln commented Aug 21, 2023 •

edited

Loading

terrelln commented Aug 21, 2023

klauspost commented Aug 22, 2023

klauspost commented Aug 22, 2023

terrelln commented Aug 22, 2023

terrelln commented Aug 22, 2023

klauspost commented Aug 23, 2023 •

edited

Loading

ot commented Aug 23, 2023

terrelln commented Aug 23, 2023

terrelln commented Aug 23, 2023

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

Comments

klauspost commented Aug 16, 2023

Cyan4973 commented Aug 16, 2023 • edited Loading

klauspost commented Aug 16, 2023

Cyan4973 commented Aug 16, 2023 • edited Loading

klauspost commented Aug 17, 2023

Cyan4973 commented Aug 17, 2023

terrelln commented Aug 21, 2023

terrelln commented Aug 21, 2023 • edited Loading

terrelln commented Aug 21, 2023

klauspost commented Aug 22, 2023

klauspost commented Aug 22, 2023

terrelln commented Aug 22, 2023

terrelln commented Aug 22, 2023

klauspost commented Aug 23, 2023 • edited Loading

ot commented Aug 23, 2023

terrelln commented Aug 23, 2023

terrelln commented Aug 23, 2023

Cyan4973 commented Aug 16, 2023 •

edited

Loading

Cyan4973 commented Aug 16, 2023 •

edited

Loading

terrelln commented Aug 21, 2023 •

edited

Loading

klauspost commented Aug 23, 2023 •

edited

Loading