-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct and clarify repcode offset history logic #3127
Conversation
Great explanation @embg ! Any performance measurement, to check if the changes produce any impact ? |
I figured performance measurement isn't necessary since I only touched code that executes at the very top and very bottom of the matchfinders (outside the hot loops). |
You would be surprised ! |
Understood -- would |
Yes |
@Cyan4973 Measurements look good! Mostly identical speeds, a few changes < 1% in both directions at various levels. Seems like noise to me since the changes go both ways, in some cases for levels which use the same matchfinder (e.g. levels 1 and 2 with clang12). Measurements for gcc11 and clang12
|
Summary
In zstd, repcode offsets are passed to each block from the previous
Compressed_Block
. The compression and decompression side need to maintain identical repcode offset histories to prevent data corruption.My last PR (#3114) introduced a bug in
ZSTD_compressBlock_fast_extDict
which caused those histories to fall out of sync. This bug was found by OSS fuzz, which was able to trigger data corruption by encoding a repcode match using an incorrect repcode offset (passed incorrectly from a previous block).This PR fixes that bug and goes further, fixing a latent issue in the existing code (pre-#3114) which @terrelln and I discovered while addressing the fuzzer bug.
The latent bug
The existing
offsetSaved
logic in fast, doublefast, and lazy noDict breaks the repcode offset history in a more subtle way than #3114. The value ofoffset_1
which is passed to the next block is always correct, butoffset_2
can be incorrect if both offsets are invalid going into the block and no matches are found. This is because there is only oneoffsetSaved
variable, but two offsets to save.Luckily, this cannot produce data corruption due to the specifics of how
offset_2
is used in those matchfinders. Still, I have been able to construct an input for streaming compression (without any dictionary) which passes an incorrectoffset_2
from the second-to-last block to the final block.Even without producing real corruption, this is undesirable; the correctness of the repcode offset history logic shouldn't depend on contingent factors regarding how
offset_2
is used in practice.Solutions
This PR addresses the above problems in three ways:
fast_extDict
, fixing the immediate OSS fuzz issue.offsetSaved
logic from the DMS matchfinders, since they don't actually use it. (Probably because in zstd, the whole dictionary is considered to be in the window if any byte of it is in the window).offsetSaved
into two variablesoffsetSaved1
andoffsetSaved2
with rotation from 1 -> 2 when necessary.