Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix root cause of #3416 #3419

Merged
merged 2 commits into from
Jan 13, 2023
Merged

fix root cause of #3416 #3419

merged 2 commits into from
Jan 13, 2023

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Jan 11, 2023

A minor update in 5434de0 changed a <= into a <, and as an indirect consequence allowed compression attempt of literals when there are only 6 literals to compress (previous limit was effectively 7 literals).

This is not in itself a problem, as the threshold is merely an heuristic, but it emerged a bug that has always been there, and was just never triggered so far due to the previous limit. This bug would make the literal compressor believes that all literals are the same symbol, but for the exact case where nbLiterals==6, plus a pretty wild combination of several exceptional conditions, this outcome could be false, resulting in data corruption.

Replaced the blind heuristic by an actual test for all limit cases. Even if the threshold is changed again in the future, the detection of Repeated Literal mode will remain reliable.

fix #3416

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test case that would trigger this bug?

The fix LGTM

ZSTD_memcpy(nextHuf, prevHuf, sizeof(*prevHuf));
return ZSTD_compressRleLiteralsBlock(dst, dstCapacity, src, srcSize);
}
if ((srcSize >= 8) || allBytesIdentical(src, srcSize)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment as to why this check is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Cyan4973
Copy link
Contributor Author

Can you please add a test case that would trigger this bug?

The fix LGTM

This is a really hard test to define.
We would probably need some golden files, both for the content and the dictionary.
None of them are naturally exported from the fuzzer (only the sequence of bits which generate those artifacts).

@terrelln
Copy link
Contributor

This is a really hard test to define.
We would probably need some golden files, both for the content and the dictionary.
None of them are naturally exported from the fuzzer (only the sequence of bits which generate those artifacts).

Ok, it'd be great if we had it, but if it is too hard, we have the fuzzers, and we know that they will quickly catch any regressions.

A minor change in 5434de0 changed a `<=` into a `<`,
and as an indirect consequence allowed compression attempt of literals when there are only 6 literals to compress
(previous limit was effectively 7 literals).

This is not in itself a problem, as the threshold is merely an heuristic,
but it emerged a bug that has always been there, and was just never triggered so far due to the previous limit.
This bug would make the literal compressor believes that all literals are the same symbol,
but for the exact case where nbLiterals==6, plus a pretty wild combination of other limit conditions,
this outcome could be false, resulting in data corruption.

Replaced the blind heuristic by an actual test for all limit cases,
so that even if the threshold is changed again in the future,
the detection of RLE mode will remain reliable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dictionary_stream_round_trip corruption
3 participants