Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve compression ratio of small alphabets #3391

Merged
merged 3 commits into from Jan 4, 2023
Merged

improve compression ratio of small alphabets #3391

merged 3 commits into from Jan 4, 2023

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Dec 21, 2022

fix #3228

In situations where the source's alphabet size is very small, the evaluation of literal costs from the Optimal Parser is initially incorrect. It takes some time to converge, during which compression is less efficient.
This is especially important for small files, since most of the parsing decisions would be based on incorrect metrics.

After this patch, the scenario ##3228 is fixed,
delivering the expected 29 bytes compressed size (smallest known compressed size, down from 54).

On other "regular" data, this patch tends to be generally positive, though this is an average, and differences remain pretty small.
The patch seems to impact text data more, likely because it prunes non present alphabet symbols much faster.
On binary data with full alphabet, it's more balanced, and results typically vary by merely a few bytes (compared to dev), making it essentially a non-event.

Since this modification is only for high compression modes, the speed impact is insignificant.

fix #3328

In situations where the alphabet size is very small,
the evaluation of literal costs from the Optimal Parser is initially incorrect.
It takes some time to converge, during which compression is less efficient.
This is especially important for small files,
because there will not be enough data to converge,
so most of the parsing is selected based on incorrect metrics.

After this patch, the scenario ##3328 gets fixed,
delivering the expected 29 bytes compressed size (smallest known compressed size).
comparing level 19 to level 22 and expecting a stricter better result from level 22
is not that guaranteed,
because level 19 and 22 are very close to each other,
especially for small files,
so any noise in the final compression result
result in failing this test.

Level 22 could be compared to something much lower, like level 15,
But level 19 is required anyway, because there is a clamping test which depends on it.

Removed level 22, kept level 19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimal parser edge case: Huffman alone is significantly better than any matches
3 participants