improve compression ratio of small alphabets #3391

Cyan4973 · 2022-12-21T23:05:18Z

In situations where the source's alphabet size is very small, the evaluation of literal costs from the Optimal Parser is initially incorrect. It takes some time to converge, during which compression is less efficient.
This is especially important for small files, since most of the parsing decisions would be based on incorrect metrics.

After this patch, the scenario ##3228 is fixed,
delivering the expected 29 bytes compressed size (smallest known compressed size, down from 54).

On other "regular" data, this patch tends to be generally positive, though this is an average, and differences remain pretty small.
The patch seems to impact text data more, likely because it prunes non present alphabet symbols much faster.
On binary data with full alphabet, it's more balanced, and results typically vary by merely a few bytes (compared to dev), making it essentially a non-event.

Since this modification is only for high compression modes, the speed impact is insignificant.

fix #3328 In situations where the alphabet size is very small, the evaluation of literal costs from the Optimal Parser is initially incorrect. It takes some time to converge, during which compression is less efficient. This is especially important for small files, because there will not be enough data to converge, so most of the parsing is selected based on incorrect metrics. After this patch, the scenario ##3328 gets fixed, delivering the expected 29 bytes compressed size (smallest known compressed size).

comparing level 19 to level 22 and expecting a stricter better result from level 22 is not that guaranteed, because level 19 and 22 are very close to each other, especially for small files, so any noise in the final compression result result in failing this test. Level 22 could be compared to something much lower, like level 15, But level 19 is required anyway, because there is a clamping test which depends on it. Removed level 22, kept level 19

facebook-github-bot added the CLA Signed label Dec 21, 2022

Cyan4973 force-pushed the fix3228 branch from 56ead60 to 1db5136 Compare December 21, 2022 23:09

Cyan4973 mentioned this pull request Dec 22, 2022

Optimal parser edge case: Huffman alone is significantly better than any matches #3228

Closed

Cyan4973 self-assigned this Dec 28, 2022

Cyan4973 force-pushed the fix3228 branch from b8aba3b to 0564070 Compare December 29, 2022 02:18

terrelln approved these changes Jan 3, 2023

View reviewed changes

Cyan4973 added 3 commits January 3, 2023 12:22

update regression results

ebba9ff

Cyan4973 force-pushed the fix3228 branch from 0564070 to c79fb4d Compare January 3, 2023 22:05

Cyan4973 merged commit 834fd07 into dev Jan 4, 2023

Cyan4973 deleted the fix3228 branch January 13, 2023 04:26

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve compression ratio of small alphabets #3391

improve compression ratio of small alphabets #3391

Cyan4973 commented Dec 21, 2022 •

edited

improve compression ratio of small alphabets #3391

improve compression ratio of small alphabets #3391

Conversation

Cyan4973 commented Dec 21, 2022 • edited

Cyan4973 commented Dec 21, 2022 •

edited