-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce experimental compression codecs #17847
Conversation
@alexey-milovidov , Where can I find unit-tests for compression codecs?File |
@fibersel unit tests are built into single binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experimental codecs should also be protected by a setting:
allow_experimental_codecs
, so users will not start to use it without being aware of the status of this feature.
Well, I will account this condition before merging. |
@alexey-milovidov can you suggest any example, that I can use to implement testing framework? |
@fibersel You can use |
How to use it:
|
How can I measure compression and decompression time most precisely? |
@fibersel |
The difference in various compression algorithms is usually obvious even without these tricks. |
@fibersel Yandex Metrica should not be used as a single dataset. It contains multiple columns and different columns have very different data distribution. Testing over all columns together will give no insights. |
The results in table are unclear and most likely contain errors:
|
|
The default build type for ClickHouse is release with debug info (that's what we need). |
Hm, no, I build it from command line. |
Is it possible that files like URL.dat (600 mb) are too lightweight to measure compression speed? |
600 MB should be enough, it will take hundreds of milliseconds. You can repeat multiple times to get more statistics... |
Actually running multiple times is required for data to fit in page cache (otherwise it can be IO bound). |
Also you can measure only userspace CPU time ( |
Sorry for a such long break, but finally I measured first new algorithm. @alexey-milovidov does that representation satisfies you requirements?If it's so, I'll submit code for measuring and benchmarks for other data types later. Current measurement is for Titles.
|
@fibersel Yes, that's what we need, except that measurement should be done for each of the data files. |
Okay, I will make these codecs available under flag next week. |
From your table I see that:
|
There is still a chance that density is specialized for some datasets - that's why we need to test on multiple columns. |
Also I see that negative compression levels of Lizard are not evaluated. |
I guess, negative levels are negative only in the documentation: |
@alexey-milovidov how can I get value of experimental flag from inside compression codec? |
You can check the flag in factory - where the codecs are created. |
Yes, but how can I access global settings? |
You can do it similar to the |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
(remove from changelog) Integrate and test experimental compression libraries. Will be available under the flag
allow_experimental_codecs
. This closes #16775