Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce experimental compression codecs #17847

Merged
merged 11 commits into from
Jun 6, 2021

Conversation

fibersel
Copy link
Contributor

@fibersel fibersel commented Dec 6, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
(remove from changelog) Integrate and test experimental compression libraries. Will be available under the flag allow_experimental_codecs. This closes #16775

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Dec 6, 2020
@fibersel
Copy link
Contributor Author

fibersel commented Dec 6, 2020

@alexey-milovidov , Where can I find unit-tests for compression codecs?File src/Compression/tests/gtest_compressionCodec.cpp is not referenced in any target.

@fibersel fibersel changed the title introduce lizard compression introduce experimental compression codecs Dec 6, 2020
@alexey-milovidov
Copy link
Member

@fibersel unit tests are built into single binary unit_tests_dbms.

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimental codecs should also be protected by a setting:
allow_experimental_codecs, so users will not start to use it without being aware of the status of this feature.

@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Dec 6, 2020
@fibersel
Copy link
Contributor Author

fibersel commented Dec 6, 2020

Experimental codecs should also be protected by a setting:
allow_experimental_codecs, so users will not start to use it without being aware of the status of this feature.

Well, I will account this condition before merging.
First step - implement and benchmark

@fibersel
Copy link
Contributor Author

@alexey-milovidov can you suggest any example, that I can use to implement testing framework?
I saw utils/compressor/decompress_perf.cpp, but it seems to be too complicated for my case.

@alexey-milovidov
Copy link
Member

@fibersel You can use clickhouse-compressor tool.

@alexey-milovidov
Copy link
Member

How to use it:

  1. When you will integrate your codecs to CompressionCodecFactory, they will become available under symbolic names in clickhouse-compressor.
  2. You can use clickhouse-compressor to test compression speed and ratio. Then the same for decompression.

@fibersel
Copy link
Contributor Author

How to use it:

  1. When you will integrate your codecs to CompressionCodecFactory, they will become available under symbolic names in clickhouse-compressor.
  2. You can use clickhouse-compressor to test compression speed and ratio. Then the same for decompression.

How can I measure compression and decompression time most precisely?

@alexey-milovidov
Copy link
Member

@fibersel --stat is for block statistics of compressed file. It does not decompress it and does not show the time.

@alexey-milovidov
Copy link
Member

How can I measure compression and decompression time most precisely?

  1. Run with time command and collect only the user value - time in userspace.
  2. Run in a loop and collect median or minimum of run times. Minimum can be not robust. If testing on your own machine, set CPU scaling governor to "performance", close web browser, close all other apps that use CPU, pin to one core with numactl.

The difference in various compression algorithms is usually obvious even without these tricks.

@alexey-milovidov alexey-milovidov marked this pull request as draft December 13, 2020 01:38
@fibersel
Copy link
Contributor Author

Here are first results of testing:
silesia dataset.
Screenshot 2020-12-20 at 20 22 15

@fibersel
Copy link
Contributor Author

Yandex metrics
image

@alexey-milovidov
Copy link
Member

@fibersel Yandex Metrica should not be used as a single dataset. It contains multiple columns and different columns have very different data distribution. Testing over all columns together will give no insights.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 20, 2020

The results in table are unclear and most likely contain errors:

  1. The "level" of lz4 is not specified.
  2. If it is the default level (1), decompression speed looks unrealistic and most likely indicates that you have build it in debug mode (e.g. if you are using CLion, it enables debug mode by default). Typical decompression speed for LZ4 (on single core x86_64 CPU from the last 10 years) is in range of 1..4 GB/sec of uncompressed data.
  3. The nature of lz4 algorithm makes decompression always faster than compression, but it's untrue in your table - that indicates an error.
  4. No units specified for speed. I assume MB of uncompressed data per second.

@fibersel
Copy link
Contributor Author

  1. There is not level option for LZ4 into clickhouse-compressor tool.The only option for LZ4 is using HC codec.
  2. Is default build type debug?If it is so, I am going to rebuild it...
  3. Yes, units are mb/s

@alexey-milovidov
Copy link
Member

The default build type for ClickHouse is release with debug info (that's what we need).
But if you are using CLion, it has a quirk that sets the build mode to debug by default (that is a common source of confusion).

@fibersel
Copy link
Contributor Author

Hm, no, I build it from command line.

@fibersel
Copy link
Contributor Author

Is it possible that files like URL.dat (600 mb) are too lightweight to measure compression speed?

@alexey-milovidov
Copy link
Member

600 MB should be enough, it will take hundreds of milliseconds. You can repeat multiple times to get more statistics...

@alexey-milovidov
Copy link
Member

Actually running multiple times is required for data to fit in page cache (otherwise it can be IO bound).

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 20, 2020

Also you can measure only userspace CPU time (time will output user field).

@fibersel
Copy link
Contributor Author

Sorry for a such long break, but finally I measured first new algorithm. @alexey-milovidov does that representation satisfies you requirements?If it's so, I'll submit code for measuring and benchmarks for other data types later.

Current measurement is for Titles.

compression (mb/s) decompression (mb/s) rate (compressed size / initial size)
lizard-10 441.7 2180.89 0.237939
lizard-13 86.479 2684.17 0.187487
lizard-16 51.7335 2791.54 0.176585
lizard-19 1.60911 2326.28 0.166315
lizard-22 193.32 1517.14 0.1913
lizard-25 29.9265 1622.99 0.168324
lizard-28 1.2575 1702.16 0.148866
lizard-31 284.851 1586.1 0.198668
lizard-34 80.4015 1836.54 0.177797
lizard-37 51.7719 1744.71 0.166143
lizard-40 289.579 1107.75 0.185636
lizard-43 62.0342 1163.14 0.162803
lizard-46 12.7841 1268.88 0.156169
lz4 532.737 2326.28 0.237471
zstd-11 34.4295 1292.38 0.126935
zstd-14 9.22275 1292.38 0.122337
zstd-17 5.60145 1268.88 0.117369
zstd-20 2.54535 1268.88 0.116618
zstd-5 119.297 1073.67 0.141384
zstd-8 56.8774 1224.36 0.129935

@alexey-milovidov
Copy link
Member

@fibersel Yes, that's what we need, except that measurement should be done for each of the data files.
(we should choose a set of about 5..10 data files that most likely have very different data distribution / data locality)

@fibersel
Copy link
Contributor Author

Okay, I will make these codecs available under flag next week.

@alexey-milovidov
Copy link
Member

From your table I see that:

  • Lizard is very promising (can decompress faster than ZSTD with similar compression ratio);
  • LZSSE is good for faster decompression;
  • density looks useless.

@alexey-milovidov
Copy link
Member

There is still a chance that density is specialized for some datasets - that's why we need to test on multiple columns.

@alexey-milovidov
Copy link
Member

Also I see that negative compression levels of Lizard are not evaluated.

@fibersel
Copy link
Contributor Author

I guess, negative levels are negative only in the documentation:
https://github.com/inikep/lizard/blob/lizard/lib/lizard_compress.h#L86

@fibersel
Copy link
Contributor Author

fibersel commented May 5, 2021

@alexey-milovidov how can I get value of experimental flag from inside compression codec?
Should I derive codec form WithContext class?

@alexey-milovidov
Copy link
Member

You can check the flag in factory - where the codecs are created.

@fibersel
Copy link
Contributor Author

fibersel commented May 5, 2021

Yes, but how can I access global settings?

@alexey-milovidov
Copy link
Member

You can do it similar to the allow_suspicious_codecs setting.

@fibersel fibersel reopened this May 6, 2021
@fibersel fibersel marked this pull request as ready for review May 6, 2021 11:59
@robot-ch-test-poll2 robot-ch-test-poll2 removed the submodule changed At least one submodule changed in this PR. label May 6, 2021
@robot-ch-test-poll4 robot-ch-test-poll4 added the submodule changed At least one submodule changed in this PR. label May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate and test experimental compression libraries.
5 participants