Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compression ratio #5

Closed
niyue opened this issue Dec 2, 2021 · 2 comments
Closed

compression ratio #5

niyue opened this issue Dec 2, 2021 · 2 comments

Comments

@niyue
Copy link

niyue commented Dec 2, 2021

Hi Peter,

I am working on a columnar database project, and this paper/library is very interesting and I am evaluating how good the fsst compression ratio is comparing to lz4 and zstd.

Here is how I did it:
Test env: macOS 11.6, Intel MacBook Pro 2019

  1. I built fsst using the default CMake setup, and got the fsst binary.
  2. I installed zstd binary (zstd 1.5.0) and lz4 (lz4 1.9.3) using homebrew
  3. I tried several sample text files in the paper/dbtext folder (I tried city/email/credentials/faust), compressing them using both fsst and zstd without any additional parameter (zstd defaults to level 3 compression according to its documentation). I found fsst compressed files are around 10% larger than zstd compressed files for these sample files. This result is close to my expectation.
  4. Then I tried compressing some web server's access log files (a typical kind of data in my project), more specifically, this access log file (121MB, http://www.almhuette-raith.at/apache-log/access.log). For this particular data set, fsst achieves 38% compression ratio (46MB after compression), but zstd achieves 3% compression ratio (3.8MB after compression). And then I gave lz4 a try (using its command line without any parameter), and it can achieve 6% compression ratio (7.4MB after compression). The gap with zstd/lz4 for this data set is non trivial here and is not expected.

I wonder if you can shed some light on this evaluation:

  1. is the fsst compression ratio expected for this data set or am I doing it wrong here?
  2. if this is expected, what kind of data set is more well suited for fsst compression?

Thanks so much.

UPDATE:
I read the paper again, and it did mention json/xml text are around 2~2.5x worse than lz4. But here my test data is a big access log file, and it looks like there is 6x gap for lz4. What could be the type of data that is not working as well as expected?

@peterboncz
Copy link
Collaborator

peterboncz commented Nov 17, 2022

Many apologies for not responding to this. The email notifications were not correctly set up, which means I only saw this issue when accidentally browsing here.

As for the compression ration: FSST is especially suitable for database columns, where one has very many relatively short strings. Its static symbol table of 256 symbols makes it quite effective for textual human language (that is not extremely repetitive, but does contain often co-occuring letter combinations which cause compressibility); but in situations that are highly repetitive but with a relatively large amount of repetitive text it will lose out (it cannot remember more than 255x8=2KB of repetitive text anyway, but given that the non-repetitive text also needs some shorter symbols, you may probably divide that by 4, so 512 of highly repetitive text). So on JSON with complex (very repetitive) record structures FSST will start to lose out to compression schemes that can remember more text.

While it did not make the paper, we did experiment with FSST also using 12-bit codes (which would have 16x the 'textual memory' in the symbol table) and that can also be a sweet spot. It achieves roughly the same compression ratio on the benchmark in the paper and is even a bit faster in decompression (as with codes with more bits, and therefore a symbol table that can hold more symbols, the average symbol becomes longer -- and it has to become so, to counter the 50% increase in code-length). And longer symbols means faster decoding, as you produce more uncompressed text per iteration (per decoded code).

Increasing the code-width can thus help.. but to a point, of course. For instance, a 16-bits code scheme would also work, but counting the frequencies of two subsequent codes would require a 4GB array (the concatenated key of two codes is then 2x16-bits = 32 bits). That will make symbol table creation much slower. Either because of cache misses, or maybe because you would then switch from counting with an array to counting with a hash table. Also, with 16-bits codes, during decompression, the symbol table will no longer fit in the L1 CPU cache, making decompression likely slower.

So maybe we were wrong in deciding for 8-bits codes; and maybe we should try 12-bits or 16-bits in order to better handle JSON and these kinds of log files. Our consideration was that 8-bits codes beautifully fit in existing string representations in data systems, as an uncompressed string is a byte sequence; and a compressed string as well. This allows very often to do lazy decompression, without modifying existing systems much. Also, we considered that competent database systems should not store JSON as text, rather they should automatically recognise structure and store it shredded in columns. If those columns (JSON attributes) are string-typed, FSST could be used on those. But the attribute names would be 'compressed out' thanks to the shredding already.

We did not consider the kind of log-files you want to compress.

@niyue
Copy link
Author

niyue commented Nov 17, 2022

Thanks so much for the detailed explanation. I recently found DuckDB starts to support FSST, and they use it for compressing url/email, which is probably the best domain for its application. And I will keep an eye on FSST and see if I can use it wisely later. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants