compression ratio #5

niyue · 2021-12-02T07:10:39Z

Hi Peter,

I am working on a columnar database project, and this paper/library is very interesting and I am evaluating how good the fsst compression ratio is comparing to lz4 and zstd.

Here is how I did it:
Test env: macOS 11.6, Intel MacBook Pro 2019

I built fsst using the default CMake setup, and got the fsst binary.
I installed zstd binary (zstd 1.5.0) and lz4 (lz4 1.9.3) using homebrew
I tried several sample text files in the paper/dbtext folder (I tried city/email/credentials/faust), compressing them using both fsst and zstd without any additional parameter (zstd defaults to level 3 compression according to its documentation). I found fsst compressed files are around 10% larger than zstd compressed files for these sample files. This result is close to my expectation.
Then I tried compressing some web server's access log files (a typical kind of data in my project), more specifically, this access log file (121MB, http://www.almhuette-raith.at/apache-log/access.log). For this particular data set, fsst achieves 38% compression ratio (46MB after compression), but zstd achieves 3% compression ratio (3.8MB after compression). And then I gave lz4 a try (using its command line without any parameter), and it can achieve 6% compression ratio (7.4MB after compression). The gap with zstd/lz4 for this data set is non trivial here and is not expected.

I wonder if you can shed some light on this evaluation:

is the fsst compression ratio expected for this data set or am I doing it wrong here?
if this is expected, what kind of data set is more well suited for fsst compression?

Thanks so much.

UPDATE:
I read the paper again, and it did mention json/xml text are around 2~2.5x worse than lz4. But here my test data is a big access log file, and it looks like there is 6x gap for lz4. What could be the type of data that is not working as well as expected?

The text was updated successfully, but these errors were encountered:

peterboncz · 2022-11-17T10:17:09Z

Many apologies for not responding to this. The email notifications were not correctly set up, which means I only saw this issue when accidentally browsing here.

As for the compression ration: FSST is especially suitable for database columns, where one has very many relatively short strings. Its static symbol table of 256 symbols makes it quite effective for textual human language (that is not extremely repetitive, but does contain often co-occuring letter combinations which cause compressibility); but in situations that are highly repetitive but with a relatively large amount of repetitive text it will lose out (it cannot remember more than 255x8=2KB of repetitive text anyway, but given that the non-repetitive text also needs some shorter symbols, you may probably divide that by 4, so 512 of highly repetitive text). So on JSON with complex (very repetitive) record structures FSST will start to lose out to compression schemes that can remember more text.

While it did not make the paper, we did experiment with FSST also using 12-bit codes (which would have 16x the 'textual memory' in the symbol table) and that can also be a sweet spot. It achieves roughly the same compression ratio on the benchmark in the paper and is even a bit faster in decompression (as with codes with more bits, and therefore a symbol table that can hold more symbols, the average symbol becomes longer -- and it has to become so, to counter the 50% increase in code-length). And longer symbols means faster decoding, as you produce more uncompressed text per iteration (per decoded code).

Increasing the code-width can thus help.. but to a point, of course. For instance, a 16-bits code scheme would also work, but counting the frequencies of two subsequent codes would require a 4GB array (the concatenated key of two codes is then 2x16-bits = 32 bits). That will make symbol table creation much slower. Either because of cache misses, or maybe because you would then switch from counting with an array to counting with a hash table. Also, with 16-bits codes, during decompression, the symbol table will no longer fit in the L1 CPU cache, making decompression likely slower.

So maybe we were wrong in deciding for 8-bits codes; and maybe we should try 12-bits or 16-bits in order to better handle JSON and these kinds of log files. Our consideration was that 8-bits codes beautifully fit in existing string representations in data systems, as an uncompressed string is a byte sequence; and a compressed string as well. This allows very often to do lazy decompression, without modifying existing systems much. Also, we considered that competent database systems should not store JSON as text, rather they should automatically recognise structure and store it shredded in columns. If those columns (JSON attributes) are string-typed, FSST could be used on those. But the attribute names would be 'compressed out' thanks to the shredding already.

We did not consider the kind of log-files you want to compress.

niyue · 2022-11-17T10:38:06Z

Thanks so much for the detailed explanation. I recently found DuckDB starts to support FSST, and they use it for compressing url/email, which is probably the best domain for its application. And I will keep an eye on FSST and see if I can use it wisely later. Thanks again.

niyue closed this as completed Nov 17, 2022

niyue mentioned this issue May 14, 2024

Lance File Format Version 2 (technically v0.3) lancedb/lance#1929

Open

77 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression ratio #5

compression ratio #5

niyue commented Dec 2, 2021 •

edited

peterboncz commented Nov 17, 2022 •

edited

niyue commented Nov 17, 2022

compression ratio #5

compression ratio #5

Comments

niyue commented Dec 2, 2021 • edited

peterboncz commented Nov 17, 2022 • edited

niyue commented Nov 17, 2022

niyue commented Dec 2, 2021 •

edited

peterboncz commented Nov 17, 2022 •

edited