-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compression ratio #5
Comments
Many apologies for not responding to this. The email notifications were not correctly set up, which means I only saw this issue when accidentally browsing here. As for the compression ration: FSST is especially suitable for database columns, where one has very many relatively short strings. Its static symbol table of 256 symbols makes it quite effective for textual human language (that is not extremely repetitive, but does contain often co-occuring letter combinations which cause compressibility); but in situations that are highly repetitive but with a relatively large amount of repetitive text it will lose out (it cannot remember more than 255x8=2KB of repetitive text anyway, but given that the non-repetitive text also needs some shorter symbols, you may probably divide that by 4, so 512 of highly repetitive text). So on JSON with complex (very repetitive) record structures FSST will start to lose out to compression schemes that can remember more text. While it did not make the paper, we did experiment with FSST also using 12-bit codes (which would have 16x the 'textual memory' in the symbol table) and that can also be a sweet spot. It achieves roughly the same compression ratio on the benchmark in the paper and is even a bit faster in decompression (as with codes with more bits, and therefore a symbol table that can hold more symbols, the average symbol becomes longer -- and it has to become so, to counter the 50% increase in code-length). And longer symbols means faster decoding, as you produce more uncompressed text per iteration (per decoded code). Increasing the code-width can thus help.. but to a point, of course. For instance, a 16-bits code scheme would also work, but counting the frequencies of two subsequent codes would require a 4GB array (the concatenated key of two codes is then 2x16-bits = 32 bits). That will make symbol table creation much slower. Either because of cache misses, or maybe because you would then switch from counting with an array to counting with a hash table. Also, with 16-bits codes, during decompression, the symbol table will no longer fit in the L1 CPU cache, making decompression likely slower. So maybe we were wrong in deciding for 8-bits codes; and maybe we should try 12-bits or 16-bits in order to better handle JSON and these kinds of log files. Our consideration was that 8-bits codes beautifully fit in existing string representations in data systems, as an uncompressed string is a byte sequence; and a compressed string as well. This allows very often to do lazy decompression, without modifying existing systems much. Also, we considered that competent database systems should not store JSON as text, rather they should automatically recognise structure and store it shredded in columns. If those columns (JSON attributes) are string-typed, FSST could be used on those. But the attribute names would be 'compressed out' thanks to the shredding already. We did not consider the kind of log-files you want to compress. |
Thanks so much for the detailed explanation. I recently found DuckDB starts to support FSST, and they use it for compressing url/email, which is probably the best domain for its application. And I will keep an eye on FSST and see if I can use it wisely later. Thanks again. |
Hi Peter,
I am working on a columnar database project, and this paper/library is very interesting and I am evaluating how good the fsst compression ratio is comparing to lz4 and zstd.
Here is how I did it:
Test env: macOS 11.6, Intel MacBook Pro 2019
fsst
binary.zstd
binary (zstd 1.5.0) andlz4
(lz4 1.9.3) using homebrewpaper/dbtext
folder (I tried city/email/credentials/faust), compressing them using bothfsst
andzstd
without any additional parameter (zstd defaults to level 3 compression according to its documentation). I found fsst compressed files are around 10% larger than zstd compressed files for these sample files. This result is close to my expectation.lz4
a try (using its command line without any parameter), and it can achieve 6% compression ratio (7.4MB after compression). The gap with zstd/lz4 for this data set is non trivial here and is not expected.I wonder if you can shed some light on this evaluation:
Thanks so much.
UPDATE:
I read the paper again, and it did mention json/xml text are around 2~2.5x worse than lz4. But here my test data is a big access log file, and it looks like there is 6x gap for lz4. What could be the type of data that is not working as well as expected?
The text was updated successfully, but these errors were encountered: