New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSST compression #4366
FSST compression #4366
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks great, and great results. Some comments:
|
||
// Only Nulls, nothing to compress | ||
if (total_count == 0 || state.fsst_encoder == nullptr) { | ||
for (idx_t i = 0; i < count; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support this case? In case of all null or a mix of null and empty strings, I would imagine dictionary or constant encoding would always be better than FSST, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats true! i have changed this, and now strings if all strings are empty or null, fsst will not be considered. However we still need this case, for example when the first 1024 values of are null, but the rest are not.
@samansmink good to go from your side? |
@hannes Unfortunately I think that even though CI is succeeding, there is still an issue in this PR. I can reproduce this on my machine by building with
I'm not sure whats happening here yet, or why it isn't happening in the CI runs now. |
@Mytherin this PR is good to go from my side! |
Thanks! LGTM |
PR
This PR adds a new compression method to duckdb, called FSST. In a nutshell, FSST is similar to dictionary compression, except instead of storing entire strings in a dictionary, a lookup table is used to store common substrings. For more details, check out the original paper and the source code in the repo. FSST provides performance similar or better than LZ4, with the added benefit of fine-grained access to the compressed data.
Base Implementation
FSST is implemented with a combination of delta encoding and bitpacking for compressing the dictionary offsets. For the compression analyze step, we randomly sample 25% of the vectors that of the row group and fully compress it to determine the compressed size. Compression reuses the FSST encoder that is generated during the analysis step to compress all the strings. During a scan, we cache the dictionary offset of the last decoded row to speed up the delta decoding in sequential scans. Note that similar to dictionary compression, a minimum compression ratio of 1.2 is required for FSST to be selected by the checkpointer to prevent unnecessary overhead for poorly compressible data.
Late decompression
This PR also includes a new vector type
VectorType::FSST_VECTOR
that allows for late decompression of FSST strings. Late decompression can improve performance as some of the data may be filtered out and does not need to be decompressed at all. Additionally, it opens the door to compressed execution, where operators are implemented to directly operate on the the compressed data without needing to decompress at all. Note that currently, emitting fsst vectors is disabled, but can be enabled withSET enable_fsst_vectors=true
. The reason for this is that it currently has a higher overhead and we're not really using the benefits of it yet.SIMD
Currently the simd implementation of FSST that uses AVX512 intrinsics is disabled, to experiment with this, there's a flag in
third_party/fsst/CMakeLists.txt
that can be set to enable it, note that this is currently untested in duckdb.Next steps
Optimize memory usage of analysis step. Currently when a string column is analyzed in by the ColumnDataCheckpointer, the strings are stored separately by both dictionary compression and FSST. It would be nice to be able to share the string data during analysis.
Experiment with compressed execution. For example, a constant filter on an FSST encrypted column could be applied by encrypting the constant with the same symbol table instead of decrypting the column. This has two benefits: the comparison itself is sped up by operating on smaller strings, and also less data needs to be decrypted overall.
Switch to a single symbol table per row group. Currently the FSST symbol table is stored once per compressed segment, as this is easier to implement. This does come at an overhead of a few percent, so we could switch to storing it once per row group. This is probably also useful for implementing compressed execution as that will require determining which symbol table is used.
Results
All benchmarks run on m5.xlarge.
Compression
TPCH SF1
This benchmark shows the total database size on disk with different combinations of string compression functions enabled. Note that in this benchmark we only change the string compression functions, all fixed size datatypes remain compressed with the default compression schemes (bitpacking/rle).
As expected, fsst adds a big improvement to the tpch storage size. This is expected as with fsst, we can compress columns such as l_comment and c_name very well. For example we compress l_comment with about 3x compression ratio, which matches the results reported in the FSST paper closely.
Microbenchmarks
In this benchmark we compare fsst both with and without late decompression. A big advantage of using FSST is compression and decompressed speed, however, FSST does add some overhead. Especially compared to dictionary compression, which is often faster than a normal scan in duckdb.
The regular read/store benchmarks aim to have a "realistic" compression ratios based on the compression ratios found in the fsst paper. The
_worst_case
benchmarks have uncompressible string data. Thelate_decompression
benchmark contains a filter with a selectivity of 10% on a different column, demonstrating the effect of late decompression.Based on these benchmarks, we see that fsst decompression does come at some performance overhead, especially at low compression ratios. We could consider setting the minimum_compression ratio a bit higher based on these numbers.
Next up, a benchmark that measures how long writing and checkpointing takes for tpch sf1:
TPCH SF1
Next, we run tpch on a persistent db to see how the overhead from fsst translates into more realistic queries. All queries where no significant difference was measured have been discarded. These overheads seem pretty reasonable for the achieved compression.