Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSST compression #4366

Merged
merged 77 commits into from Oct 3, 2022
Merged

Conversation

samansmink
Copy link
Contributor

@samansmink samansmink commented Aug 12, 2022

PR

This PR adds a new compression method to duckdb, called FSST. In a nutshell, FSST is similar to dictionary compression, except instead of storing entire strings in a dictionary, a lookup table is used to store common substrings. For more details, check out the original paper and the source code in the repo. FSST provides performance similar or better than LZ4, with the added benefit of fine-grained access to the compressed data.

Base Implementation

FSST is implemented with a combination of delta encoding and bitpacking for compressing the dictionary offsets. For the compression analyze step, we randomly sample 25% of the vectors that of the row group and fully compress it to determine the compressed size. Compression reuses the FSST encoder that is generated during the analysis step to compress all the strings. During a scan, we cache the dictionary offset of the last decoded row to speed up the delta decoding in sequential scans. Note that similar to dictionary compression, a minimum compression ratio of 1.2 is required for FSST to be selected by the checkpointer to prevent unnecessary overhead for poorly compressible data.

Late decompression

This PR also includes a new vector type VectorType::FSST_VECTOR that allows for late decompression of FSST strings. Late decompression can improve performance as some of the data may be filtered out and does not need to be decompressed at all. Additionally, it opens the door to compressed execution, where operators are implemented to directly operate on the the compressed data without needing to decompress at all. Note that currently, emitting fsst vectors is disabled, but can be enabled with SET enable_fsst_vectors=true. The reason for this is that it currently has a higher overhead and we're not really using the benefits of it yet.

SIMD

Currently the simd implementation of FSST that uses AVX512 intrinsics is disabled, to experiment with this, there's a flag in third_party/fsst/CMakeLists.txt that can be set to enable it, note that this is currently untested in duckdb.

Next steps

Optimize memory usage of analysis step. Currently when a string column is analyzed in by the ColumnDataCheckpointer, the strings are stored separately by both dictionary compression and FSST. It would be nice to be able to share the string data during analysis.

Experiment with compressed execution. For example, a constant filter on an FSST encrypted column could be applied by encrypting the constant with the same symbol table instead of decrypting the column. This has two benefits: the comparison itself is sped up by operating on smaller strings, and also less data needs to be decrypted overall.

Switch to a single symbol table per row group. Currently the FSST symbol table is stored once per compressed segment, as this is easier to implement. This does come at an overhead of a few percent, so we could switch to storing it once per row group. This is probably also useful for implementing compressed execution as that will require determining which symbol table is used.

Results

All benchmarks run on m5.xlarge.

Compression

TPCH SF1
This benchmark shows the total database size on disk with different combinations of string compression functions enabled. Note that in this benchmark we only change the string compression functions, all fixed size datatypes remain compressed with the default compression schemes (bitpacking/rle).

compression storage size
no string compression 761M
dictionary compression 510M
fsst and dictionary 251M

As expected, fsst adds a big improvement to the tpch storage size. This is expected as with fsst, we can compress columns such as l_comment and c_name very well. For example we compress l_comment with about 3x compression ratio, which matches the results reported in the FSST paper closely.

Microbenchmarks

In this benchmark we compare fsst both with and without late decompression. A big advantage of using FSST is compression and decompressed speed, however, FSST does add some overhead. Especially compared to dictionary compression, which is often faster than a normal scan in duckdb.

The regular read/store benchmarks aim to have a "realistic" compression ratios based on the compression ratios found in the fsst paper. The _worst_case benchmarks have uncompressible string data. The late_decompression benchmark contains a filter with a selectivity of 10% on a different column, demonstrating the effect of late decompression.

benchmark baseline dict fsst fsst_late_decomp dict_diff fsst_diff fsst_late_decomp_diff
benchmark/micro/compression/fsst/fsst_late_decompression.benchmark 0.63 0.31 0.73 0.70 -50% 15% 10%
benchmark/micro/compression/fsst/fsst_read.benchmark 0.88 0.51 0.96 1.22 -42% 9% 38%
benchmark/micro/compression/fsst/fsst_read_worst_case.benchmark 0.42 0.43 0.79 0.98 2% 88% 133%
benchmark/micro/compression/fsst/fsst_store.benchmark 0.60 0.76 0.69 0.67 26% 15% 12%
benchmark/micro/compression/fsst/fsst_store_worst_case.benchmark 1.11 1.77 1.30 1.19 58% 16% 7%
benchmark/micro/compression/store_tpch_sf1.benchmark 25.53 26.50 27.12 27.12 4% 6% 6%

Based on these benchmarks, we see that fsst decompression does come at some performance overhead, especially at low compression ratios. We could consider setting the minimum_compression ratio a bit higher based on these numbers.

Next up, a benchmark that measures how long writing and checkpointing takes for tpch sf1:

benchmark no string compression only dict dict and fsst dict_diff dict_fsst_diff
benchmark/micro/compression/store_tpch_sf1.benchmark 25.53 26.50 27.12 4% 6%

TPCH SF1

Next, we run tpch on a persistent db to see how the overhead from fsst translates into more realistic queries. All queries where no significant difference was measured have been discarded. These overheads seem pretty reasonable for the achieved compression.

benchmark baseline_without_fsst fsst fsst_late_decomp fsst_diff fsst_late_decomp_diff
q10 0.15 0.17 0.18 8% 17%
q13 0.13 0.17 0.22 29% 66%
q17 0.20 0.22 0.22 9% 7%
q22 0.05 0.08 0.08 50% 57%

@Mytherin Mytherin marked this pull request as ready for review August 16, 2022 07:15
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great, and great results. Some comments:

src/common/types/vector.cpp Outdated Show resolved Hide resolved
src/common/types/vector.cpp Show resolved Hide resolved
src/common/types/vector.cpp Outdated Show resolved Hide resolved
src/common/types/vector.cpp Outdated Show resolved Hide resolved
src/common/vector_operations/vector_copy.cpp Outdated Show resolved Hide resolved
src/include/duckdb/common/types/vector.hpp Outdated Show resolved Hide resolved
src/storage/compression/fsst.cpp Outdated Show resolved Hide resolved
src/storage/compression/fsst.cpp Outdated Show resolved Hide resolved

// Only Nulls, nothing to compress
if (total_count == 0 || state.fsst_encoder == nullptr) {
for (idx_t i = 0; i < count; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support this case? In case of all null or a mix of null and empty strings, I would imagine dictionary or constant encoding would always be better than FSST, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats true! i have changed this, and now strings if all strings are empty or null, fsst will not be considered. However we still need this case, for example when the first 1024 values of are null, but the rest are not.

src/storage/compression/fsst.cpp Outdated Show resolved Hide resolved
@hannes hannes added this to the 0.5.0 milestone Aug 22, 2022
@hannes
Copy link
Member

hannes commented Aug 22, 2022

@samansmink good to go from your side?

@samansmink
Copy link
Contributor Author

@hannes Unfortunately I think that even though CI is succeeding, there is still an issue in this PR. I can reproduce this on my machine by building with make relassert, setting vector_size to 2, then running the test/sql/storage/compression/string/filter_pushdown.test which will result in:

Filters: test/sql/storage/compression/string/filter_pushdown.test
[0/1] (0%): test/sql/storage/compression/string/filter_pushdown.test            =================================================================
==42685==ERROR: AddressSanitizer: container-overflow on address 0x000104dc8540 at pc 0x000109b53fb0 bp 0x00016f19b9b0 sp 0x00016f19b9a8
READ of size 1 at 0x000104dc8540 thread T0
    #0 0x109b53fac in buildSymbolTable(Counters&, std::__1::vector<unsigned char*, std::__1::allocator<unsigned char*> >, unsigned long*, bool) libfsst.cpp:221
    #1 0x109b575c8 in duckdb_fsst_create libfsst.cpp:496
    #2 0x109926eb4 in duckdb::FSSTStorage::StringFinalAnalyze(duckdb::AnalyzeState&) fsst.cpp:161
    #3 0x1099d9f94 in duckdb::ColumnDataCheckpointer::DetectBestCompressionMethod(unsigned long long&) column_data_checkpointer.cpp:136
    #4 0x1099db0b4 in duckdb::ColumnDataCheckpointer::WriteToDisk() column_data_checkpointer.cpp:177
    #5 0x109a188bc in duckdb::ColumnData::Checkpoint(duckdb::RowGroup&, duckdb::TableDataWriter&, duckdb::ColumnCheckpointInfo&) column_data.cpp:373
    #6 0x109a4a37c in duckdb::StandardColumnData::Checkpoint(duckdb::RowGroup&, duckdb::TableDataWriter&, duckdb::ColumnCheckpointInfo&) standard_column_data.cpp:182
    #7 0x109a418a4 in duckdb::RowGroup::Checkpoint(duckdb::TableDataWriter&, std::__1::vector<std::__1::unique_ptr<duckdb::BaseStatistics, std::__1::default_delete<duckdb::BaseStatistics> >, std::__1::allocator<std::__1::unique_ptr<duckdb::BaseStatistics, std::__1::default_delete<duckdb::BaseStatistics> > > >&) row_group.cpp:667
    #8 0x109ae024c in duckdb::DataTable::Checkpoint(duckdb::TableDataWriter&) data_table.cpp:1416
...

I'm not sure whats happening here yet, or why it isn't happening in the CI runs now.

@hannes hannes removed this from the 0.5.0 milestone Aug 28, 2022
@samansmink
Copy link
Contributor Author

@Mytherin this PR is good to go from my side!

@Mytherin Mytherin merged commit ffa0b9d into duckdb:master Oct 3, 2022
@Mytherin
Copy link
Collaborator

Mytherin commented Oct 3, 2022

Thanks! LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants