FSST compression #4366

samansmink · 2022-08-12T08:53:43Z

PR

This PR adds a new compression method to duckdb, called FSST. In a nutshell, FSST is similar to dictionary compression, except instead of storing entire strings in a dictionary, a lookup table is used to store common substrings. For more details, check out the original paper and the source code in the repo. FSST provides performance similar or better than LZ4, with the added benefit of fine-grained access to the compressed data.

Base Implementation

FSST is implemented with a combination of delta encoding and bitpacking for compressing the dictionary offsets. For the compression analyze step, we randomly sample 25% of the vectors that of the row group and fully compress it to determine the compressed size. Compression reuses the FSST encoder that is generated during the analysis step to compress all the strings. During a scan, we cache the dictionary offset of the last decoded row to speed up the delta decoding in sequential scans. Note that similar to dictionary compression, a minimum compression ratio of 1.2 is required for FSST to be selected by the checkpointer to prevent unnecessary overhead for poorly compressible data.

Late decompression

This PR also includes a new vector type VectorType::FSST_VECTOR that allows for late decompression of FSST strings. Late decompression can improve performance as some of the data may be filtered out and does not need to be decompressed at all. Additionally, it opens the door to compressed execution, where operators are implemented to directly operate on the the compressed data without needing to decompress at all. Note that currently, emitting fsst vectors is disabled, but can be enabled with SET enable_fsst_vectors=true. The reason for this is that it currently has a higher overhead and we're not really using the benefits of it yet.

SIMD

Currently the simd implementation of FSST that uses AVX512 intrinsics is disabled, to experiment with this, there's a flag in third_party/fsst/CMakeLists.txt that can be set to enable it, note that this is currently untested in duckdb.

Next steps

Optimize memory usage of analysis step. Currently when a string column is analyzed in by the ColumnDataCheckpointer, the strings are stored separately by both dictionary compression and FSST. It would be nice to be able to share the string data during analysis.

Experiment with compressed execution. For example, a constant filter on an FSST encrypted column could be applied by encrypting the constant with the same symbol table instead of decrypting the column. This has two benefits: the comparison itself is sped up by operating on smaller strings, and also less data needs to be decrypted overall.

Switch to a single symbol table per row group. Currently the FSST symbol table is stored once per compressed segment, as this is easier to implement. This does come at an overhead of a few percent, so we could switch to storing it once per row group. This is probably also useful for implementing compressed execution as that will require determining which symbol table is used.

Results

All benchmarks run on m5.xlarge.

Compression

TPCH SF1
This benchmark shows the total database size on disk with different combinations of string compression functions enabled. Note that in this benchmark we only change the string compression functions, all fixed size datatypes remain compressed with the default compression schemes (bitpacking/rle).

compression	storage size
no string compression	761M
dictionary compression	510M
fsst and dictionary	251M

As expected, fsst adds a big improvement to the tpch storage size. This is expected as with fsst, we can compress columns such as l_comment and c_name very well. For example we compress l_comment with about 3x compression ratio, which matches the results reported in the FSST paper closely.

Microbenchmarks

In this benchmark we compare fsst both with and without late decompression. A big advantage of using FSST is compression and decompressed speed, however, FSST does add some overhead. Especially compared to dictionary compression, which is often faster than a normal scan in duckdb.

The regular read/store benchmarks aim to have a "realistic" compression ratios based on the compression ratios found in the fsst paper. The _worst_case benchmarks have uncompressible string data. The late_decompression benchmark contains a filter with a selectivity of 10% on a different column, demonstrating the effect of late decompression.

benchmark	baseline	dict	fsst	fsst_late_decomp	dict_diff	fsst_diff	fsst_late_decomp_diff
benchmark/micro/compression/fsst/fsst_late_decompression.benchmark	0.63	0.31	0.73	0.70	-50%	15%	10%
benchmark/micro/compression/fsst/fsst_read.benchmark	0.88	0.51	0.96	1.22	-42%	9%	38%
benchmark/micro/compression/fsst/fsst_read_worst_case.benchmark	0.42	0.43	0.79	0.98	2%	88%	133%
benchmark/micro/compression/fsst/fsst_store.benchmark	0.60	0.76	0.69	0.67	26%	15%	12%
benchmark/micro/compression/fsst/fsst_store_worst_case.benchmark	1.11	1.77	1.30	1.19	58%	16%	7%
benchmark/micro/compression/store_tpch_sf1.benchmark	25.53	26.50	27.12	27.12	4%	6%	6%

Based on these benchmarks, we see that fsst decompression does come at some performance overhead, especially at low compression ratios. We could consider setting the minimum_compression ratio a bit higher based on these numbers.

Next up, a benchmark that measures how long writing and checkpointing takes for tpch sf1:

benchmark	no string compression	only dict	dict and fsst	dict_diff	dict_fsst_diff
benchmark/micro/compression/store_tpch_sf1.benchmark	25.53	26.50	27.12	4%	6%

TPCH SF1

Next, we run tpch on a persistent db to see how the overhead from fsst translates into more realistic queries. All queries where no significant difference was measured have been discarded. These overheads seem pretty reasonable for the achieved compression.

benchmark	baseline_without_fsst	fsst	fsst_late_decomp	fsst_diff	fsst_late_decomp_diff
q10	0.15	0.17	0.18	8%	17%
q13	0.13	0.17	0.22	29%	66%
q17	0.20	0.22	0.22	9%	7%
q22	0.05	0.08	0.08	50%	57%

Mytherin

Thanks for the PR! Looks great, and great results. Some comments:

src/common/types/vector.cpp

src/common/vector_operations/vector_copy.cpp

src/include/duckdb/common/types/vector.hpp

src/storage/compression/fsst.cpp

Mytherin · 2022-08-15T20:18:59Z

src/storage/compression/fsst.cpp

+
+	// Only Nulls, nothing to compress
+	if (total_count == 0 || state.fsst_encoder == nullptr) {
+		for (idx_t i = 0; i < count; i++) {


Do we need to support this case? In case of all null or a mix of null and empty strings, I would imagine dictionary or constant encoding would always be better than FSST, no?

Yes thats true! i have changed this, and now strings if all strings are empty or null, fsst will not be considered. However we still need this case, for example when the first 1024 values of are null, but the rest are not.

src/storage/compression/fsst.cpp

hannes · 2022-08-22T10:31:58Z

@samansmink good to go from your side?

samansmink · 2022-08-23T19:50:35Z

@hannes Unfortunately I think that even though CI is succeeding, there is still an issue in this PR. I can reproduce this on my machine by building with make relassert, setting vector_size to 2, then running the test/sql/storage/compression/string/filter_pushdown.test which will result in:

Filters: test/sql/storage/compression/string/filter_pushdown.test
[0/1] (0%): test/sql/storage/compression/string/filter_pushdown.test            =================================================================
==42685==ERROR: AddressSanitizer: container-overflow on address 0x000104dc8540 at pc 0x000109b53fb0 bp 0x00016f19b9b0 sp 0x00016f19b9a8
READ of size 1 at 0x000104dc8540 thread T0
    #0 0x109b53fac in buildSymbolTable(Counters&, std::__1::vector<unsigned char*, std::__1::allocator<unsigned char*> >, unsigned long*, bool) libfsst.cpp:221
    #1 0x109b575c8 in duckdb_fsst_create libfsst.cpp:496
    #2 0x109926eb4 in duckdb::FSSTStorage::StringFinalAnalyze(duckdb::AnalyzeState&) fsst.cpp:161
    #3 0x1099d9f94 in duckdb::ColumnDataCheckpointer::DetectBestCompressionMethod(unsigned long long&) column_data_checkpointer.cpp:136
    #4 0x1099db0b4 in duckdb::ColumnDataCheckpointer::WriteToDisk() column_data_checkpointer.cpp:177
    #5 0x109a188bc in duckdb::ColumnData::Checkpoint(duckdb::RowGroup&, duckdb::TableDataWriter&, duckdb::ColumnCheckpointInfo&) column_data.cpp:373
    #6 0x109a4a37c in duckdb::StandardColumnData::Checkpoint(duckdb::RowGroup&, duckdb::TableDataWriter&, duckdb::ColumnCheckpointInfo&) standard_column_data.cpp:182
    #7 0x109a418a4 in duckdb::RowGroup::Checkpoint(duckdb::TableDataWriter&, std::__1::vector<std::__1::unique_ptr<duckdb::BaseStatistics, std::__1::default_delete<duckdb::BaseStatistics> >, std::__1::allocator<std::__1::unique_ptr<duckdb::BaseStatistics, std::__1::default_delete<duckdb::BaseStatistics> > > >&) row_group.cpp:667
    #8 0x109ae024c in duckdb::DataTable::Checkpoint(duckdb::TableDataWriter&) data_table.cpp:1416
...

I'm not sure whats happening here yet, or why it isn't happening in the CI runs now.

samansmink · 2022-10-03T10:45:23Z

@Mytherin this PR is good to go from my side!

Mytherin · 2022-10-03T13:35:04Z

Thanks! LGTM

samansmink added 30 commits June 16, 2022 13:14

cloned fsst code from repo

087e7d0

WIP: fsst POC

5af0f0f

WIP: fixes to POC/rebase

7780ad7

wip: fsst + delta encode + bp: beginnetje werkt

c9704d4

wip: fsst + delta encode + bp: meer tests slagen

4bfc527

wip: fsst + delta encode + bp: nog ietsje meer tests slagen

6f81bcc

wip: fsst + delta encode + bp: list test slaagt nu ook

b968b25

wip: fsst + delta encode + bp: bugfix in libfsst

20f095e

wip: fsst + delta encode + bp: remove random trailing space

e256010

wip: fsst + delta encode + bp: added getValue, right before refactor

253d0ca

wip: fsst + delta encode + bp: fetch now works too!

d03a7f3

wip: fsst + delta encode + bp: some cleaning up for CI

b76f44a

wip: fsst + delta encode + bp: unaligned load in libfsst

b48e562

wip: fsst + delta encode + bp: incorrect template default param

f19f0e9

fsst build fixes

bcb2b43

prevent leaking fsst encoder

9bf313c

clean up fsst cmakefile, remove include fsst.h from header files

603ab2d

various issues from ci

d3f2ce3

format

40a88a5

add a move here

0c283ac

prefix fsst symbols

3aceca4

more ci fixes

de48b0a

fix 0 scan_count bug

3072bfb

remove unused header

bbb68b6

Added WIN32 version for builtin_ctzl

9a2d44a

Merge branch 'master' into fsst-compression-rebased

fa96ad2

fixes after master merge

52756cb

incorrect win32 type

f820d5e

Merge branch 'master' into fsst-compression-rebased

2869fb1

fix merge

4f4efbb

disabled march=native for portability

00a1e70

Mytherin marked this pull request as ready for review August 16, 2022 07:15

Mytherin reviewed Aug 16, 2022

View reviewed changes

samansmink added 7 commits August 16, 2022 21:26

fix issue due to rtti disabled on node windows

71c95a7

Merge branch 'master' into fsst-compression-rebased

82e51d8

fixed several code style issues

084b752

make FSSTVector specific functions

b745dcb

encapsulate fsst decompression better

66d3cd4

dont use fsst for empty segments, small test rework

e35bf5c

fix broken tests due to vector size

0bf8e55

hannes added this to the 0.5.0 milestone Aug 22, 2022

hannes removed this from the 0.5.0 milestone Aug 28, 2022

samansmink added 9 commits September 2, 2022 11:42

prevent reading beyond end here

2dca798

Merge branch 'master' into fsst-compression-rebased

aceda97

Merge branch 'master' into fsst-compression-rebased

df901f6

this should be a slow test fo sho

6d3b476

Merge branch 'master' into fsst-compression-rebased

b4f16bc

test should be slow

4922bf8

need to store the string count for FSST vectors

9c7aa5e

format

1964bff

oops forgot this

fcdc9ab

Mytherin merged commit ffa0b9d into duckdb:master Oct 3, 2022

niyue mentioned this pull request Nov 17, 2022

compression ratio cwida/fsst#5

Closed

RXminuS mentioned this pull request Dec 13, 2022

FSST string compression failed due to incorrect size calculation #5675

Closed

2 tasks

semihsalihoglu-uw mentioned this pull request May 4, 2023

Node-group based Storage kuzudb/kuzu#1474

Open

20 tasks

semihsalihoglu-uw mentioned this pull request Feb 10, 2024

FSST-based string compression kuzudb/kuzu#2861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSST compression #4366

FSST compression #4366

samansmink commented Aug 12, 2022 •

edited

Mytherin left a comment

Mytherin Aug 15, 2022

samansmink Aug 17, 2022

hannes commented Aug 22, 2022

samansmink commented Aug 23, 2022

samansmink commented Oct 3, 2022

Mytherin commented Oct 3, 2022

FSST compression #4366

FSST compression #4366

Conversation

samansmink commented Aug 12, 2022 • edited

PR

Base Implementation

Late decompression

SIMD

Next steps

Results

Compression

Microbenchmarks

TPCH SF1

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin Aug 15, 2022

Choose a reason for hiding this comment

samansmink Aug 17, 2022

Choose a reason for hiding this comment

hannes commented Aug 22, 2022

samansmink commented Aug 23, 2022

samansmink commented Oct 3, 2022

Mytherin commented Oct 3, 2022

samansmink commented Aug 12, 2022 •

edited