Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSST compression #4366

Merged
merged 77 commits into from
Oct 3, 2022
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
087e7d0
cloned fsst code from repo
samansmink May 18, 2022
5af0f0f
WIP: fsst POC
samansmink May 19, 2022
7780ad7
WIP: fixes to POC/rebase
samansmink Jun 16, 2022
c9704d4
wip: fsst + delta encode + bp: beginnetje werkt
samansmink Jun 16, 2022
4bfc527
wip: fsst + delta encode + bp: meer tests slagen
samansmink Jun 27, 2022
6f81bcc
wip: fsst + delta encode + bp: nog ietsje meer tests slagen
samansmink Jun 28, 2022
b968b25
wip: fsst + delta encode + bp: list test slaagt nu ook
samansmink Jun 29, 2022
20f095e
wip: fsst + delta encode + bp: bugfix in libfsst
samansmink Jun 30, 2022
e256010
wip: fsst + delta encode + bp: remove random trailing space
samansmink Jun 30, 2022
253d0ca
wip: fsst + delta encode + bp: added getValue, right before refactor
samansmink Jun 30, 2022
d03a7f3
wip: fsst + delta encode + bp: fetch now works too!
samansmink Jul 1, 2022
b76f44a
wip: fsst + delta encode + bp: some cleaning up for CI
samansmink Jul 1, 2022
b48e562
wip: fsst + delta encode + bp: unaligned load in libfsst
samansmink Jul 4, 2022
f19f0e9
wip: fsst + delta encode + bp: incorrect template default param
samansmink Jul 4, 2022
bcb2b43
fsst build fixes
samansmink Jul 4, 2022
9bf313c
prevent leaking fsst encoder
samansmink Jul 4, 2022
603ab2d
clean up fsst cmakefile, remove include fsst.h from header files
samansmink Jul 5, 2022
d3f2ce3
various issues from ci
samansmink Jul 7, 2022
40a88a5
format
samansmink Jul 7, 2022
0c283ac
add a move here
samansmink Jul 12, 2022
3aceca4
prefix fsst symbols
samansmink Jul 12, 2022
de48b0a
more ci fixes
samansmink Jul 14, 2022
3072bfb
fix 0 scan_count bug
samansmink Jul 15, 2022
bbb68b6
remove unused header
samansmink Jul 15, 2022
9a2d44a
Added WIN32 version for builtin_ctzl
samansmink Jul 15, 2022
fa96ad2
Merge branch 'master' into fsst-compression-rebased
samansmink Jul 15, 2022
52756cb
fixes after master merge
samansmink Jul 15, 2022
f820d5e
incorrect win32 type
samansmink Jul 15, 2022
2869fb1
Merge branch 'master' into fsst-compression-rebased
samansmink Jul 22, 2022
4f4efbb
fix merge
samansmink Jul 22, 2022
05670e5
Merge branch 'master' into fsst-compression-rebased
samansmink Jul 28, 2022
fb7bda6
adding duckdb_ symbol prefix, better size analysis, compaction
samansmink Aug 1, 2022
4c77e56
added two missing tests
samansmink Aug 1, 2022
e500c56
small cleanup
samansmink Aug 2, 2022
4e889a0
Merge branch 'master' into fsst-compression-rebased
samansmink Aug 2, 2022
c9226ab
corrected fsst compression ratio test
samansmink Aug 2, 2022
114170d
added l_comment compression ratio test too
samansmink Aug 2, 2022
9f9ab8f
clean up scanning without fsst_vectors
samansmink Aug 4, 2022
247fbb1
format, remove comment
samansmink Aug 4, 2022
a0b0cfb
cleaning up
samansmink Aug 4, 2022
174a515
apply intrinsic disable flag here too
samansmink Aug 4, 2022
1f47e36
Merge branch 'master' into fsst-compression-rebased
samansmink Aug 4, 2022
00e0745
fix empty string statistics issue
samansmink Aug 4, 2022
7e91b94
invert flag
samansmink Aug 5, 2022
d316eb9
fix odbc test failure
samansmink Aug 8, 2022
4ad7ebf
fsst benchmarks
samansmink Aug 9, 2022
9bae190
Merge branch 'master' into fsst-compression-rebased
samansmink Aug 9, 2022
93b9927
make fsst vector emitting optional
samansmink Aug 9, 2022
10079a2
format and cleanup
samansmink Aug 9, 2022
ae7dd33
make tidy
samansmink Aug 9, 2022
e4df851
cleanup
samansmink Aug 10, 2022
197d5b1
switch to string heap
samansmink Aug 10, 2022
3d2ad4f
sample fsst analysis
samansmink Aug 10, 2022
55105f1
small analysis fix
samansmink Aug 10, 2022
1c92487
fix issue with analysis sampling
samansmink Aug 10, 2022
05b9930
decrease benchmark runtime a bit
samansmink Aug 10, 2022
5ab1400
renamed benchmark
samansmink Aug 10, 2022
d312220
should be AddBlob ofc
samansmink Aug 10, 2022
285a78b
small fix
samansmink Aug 11, 2022
fdb4049
format
samansmink Aug 12, 2022
00a1e70
disabled march=native for portability
samansmink Aug 13, 2022
71c95a7
fix issue due to rtti disabled on node windows
samansmink Aug 16, 2022
82e51d8
Merge branch 'master' into fsst-compression-rebased
samansmink Aug 16, 2022
084b752
fixed several code style issues
samansmink Aug 17, 2022
b745dcb
make FSSTVector specific functions
samansmink Aug 17, 2022
66d3cd4
encapsulate fsst decompression better
samansmink Aug 17, 2022
e35bf5c
dont use fsst for empty segments, small test rework
samansmink Aug 17, 2022
0bf8e55
fix broken tests due to vector size
samansmink Aug 17, 2022
2dca798
prevent reading beyond end here
samansmink Sep 2, 2022
aceda97
Merge branch 'master' into fsst-compression-rebased
samansmink Sep 2, 2022
df901f6
Merge branch 'master' into fsst-compression-rebased
samansmink Sep 13, 2022
6d3b476
this should be a slow test fo sho
samansmink Sep 28, 2022
b4f16bc
Merge branch 'master' into fsst-compression-rebased
samansmink Sep 28, 2022
4922bf8
test should be slow
samansmink Sep 28, 2022
9c7aa5e
need to store the string count for FSST vectors
samansmink Sep 30, 2022
1964bff
format
samansmink Sep 30, 2022
fcdc9ab
oops forgot this
samansmink Sep 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ endif()


include_directories(src/include)
include_directories(third_party/fsst)
include_directories(third_party/fmt/include)
include_directories(third_party/hyperloglog)
include_directories(third_party/fastpforlib)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ storage persistent

load
DROP TABLE IF EXISTS test;
PRAGMA force_compression='uncompressed';
PRAGMA force_compression='dictionary';
CREATE TABLE test AS SELECT (100 + (i%2))::VARCHAR AS i FROM range(0, 200000000) tbl(i);
checkpoint;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ require_reinit

load
PRAGMA force_compression='dictionary';
DROP TABLE IF EXISTS integers;
DROP TABLE IF EXISTS test;

run
CREATE TABLE test AS SELECT (100 + (i%1000))::VARCHAR AS i FROM range(0, 100000000) tbl(i);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ require_reinit

load
PRAGMA force_compression='dictionary';
DROP TABLE IF EXISTS integers;
DROP TABLE IF EXISTS test;

run
CREATE TABLE test AS SELECT i::VARCHAR AS i FROM range(0, 50000000) tbl(i);
Expand Down
20 changes: 20 additions & 0 deletions benchmark/micro/compression/fsst/fsst_late_decompression.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# name: benchmark/micro/compression/fsst/fsst_late_decompression.benchmark
# description: Using a filter on another column to make use of late decompression
# group: [fsst]

name fsst late decompression benefit
group fsst
storage persistent

load
DROP TABLE IF EXISTS test;
PRAGMA force_compression='fsst';
CREATE TABLE test AS SELECT i as id, (100 + (i%2))::VARCHAR AS value FROM range(0, 50000000) tbl(i);
checkpoint;
SET enable_fsst_vectors=false;

run
select avg(value::INT) from test where id%10=0;

result I
100.500000
19 changes: 19 additions & 0 deletions benchmark/micro/compression/fsst/fsst_read.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# name: benchmark/micro/compression/fsst/fsst_read.benchmark
# description: Scanning strings at ~3.35x compression
# group: [fsst]

name fsst Compression Scan
group fsst
storage persistent

load
DROP TABLE IF EXISTS test;
PRAGMA force_compression='fsst';
CREATE TABLE test AS SELECT (100 + (i%1000))::VARCHAR AS i FROM range(0, 50000000) tbl(i);
checkpoint;

run
select avg(i::INT) from test;

result I
599.500000
16 changes: 16 additions & 0 deletions benchmark/micro/compression/fsst/fsst_read_worst_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/fsst/fsst_read_worst_case.benchmark
# description: Scanning data that is not with fsst encoding, note that compresssion ratio is still 1.9x due to bitpacking
# group: [fsst]

name fsst Compression Scan
group aggregate
storage persistent

load
DROP TABLE IF EXISTS test;
PRAGMA force_compression='fsst';
CREATE TABLE test AS SELECT gen_random_uuid()::VARCHAR AS i FROM range(0, 20000000) tbl(i);
checkpoint;

run
select max(i[2]) from test;
16 changes: 16 additions & 0 deletions benchmark/micro/compression/fsst/fsst_store.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/fsst/fsst_store.benchmark
# description: Storing strings compressed at ~3.3x compression
# group: [fsst]

name fsst Compression Write
group aggregate
storage persistent
require_reinit

load
PRAGMA force_compression='fsst';

run
CREATE TABLE test_compressed AS SELECT (100 + (i%1000))::VARCHAR AS i FROM range(0, 2500000) tbl(i);
checkpoint;

16 changes: 16 additions & 0 deletions benchmark/micro/compression/fsst/fsst_store_worst_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/fsst/fsst_store_worst_case.benchmark
# description: Storing a column containing only unique strings.
# group: [fsst]

name name fsst Compression Write
group fsst
storage persistent
require_reinit

load
PRAGMA force_compression='fsst';
DROP TABLE IF EXISTS test;

run
CREATE TABLE test AS SELECT gen_random_uuid()::VARCHAR AS i FROM range(0, 2000000) tbl(i);
checkpoint;
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# name: benchmark/micro/compression/dictionary/dictionary_store_tpch_sf1.benchmark
# name: benchmark/micro/compression/store_tpch_sf1.benchmark
# description: Generating and storing a tpc-h sf1 database using default compression
# group: [dictionary]
# group: [compression]

name Dictionary Compression Write
name TPC-H Write benchmark
group aggregate
storage persistent
require_reinit
Expand Down
2 changes: 2 additions & 0 deletions scripts/package_build.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
def third_party_includes():
includes = []
includes += [os.path.join('third_party', 'fmt', 'include')]
includes += [os.path.join('third_party', 'fsst')]
includes += [os.path.join('third_party', 're2')]
includes += [os.path.join('third_party', 'miniz')]
includes += [os.path.join('third_party', 'utf8proc', 'include')]
Expand All @@ -33,6 +34,7 @@ def third_party_includes():
def third_party_sources():
sources = []
sources += [os.path.join('third_party', 'fmt')]
sources += [os.path.join('third_party', 'fsst')]
sources += [os.path.join('third_party', 'miniz')]
sources += [os.path.join('third_party', 're2')]
sources += [os.path.join('third_party', 'hyperloglog')]
Expand Down
1 change: 1 addition & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ else()

set(DUCKDB_LINK_LIBS
${DUCKDB_SYSTEM_LIBS}
duckdb_fsst
duckdb_fmt
duckdb_pg_query
duckdb_re2
Expand Down
106 changes: 105 additions & 1 deletion src/common/types/vector.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
#include "duckdb/common/vector_operations/vector_operations.hpp"
#include "duckdb/storage/buffer/buffer_handle.hpp"
#include "duckdb/function/scalar/nested_functions.hpp"
#include "duckdb/storage/string_uncompressed.hpp"
#include "fsst.h"

#include <cstring> // strlen() on Solaris

Expand Down Expand Up @@ -167,6 +169,12 @@ void Vector::Slice(const SelectionVector &sel, idx_t count) {
}
return;
}

if (GetVectorType() == VectorType::FSST_VECTOR) {
Flatten(sel, count);
return;
}

Vector child_vector(*this);
auto internal_type = GetType().InternalType();
if (internal_type == PhysicalType::STRUCT) {
Expand Down Expand Up @@ -416,7 +424,10 @@ Value Vector::GetValueInternal(const Vector &v_p, idx_t index_p) {
case VectorType::FLAT_VECTOR:
finished = true;
break;
// dictionary: apply dictionary and forward to child
case VectorType::FSST_VECTOR:
finished = true;
break;
// dictionary: apply dictionary and forward to child
case VectorType::DICTIONARY_VECTOR: {
auto &sel_vector = DictionaryVector::SelVector(*vector);
auto &child = DictionaryVector::Child(*vector);
Expand All @@ -440,6 +451,24 @@ Value Vector::GetValueInternal(const Vector &v_p, idx_t index_p) {
if (!validity.RowIsValid(index)) {
return Value(vector->GetType());
}

if (vector->GetVectorType() == VectorType::FSST_VECTOR) {
if (vector->GetType().id() == LogicalTypeId::VARCHAR) {
samansmink marked this conversation as resolved.
Show resolved Hide resolved
unsigned char decompress_buffer[StringUncompressed::STRING_BLOCK_LIMIT + 1];

auto str_compressed = ((string_t *)data)[index];
auto decompressed_string_size =
samansmink marked this conversation as resolved.
Show resolved Hide resolved
duckdb_fsst_decompress((duckdb_fsst_decoder_t *)FSSTVector::GetDecoder(const_cast<Vector &>(*vector)),
str_compressed.GetSize(), (unsigned char *)str_compressed.GetDataUnsafe(),
StringUncompressed::STRING_BLOCK_LIMIT + 1, &decompress_buffer[0]);
D_ASSERT(decompressed_string_size <= StringUncompressed::STRING_BLOCK_LIMIT);

return Value(string((char *)decompress_buffer, decompressed_string_size));
} else {
throw InternalException("FSST Vector with non-string datatype found!");
samansmink marked this conversation as resolved.
Show resolved Hide resolved
}
}

switch (vector->GetType().id()) {
case LogicalTypeId::BOOLEAN:
return Value::BOOLEAN(((bool *)data)[index]);
Expand Down Expand Up @@ -579,6 +608,8 @@ string VectorTypeToString(VectorType type) {
switch (type) {
case VectorType::FLAT_VECTOR:
return "FLAT";
case VectorType::FSST_VECTOR:
return "FSST";
case VectorType::SEQUENCE_VECTOR:
return "SEQUENCE";
case VectorType::DICTIONARY_VECTOR:
Expand All @@ -600,6 +631,21 @@ string Vector::ToString(idx_t count) const {
retval += GetValue(i).ToString() + (i == count - 1 ? "" : ", ");
}
break;
case VectorType::FSST_VECTOR: {
for (idx_t i = 0; i < count; i++) {
string_t compressed_string = ((string_t *)data)[i];

// Decompress string
unsigned char decompress_buffer[StringUncompressed::STRING_BLOCK_LIMIT + 1];
auto decompressed_string_size =
samansmink marked this conversation as resolved.
Show resolved Hide resolved
duckdb_fsst_decompress((duckdb_fsst_decoder_t *)FSSTVector::GetDecoder(const_cast<Vector &>(*this)),
compressed_string.GetSize(), (unsigned char *)compressed_string.GetDataUnsafe(),
StringUncompressed::STRING_BLOCK_LIMIT + 1, &decompress_buffer[0]);
D_ASSERT(decompressed_string_size <= StringUncompressed::STRING_BLOCK_LIMIT);

retval += string((const char *)decompress_buffer, decompressed_string_size) + (i == count - 1 ? "" : ", ");
}
} break;
case VectorType::CONSTANT_VECTOR:
retval += GetValue(0).ToString();
break;
Expand Down Expand Up @@ -662,6 +708,15 @@ void Vector::Flatten(idx_t count) {
case VectorType::FLAT_VECTOR:
// already a flat vector
break;
case VectorType::FSST_VECTOR: {
// create vector to decompress into
Vector other(GetType(), count);
// now copy the data of this vector to the other vector, decompressing the strings in the process
VectorOperations::Copy(*this, other, count, 0, 0);
// create a reference to the data in the other vector
this->Reference(other);
break;
}
case VectorType::DICTIONARY_VECTOR: {
// create a new flat vector of this type
Vector other(GetType(), count);
Expand Down Expand Up @@ -771,6 +826,15 @@ void Vector::Flatten(const SelectionVector &sel, idx_t count) {
case VectorType::FLAT_VECTOR:
// already a flat vector
break;
case VectorType::FSST_VECTOR: {
// create a new flat vector of this type
Vector other(GetType());
// now copy the data of this vector to the other vector, removing the selection vector in the process
VectorOperations::Copy(*this, other, sel, count, 0, 0);
// create a reference to the data in the other vector
this->Reference(other);
break;
}
case VectorType::SEQUENCE_VECTOR: {
int64_t start, increment;
SequenceVector::GetSequence(*this, start, increment);
Expand Down Expand Up @@ -1357,6 +1421,46 @@ void StringVector::AddHeapReference(Vector &vector, Vector &other) {
StringVector::AddBuffer(vector, other.auxiliary);
}

string_t FSSTVector::AddCompressedString(Vector &vector, const char *data, idx_t len) {
return FSSTVector::AddCompressedString(vector, string_t(data, len));
}

string_t FSSTVector::AddCompressedString(Vector &vector, string_t data) {
D_ASSERT(vector.GetType().InternalType() == PhysicalType::VARCHAR);
if (data.IsInlined()) {
// string will be inlined: no need to store in string heap
return data;
}
if (!vector.auxiliary) {
vector.auxiliary = make_buffer<VectorFSSTStringBuffer>();
}
D_ASSERT(vector.auxiliary->GetBufferType() == VectorBufferType::FSST_BUFFER);
auto &fsst_string_buffer = (VectorFSSTStringBuffer &)*vector.auxiliary;
return fsst_string_buffer.AddBlob(data);
}

void *FSSTVector::GetDecoder(Vector &vector) {
D_ASSERT(vector.GetType().InternalType() == PhysicalType::VARCHAR);
if (!vector.auxiliary) {
throw InternalException("GetDecoder called on FSST Vector without registered buffer");
}
D_ASSERT(vector.auxiliary->GetBufferType() == VectorBufferType::FSST_BUFFER);
auto &fsst_string_buffer = (VectorFSSTStringBuffer &)*vector.auxiliary;
return (duckdb_fsst_decoder_t *)fsst_string_buffer.GetDecoder();
}

void FSSTVector::RegisterDecoder(Vector &vector, buffer_ptr<void> &duckdb_fsst_decoder) {
D_ASSERT(vector.GetType().InternalType() == PhysicalType::VARCHAR);

if (!vector.auxiliary) {
vector.auxiliary = make_buffer<VectorFSSTStringBuffer>();
}
D_ASSERT(vector.auxiliary->GetBufferType() == VectorBufferType::FSST_BUFFER);

auto &fsst_string_buffer = (VectorFSSTStringBuffer &)*vector.auxiliary;
fsst_string_buffer.AddDecoder(duckdb_fsst_decoder);
}

vector<unique_ptr<Vector>> &StructVector::GetEntries(Vector &vector) {
D_ASSERT(vector.GetType().id() == LogicalTypeId::STRUCT || vector.GetType().id() == LogicalTypeId::MAP);
if (vector.GetVectorType() == VectorType::DICTIONARY_VECTOR) {
Expand Down
6 changes: 6 additions & 0 deletions src/common/types/vector_buffer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ buffer_ptr<VectorBuffer> VectorBuffer::CreateStandardVector(const LogicalType &t
VectorStringBuffer::VectorStringBuffer() : VectorBuffer(VectorBufferType::STRING_BUFFER) {
}

VectorStringBuffer::VectorStringBuffer(VectorBufferType type) : VectorBuffer(type) {
}

VectorFSSTStringBuffer::VectorFSSTStringBuffer() : VectorStringBuffer(VectorBufferType::FSST_BUFFER) {
}

VectorStructBuffer::VectorStructBuffer() : VectorBuffer(VectorBufferType::STRUCT_BUFFER) {
}

Expand Down