-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitpacking storage compression #2679
Bitpacking storage compression #2679
Conversation
…tually compress yet
Hey @samansmink ! Cool stuff! |
@pdet Ah I missed the SQL compression option completely, I added a bitpacking column to the existing test. |
Reran TPC-H at SF10 (filesize with/without bp 8.4G/11G)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Looks excellent. Some more minor comments, then I think this is ready to merge:
PRAGMA force_compression = 'bitpacking' | ||
|
||
statement ok | ||
CREATE TABLE test (id INTEGER, l INTEGER[]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add one more extra test:
- A few really long lists (>vector size, you can use the
LIST
aggregate function to create this, e.g.SELECT LIST(i) FROM range(10000) tbl(i)
, you can useGROUP BY
to create multiple long lists)
@Mytherin TPC-H SF10 latest results added as bitpacking_current:
|
Excellent results! |
Implemented the bitpacking for in the storage compression framework. The bitpacking itself is done with the FastPfor library from which I moved the necessary code into a ./third_party folder. I ran some benchmarks for evaluation, which show that read performance does suffer significantly for some queries. especially Q06 and Q11 have some serious slowdown, overal times are not too bad though. I'm interested to hear what you think!
I think there's an optimization possibility for INT32 and INT64 which would allow not using the decompression buffer and decompressing straight into the result vector in some cases. I could add that to this pull request or do that in a separate one, wanted to get your opinion on the code first!
Evaluation
TPC-H SF1 (DuckDB in persistent mode)
To get an estimate of the real world compression ratio, I can the following query on the TPC-H SF1 lineitem table.
select count(distinct block_id) from pragma_storage_info('lineitem') where segment_type not in('VARCHAR', 'VALIDITY');
The overal size of the TPC-H SF1 file is: