Skip to content

[Compression] CHIMP128 Compression Algorithm #4878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 159 commits into from
Oct 20, 2022
Merged

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Oct 4, 2022

This PR adds the Chimp compression algorithm.

CHIMP(128)

Chimp is a compression algorithm that can be used to store floating point values (DOUBLE or FLOAT).

The algorithm was introduced recently as a competitor to Gorilla, the paper about the algorithm can be found here

In short:
It uses a ring buffer to keep track of the past 128 values, to select the best suited reference value.
After it has done that, it xors with that reference value.
It can store this information in 4 different ways.
It is serialized to disk with a 2 bit flag to indicate how it's stored, followed by the bits relevant to that storage method.

Compression time benchmarks

Uncompressed:

name    run     timing
benchmark/micro/compression/chimp/chimp_store.benchmark 1       2.398617
benchmark/micro/compression/chimp/chimp_store.benchmark 2       2.418939
benchmark/micro/compression/chimp/chimp_store.benchmark 3       2.470288
benchmark/micro/compression/chimp/chimp_store.benchmark 4       2.418947
benchmark/micro/compression/chimp/chimp_store.benchmark 5       3.512568

Chimp:

name    run     timing
benchmark/micro/compression/chimp/chimp_store.benchmark 1       4.904306
benchmark/micro/compression/chimp/chimp_store.benchmark 2       4.029724
benchmark/micro/compression/chimp/chimp_store.benchmark 3       4.280829
benchmark/micro/compression/chimp/chimp_store.benchmark 4       3.999000
benchmark/micro/compression/chimp/chimp_store.benchmark 5       4.062333

Decompression time benchmarks (sequential scan)

Uncompressed:

name    run     timing
benchmark/micro/compression/chimp/chimp_read.benchmark  1       0.020655
benchmark/micro/compression/chimp/chimp_read.benchmark  2       0.021452
benchmark/micro/compression/chimp/chimp_read.benchmark  3       0.019731
benchmark/micro/compression/chimp/chimp_read.benchmark  4       0.019733
benchmark/micro/compression/chimp/chimp_read.benchmark  5       0.020495

Chimp:

name    run     timing
benchmark/micro/compression/chimp/chimp_read.benchmark  1       0.106700
benchmark/micro/compression/chimp/chimp_read.benchmark  2       0.101608
benchmark/micro/compression/chimp/chimp_read.benchmark  3       0.105511
benchmark/micro/compression/chimp/chimp_read.benchmark  4       0.109822
benchmark/micro/compression/chimp/chimp_read.benchmark  5       0.098079

@Tishj
Copy link
Contributor Author

Tishj commented Oct 17, 2022

CI on my fork passes, but it randomly fails here
(and it's not related to anything I've done)

@Mytherin
Copy link
Collaborator

Those look like spurious failures indeed

@Tishj
Copy link
Contributor Author

Tishj commented Oct 17, 2022

              ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/18/ad/ec41343a49a0371ea40daf37b1ba2c11333cdd121cb378161635d14b9750/setuptools-59.2.0-py3-none-any.whl (Caused by ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fb47921ea40>, 'Connection to files.pythonhosted.org timed out. (connect timeout=15)'))

more spurious failures

@Mytherin Mytherin merged commit 5733a22 into duckdb:master Oct 20, 2022
@Mytherin
Copy link
Collaborator

Thanks for the fixes! LGTM

@Alex-Monahan
Copy link
Contributor

Super cool! Any high-level benchmarks we can tweet out upon the next release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants