ALP: Adaptive Lossless Floating-Point Compression

Lossless floating-point compression algorithm for double/float data type. ALP significantly improves over all previous floating-point encodings in both speed and compression ratio (figure below; each dot represents a dataset).

⚡ High Speed: Scans 44x faster than Gorilla, 64x faster than Chimp, 31x faster than Zstd. Compresses 11x faster than Zstd, 138x faster than PDE and x10 faster than Chimp.
✅ High Compression: 50% more compression than Gorillas. 24% more than Chimp128. On par with Zstd level 3.
✅ Adaps to data: By using a two-stage algorithm that first samples row-groups and then vectors.
✅ Scalar code: Auto-vectorizes thanks to FastLanes.
✅ Lightweight Encoding: Compression and decompression occurs in blocks of 1024 values. Ideal for columnar databases.
✅ Proven Effectiveness: Effectiveness and speed led to deprecating Chimp128 and Patas in DuckDB.
✅ Works on difficult floats: Can losslessly compress even floats present as ML models parameters better than Zstd and all other encodings.

To rigorously benchmark ALP with your own data we provide our ALP primitives as a single C++ header file.

To quickly test ALP we recommend following our examples in the Quickstart guide or using it on DuckDB (note that ALP inside DuckDB is slower than using our primitives).

ALP details can be found in the publication.

Used By

ALP in a Nutshell

ALP has two compression schemes: ALP for doubles/floats which were once decimals, and ALP_RD for true double/floats (e.g. the ones which stem from many calculations, scientific data, ML weights).

ALP losslessly transforms doubles/floats to integer values with two multiplications to FOR+BitPack them into only the necessary bits. This is a strongly enhanced version of PseudoDecimals.

ALP_RD splits the doubles/floats bitwise representations into two parts (left and right). The left part is encoded with a Dictionary compression and the right part is Bitpacked to just the necessary bits.

Both encodings operate in vectors of 1024 values at a time (fit vectorized execution) and leverage in-vector commonalities to achieve higher compression ratios and be faster (by avoiding per-value adaptivity) than other methods.

Both encodings encode outliers as exceptions to achieve higher compression ratios.

Quickstart

Usage examples are available under the example directory. In here, we use a simple [de]compression API to store/read ALP data in/from memory.

Simple compress: An example to compress a buffer of random doubles with limited decimal precision. Also available for 32bit single precision
RD Compress: An example to directly compress using ALP_RD scheme if the data are true doubles. Also available for 32bit single precision
Adaptive Compress: An example in which half of the data is of limited decimal precision and half of the data are true doubles.

Note that the [de]compression API used by these examples is only a naive wrapper of the real ALP core: the primitives.

Building and Running

Requirements:

Clang++
CMake 3.20 or higher

Building and running the simple compress example:

cmake -DALP_BUILD_EXAMPLE=ON .   # or set option in the CMakeLists.txt
cd example
make
./simple_compress

This will also generate the ALP Primitives.

ALP Primitives

You can make your own [de]compression API by using ALP primitives. An example of the usage of these can be found in our simple compression and decompression API. The decoding primitives of ALP are auto-vectorized thanks to FastLanes. For benchmarking purposes, we recommend you use these primitives.

You can use these by including our library in your code: #include "alp.hpp".

Check the full documentation of these on the PRIMITIVES.MD readme.

ALP in DuckDB

ALP replaced Chimp128 and Patas in DuckDB. In DuckDB, ALP is x2-4 times faster than Patas (at decompression) achieving twice as high compression ratios (sometimes even much more). DuckDB can be used to quickly test ALP on custom data, however, we advise against doing so if your purpose is to rigorously benchmark ALP against other algorithms.

Here you can find a basic example on how to load data in DuckDB forcing ALP to be used as compression method. These statements can be called using the Python API.

Please note: ALP inside DuckDB: i) Is slower than using our primitives presented here, and ii) compression ratios can be slightly worse due to the metadata needed to skip vectors and DuckDB storage layout.

Benchmarking (Replicating Paper Experiments)

In BENCHMARKING.md we detail how to replicate the experiments and benchmarks presented in our publication.

On the benchmarked datasets from our publication:

ALP achieves on average x3 compression ratios (sometimes much much higher).
ALP encodes on average 0.5 doubles per CPU cycle.
ALP decodes on average 2.6 doubles per CPU cycle.

On FCBench:

ALP achieves a compression ratio of 2.08 (beating all other compressors)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
data		data
example		example
generated		generated
include		include
publication		publication
scripts		scripts
src		src
test		test
toolchain		toolchain
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
BENCHMARKING.md		BENCHMARKING.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
PRIMITIVES.md		PRIMITIVES.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALP: Adaptive Lossless Floating-Point Compression

Used By

Contents

ALP in a Nutshell

Quickstart

Building and Running

ALP Primitives

ALP in DuckDB

Benchmarking (Replicating Paper Experiments)

About

Releases

Packages

Contributors 2

Languages

License

cwida/ALP

Folders and files

Latest commit

History

Repository files navigation

ALP: Adaptive Lossless Floating-Point Compression

Used By

Contents

ALP in a Nutshell

Quickstart

Building and Running

ALP Primitives

ALP in DuckDB

Benchmarking (Replicating Paper Experiments)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages