Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better long and float column compression #4080

Closed
leventov opened this issue Mar 19, 2017 · 23 comments
Closed

Better long and float column compression #4080

leventov opened this issue Mar 19, 2017 · 23 comments

Comments

@leventov
Copy link
Member

The idea is brought here: #4027 (comment)

https://github.com/lemire/JavaFastPFOR

@leventov
Copy link
Member Author

leventov commented Mar 19, 2017

Also a good timeseries compression algorithm is described here: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf (Gorilla paper)

@leventov leventov changed the title Better long and string (indexed) column compression Better long and float column compression Jun 23, 2017
@leventov
Copy link
Member Author

@gianm
Copy link
Contributor

gianm commented Feb 5, 2018

I ran some benchmarks on a Druid column format based on JavaFastPFOR today and got the following sizes and timings. The dataset was ints with randomness uniformly distributed over some number of bits, and the other bits left as zero (emulating string columns with varying cardinalities). I did three scans: at 10%, 95%, and 100% selectivity.

Some notes:

  1. "uncompressed" means VSizeColumnarInts (byte packing). "compressed" means byte packing + LZ4 with blocksize of 64KB (however many byte-packed ints fit in). "fastPfor" means JavaFastPFOR's FastPFOR impl with blocks of 16384 ints. "binaryPacked" means JavaFastPFOR's BinaryPacked impl (bit-packing, I assume) also with blocks of 16384 ints.
  2. The FastPFOR implementations both have unnecessary copies on the decompression path: the library accepts compressed data as int[] but we really have ByteBuffer. So modifying the library to accept ByteBuffer should improve performance somewhat.
  3. LZ4 was not able to compress this data. It got a little larger, indicating LZ4 probably just stored the data as-is, with some header indicating such. This wasn't surprising for random-bits settings close to a whole byte, since in that case it would have been trying to compress whole bytes of random bits. But I was surprised that it couldn't compress well with, for example, 11 random bits. On real world data, LZ4 may do somewhat better than this, since many columns are power-law distributed rather than uniformly distributed, and that might provide more opportunity for the algorithm to work with.
  4. FastPFOR did a good job compressing, especially on numbers of bits far from a whole byte. This makes sense since it does bit packing rather than byte packing. It always got the best data size.
  5. FastPFOR and BinaryPacking beat the other schemes on scan rate of the whole dataset, by about 10–30%. On the 95% selectivity scan it was about even (seemingly mostly because of the added overhead of walking the filter bitset). On the 10% selectivity scan they performed the worst, about 70% slower. This is possibly because their work is amortized over fewer rows (the implementations must unpack an entire block of 16384 ints to get even one, whereas other implementations may need to LZ4-decompress an entire block, but only need to byte-unpack the specific rows being read). I would expect this all to improve somewhat if the unnecessary copy is eliminated.
  6. Both FastPFOR and BinaryPacking could operate effectively in smaller blocks than the 16384-int blocks we used, potentially improving performance when the filter is more selective. FastPFOR's docs say the blocksize should be greater than 1024. BinaryPacking says 32 is OK.

Follow ups needed:

  • Test with real world data, or at least distributions other than uniform
  • Remove the unnecessary copy
Benchmark                                     (bits)  (filteredRowCount)   (rows)  Mode  Cnt      Score      Error  Units
CompressedColumnarIntsBenchmark.compressed         7              300000  3000000  avgt   20   2943.665 ±   68.762  us/op
CompressedColumnarIntsBenchmark.compressed         7             2850000  3000000  avgt   20  20766.045 ±  501.196  us/op
CompressedColumnarIntsBenchmark.compressed         7             3000000  3000000  avgt   20  15801.290 ±  349.125  us/op
CompressedColumnarIntsBenchmark.compressed        11              300000  3000000  avgt   20   3321.190 ±   74.982  us/op
CompressedColumnarIntsBenchmark.compressed        11             2850000  3000000  avgt   20  20621.893 ±  607.840  us/op
CompressedColumnarIntsBenchmark.compressed        11             3000000  3000000  avgt   20  14974.348 ±  450.959  us/op
CompressedColumnarIntsBenchmark.compressed        15              300000  3000000  avgt   20   3021.729 ±  103.790  us/op
CompressedColumnarIntsBenchmark.compressed        15             2850000  3000000  avgt   20  20802.272 ± 1672.917  us/op
CompressedColumnarIntsBenchmark.compressed        15             3000000  3000000  avgt   20  14651.987 ±  378.612  us/op
CompressedColumnarIntsBenchmark.compressed        21              300000  3000000  avgt   20   3706.261 ±  119.774  us/op
CompressedColumnarIntsBenchmark.compressed        21             2850000  3000000  avgt   20  23446.937 ±  716.861  us/op
CompressedColumnarIntsBenchmark.compressed        21             3000000  3000000  avgt   20  20023.993 ±  692.204  us/op
CompressedColumnarIntsBenchmark.fastPfor           7              300000  3000000  avgt   20   4882.731 ±  163.518  us/op
CompressedColumnarIntsBenchmark.fastPfor           7             2850000  3000000  avgt   20  21581.873 ±  646.544  us/op
CompressedColumnarIntsBenchmark.fastPfor           7             3000000  3000000  avgt   20  13632.693 ±  546.829  us/op
CompressedColumnarIntsBenchmark.fastPfor          11              300000  3000000  avgt   20   5583.134 ±  162.427  us/op
CompressedColumnarIntsBenchmark.fastPfor          11             2850000  3000000  avgt   20  21851.592 ±  861.589  us/op
CompressedColumnarIntsBenchmark.fastPfor          11             3000000  3000000  avgt   20  14563.488 ±  586.785  us/op
CompressedColumnarIntsBenchmark.fastPfor          15              300000  3000000  avgt   20   6340.217 ±  257.035  us/op
CompressedColumnarIntsBenchmark.fastPfor          15             2850000  3000000  avgt   20  23250.500 ± 1115.758  us/op
CompressedColumnarIntsBenchmark.fastPfor          15             3000000  3000000  avgt   20  14999.228 ± 1055.155  us/op
CompressedColumnarIntsBenchmark.fastPfor          21              300000  3000000  avgt   20   7450.851 ±  122.857  us/op
CompressedColumnarIntsBenchmark.fastPfor          21             2850000  3000000  avgt   20  24930.446 ±  621.486  us/op
CompressedColumnarIntsBenchmark.fastPfor          21             3000000  3000000  avgt   20  16123.002 ±  481.372  us/op
CompressedColumnarIntsBenchmark.binaryPacked       7              300000  3000000  avgt   20   5151.170 ±  145.566  us/op
CompressedColumnarIntsBenchmark.binaryPacked       7             2850000  3000000  avgt   20  21944.219 ±  700.875  us/op
CompressedColumnarIntsBenchmark.binaryPacked       7             3000000  3000000  avgt   20  13916.071 ±  222.431  us/op
CompressedColumnarIntsBenchmark.binaryPacked      11              300000  3000000  avgt   20   5742.023 ±  176.652  us/op
CompressedColumnarIntsBenchmark.binaryPacked      11             2850000  3000000  avgt   20  22311.412 ±  701.435  us/op
CompressedColumnarIntsBenchmark.binaryPacked      11             3000000  3000000  avgt   20  13973.136 ±  491.661  us/op
CompressedColumnarIntsBenchmark.binaryPacked      15              300000  3000000  avgt   20   6460.735 ±  138.532  us/op
CompressedColumnarIntsBenchmark.binaryPacked      15             2850000  3000000  avgt   20  23771.121 ±  619.867  us/op
CompressedColumnarIntsBenchmark.binaryPacked      15             3000000  3000000  avgt   20  15358.863 ±  524.427  us/op
CompressedColumnarIntsBenchmark.binaryPacked      21              300000  3000000  avgt   20   7229.078 ±  237.369  us/op
CompressedColumnarIntsBenchmark.binaryPacked      21             2850000  3000000  avgt   20  23778.470 ±  755.032  us/op
CompressedColumnarIntsBenchmark.binaryPacked      21             3000000  3000000  avgt   20  15788.168 ±  393.521  us/op
CompressedColumnarIntsBenchmark.uncompressed       7              300000  3000000  avgt   20   2837.838 ±   98.147  us/op
CompressedColumnarIntsBenchmark.uncompressed       7             2850000  3000000  avgt   20  22704.229 ±  885.314  us/op
CompressedColumnarIntsBenchmark.uncompressed       7             3000000  3000000  avgt   20  17371.661 ±  421.292  us/op
CompressedColumnarIntsBenchmark.uncompressed      11              300000  3000000  avgt   20   2889.088 ±   79.596  us/op
CompressedColumnarIntsBenchmark.uncompressed      11             2850000  3000000  avgt   20  23247.499 ±  606.357  us/op
CompressedColumnarIntsBenchmark.uncompressed      11             3000000  3000000  avgt   20  17543.058 ±  314.034  us/op
CompressedColumnarIntsBenchmark.uncompressed      15              300000  3000000  avgt   20   2845.691 ±   52.430  us/op
CompressedColumnarIntsBenchmark.uncompressed      15             2850000  3000000  avgt   20  22749.856 ±  707.239  us/op
CompressedColumnarIntsBenchmark.uncompressed      15             3000000  3000000  avgt   20  17000.533 ±  673.675  us/op
CompressedColumnarIntsBenchmark.uncompressed      21              300000  3000000  avgt   20   2951.205 ±  102.273  us/op
CompressedColumnarIntsBenchmark.uncompressed      21             2850000  3000000  avgt   20  22556.333 ± 1121.255  us/op
CompressedColumnarIntsBenchmark.uncompressed      21             3000000  3000000  avgt   20  16884.971 ±  357.128  us/op

7 bits:
compressed size = 3012020
uncompressed size = 3000009
fastPfor size = 2651425
binaryPacked size = 2719513

11 bits:
compressed size = 6011812
uncompressed size = 6000008
fastPfor size = 4151509
binaryPacked size = 4219513

17 bits:
compressed size = 6024346
uncompressed size = 6000008
fastPfor size = 5651525
binaryPacked size = 5719513

21 bits:
compressed size = 9036750
uncompressed size = 9000007
fastPfor size = 7901473
binaryPacked size = 7969513

@gianm
Copy link
Contributor

gianm commented Feb 5, 2018

/cc @lemire (as fyi; not that I am demanding anything of you 😄)

@gianm
Copy link
Contributor

gianm commented Feb 5, 2018

Btw we already have implemented a better long (int64) compression about a year ago, although it is still off by default. It uses fixed length bit packing so it supports relatively random access. Maybe it's time to turn it on by default and disable LZ4 by default.

Some sizings and timings are in https://imply.io/post/2016/12/07/compressing-longs.html.

(Here, we did test more distributions, and found that for uniform distributions, bit packing of deltas compresses smaller than lz4; but for power law distributions lz4 is smaller. In all cases, bit packing is fastest, since the old implementation of int64 columns didn't even do byte packing).

@lemire
Copy link

lemire commented Feb 5, 2018

@gianm

I'm available and interested in helping out.

Note: This could end up generating an interesting paper if you guys care about documenting the work. I'd love to help make this happen.

email: lemire@gmail.com

@gianm
Copy link
Contributor

gianm commented Feb 5, 2018

@lemire Thanks for the offer. At this point I have one main question: do you know if your JavaFastPFOR library is still the best freely-licensed, small-ish-integer compression library for Java that you are aware of? (I'm looking to compress the offset section of string columns; in Druid we expect these to be positive ints with the max value being the cardinality of the column -- which is never more than a few million and is often less.)

Another note: currently Druid can access values in int columns randomly, since we pack into a fixed number of bytes per value (whatever is needed based on the max value for the overall column). Random access isn't supported by JavaFastPFOR, and I suspect that contributes to its relative slowness on the 10% selectivity test. I would like to run some tests with fixed-length bit packing (which could support random access, or at least much smaller blocks) to see how it fares on various selectivities vs. FastPFOR.

@gianm
Copy link
Contributor

gianm commented Feb 5, 2018

I edited the original comment to include BinaryPacking too (the paper says it's fixed-width bit packing). They look similar to FastPFOR at the current block sizes I'm testing.

@lemire
Copy link

lemire commented Feb 6, 2018

do you know if your JavaFastPFOR library is still the best freely-licensed, small-ish-integer compression library for Java that you are aware of?

I'm not aware of anything else that is as good... ;-)

But I did not join the conversation to promote JavaFastPFOR specifically. We wrote JavaFastPFOR so that people could have access to a collection of tricks that work well. But what the Parquet folks did was to grab whatever code they needed from JavaFastPFOR and adapt it... and I am totally fine with that!

Another note: currently Druid can access values in int columns randomly, since we pack into a fixed number of bytes per value (whatever is needed based on the max value for the overall column). Random access isn't supported by JavaFastPFOR, and I suspect that contributes to its relative slowness on the 10% selectivity test. I would like to run some tests with fixed-length bit packing (which could support random access, or at least much smaller blocks) to see how it fares on various selectivities vs. FastPFOR.

Ah. Ah.

It is true that we have not implemented random access in JavaFastPFOR, but we did in Upscaledb (also open source, but not in Java):

Daniel Lemire, Christoph Rupp, Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions, Information Systems Volume 66, June 2017, Pages 13-23 https://lemire.me/en/publication/arxiv1611.05428/

You'll say: "Wait Daniel! We don't care about SIMD Instruction". And I'll say: look again. In that paper, the best performance was with SIMD-based binary packing, but we got very good speed with straight binary packing which ought to be quite fast (even in Java). We have C++ code there that does things like fast select...

https://github.com/lemire/SIMDCompressionAndIntersection/blob/5649b9f39cd88c7b05b39fa29478cca16e14e8c8/src/benchsearch.cpp

Doing the equivalent in Java is not hard.

By "not hard", I mean that if there is a potential market for it, I'll volunteer to contribute code. And be happy to do so. (And that means, producing open source stuff, naturally.)

I edited the original comment to include BinaryPacking too (the paper says it's fixed-width bit packing). They look similar to FastPFOR at the current block sizes I'm testing.

I favor binary packing given a choice because it is really fun to engineer. Meaning that you can easily write special-purpose functions that act directly on the compressed data without uncompressing it. See above paper.

@lemire
Copy link

lemire commented Feb 6, 2018

In case you don't know about Upscaledb: https://upscaledb.com

cc @cruppstahl

@b-slim
Copy link
Contributor

b-slim commented Feb 15, 2018

@gianm Am also wondering if we are taking advantage of the fact that the array of ints is sorted with known min and max, I guess that should help.

@gianm
Copy link
Contributor

gianm commented Feb 15, 2018

@b-slim Binary packing definitely takes advantage of the fact that the mins and maxes are known. I didn't pick encodings that take advantage of sortedness, however, because in general we can only expect at most one dimension's values to be sorted (the first one, and only if segmentGranularity = queryGranularity). In some cases this does happen and would be a useful optimization, but I wasn't worrying about it at this point.

__time we can always expect to be sorted, however it's int64 not int32 and I was only looking at int32 compression recently. But for __time I'm hoping we can switch the default to one of the strategies from https://imply.io/post/2016/12/07/compressing-longs.html soon. I'd expect both delta and table encoding should work very well on __time columns (table for coarse granularities and delta for fine ones).

@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@lemire
Copy link

lemire commented Jun 21, 2019

Again... no interest in this? The potential is huge.

@stale
Copy link

stale bot commented Jun 21, 2019

This issue is no longer marked as stale.

@stale stale bot removed the stale label Jun 21, 2019
@clintropolis
Copy link
Member

@lemire I am still very interested in pushing forward on this, and I agree with you on the implications. I've been stalling on making any progress on this issue or with my experiments until #6794 is merged, mainly because vectorization of the rest of the query processing engine changes the dynamics a bit at the compression layer and I want to make sure I am also targeting that optimally since I think it's probably the future of the query engine. But as soon as it goes in, I definitely plan to pick this back up!

@lemire
Copy link

lemire commented Jun 26, 2019

Ok. Feel free to get in touch with me when you get to it, if you want.

@CmdrKeen
Copy link

@lemire @clintropolis - I'm interested :) and #6794 is now merged also. Is there still an apitite to look into this?

@lemire
Copy link

lemire commented Jun 22, 2020

@CmdrKeen I can't lead this, but I sure could help with pointers and reviews.

@clintropolis
Copy link
Member

Hi @CmdrKeen,
Apologies in advance for the wall of text.

#6016 captures much of my experiments and findings into this so far. For integer compression of dictionary encoded string columns, FastPFOR is a very solid improvement in many cases, but was not always better across the board for some value distributions. However the strategy taken in #6016, which uses it as part of an approach that attempts to find the "best" encoding for a given set of values at ingestion time, I feel has been at least proven as viable in that PR.

The main stalling point I ran into was when I wired up the C++ implementation of FastPFOR to Druid through JNI and ran into the situation where the C++ and Java implementations are not actually compatible (and don't claim to be, I was just experimenting to see if I could make it go faster). The performance increase of the C++ version make it worth using, but it is also necessary to have a fallback to pure java for cases when the compiled native version is not available for a platform. I also think the 'native' version of my branch in that PR is a bit cleaner of an implementation since it deals in bytebuffer/memory locations rather than on heap arrays like the branch associated with #6016 itself (see clintropolis/druid@shapeshift...clintropolis:shapeshift-native for the jni branch differences).

I've been intending to pick this back up for like... over a year now, but I just haven't been able to prioritize it to bring it to the finish line. I did "dust off" the branches and start testing stuff a few months ago, but haven't pushed my changes to fix conflicts and to add vectorization support after #6794 or really had the chance to make any material progress on finishing it.

I think the next steps are creating a version of FastPFOR in pure Java that is compatible with the C++ version, that preferably reads to/writes from little endian ByteBuffer from the mapped segments so that the simpler interfaces defined in the 'native' version of the branch can be used. Besides making it so we can utilize the faster JNI/native version when available (which is how lz4-java currently behaves), and it also doesn't suffer from the additional heap memory footprint that the version in the PR has.

In lemire/FastPFor#42, @lemire has suggested that making a compatible version is non-trivial but might not take more than a few days or a week or so. My problem has been finding a continuous chunk of time to devote to fully understanding the C++ implementation and then of course doing the compatible Java implementation.

Much of the benchmarking will need repeated against the latest version of Druid, with a more heavy focus on vectorized reads, though I was already benchmarking against #6794 while it was in progress and my branch was competitive at the time, so many of the measurements might still be close to true.

The other part that has given me some pause more recently, is thinking about how potentially more of the low level column reads of Druid could be pushed into native code, beyond just decompressing chunks of values, and what that might look like and how that might change implementation of what has been done in the prototype thus far (if at all). I expect the strategy and physical column layout itself wouldn't really change, just a lot more of the code implemented in java right now would be pushed to native code, so this might not be worth worrying too much about.

Lastly, for some further thinking with regards to native processing, I think there are probably some unresolved questions about whether the new column format suggested in my PR (or any larger segment format) changes would be worth further modifying so that data in the segment would be well aligned to work with vectorized CPU instructions. This might not be an issue worth considering at all, I mostly want to make sure because segment format changes are among the riskiest changes to make. Once released they must remain supported for backwards compatibility for very long cycles, so it feels very important to get any new stuff right.

Anyway, thanks for bringing this issue back to my attention and starting the discussion. Thinking about it to respond to your question has given me some motivation to .. well at minimum at least keep thinking about it a bit more, but maybe we can actually get this moving again 😅.

@CmdrKeen
Copy link

For what it's worth, I appreciated the wall of text @clintropolis :) It helps provide insight!

No pressure, just always excited by efficiency gains! 👍

@github-actions
Copy link

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label May 31, 2023
@github-actions
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants