vectorize 'auto' long decoding #11004

clintropolis · 2021-03-17T03:27:03Z

Description

This PR specializes LongDeserializer implementations with getDelta and getTable methods to push down unpacking bits, delta encoding adjustment, and table lookups for vectors of data as far as possible and work more efficiently with vectorized query engines, primarily focusing on contiguous reads.

It works by unrolling value reads to line up with ByteBuffer get methods to eliminate overlapping reads where possible, reading blocks of 8 values at a time for un-aligned values (1, 2, 4, 12, 20, 24, 40, 48, 56). ByteBuffer aligned bit-packing widths (8, 16, 32, 64) actually performed worse when this same unrolling was performed, so instead they utilize a traditional for loop with the aligned get methods.

Most of the improvement is on the small end since it has the most redundant/overlapping memory accesses, but is pretty decent improvement there.

Full column scans before/after on uniform distribution columns of varying bits to cover the entire set of value decoders

before:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)  (zeroProbability)  Mode  Cnt         Score      Error  Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-1    lz4-auto                           1.0  5000000                0.0  avgt    5     23862.994 ± 1944.514  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-2    lz4-auto                           1.0  5000000                0.0  avgt    5     23567.778 ±  757.632  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-4    lz4-auto                           1.0  5000000                0.0  avgt    5     28826.685 ± 3260.379  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-8    lz4-auto                           1.0  5000000                0.0  avgt    5     19829.688 ±  923.047  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-12    lz4-auto                           1.0  5000000                0.0  avgt    5     32419.941 ± 1486.951  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-16    lz4-auto                           1.0  5000000                0.0  avgt    5     22460.206 ± 4207.942  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-20    lz4-auto                           1.0  5000000                0.0  avgt    5     31415.040 ± 3867.699  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-24    lz4-auto                           1.0  5000000                0.0  avgt    5     24444.519 ± 2472.537  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uinform-32    lz4-auto                           1.0  5000000                0.0  avgt    5     18031.908 ± 1471.220  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-40    lz4-auto                           1.0  5000000                0.0  avgt    5     22442.611 ± 1778.178  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-48    lz4-auto                           1.0  5000000                0.0  avgt    5     24733.015 ± 3076.896  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-56    lz4-auto                           1.0  5000000                0.0  avgt    5     23201.264 ± 1578.009  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-64    lz4-auto                           1.0  5000000                0.0  avgt    5     20581.399 ± 1269.190  us/op

after:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-1    lz4-auto                           1.0  5000000                0.0  avgt    5     17892.362 ±  102.776  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-2    lz4-auto                           1.0  5000000                0.0  avgt    5     17796.103 ±  417.847  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-4    lz4-auto                           1.0  5000000                0.0  avgt    5     17879.496 ±  237.066  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-8    lz4-auto                           1.0  5000000                0.0  avgt    5     17508.260 ±  560.856  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-12    lz4-auto                           1.0  5000000                0.0  avgt    5     18272.440 ±   71.751  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-16    lz4-auto                           1.0  5000000                0.0  avgt    5     19042.292 ±  595.685  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-20    lz4-auto                           1.0  5000000                0.0  avgt    5     18782.746 ±  248.738  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-24    lz4-auto                           1.0  5000000                0.0  avgt    5     19048.354 ±  160.025  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uinform-32    lz4-auto                           1.0  5000000                0.0  avgt    5     17984.778 ±  633.691  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-40    lz4-auto                           1.0  5000000                0.0  avgt    5     22070.007 ±  166.035  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-48    lz4-auto                           1.0  5000000                0.0  avgt    5     22052.517 ± 2168.763  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-56    lz4-auto                           1.0  5000000                0.0  avgt    5     25001.739 ±  259.184  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-64    lz4-auto                           1.0  5000000                0.0  avgt    5     20325.369 ±   60.724  us/op

Full column scans on other value distribution before/after comparison

before:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)  (zeroProbability)  Mode  Cnt         Score      Error  Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  enumerated-0-1    lz4-auto                           1.0  5000000                0.0  avgt    5     24186.788 ± 1440.692  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 enumerated-full    lz4-auto                           1.0  5000000                0.0  avgt    5     27220.865 ± 2221.740  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                     normal-1-32    lz4-auto                           1.0  5000000                0.0  avgt    5     19490.250 ± 1231.429  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  normal-40-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     20701.792 ± 1898.845  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 sequential-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     28977.482 ±  878.240  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized               sequential-unique    lz4-auto                           1.0  5000000                0.0  avgt    5     22856.419 ±  247.786  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    zipf-low-100    lz4-auto                           1.0  5000000                0.0  avgt    5     19135.244 ±  437.260  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     31696.149 ± 2933.263  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     25815.759 ± 2073.480  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   zipf-high-100    lz4-auto                           1.0  5000000                0.0  avgt    5     20582.544 ± 1726.414  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     18996.105 ± 1412.307  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     18628.002 ±  450.125  us/op

after:

ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  enumerated-0-1    lz4-auto                           1.0  5000000                0.0  avgt    5     19250.882 ± 1221.067  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 enumerated-full    lz4-auto                           1.0  5000000                0.0  avgt    5     19853.160 ±  780.701  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                     normal-1-32    lz4-auto                           1.0  5000000                0.0  avgt    5     18414.646 ± 1705.926  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  normal-40-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     20054.183 ± 2539.912  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 sequential-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     18585.123 ± 1524.581  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized               sequential-unique    lz4-auto                           1.0  5000000                0.0  avgt    5     19304.239 ±  808.152  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    zipf-low-100    lz4-auto                           1.0  5000000                0.0  avgt    5     19248.088 ±  470.994  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     23815.318 ± 4413.581  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     26111.974 ± 4594.582  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   zipf-high-100    lz4-auto                           1.0  5000000                0.0  avgt    5     20144.349 ±  489.978  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     18956.415 ±  796.136  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     18666.583 ±  385.407  us/op

Full benchmarks on column scan value selects with simulated filters (up to full scan)

For the top line graph, the x axis is percent of rows selected by column scan (offset analog) and goes up to 1.0/contiguous scan. The y axis is the amount of time it took to select that many values. The numbers are kind of noisy because i didn't spend the multiple days necessary to do it proper, but it gives a decent idea. Nearly all of the improvement is on the full scan end of the spectrum (last datapoint), because the underlying BitmapVectorOffset will never call the contiguous vectorized get methods (which I will fix in a follow-up PR) and the contiguous gets were the primary area of improvement in this PR. The bottom bar chart is the size of the column in bytes as it would be stored in a segment (encoded and/or compressed).

In a follow-up PR i will probably argue for making this the default long encoding since it does pretty well with the vectorized engine, compared to lz4 longs.

Unpacking

I uh, drew pictures to help me get the shifts and masking correct, so I'll add them here in case they are any help to reviewers. Imagine each diagram for the bit width is the number of values to unpack (8) per loop, where each color is a separate value, and each box is 4 bits.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

Also, add many more tests.

gianm · 2021-03-17T18:10:12Z

processing/src/main/java/org/apache/druid/segment/data/VSizeLongSerde.java

@@ -330,7 +329,7 @@ public void write(long value) throws IOException
        curByte = (byte) value;
        first = false;
      } else {
-        curByte = (byte) ((curByte << 4) | ((value >> (numBytes << 3)) & 0xF));
+        curByte = (byte) ((curByte << 4) | ((value >>> (numBytes << 3)) & 0xF));


Was this a bug fix? If so: it's on the write side; does that mean there might be bad segments out there, or is there some reason that this line wouldn't have affected any already-written data that people might have? (Maybe negative numbers were never fed to this method.)

oops, this was an accidental change, will revert

gianm · 2021-03-17T18:12:25Z

processing/src/main/java/org/apache/druid/segment/data/VSizeLongSerde.java

+     * Unpack a non-contiguous vector of long values at the specified indexes and adjust them by the supplied delta base
+     * value.
+     */
+    default int getDelta(long[] out, int outPosition, int[] indexes, int length, int indexOffset, int limit, long base)


Do you have evidence that the getDelta and getTable methods are helpful? (vs. the alternative: first calling a regular bulk get method, then applying the delta or table adjustment in a loop over the returned arrays)

They complexify the code quite a bit, so we should only include them if they are meaningfully better performance-wise.

I haven't measured as you suggest with these exact get methods, (did an earlier test before they were filled out that seemed to show improvement though which is why I went down this path because cutting out the extra vector iteration seemed to be worth the math). i will try to run the numbers this evening to compare.

I did the first 8 bits so far without pushdown, the results are 1-2 millis slower.

without pushdown:

Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows) ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-1 lz4-auto 1.0 5000000 0.0 avgt 5 20409.851 ± 998.810 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-2 lz4-auto 1.0 5000000 0.0 avgt 5 18935.144 ± 199.459 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-4 lz4-auto 1.0 5000000 0.0 avgt 5 18668.448 ± 538.985 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-8 lz4-auto 1.0 5000000 0.0 avgt 5 19010.981 ± 695.243 us/op

compared to with pushdown:

Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows) ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-1 lz4-auto 1.0 5000000 0.0 avgt 5 17892.362 ± 102.776 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-2 lz4-auto 1.0 5000000 0.0 avgt 5 17796.103 ± 417.847 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-4 lz4-auto 1.0 5000000 0.0 avgt 5 17879.496 ± 237.066 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-8 lz4-auto 1.0 5000000 0.0 avgt 5 17508.260 ± 560.856 us/op

I'll run the rest, but based on the results so far, since only the first 4 deserializers implement getTable, I would be in favor of leaving the pushdown in place since it isn't that much extra complexity.

without pushdown:

Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows) (zeroProbability) Mode Cnt Score Error Units ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-12 lz4-auto 1.0 5000000 0.0 avgt 5 20574.724 ± 3674.509 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-16 lz4-auto 1.0 5000000 0.0 avgt 5 20893.434 ± 2894.104 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-20 lz4-auto 1.0 5000000 0.0 avgt 5 19857.499 ± 1926.944 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-24 lz4-auto 1.0 5000000 0.0 avgt 5 22194.340 ± 2624.983 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uinform-32 lz4-auto 1.0 5000000 0.0 avgt 5 18321.336 ± 465.633 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-40 lz4-auto 1.0 5000000 0.0 avgt 5 23252.329 ± 341.846 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-48 lz4-auto 1.0 5000000 0.0 avgt 5 25273.632 ± 1414.751 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-56 lz4-auto 1.0 5000000 0.0 avgt 5 26429.779 ± 2649.011 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-64 lz4-auto 1.0 5000000 0.0 avgt 5 21943.211 ± 2124.446 us/op

compare with pushdown:

Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows) (zeroProbability) Mode Cnt Score Error Units ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-12 lz4-auto 1.0 5000000 0.0 avgt 5 18272.440 ± 71.751 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-16 lz4-auto 1.0 5000000 0.0 avgt 5 19042.292 ± 595.685 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-20 lz4-auto 1.0 5000000 0.0 avgt 5 18782.746 ± 248.738 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-24 lz4-auto 1.0 5000000 0.0 avgt 5 19048.354 ± 160.025 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uinform-32 lz4-auto 1.0 5000000 0.0 avgt 5 17984.778 ± 633.691 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-40 lz4-auto 1.0 5000000 0.0 avgt 5 22070.007 ± 166.035 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-48 lz4-auto 1.0 5000000 0.0 avgt 5 22052.517 ± 2168.763 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-56 lz4-auto 1.0 5000000 0.0 avgt 5 25001.739 ± 259.184 us/op ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-64 lz4-auto 1.0 5000000 0.0 avgt 5 20325.369 ± 60.724 us/op

Thanks for running the tests. It seems worth it to keep it this way.

gianm · 2021-03-17T18:13:31Z

processing/src/main/java/org/apache/druid/segment/data/VSizeLongSerde.java

+     * Unpack a contiguous vector of long values at the specified start index of length and adjust them by the supplied
+     * delta base value.
+     */
+    default void getDelta(long[] out, int outPosition, int startIndex, int length, long base)


Are the default implementations ever used? If not, we could remove them.

i think the default getDelta methods could be removed, and the default getTable methods could probably throw an unsupported operation exception for sizes that don't support table encoding (or we could push that to the implementations so no defaults)

I was wrong, the default contiguous getDelta methods are not used, but int[] indexes is used by all un-aligned decoders (1,2,4,12,20,24,40,48,56)

gianm · 2021-03-17T18:22:35Z

processing/src/test/java/org/apache/druid/segment/data/VSizeLongSerdeTest.java


+@RunWith(Enclosed.class)
 public class VSizeLongSerdeTest


The change in Mult4Ser suggests that we care about handling negative numbers, but this test class doesn't exercise negative numbers very much. (I think it only tests Long.MIN_VALUE, in testEveryPowerOfTwo.)

If negative numbers matter, we should extend the test cases in this file to cover them better. I'd suggest adding tests to EveryLittleBitTest that are similar to testEveryPowerOfTwo and testEveryPowerOfTwoMinusOne, but have the sign bit set (i.e. bitwise or with Long.MIN_VALUE).

If negative numbers aren't important, I'd suggest blocking them on the write side, i.e. have all the LongSerializers throw errors if they are fed negative numbers.

I don't think its necessary since that change was accidental, and I think negative numbers shouldn't make it here because both delta and table encoding appear to make all numbers positive.

In that case, IMO it'd be good to add checks to the serializers to make sure negative numbers aren't provided.

Looking closer, I don't think it is necessary to add the checks based on how IntermediateColumnarLongsSerializer works.

For delta encoding, the delta is computed with LongMath.checkedSubtract which will ensure delta is a positive number and the checks ensure it won't be used if delta is Long.MAX_VALUE, or it overflowed because the range between values was too big, setting delta to -1. These checks make it so the value in the constructor of the writer fed to VSizeLongSerde.getBitsForMax will be appropriate. The minVal is used as the base, and the writer subtracts the base from every value while serializing so every number given to the serializer will be between 0 and delta.

For table encoding, since the values are replaced with their corresponding index into the table array, they should also always be 0 or positive.

Is it worth the overhead to make LongSerializer.write implementations check this for every value?

It doesn't really matter much if the methods are being called correctly today. The purpose of the checks is in case someone starts calling these methods incorrectly in the future.

I would imagine the overhead is minimal given that LongSerializers are doing a single write per method call. (A precondition check shouldn't add much on top of the method call overhead that's already there, especially since the branch would be 100% predictable.)

If it turns out that the overhead is worth worrying about, then I'd add an assert instead of a regular precondition check. The assert would be skipped in production but will run during tests.

It's not too bad, low end doesn't seem very different at all

Before adding Preconditions.checkArgument(value >= 0); to every write:

Benchmark (distribution) (encoding) (rows) (zeroProbability) Mode Cnt Score Error Units ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-1 lz4-auto 5000000 0.0 avgt 2 97.509 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-2 lz4-auto 5000000 0.0 avgt 2 148.329 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-4 lz4-auto 5000000 0.0 avgt 2 171.316 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-8 lz4-auto 5000000 0.0 avgt 2 242.792 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-12 lz4-auto 5000000 0.0 avgt 2 249.412 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-16 lz4-auto 5000000 0.0 avgt 2 254.304 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-20 lz4-auto 5000000 0.0 avgt 2 309.736 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-24 lz4-auto 5000000 0.0 avgt 2 359.252 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uinform-32 lz4-auto 5000000 0.0 avgt 2 416.235 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-40 lz4-auto 5000000 0.0 avgt 2 492.306 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-48 lz4-auto 5000000 0.0 avgt 2 611.416 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-56 lz4-auto 5000000 0.0 avgt 2 734.503 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-64 lz4-auto 5000000 0.0 avgt 2 669.983 ms/op

after:

Benchmark (distribution) (encoding) (rows) (zeroProbability) Mode Cnt Score Error Units ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-1 lz4-auto 5000000 0.0 avgt 2 100.281 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-2 lz4-auto 5000000 0.0 avgt 2 148.947 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-4 lz4-auto 5000000 0.0 avgt 2 178.900 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-8 lz4-auto 5000000 0.0 avgt 2 264.585 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-12 lz4-auto 5000000 0.0 avgt 2 262.852 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-16 lz4-auto 5000000 0.0 avgt 2 265.252 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-20 lz4-auto 5000000 0.0 avgt 2 341.000 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-24 lz4-auto 5000000 0.0 avgt 2 368.859 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uinform-32 lz4-auto 5000000 0.0 avgt 2 456.571 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-40 lz4-auto 5000000 0.0 avgt 2 635.758 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-48 lz4-auto 5000000 0.0 avgt 2 831.513 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-56 lz4-auto 5000000 0.0 avgt 2 975.933 ms/op ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-64 lz4-auto 5000000 0.0 avgt 2 727.638 ms/op

oops, preconditions broke test for 64 bit testEveryPowerOfTwo, which overflows, though thinking about it, I guess it wouldn't matter if 64 is negative because it has all the bits. I've made an exception for this case to make this test pass again.

gianm

LGTM

clintropolis · 2021-03-27T01:39:09Z

thanks for the review @gianm 🤘

Regression introduced in apache#11004 due to overzealous optimization. Even though we replaced stateful usage of ByteBuffer with stateless usage of Memory, we still need to create a new object on "duplicate" due to semantics of setBuffer.

…g. (#11098) Regression introduced in #11004 due to overzealous optimization. Even though we replaced stateful usage of ByteBuffer with stateless usage of Memory, we still need to create a new object on "duplicate" due to semantics of setBuffer.

This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.

This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by apache#11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.

This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.

gianm and others added 4 commits March 13, 2021 00:18

Vectorize LongDeserializers.

4456e06

Also, add many more tests.

more faster

f94da59

more more faster

c2eb80f

more cleanup

880a089

clintropolis added Performance Area - Querying WIP Area - Segment Format and Ser/De labels Mar 17, 2021

clintropolis changed the title ~~vectorize 'auto' long encoding~~ vectorize 'auto' long decoding Mar 17, 2021

clintropolis removed the WIP label Mar 17, 2021

clintropolis added 4 commits March 16, 2021 21:38

fixes

04652b8

forbidden

ba2ca1d

benchmark style

3d330a8

idk why

ff81b6d

gianm reviewed Mar 17, 2021

View reviewed changes

clintropolis added 2 commits March 17, 2021 21:16

adjust

23309e0

add preconditions for value >= 0 for writers

c3e3fb7

gianm approved these changes Mar 19, 2021

View reviewed changes

add 64 bit exception

84f1bcc

clintropolis merged commit c0e6d1c into apache:master Mar 27, 2021

clintropolis deleted the auto-longs-vectorize branch March 27, 2021 01:39

clintropolis mentioned this pull request Mar 27, 2021

improve bitmap vector offset to report contiguous groups #11039

Merged

4 tasks

gianm mentioned this pull request Apr 12, 2021

LongsLongEncodingReader: Implement "duplicate", fixing concurrency bug. #11098

Merged

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

clintropolis mentioned this pull request Nov 17, 2022

Unsigned integer druid complex column #13370

Closed

clintropolis mentioned this pull request Apr 29, 2023

fix bugs with auto encoded long vector deserializers #14186

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorize 'auto' long decoding #11004

vectorize 'auto' long decoding #11004

clintropolis commented Mar 17, 2021 •

edited

Loading

gianm Mar 17, 2021

clintropolis Mar 17, 2021

gianm Mar 17, 2021

clintropolis Mar 17, 2021

clintropolis Mar 18, 2021

clintropolis Mar 18, 2021

gianm Mar 18, 2021

gianm Mar 17, 2021

clintropolis Mar 17, 2021

clintropolis Mar 18, 2021

gianm Mar 17, 2021

clintropolis Mar 17, 2021

gianm Mar 17, 2021

clintropolis Mar 18, 2021

gianm Mar 18, 2021

clintropolis Mar 19, 2021

clintropolis Mar 19, 2021

gianm left a comment

clintropolis commented Mar 27, 2021

vectorize 'auto' long decoding #11004

vectorize 'auto' long decoding #11004

Conversation

clintropolis commented Mar 17, 2021 • edited Loading

Description

Full column scans before/after on uniform distribution columns of varying bits to cover the entire set of value decoders

Full column scans on other value distribution before/after comparison

Full benchmarks on column scan value selects with simulated filters (up to full scan)

Unpacking

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

clintropolis commented Mar 27, 2021

clintropolis commented Mar 17, 2021 •

edited

Loading