Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorize 'auto' long decoding #11004

Merged
merged 11 commits into from
Mar 27, 2021

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 17, 2021

Description

This PR specializes LongDeserializer implementations with getDelta and getTable methods to push down unpacking bits, delta encoding adjustment, and table lookups for vectors of data as far as possible and work more efficiently with vectorized query engines, primarily focusing on contiguous reads.

It works by unrolling value reads to line up with ByteBuffer get methods to eliminate overlapping reads where possible, reading blocks of 8 values at a time for un-aligned values (1, 2, 4, 12, 20, 24, 40, 48, 56). ByteBuffer aligned bit-packing widths (8, 16, 32, 64) actually performed worse when this same unrolling was performed, so instead they utilize a traditional for loop with the aligned get methods.

Most of the improvement is on the small end since it has the most redundant/overlapping memory accesses, but is pretty decent improvement there.

Full column scans before/after on uniform distribution columns of varying bits to cover the entire set of value decoders

before:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)  (zeroProbability)  Mode  Cnt         Score      Error  Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-1    lz4-auto                           1.0  5000000                0.0  avgt    5     23862.994 ± 1944.514  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-2    lz4-auto                           1.0  5000000                0.0  avgt    5     23567.778 ±  757.632  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-4    lz4-auto                           1.0  5000000                0.0  avgt    5     28826.685 ± 3260.379  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-8    lz4-auto                           1.0  5000000                0.0  avgt    5     19829.688 ±  923.047  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-12    lz4-auto                           1.0  5000000                0.0  avgt    5     32419.941 ± 1486.951  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-16    lz4-auto                           1.0  5000000                0.0  avgt    5     22460.206 ± 4207.942  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-20    lz4-auto                           1.0  5000000                0.0  avgt    5     31415.040 ± 3867.699  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-24    lz4-auto                           1.0  5000000                0.0  avgt    5     24444.519 ± 2472.537  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uinform-32    lz4-auto                           1.0  5000000                0.0  avgt    5     18031.908 ± 1471.220  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-40    lz4-auto                           1.0  5000000                0.0  avgt    5     22442.611 ± 1778.178  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-48    lz4-auto                           1.0  5000000                0.0  avgt    5     24733.015 ± 3076.896  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-56    lz4-auto                           1.0  5000000                0.0  avgt    5     23201.264 ± 1578.009  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-64    lz4-auto                           1.0  5000000                0.0  avgt    5     20581.399 ± 1269.190  us/op

after:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-1    lz4-auto                           1.0  5000000                0.0  avgt    5     17892.362 ±  102.776  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-2    lz4-auto                           1.0  5000000                0.0  avgt    5     17796.103 ±  417.847  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-4    lz4-auto                           1.0  5000000                0.0  avgt    5     17879.496 ±  237.066  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-8    lz4-auto                           1.0  5000000                0.0  avgt    5     17508.260 ±  560.856  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-12    lz4-auto                           1.0  5000000                0.0  avgt    5     18272.440 ±   71.751  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-16    lz4-auto                           1.0  5000000                0.0  avgt    5     19042.292 ±  595.685  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-20    lz4-auto                           1.0  5000000                0.0  avgt    5     18782.746 ±  248.738  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-24    lz4-auto                           1.0  5000000                0.0  avgt    5     19048.354 ±  160.025  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uinform-32    lz4-auto                           1.0  5000000                0.0  avgt    5     17984.778 ±  633.691  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-40    lz4-auto                           1.0  5000000                0.0  avgt    5     22070.007 ±  166.035  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-48    lz4-auto                           1.0  5000000                0.0  avgt    5     22052.517 ± 2168.763  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-56    lz4-auto                           1.0  5000000                0.0  avgt    5     25001.739 ±  259.184  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-64    lz4-auto                           1.0  5000000                0.0  avgt    5     20325.369 ±   60.724  us/op

Full column scans on other value distribution before/after comparison

before:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)  (zeroProbability)  Mode  Cnt         Score      Error  Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  enumerated-0-1    lz4-auto                           1.0  5000000                0.0  avgt    5     24186.788 ± 1440.692  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 enumerated-full    lz4-auto                           1.0  5000000                0.0  avgt    5     27220.865 ± 2221.740  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                     normal-1-32    lz4-auto                           1.0  5000000                0.0  avgt    5     19490.250 ± 1231.429  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  normal-40-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     20701.792 ± 1898.845  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 sequential-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     28977.482 ±  878.240  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized               sequential-unique    lz4-auto                           1.0  5000000                0.0  avgt    5     22856.419 ±  247.786  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    zipf-low-100    lz4-auto                           1.0  5000000                0.0  avgt    5     19135.244 ±  437.260  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     31696.149 ± 2933.263  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     25815.759 ± 2073.480  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   zipf-high-100    lz4-auto                           1.0  5000000                0.0  avgt    5     20582.544 ± 1726.414  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     18996.105 ± 1412.307  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     18628.002 ±  450.125  us/op

after:

ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  enumerated-0-1    lz4-auto                           1.0  5000000                0.0  avgt    5     19250.882 ± 1221.067  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 enumerated-full    lz4-auto                           1.0  5000000                0.0  avgt    5     19853.160 ±  780.701  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                     normal-1-32    lz4-auto                           1.0  5000000                0.0  avgt    5     18414.646 ± 1705.926  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                  normal-40-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     20054.183 ± 2539.912  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 sequential-1000    lz4-auto                           1.0  5000000                0.0  avgt    5     18585.123 ± 1524.581  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized               sequential-unique    lz4-auto                           1.0  5000000                0.0  avgt    5     19304.239 ±  808.152  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    zipf-low-100    lz4-auto                           1.0  5000000                0.0  avgt    5     19248.088 ±  470.994  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     23815.318 ± 4413.581  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                 zipf-low-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     26111.974 ± 4594.582  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   zipf-high-100    lz4-auto                           1.0  5000000                0.0  avgt    5     20144.349 ±  489.978  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-100000    lz4-auto                           1.0  5000000                0.0  avgt    5     18956.415 ±  796.136  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                zipf-high-32-bit    lz4-auto                           1.0  5000000                0.0  avgt    5     18666.583 ±  385.407  us/op

Full benchmarks on column scan value selects with simulated filters (up to full scan)

For the top line graph, the x axis is percent of rows selected by column scan (offset analog) and goes up to 1.0/contiguous scan. The y axis is the amount of time it took to select that many values. The numbers are kind of noisy because i didn't spend the multiple days necessary to do it proper, but it gives a decent idea. Nearly all of the improvement is on the full scan end of the spectrum (last datapoint), because the underlying BitmapVectorOffset will never call the contiguous vectorized get methods (which I will fix in a follow-up PR) and the contiguous gets were the primary area of improvement in this PR. The bottom bar chart is the size of the column in bytes as it would be stored in a segment (encoded and/or compressed).

before-after

In a follow-up PR i will probably argue for making this the default long encoding since it does pretty well with the vectorized engine, compared to lz4 longs.


Unpacking

I uh, drew pictures to help me get the shifts and masking correct, so I'll add them here in case they are any help to reviewers. Imagine each diagram for the bit width is the number of values to unpack (8) per loop, where each color is a separate value, and each box is 4 bits.

PNG image-9F52520DEE1D-1

PNG image-BA4F111C2B53-1

PNG image-F37823656142-1

PNG image-8754DDE6B474-1

PNG image-D07C1140D960-1


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@clintropolis clintropolis changed the title vectorize 'auto' long encoding vectorize 'auto' long decoding Mar 17, 2021
@clintropolis clintropolis removed the WIP label Mar 17, 2021
@@ -330,7 +329,7 @@ public void write(long value) throws IOException
curByte = (byte) value;
first = false;
} else {
curByte = (byte) ((curByte << 4) | ((value >> (numBytes << 3)) & 0xF));
curByte = (byte) ((curByte << 4) | ((value >>> (numBytes << 3)) & 0xF));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug fix? If so: it's on the write side; does that mean there might be bad segments out there, or is there some reason that this line wouldn't have affected any already-written data that people might have? (Maybe negative numbers were never fed to this method.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this was an accidental change, will revert

* Unpack a non-contiguous vector of long values at the specified indexes and adjust them by the supplied delta base
* value.
*/
default int getDelta(long[] out, int outPosition, int[] indexes, int length, int indexOffset, int limit, long base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have evidence that the getDelta and getTable methods are helpful? (vs. the alternative: first calling a regular bulk get method, then applying the delta or table adjustment in a loop over the returned arrays)

They complexify the code quite a bit, so we should only include them if they are meaningfully better performance-wise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't measured as you suggest with these exact get methods, (did an earlier test before they were filled out that seemed to show improvement though which is why I went down this path because cutting out the extra vector iteration seemed to be worth the math). i will try to run the numbers this evening to compare.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the first 8 bits so far without pushdown, the results are 1-2 millis slower.

without pushdown:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    uniform-1    lz4-auto                           1.0  5000000                0.0  avgt    5    20409.851 ±  998.810  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    uniform-2    lz4-auto                           1.0  5000000                0.0  avgt    5    18935.144 ±  199.459  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    uniform-4    lz4-auto                           1.0  5000000                0.0  avgt    5    18668.448 ±  538.985  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                    uniform-8    lz4-auto                           1.0  5000000                0.0  avgt    5    19010.981 ±  695.243  us/op

compared to with pushdown:

Benchmark                                                                           (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-1    lz4-auto                           1.0  5000000                0.0  avgt    5     17892.362 ±  102.776  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-2    lz4-auto                           1.0  5000000                0.0  avgt    5     17796.103 ±  417.847  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-4    lz4-auto                           1.0  5000000                0.0  avgt    5     17879.496 ±  237.066  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                       uniform-8    lz4-auto                           1.0  5000000                0.0  avgt    5     17508.260 ±  560.856  us/op

I'll run the rest, but based on the results so far, since only the first 4 deserializers implement getTable, I would be in favor of leaving the pushdown in place since it isn't that much extra complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without pushdown:

Benchmark                                                                        (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)  (zeroProbability)  Mode  Cnt         Score      Error  Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-12    lz4-auto                           1.0  5000000                0.0  avgt    5     20574.724 ± 3674.509  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-16    lz4-auto                           1.0  5000000                0.0  avgt    5     20893.434 ± 2894.104  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-20    lz4-auto                           1.0  5000000                0.0  avgt    5     19857.499 ± 1926.944  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-24    lz4-auto                           1.0  5000000                0.0  avgt    5     22194.340 ± 2624.983  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uinform-32    lz4-auto                           1.0  5000000                0.0  avgt    5     18321.336 ±  465.633  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-40    lz4-auto                           1.0  5000000                0.0  avgt    5     23252.329 ±  341.846  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-48    lz4-auto                           1.0  5000000                0.0  avgt    5     25273.632 ± 1414.751  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-56    lz4-auto                           1.0  5000000                0.0  avgt    5     26429.779 ± 2649.011  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                   uniform-64    lz4-auto                           1.0  5000000                0.0  avgt    5     21943.211 ± 2124.446  us/op

compare with pushdown:

Benchmark                                                                        (distribution)  (encoding)  (filteredRowCountPercentage)   (rows)  (zeroProbability)  Mode  Cnt         Score      Error  Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-12    lz4-auto                           1.0  5000000                0.0  avgt    5     18272.440 ±   71.751  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-16    lz4-auto                           1.0  5000000                0.0  avgt    5     19042.292 ±  595.685  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-20    lz4-auto                           1.0  5000000                0.0  avgt    5     18782.746 ±  248.738  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-24    lz4-auto                           1.0  5000000                0.0  avgt    5     19048.354 ±  160.025  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uinform-32    lz4-auto                           1.0  5000000                0.0  avgt    5     17984.778 ±  633.691  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-40    lz4-auto                           1.0  5000000                0.0  avgt    5     22070.007 ±  166.035  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-48    lz4-auto                           1.0  5000000                0.0  avgt    5     22052.517 ± 2168.763  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-56    lz4-auto                           1.0  5000000                0.0  avgt    5     25001.739 ±  259.184  us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized                      uniform-64    lz4-auto                           1.0  5000000                0.0  avgt    5     20325.369 ±   60.724  us/op

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the tests. It seems worth it to keep it this way.

* Unpack a contiguous vector of long values at the specified start index of length and adjust them by the supplied
* delta base value.
*/
default void getDelta(long[] out, int outPosition, int startIndex, int length, long base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the default implementations ever used? If not, we could remove them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the default getDelta methods could be removed, and the default getTable methods could probably throw an unsupported operation exception for sizes that don't support table encoding (or we could push that to the implementations so no defaults)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, the default contiguous getDelta methods are not used, but int[] indexes is used by all un-aligned decoders (1,2,4,12,20,24,40,48,56)


@RunWith(Enclosed.class)
public class VSizeLongSerdeTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in Mult4Ser suggests that we care about handling negative numbers, but this test class doesn't exercise negative numbers very much. (I think it only tests Long.MIN_VALUE, in testEveryPowerOfTwo.)

If negative numbers matter, we should extend the test cases in this file to cover them better. I'd suggest adding tests to EveryLittleBitTest that are similar to testEveryPowerOfTwo and testEveryPowerOfTwoMinusOne, but have the sign bit set (i.e. bitwise or with Long.MIN_VALUE).

If negative numbers aren't important, I'd suggest blocking them on the write side, i.e. have all the LongSerializers throw errors if they are fed negative numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its necessary since that change was accidental, and I think negative numbers shouldn't make it here because both delta and table encoding appear to make all numbers positive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, IMO it'd be good to add checks to the serializers to make sure negative numbers aren't provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, I don't think it is necessary to add the checks based on how IntermediateColumnarLongsSerializer works.

For delta encoding, the delta is computed with LongMath.checkedSubtract which will ensure delta is a positive number and the checks ensure it won't be used if delta is Long.MAX_VALUE, or it overflowed because the range between values was too big, setting delta to -1. These checks make it so the value in the constructor of the writer fed to VSizeLongSerde.getBitsForMax will be appropriate. The minVal is used as the base, and the writer subtracts the base from every value while serializing so every number given to the serializer will be between 0 and delta.

For table encoding, since the values are replaced with their corresponding index into the table array, they should also always be 0 or positive.

Is it worth the overhead to make LongSerializer.write implementations check this for every value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter much if the methods are being called correctly today. The purpose of the checks is in case someone starts calling these methods incorrectly in the future.

I would imagine the overhead is minimal given that LongSerializers are doing a single write per method call. (A precondition check shouldn't add much on top of the method call overhead that's already there, especially since the branch would be 100% predictable.)

If it turns out that the overhead is worth worrying about, then I'd add an assert instead of a regular precondition check. The assert would be skipped in production but will run during tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not too bad, low end doesn't seem very different at all

Before adding Preconditions.checkArgument(value >= 0); to every write:

Benchmark                                                                   (distribution)  (encoding)   (rows)  (zeroProbability)  Mode  Cnt         Score   Error  Units
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-1    lz4-auto  5000000                0.0  avgt    2        97.509          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-2    lz4-auto  5000000                0.0  avgt    2       148.329          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-4    lz4-auto  5000000                0.0  avgt    2       171.316          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-8    lz4-auto  5000000                0.0  avgt    2       242.792          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-12    lz4-auto  5000000                0.0  avgt    2       249.412          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-16    lz4-auto  5000000                0.0  avgt    2       254.304          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-20    lz4-auto  5000000                0.0  avgt    2       309.736          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-24    lz4-auto  5000000                0.0  avgt    2       359.252          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uinform-32    lz4-auto  5000000                0.0  avgt    2       416.235          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-40    lz4-auto  5000000                0.0  avgt    2       492.306          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-48    lz4-auto  5000000                0.0  avgt    2       611.416          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-56    lz4-auto  5000000                0.0  avgt    2       734.503          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-64    lz4-auto  5000000                0.0  avgt    2       669.983          ms/op

after:

Benchmark                                                                   (distribution)  (encoding)   (rows)  (zeroProbability)  Mode  Cnt         Score   Error  Units
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-1    lz4-auto  5000000                0.0  avgt    2       100.281          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-2    lz4-auto  5000000                0.0  avgt    2       148.947          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-4    lz4-auto  5000000                0.0  avgt    2       178.900          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                       uniform-8    lz4-auto  5000000                0.0  avgt    2       264.585          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-12    lz4-auto  5000000                0.0  avgt    2       262.852          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-16    lz4-auto  5000000                0.0  avgt    2       265.252          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-20    lz4-auto  5000000                0.0  avgt    2       341.000          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-24    lz4-auto  5000000                0.0  avgt    2       368.859          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uinform-32    lz4-auto  5000000                0.0  avgt    2       456.571          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-40    lz4-auto  5000000                0.0  avgt    2       635.758          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-48    lz4-auto  5000000                0.0  avgt    2       831.513          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-56    lz4-auto  5000000                0.0  avgt    2       975.933          ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn                      uniform-64    lz4-auto  5000000                0.0  avgt    2       727.638          ms/op

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, preconditions broke test for 64 bit testEveryPowerOfTwo, which overflows, though thinking about it, I guess it wouldn't matter if 64 is negative because it has all the bits. I've made an exception for this case to make this test pass again.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clintropolis
Copy link
Member Author

thanks for the review @gianm 🤘

@clintropolis clintropolis merged commit c0e6d1c into apache:master Mar 27, 2021
@clintropolis clintropolis deleted the auto-longs-vectorize branch March 27, 2021 01:39
gianm added a commit to gianm/druid that referenced this pull request Apr 12, 2021
Regression introduced in apache#11004 due to overzealous optimization. Even though
we replaced stateful usage of ByteBuffer with stateless usage of Memory, we
still need to create a new object on "duplicate" due to semantics of setBuffer.
gianm added a commit that referenced this pull request Apr 13, 2021
…g. (#11098)

Regression introduced in #11004 due to overzealous optimization. Even though
we replaced stateful usage of ByteBuffer with stateless usage of Memory, we
still need to create a new object on "duplicate" due to semantics of setBuffer.
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
abhishekagarwal87 pushed a commit that referenced this pull request May 1, 2023
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.

While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
clintropolis added a commit to clintropolis/druid that referenced this pull request May 1, 2023
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by apache#11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.

While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
clintropolis added a commit to clintropolis/druid that referenced this pull request May 5, 2023
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by apache#11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.

While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
clintropolis added a commit that referenced this pull request May 8, 2023
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.

While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants