Arrow: Pad decimal bytes before passing to decimal vector #5168

bryanck · 2022-06-30T15:12:41Z

The vectorized reader benchmarks showed that the Iceberg Parquet vectorized reader falls behind the one in Spark when reading decimal types. When profiling the code, a bottleneck was discovered in a method in Arrow that pads the byte buffer when setting a value in the DecimalVector, specifically this operation.

Runs of this benchmark showed that calling Unsafe.setMemory() can be slower than Java array operations. Results of a run are here.

This PR adds a workaround that pads the byte buffer before calling setBigEndian() to avoid Unsafe.setMemory() from being called.

Here are the results of a run of the VectorizedReadDictionaryEncodedFlatParquetDataBenchmark benchmark without this change:

Benchmark                                                                                  Mode  Cnt   Score   Error  Units
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5   2.016 ± 0.069   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5   2.083 ± 0.076   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  14.451 ± 0.273   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5   6.886 ± 0.163   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5   2.058 ± 0.108   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5   1.731 ± 0.117   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5   1.905 ± 0.016   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5   2.436 ± 0.178   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5   2.975 ± 0.053   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5   2.461 ± 0.951   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5   2.713 ± 0.075   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5   2.321 ± 0.953   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5   3.154 ± 0.062   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5   4.567 ± 1.864   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5   2.674 ± 0.085   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5   2.634 ± 0.089   s/op

Here are the results of a run with this change:

Benchmark                                                                                  Mode  Cnt  Score   Error  Units
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  2.339 ± 1.092   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  2.204 ± 0.085   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.501 ± 0.129   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  7.130 ± 0.111   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.677 ± 0.083   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.251 ± 0.142   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.616 ± 0.090   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.438 ± 0.074   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.620 ± 0.171   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.242 ± 0.140   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.679 ± 0.084   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.504 ± 0.173   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  3.804 ± 0.215   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  4.864 ± 0.163   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  2.544 ± 0.086   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  2.524 ± 0.193   s/op

bryanck · 2022-06-30T15:17:07Z

I'd be interested to see what results others are seeing with the benchmarks.

rdblue · 2022-07-03T22:29:07Z

.../apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java

      ((DecimalVector) vector).setBigEndian(idx, vectorBytes);
-      ByteBuffer buffer = dict.decodeToBinary(currentVal).toByteBuffer();
-      vector.getDataBuffer().setBytes(idx, buffer);


@bryanck, was this really setting the value twice? It looks like it was calling setBigEndian on the vector and then setBytes on the backing buffer. That could explain a lot of the slowness as well?

It looks like that's what it was doing.

rdblue · 2022-07-03T22:30:13Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/DecimalVectorUtil.java

+    } else if (bigEndianBytes.length < newLength) {
+      byte[] result = new byte[newLength];
+      if (bigEndianBytes.length == 0) {
+        return result;


@bryanck, is this hit? It looks like an invalid case because the decimal precision would need to be 0, but we're choosing to return 0 for it.

Probably not, I was mimicking the behavior in DecimalVector.setBigEndian() to be on the safe side.

rdblue · 2022-07-03T22:32:30Z

.../apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java

+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);


Is typeWidth going to be the same as DecimalVector.TYPE_WIDTH?

typeWidth is the Parquet width, I believe, which is variable depending on the precision of the decimal, but the Arrow width is always 16.

rdblue · 2022-07-03T22:34:01Z

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java

@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
    protected void nextVal(
        FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
      valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);


This looks like a place where we could reuse a buffer rather than allocating in padBigEndianBytes every time.

I did some testing, and reusing the buffer was a little bit slower, partly because we need to always to fill the buffer to zero out the last value.

One thing that was a little bit faster was to bypass DecimalVector.setBigEndian(), convert to little endian (if needed) and copy the bytes directly to the value buffer.

Also one thing to note is that the benchmark isn't quite right. Decimal(20,5) will end up taking 9 bytes and will thus use a fixed length byte array instead of long or int encoding. And fixed length byte arrays aren't dictionary encoded in Parquet v1. That explains why the decimal benchmark is much slower than the other data types (which are dictionary encoded).

(It looks like dictionary encoding for fixed length byte arrays wouldn't work correctly anyway, I may follow up with a fix for that)

On second thought about reusing the buffer, we could create a buffer per value reader so the width of the value is the same, then skip the array fill (if you have 2 buffers, one for negative and one for positive values)

Here's the PR that has a fix for the dictionary encoding

rdblue · 2022-07-03T22:35:05Z

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java

@@ -369,9 +370,10 @@ protected void nextDictEncodedVal(
        reader.fixedLengthDecimalDictEncodedReader()
            .nextBatch(vector, idx, numValuesToRead, dict, nullabilityHolder, typeWidth);
      } else if (Mode.PACKED.equals(mode)) {
-        ByteBuffer decimalBytes = dict.decodeToBinary(reader.readInteger()).toByteBuffer();
-        byte[] vectorBytes = new byte[typeWidth];
-        System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);


Was this correct before? It looks like it was trying to use System.arraycopy with a ByteBuffer!

I believe this would have thrown an ArrayStoreException

* Arrow: Pad decimal bytes before passing to vector * comment clarification * optimize fill for neg numbers * Add overflow check

Arrow: Pad decimal bytes before passing to vector

5054912

github-actions bot added the arrow label Jun 30, 2022

bryanck added 3 commits June 30, 2022 08:27

comment clarification

42825ca

optimize fill for neg numbers

4bb465b

Merge remote-tracking branch 'upstream/master' into vector-dec-pad

ed36e40

danielcweeks approved these changes Jul 1, 2022

View reviewed changes

Add overflow check

ca99ff4

danielcweeks merged commit 7e1ade8 into apache:master Jul 1, 2022

rdblue reviewed Jul 3, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Pad decimal bytes before passing to decimal vector #5168

Arrow: Pad decimal bytes before passing to decimal vector #5168

bryanck commented Jun 30, 2022

bryanck commented Jun 30, 2022

rdblue Jul 3, 2022

bryanck Jul 3, 2022

rdblue Jul 3, 2022 •

edited

bryanck Jul 3, 2022

rdblue Jul 3, 2022

bryanck Jul 4, 2022 •

edited

rdblue Jul 3, 2022

bryanck Jul 3, 2022

bryanck Jul 3, 2022

bryanck Jul 3, 2022

bryanck Jul 3, 2022

bryanck Jul 4, 2022 •

edited

bryanck Jul 4, 2022

rdblue Jul 3, 2022

bryanck Jul 3, 2022

Arrow: Pad decimal bytes before passing to decimal vector #5168

Arrow: Pad decimal bytes before passing to decimal vector #5168

Conversation

bryanck commented Jun 30, 2022

bryanck commented Jun 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jul 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanck Jul 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanck Jul 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jul 3, 2022 •

edited

bryanck Jul 4, 2022 •

edited

bryanck Jul 4, 2022 •

edited