Background
DeltaByteArrayWriter.writeBytes(Binary) is the per-value entry point for the DELTA_BYTE_ARRAY encoding. For each value it computes the common prefix with the previous value and forwards the suffix to a DeltaLengthByteArrayValuesWriter. The suffix path currently does:
suffixWriter.writeBytes(v.slice(i, vb.length - i));
which:
- Allocates a
ByteArraySliceBackedBinary wrapper for the suffix (Binary.slice()).
- Dispatches through
Binary.writeTo(OutputStream) → out.write(bytes, offset, length).
- Inside
DeltaLengthByteArrayValuesWriter.writeBytes(Binary), the value first goes through a LittleEndianDataOutputStream wrapper that adds no useful work for byte[] writes (it only matters for writeInt/writeLong/writeShort).
For short strings, the per-value Binary.slice() allocation and the wrapper indirection dominate the actual work of copying a few bytes.
DeltaLengthByteArrayValuesWriter itself wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that is only used for Binary.writeTo() — i.e., it adds an extra layer of dispatch on every value but never uses any of LE's actual functionality (writeInt/writeLong/etc.).
Proposal
Two related changes, both in the delta byte-array write path:
-
DeltaLengthByteArrayValuesWriter: drop the unused LittleEndianDataOutputStream wrapper. Binary.writeTo(arrayOut) works directly with the underlying CapacityByteArrayOutputStream. Add a new package-private method:
public void writeBytes(byte[] data, int offset, int length) {
lengthWriter.writeInteger(length);
arrayOut.write(data, offset, length);
}
for callers that already have the raw bytes and don't want to allocate a Binary wrapper.
-
DeltaByteArrayWriter: tighten the suffixWriter field type to DeltaLengthByteArrayValuesWriter (it's always constructed as one) so the new writeBytes(byte[], int, int) overload is callable. Replace the suffix call with the raw-bytes overload:
suffixWriter.writeBytes(vb, i, vb.length - i);
eliminating the per-value Binary.slice() allocation.
Expected impact
From local benchmarks (BinaryEncodingBenchmark.encodeDeltaByteArray, BinaryEncodingBenchmark.encodeDeltaLengthByteArray — being added in #3512):
encodeDeltaByteArray (short strings, low cardinality): +23% to +33%
encodeDeltaLengthByteArray (short strings, low cardinality): +16% to +18%
- Long-string cases: flat (the per-value alloc is amortized away)
Files affected
parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java
parquet-column/src/main/java/org/apache/parquet/column/values/deltalengthbytearray/DeltaLengthByteArrayValuesWriter.java
No public API change. No file format change.
Background
DeltaByteArrayWriter.writeBytes(Binary)is the per-value entry point for theDELTA_BYTE_ARRAYencoding. For each value it computes the common prefix with the previous value and forwards the suffix to aDeltaLengthByteArrayValuesWriter. The suffix path currently does:which:
ByteArraySliceBackedBinarywrapper for the suffix (Binary.slice()).Binary.writeTo(OutputStream)→out.write(bytes, offset, length).DeltaLengthByteArrayValuesWriter.writeBytes(Binary), the value first goes through aLittleEndianDataOutputStreamwrapper that adds no useful work forbyte[]writes (it only matters forwriteInt/writeLong/writeShort).For short strings, the per-value
Binary.slice()allocation and the wrapper indirection dominate the actual work of copying a few bytes.DeltaLengthByteArrayValuesWriteritself wraps itsCapacityByteArrayOutputStreamwith aLittleEndianDataOutputStreamthat is only used forBinary.writeTo()— i.e., it adds an extra layer of dispatch on every value but never uses any of LE's actual functionality (writeInt/writeLong/etc.).Proposal
Two related changes, both in the delta byte-array write path:
DeltaLengthByteArrayValuesWriter: drop the unusedLittleEndianDataOutputStreamwrapper.Binary.writeTo(arrayOut)works directly with the underlyingCapacityByteArrayOutputStream. Add a new package-private method:for callers that already have the raw bytes and don't want to allocate a
Binarywrapper.DeltaByteArrayWriter: tighten thesuffixWriterfield type toDeltaLengthByteArrayValuesWriter(it's always constructed as one) so the newwriteBytes(byte[], int, int)overload is callable. Replace the suffix call with the raw-bytes overload:eliminating the per-value
Binary.slice()allocation.Expected impact
From local benchmarks (
BinaryEncodingBenchmark.encodeDeltaByteArray,BinaryEncodingBenchmark.encodeDeltaLengthByteArray— being added in #3512):encodeDeltaByteArray(short strings, low cardinality): +23% to +33%encodeDeltaLengthByteArray(short strings, low cardinality): +16% to +18%Files affected
parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.javaparquet-column/src/main/java/org/apache/parquet/column/values/deltalengthbytearray/DeltaLengthByteArrayValuesWriter.javaNo public API change. No file format change.