Skip to content

Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter: remove per-value allocation and LittleEndianDataOutputStream wrapper #3516

@iemejia

Description

@iemejia

Background

DeltaByteArrayWriter.writeBytes(Binary) is the per-value entry point for the DELTA_BYTE_ARRAY encoding. For each value it computes the common prefix with the previous value and forwards the suffix to a DeltaLengthByteArrayValuesWriter. The suffix path currently does:

suffixWriter.writeBytes(v.slice(i, vb.length - i));

which:

  1. Allocates a ByteArraySliceBackedBinary wrapper for the suffix (Binary.slice()).
  2. Dispatches through Binary.writeTo(OutputStream)out.write(bytes, offset, length).
  3. Inside DeltaLengthByteArrayValuesWriter.writeBytes(Binary), the value first goes through a LittleEndianDataOutputStream wrapper that adds no useful work for byte[] writes (it only matters for writeInt/writeLong/writeShort).

For short strings, the per-value Binary.slice() allocation and the wrapper indirection dominate the actual work of copying a few bytes.

DeltaLengthByteArrayValuesWriter itself wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that is only used for Binary.writeTo() — i.e., it adds an extra layer of dispatch on every value but never uses any of LE's actual functionality (writeInt/writeLong/etc.).

Proposal

Two related changes, both in the delta byte-array write path:

  1. DeltaLengthByteArrayValuesWriter: drop the unused LittleEndianDataOutputStream wrapper. Binary.writeTo(arrayOut) works directly with the underlying CapacityByteArrayOutputStream. Add a new package-private method:

    public void writeBytes(byte[] data, int offset, int length) {
      lengthWriter.writeInteger(length);
      arrayOut.write(data, offset, length);
    }

    for callers that already have the raw bytes and don't want to allocate a Binary wrapper.

  2. DeltaByteArrayWriter: tighten the suffixWriter field type to DeltaLengthByteArrayValuesWriter (it's always constructed as one) so the new writeBytes(byte[], int, int) overload is callable. Replace the suffix call with the raw-bytes overload:

    suffixWriter.writeBytes(vb, i, vb.length - i);

    eliminating the per-value Binary.slice() allocation.

Expected impact

From local benchmarks (BinaryEncodingBenchmark.encodeDeltaByteArray, BinaryEncodingBenchmark.encodeDeltaLengthByteArray — being added in #3512):

  • encodeDeltaByteArray (short strings, low cardinality): +23% to +33%
  • encodeDeltaLengthByteArray (short strings, low cardinality): +16% to +18%
  • Long-string cases: flat (the per-value alloc is amortized away)

Files affected

  • parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java
  • parquet-column/src/main/java/org/apache/parquet/column/values/deltalengthbytearray/DeltaLengthByteArrayValuesWriter.java

No public API change. No file format change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions