Skip to content

GH-3520: Cleanup binary write path (DeltaByteArrayWriter copy + FLBA LE wrapper)#3521

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-binary-write-cleanup
Open

GH-3520: Cleanup binary write path (DeltaByteArrayWriter copy + FLBA LE wrapper)#3521
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-binary-write-cleanup

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 21, 2026

Summary

Resolves #3520.

Two small cleanups in the binary write path.

1. DeltaByteArrayWriter: avoid unconditional copy of input bytes

The first line of writeBytes(Binary v) is:

byte[] vb = v.getBytes();

Binary.getBytes() is contractually required to return a fresh array the caller can keep and mutate. For ByteArrayBackedBinary (the most common case from Binary.fromConstantByteArray() and similar), the implementation does an unconditional Arrays.copyOf of the backing array — but DeltaByteArrayWriter only reads vb and then drops it. The copy is wasted work on every value.

The right call here is v.copy().getBytesUnsafe():

  • Binary.copy() is a no-op (return this) for constant Binaries that are already independent of any reused buffer.
  • For reused-buffer Binaries (e.g. ByteBufferBackedBinary over a slab being mutated), copy() snapshots them — preserving correctness.
  • getBytesUnsafe() then returns the backing array directly without a defensive copy.

For the common ByteArrayBackedBinary case this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.

2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream wrapper

Same pattern as #3517 fixed in DeltaLengthByteArrayValuesWriter: the writer wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that's only used to call Binary.writeTo() — i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.). Binary.writeTo(arrayOut) works directly with the underlying stream.

The trailing out.flush() in getBytes() is also dead — CapacityByteArrayOutputStream doesn't buffer.

Benchmark

BinaryEncodingBenchmark.encodeDeltaByteArray (short strings): roughly +5% to +10% standalone — the per-value getBytes() copy is one of several overheads; this PR removes one of them, stacking with #3517 which removed the suffix-side allocation.

FixedLenByteArrayPlainValuesWriter: code-quality cleanup; removes a per-value layer of dispatch and an unnecessary flush() call. No headline benchmark.

Validation

  • parquet-column: 573 tests pass
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

User-facing changes

None. No public API change. No file format change.

Closes #3520

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506, #3510, #3514, #3517, #3519. Companion benchmarks contribution: #3512.

Two small cleanups on the binary write side:

1. DeltaByteArrayWriter: replace v.getBytes() with v.copy().getBytesUnsafe()
   to avoid the unconditional Arrays.copyOf that getBytes() performs for
   ByteArrayBackedBinary. copy() is a no-op for constant Binaries, and
   getBytesUnsafe() returns the backing array directly. For reused-buffer
   Binaries (e.g. ByteBufferBackedBinary over a slab being mutated), copy()
   still snapshots them so correctness is preserved.

2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream
   wrapper (only used to call Binary.writeTo(), which works directly with
   the underlying CapacityByteArrayOutputStream). The trailing out.flush()
   in getBytes() is also dead. Same pattern as apache#3517 fixed in
   DeltaLengthByteArrayValuesWriter.

No public API change. No file format change.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
// copy() is a no-op for constant (non-reused) Binaries, and getBytesUnsafe()
// returns the backing array directly for ByteArrayBackedBinary — avoiding
// the unconditional array copy that getBytes() always performs.
byte[] vb = v.copy().getBytesUnsafe();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I address this area already in PR #3465. Can review it and reduce that one to the part which is not addressed there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup binary write path: avoid Binary.getBytes() copy in DeltaByteArrayWriter and remove LE wrapper from FixedLenByteArrayPlainValuesWriter

2 participants