GH-3520: Cleanup binary write path (DeltaByteArrayWriter copy + FLBA LE wrapper)#3521
Open
iemejia wants to merge 1 commit intoapache:masterfrom
Open
GH-3520: Cleanup binary write path (DeltaByteArrayWriter copy + FLBA LE wrapper)#3521iemejia wants to merge 1 commit intoapache:masterfrom
iemejia wants to merge 1 commit intoapache:masterfrom
Conversation
Two small cleanups on the binary write side: 1. DeltaByteArrayWriter: replace v.getBytes() with v.copy().getBytesUnsafe() to avoid the unconditional Arrays.copyOf that getBytes() performs for ByteArrayBackedBinary. copy() is a no-op for constant Binaries, and getBytesUnsafe() returns the backing array directly. For reused-buffer Binaries (e.g. ByteBufferBackedBinary over a slab being mutated), copy() still snapshots them so correctness is preserved. 2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream wrapper (only used to call Binary.writeTo(), which works directly with the underlying CapacityByteArrayOutputStream). The trailing out.flush() in getBytes() is also dead. Same pattern as apache#3517 fixed in DeltaLengthByteArrayValuesWriter. No public API change. No file format change. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
arouel
reviewed
Apr 21, 2026
| // copy() is a no-op for constant (non-reused) Binaries, and getBytesUnsafe() | ||
| // returns the backing array directly for ByteArrayBackedBinary — avoiding | ||
| // the unconditional array copy that getBytes() always performs. | ||
| byte[] vb = v.copy().getBytesUnsafe(); |
Contributor
There was a problem hiding this comment.
I address this area already in PR #3465. Can review it and reduce that one to the part which is not addressed there?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #3520.
Two small cleanups in the binary write path.
1.
DeltaByteArrayWriter: avoid unconditional copy of input bytesThe first line of
writeBytes(Binary v)is:Binary.getBytes()is contractually required to return a fresh array the caller can keep and mutate. ForByteArrayBackedBinary(the most common case fromBinary.fromConstantByteArray()and similar), the implementation does an unconditionalArrays.copyOfof the backing array — butDeltaByteArrayWriteronly readsvband then drops it. The copy is wasted work on every value.The right call here is
v.copy().getBytesUnsafe():Binary.copy()is a no-op (return this) for constant Binaries that are already independent of any reused buffer.ByteBufferBackedBinaryover a slab being mutated),copy()snapshots them — preserving correctness.getBytesUnsafe()then returns the backing array directly without a defensive copy.For the common
ByteArrayBackedBinarycase this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.2.
FixedLenByteArrayPlainValuesWriter: drop the unusedLittleEndianDataOutputStreamwrapperSame pattern as #3517 fixed in
DeltaLengthByteArrayValuesWriter: the writer wraps itsCapacityByteArrayOutputStreamwith aLittleEndianDataOutputStreamthat's only used to callBinary.writeTo()— i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.).Binary.writeTo(arrayOut)works directly with the underlying stream.The trailing
out.flush()ingetBytes()is also dead —CapacityByteArrayOutputStreamdoesn't buffer.Benchmark
BinaryEncodingBenchmark.encodeDeltaByteArray(short strings): roughly +5% to +10% standalone — the per-valuegetBytes()copy is one of several overheads; this PR removes one of them, stacking with #3517 which removed the suffix-side allocation.FixedLenByteArrayPlainValuesWriter: code-quality cleanup; removes a per-value layer of dispatch and an unnecessaryflush()call. No headline benchmark.Validation
parquet-column: 573 tests pass-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=trueUser-facing changes
None. No public API change. No file format change.
Closes #3520
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506, #3510, #3514, #3517, #3519. Companion benchmarks contribution: #3512.