GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads by iemejia · Pull Request #3494 · apache/parquet-java

iemejia · 2026-04-19T11:31:27Z

Rationale for this change

Closes #3493.

PlainValuesReader (used for PLAIN-encoded INT32, INT64, FLOAT, and DOUBLE columns, and for decoding the dictionary page of every dictionary-encoded numeric column) currently reads each value through a LittleEndianDataInputStream wrapper around a ByteBufferInputStream. Per value, readInt() performs 4 separate virtual in.read() calls and reassembles the result with bit shifts. The LittleEndianDataInputStream.readInt() method itself carries a TODO comment from years ago suggesting exactly this kind of replacement.

What changes are included in this PR?

Commit 1 — Optimize PlainValuesReader with direct ByteBuffer reads

In PlainValuesReader.initFromPage(), obtain the page data as a single contiguous ByteBuffer via stream.slice(stream.available()) with ByteOrder.LITTLE_ENDIAN, and call the corresponding ByteBuffer accessor directly per value:

readInteger() → buffer.getInt()
readLong() → buffer.getLong()
readFloat() → buffer.getFloat()
readDouble() → buffer.getDouble()
skip(n) → buffer.position(buffer.position() + n * typeSize)

ByteBuffer.getInt() with the appropriate byte order is a HotSpot intrinsic that compiles to a single unaligned load instruction on x86/ARM — no virtual dispatch, no per-byte assembly, no checked IOException on the per-value path. ByteBufferInputStream.slice() already handles both single-buffer (zero-copy view) and multi-buffer (single contiguous copy) cases transparently.

Commit 2 — Deprecate LittleEndianDataInputStream and migrate last test usage

After commit 1, LittleEndianDataInputStream has no remaining production usages. This commit:

Adds @Deprecated and detailed javadoc pointing to the faster ByteBuffer + LITTLE_ENDIAN alternative
Migrates the only remaining usage in TestColumnChunkPageWriteStore.intValue() to BytesInput.toByteBuffer().order(LITTLE_ENDIAN).getInt(), eliminating a BytesInput → ByteArrayOutputStream → ByteArrayInputStream → LittleEndianDataInputStream round-trip

The class is left in place (only @Deprecated) for source/binary compatibility of any downstream code that may still reference it. It can be removed in a future major release.

Benchmark results

IntEncodingBenchmark.decodePlain (100,000 INT32 values per invocation, JMH -wi 3 -i 5 -f 1):

Pattern	Before (ops/s)	After (ops/s)	Speedup
SEQUENTIAL	92,918,297	1,143,149,235	12.3x
RANDOM	92,126,888	1,147,547,093	12.5x
LOW_CARDINALITY	93,005,451	1,142,666,760	12.3x
HIGH_CARDINALITY	93,312,596	1,144,681,876	12.3x

The speedup is consistent across data patterns because the bottleneck is entirely in the per-value dispatch overhead, not the data itself. All four numeric plain reader types (int, long, float, double) benefit equally.

Are these changes tested?

Yes. All 573 parquet-column and 308 parquet-common tests pass. The migrated TestColumnChunkPageWriteStore.testColumnOrderV1 test passes; the two pre-existing getSubject failures in TestColumnChunkPageWriteStore on JDK 18+ are unrelated and reproduce on master without these changes.

Are there any user-facing changes?

LittleEndianDataInputStream is now @Deprecated. No behavioral or binary-compatibility changes for existing callers.

Replace the LittleEndianDataInputStream wrapper with direct ByteBuffer access using LITTLE_ENDIAN byte order in PlainValuesReader. Each read{Integer,Long,Float,Double}() previously dispatched through 4 in.read() calls per value and assembled the result with manual bit shifts; it now compiles to a single ByteBuffer get*() JVM intrinsic. In initFromPage, the page data is obtained as a single contiguous ByteBuffer via ByteBufferInputStream.slice(available). The ByteBufferInputStream.slice() method handles both single-buffer (zero-copy view) and multi-buffer (copy into contiguous buffer) cases transparently. In practice page data is almost always a single contiguous buffer. Benchmark (IntEncodingBenchmark.decodePlain, 100,000 INT32 values per invocation): Pattern Before (ops/s) After (ops/s) Speedup SEQUENTIAL 92,918,297 1,143,149,235 12.3x RANDOM 92,126,888 1,147,547,093 12.5x LOW_CARDINALITY 93,005,451 1,142,666,760 12.3x HIGH_CARDINALITY 93,312,596 1,144,681,876 12.3x The improvement is consistent regardless of data distribution because the bottleneck was entirely in the dispatch overhead. All four numeric plain reader types (int, long, float, double) benefit equally. All 573 parquet-column tests pass.

@deprecated

… test usage After apacheGH-3493 replaced the only production usage of LittleEndianDataInputStream in PlainValuesReader with direct ByteBuffer reads, the class has no remaining production callers. Mark it @deprecated and document the faster alternative. Migrate the only remaining usage in TestColumnChunkPageWriteStore.intValue() to ByteBuffer.getInt() with LITTLE_ENDIAN order, reading directly from BytesInput.toByteBuffer() instead of round-tripping through a ByteArrayOutputStream + ByteArrayInputStream + LittleEndianDataInputStream. Per-call readInt() on the deprecated class performs 4 virtual in.read() dispatches and manually reassembles the value with bit shifts. The ByteBuffer.getInt() replacement is a HotSpot intrinsic that compiles to a single unaligned load on x86/ARM. The class is left in place (only @deprecated) for source/binary compatibility of any downstream code that may still reference it. It can be removed in a future major release. All 308 parquet-common, 573 parquet-column, and TestColumnChunkPageWriteStore column-order tests pass. (The two pre-existing JDK Hadoop getSubject failures in TestColumnChunkPageWriteStore are unrelated to this change.)

arouel approved these changes Apr 19, 2026

View reviewed changes

iemejia mentioned this pull request Apr 19, 2026

GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup) #3496

Open

iemejia added 2 commits April 19, 2026 14:34

iemejia force-pushed the perf-plain-values-reader branch from 44a6a56 to 9c06a94 Compare April 19, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads#3494

GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads#3494
iemejia wants to merge 2 commits intoapache:masterfrom
iemejia:perf-plain-values-reader

iemejia commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Apr 19, 2026

Rationale for this change

What changes are included in this PR?

Benchmark results

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants