Skip to content

GH-3530: Optimize DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, and DELTA_BYTE_ARRAY encoding/decoding#3567

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par3-delta
Open

GH-3530: Optimize DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, and DELTA_BYTE_ARRAY encoding/decoding#3567
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par3-delta

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 17, 2026

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Optimize scalar hot paths for DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, and DELTA_BYTE_ARRAY.

DELTA_BINARY_PACKED reader: Cache BytePackerForLong instances, add unpack32Values bulk method, replace ByteBuffer with reused byte[] for mini-block data.

DELTA_BINARY_PACKED writers: Cache BytePackerForLong instances, add pack32Values.

DELTA_LENGTH_BYTE_ARRAY writer: Remove LittleEndianDataOutputStream wrapper; write directly to CapacityByteArrayOutputStream via BytesUtils.

DELTA_BYTE_ARRAY reader: ByteArraySliceOutputStream to eliminate temporary copies.

DELTA_BYTE_ARRAY writer: Arrays.mismatch for SIMD prefix-length computation, direct writeBytes(byte[],int,int) to avoid Binary allocations.

JMH benchmarks: DeltaBinaryPackedEncodingBenchmark, DeltaBinaryPackedDecodingBenchmark, DeltaByteArrayEncodingBenchmark, DeltaByteArrayDecodingBenchmark, DeltaLengthByteArrayEncodingBenchmark, DeltaLengthByteArrayDecodingBenchmark, LongDeltaDecodingBenchmark.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

Component Avg Improvement Range
DELTA_BINARY_PACKED Decoding (INT32/INT64) +27.4% +16.9% to +43.4%
Long Delta Decoding (INT64, 5 bit-width patterns) +22.9% +3.3% to +35.9%
DELTA_BYTE_ARRAY Decoding (BINARY/FLBA) +31.4% +8.4% to +134.8%
DELTA_BINARY_PACKED Encoding +6.6% +3.3% to +12.4%
DELTA_BYTE_ARRAY Encoding +5.3% -0.6% to +16.0%
DELTA_LENGTH_BYTE_ARRAY Encoding +2.6% -1.0% to +7.6%
DELTA_LENGTH_BYTE_ARRAY Decoding +1.9% +0.0% to +3.8%

Key observations:

  • DELTA_BYTE_ARRAY SORTED decoding shows the largest gains (+27% to +135%) because ByteArraySliceOutputStream eliminates per-value suffix byte[] allocation.
  • TIMESTAMP_MILLIS (a common real-world INT64 delta pattern) improved +28%.
  • Encoding improvements are more modest since the write path was already relatively lean.

… and DELTA_BYTE_ARRAY encoding/decoding

DELTA_BINARY_PACKED reader:
- Cache BytePackerForLong instances (packerCache) to eliminate repeated
  factory lookups per mini block
- Add unpack32Values bulk method that processes 32 values per call
  instead of 8, reducing loop overhead
- Replace ByteBuffer miniBlockByteBuffer with byte[] to avoid
  ByteBuffer.slice() allocation per mini block and enable the faster
  byte[]-based packer APIs

DELTA_BINARY_PACKED integer writer:
- Cache BytePackerForLong instances (packerCache)
- Add pack32Values bulk packing method (32 values per call)

DELTA_BINARY_PACKED long writer:
- Cache BytePackerForLong instances (packerCache)
- Add pack32Values bulk packing method (32 values per call)

DELTA_BINARY_PACKED base writer:
- Remove unused 3-argument constructor

DELTA_LENGTH_BYTE_ARRAY writer:
- Remove LittleEndianDataOutputStream wrapper; write directly to
  CapacityByteArrayOutputStream via BytesUtils
- Add writeBytes(byte[],int,int) overload for direct byte array writes

DELTA_BYTE_ARRAY reader:
- Add ByteArraySliceOutputStream to eliminate temporary byte[] copies
  when materializing prefix+suffix in readBytes()

DELTA_BYTE_ARRAY writer:
- Use copy().getBytesUnsafe() and direct writeBytes(byte[],int,int) to
  avoid intermediate Binary allocations
- Use Arrays.mismatch for prefix length computation, which is
  JVM-intrinsified for SIMD acceleration

Test utilities:
- Remove unused writeInts method from Utils

JMH benchmarks:
- DeltaBinaryPackedEncodingBenchmark: INT32/INT64 scalar encode with
  SEQUENTIAL, RANDOM, LOW_CARDINALITY, HIGH_CARDINALITY data patterns
- DeltaBinaryPackedDecodingBenchmark: INT32/INT64 scalar decode
- DeltaByteArrayEncodingBenchmark: BINARY/FLBA scalar encode with
  RANDOM/SORTED data and varying string/fixed lengths
- DeltaByteArrayDecodingBenchmark: BINARY/FLBA scalar decode
- DeltaLengthByteArrayEncodingBenchmark: BINARY scalar encode with
  UNIFORM_LENGTH/VARIABLE_LENGTH distributions
- DeltaLengthByteArrayDecodingBenchmark: BINARY scalar decode
- LongDeltaDecodingBenchmark: INT64 decode with 5 bit-width patterns
  (SEQUENTIAL_DENSE, SEQUENTIAL_STRIDED, RANDOM_SMALL, RANDOM_WIDE,
  TIMESTAMP_MILLIS)
- Shared TestDataFactory for deterministic benchmark data generation
@iemejia iemejia force-pushed the parquet-perf-v2-par3-delta branch from 7aae9be to e794ec2 Compare May 17, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant