Skip to content

GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup)#3500

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-binary-hashcode-cache
Open

GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup)#3500
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-binary-hashcode-cache

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Summary

Closes #3499.

Caches Binary.hashCode() per instance for non-reused (immutable-backing) Binary values. Eliminates repeated full-buffer hash recomputation during PLAIN_DICTIONARY encoding, where the same key is hashed many times across a 100k-value page. Reused (mutable-backing) instances skip the cache to preserve their existing semantics.

Uses the java.lang.String.hashCode() idiom — a single int field with sentinel 0 meaning "not yet computed" — so the cache is race-safe without volatile (concurrent first calls compute the same deterministic value; either ordering is correct).

Benchmark

BinaryEncodingBenchmark.encodeDictionary, 100k BINARY values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3 (30 samples per row):

cardinality stringLength Before (ops/s) After (ops/s) Improvement
LOW 10 13,170,110 20,203,480 +53%
LOW 100 2,955,460 18,048,610 +511%
LOW 1000 300,693 21,933,470 +7193% (72.9x)
HIGH 10 847,657 1,336,238 +58%
HIGH 100 418,327 1,323,284 +216%
HIGH 1000 72,553 1,296,679 +1687% (17.9x)

The relative gain grows with string length (per-call hash work is O(N), cache lookup is O(1)) and with low cardinality (each unique key is hashed many more times).

Negative control: encodePlain (writes Binary without dictionary lookups, so doesn't exercise hashCode) is unchanged within ±2.5% across all parameter combinations. Allocation rate per op (gc.alloc.rate.norm) is identical between baseline and optimized — speedup is pure CPU saved.

Implementation notes

  • Single field added: transient int cachedHashCode on Binary (package-private so the three nested subclasses can read it directly on the hot path; inherited private fields are not accessible from nested subclasses without a method-call indirection).
  • The cached value is gated on !isBackingBytesReused inside a small package-private cacheHashCode(int) helper that runs only on the cache-miss path.
  • 2 new tests in TestBinary:
    • testHashCodeCachedForConstantBinary: constant Binary returns stable hashCode, equal across the three impls (ByteArraySliceBackedBinary, ByteArrayBackedBinary, ByteBufferBackedBinary).
    • testHashCodeNotCachedForReusedBinary: reused Binary returns the new hash after the backing buffer is replaced.

Validation

  • parquet-column: 575 tests pass (was 573; +2 new tests for the cache).
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

Related

This is the third in a small series of focused performance PRs from work in https://github.com/iemejia/parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'BinaryEncodingBenchmark.encodeDictionary' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

PLAIN_DICTIONARY encoding of BINARY columns repeatedly hashes Binary keys
during dictionary map lookups, but the existing Binary.hashCode()
implementations (in ByteArraySliceBackedBinary, ByteArrayBackedBinary, and
ByteBufferBackedBinary) recompute the hash byte-by-byte on every call. For
columns with many repeated values this is the dominant cost of
encodeDictionary -- we observed up to 73x slowdown vs. the cached version on
the existing JMH benchmark.

Cache the hash code in a single int field on Binary. Reused Binary instances
(those whose backing array can be mutated by the producer between calls) do
not cache, preserving the existing mutable-buffer semantics.

Thread safety follows the java.lang.String.hashCode() idiom: the cache is a
single int field with sentinel value 0 meaning "not yet computed". Two
threads racing on the first hashCode() call may both compute and write the
same deterministic value, which is benign. A Binary whose true hash equals 0
is recomputed on every call (acceptably rare and still correct). No volatile
or synchronization is needed; both the field load and the field store are
atomic per JLS, and the value is deterministic given the immutable byte
content.

Implementation notes:
- The cache field is package-private (not private) so the three nested
  Binary subclasses can read it directly in their hashCode() hot path,
  avoiding an extra method-call layer that would otherwise be needed since
  inherited private fields are not accessible from nested subclasses.
- A package-private cacheHashCode(int) helper centralises the
  isBackingBytesReused check on the slow path.
- New tests in TestBinary cover (a) cached-and-stable hashCode for the three
  constant Binary impls, and (b) reused Binary not returning a stale hash
  after the backing buffer is replaced.

Benchmark (BinaryEncodingBenchmark.encodeDictionary, 100k BINARY values per
invocation, JMH -wi 5 -i 10 -f 3, 30 samples per row):

  Param            Before (ops/s)   After (ops/s)   Improvement
  LOW   / 10        13,170,110      20,203,480     +53%   (1.53x)
  LOW   / 100        2,955,460      18,048,610     +511%  (6.11x)
  LOW   / 1000         300,693      21,933,470     +7193% (72.9x)
  HIGH  / 10           847,657       1,336,238     +58%   (1.58x)
  HIGH  / 100          418,327       1,323,284     +216%  (3.16x)
  HIGH  / 1000          72,553       1,296,679     +1687% (17.9x)

The relative gain grows with string length because the per-value hash cost
(byte-loop length) grows linearly while the cached lookup is O(1). LOW
cardinality benefits even more because each unique key is hashed many more
times (once per insertion check across the 100k values).

Negative control: BinaryEncodingBenchmark.encodePlain (which writes Binary
without dictionary lookups, so does not exercise hashCode) is unchanged
within +/- 2.5% across all parameter combinations.

Allocation rate per operation is identical between baseline and optimized
(7.36 B/op for LOW/10, etc.), confirming the speedup comes from CPU saved
on hashing rather than reduced allocations.

All 575 parquet-column tests pass (was 573; +2 new tests for the cache).
Copy link
Copy Markdown

@arouel arouel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a similar optimization already in a patched parquet-column version on my side and verified the improvement.

Comment thread parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
@iemejia iemejia force-pushed the perf-binary-hashcode-cache branch from e1c3ed9 to a8152c9 Compare April 19, 2026 18:18
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache hashCode() for non-reused Binary instances (huge dictionary-encode speedup)

2 participants