Describe the enhancement requested
PLAIN_DICTIONARY encoding of BINARY columns repeatedly hashes Binary keys during dictionary map lookups (Object2IntMap.put / getInt calls inside DictionaryValuesWriter.PlainBinaryDictionaryValuesWriter). Each call to Binary.hashCode() recomputes the FNV-style hash byte-by-byte across the full backing buffer:
private static final int hashCode(byte[] array, int offset, int length) {
int result = 1;
for (int i = offset; i < offset + length; i++) {
byte b = array[i];
result = 31 * result + b;
}
return result;
}
For columns with many repeated values, the same Binary instance is hashed many times — once per dictionary lookup attempt across the entire page. As string length grows, the cost is dominated by these recomputations.
JMH (BinaryEncodingBenchmark.encodeDictionary, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3) on master:
| cardinality |
stringLength |
ops/s |
| LOW |
10 |
13.2M |
| LOW |
100 |
3.0M |
| LOW |
1000 |
300K |
| HIGH |
10 |
848K |
| HIGH |
100 |
418K |
| HIGH |
1000 |
72.5K |
The 1000-byte LOW case spending 3 ms to encode 100k values is essentially all hashing work.
Proposal
Cache hashCode() per Binary instance using the java.lang.String.hashCode() idiom — a single int field with sentinel 0 meaning "not yet computed":
@Override
public int hashCode() {
int h = cachedHashCode;
if (h != 0) {
return h;
}
return cacheHashCode(Binary.hashCode(value, offset, length));
}
Properties:
- Race-safe without
volatile: the value computed is a deterministic function of immutable bytes, so any two threads that race on the first call produce the same int and either ordering is correct (per java.lang.String).
- Reused (mutable-buffer) Binary instances do not cache, preserving the current contract that mutating the backing array between calls produces a different hash.
- Hash that genuinely equals 0 is recomputed every call — acceptably rare and still semantically correct.
Expected speedup (JMH same configuration, 30 samples per row):
| cardinality |
stringLength |
Before |
After |
Δ |
| LOW |
10 |
13.2M |
20.2M |
+53% |
| LOW |
100 |
3.0M |
18.0M |
+511% |
| LOW |
1000 |
300K |
21.9M |
+7193% (72.9x) |
| HIGH |
10 |
848K |
1.34M |
+58% |
| HIGH |
100 |
418K |
1.32M |
+216% |
| HIGH |
1000 |
72.5K |
1.30M |
+1687% (17.9x) |
Scope
- Single file change to
parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java.
- Adds two unit tests in
TestBinary for cache correctness (constant cached, reused not cached).
- No public-API change. Adds one package-private field (
transient int cachedHashCode) and one package-private helper (cacheHashCode(int)).
- No allocation-rate change (cache is one extra
int field on existing instances).
Relation
This is part of a small series of focused PRs upstreaming optimizations from work in parquet-perf. Previous PRs in the series: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).
Describe the enhancement requested
PLAIN_DICTIONARY encoding of
BINARYcolumns repeatedly hashesBinarykeys during dictionary map lookups (Object2IntMap.put/getIntcalls insideDictionaryValuesWriter.PlainBinaryDictionaryValuesWriter). Each call toBinary.hashCode()recomputes the FNV-style hash byte-by-byte across the full backing buffer:For columns with many repeated values, the same
Binaryinstance is hashed many times — once per dictionary lookup attempt across the entire page. As string length grows, the cost is dominated by these recomputations.JMH (
BinaryEncodingBenchmark.encodeDictionary, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3) onmaster:The 1000-byte LOW case spending 3 ms to encode 100k values is essentially all hashing work.
Proposal
Cache
hashCode()perBinaryinstance using thejava.lang.String.hashCode()idiom — a singleintfield with sentinel0meaning "not yet computed":Properties:
volatile: the value computed is a deterministic function of immutable bytes, so any two threads that race on the first call produce the sameintand either ordering is correct (perjava.lang.String).Expected speedup (JMH same configuration, 30 samples per row):
Scope
parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java.TestBinaryfor cache correctness (constant cached, reused not cached).transient int cachedHashCode) and one package-private helper (cacheHashCode(int)).intfield on existing instances).Relation
This is part of a small series of focused PRs upstreaming optimizations from work in parquet-perf. Previous PRs in the series: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).