Skip to content

Cache hashCode() for non-reused Binary instances (huge dictionary-encode speedup) #3499

@iemejia

Description

@iemejia

Describe the enhancement requested

PLAIN_DICTIONARY encoding of BINARY columns repeatedly hashes Binary keys during dictionary map lookups (Object2IntMap.put / getInt calls inside DictionaryValuesWriter.PlainBinaryDictionaryValuesWriter). Each call to Binary.hashCode() recomputes the FNV-style hash byte-by-byte across the full backing buffer:

private static final int hashCode(byte[] array, int offset, int length) {
  int result = 1;
  for (int i = offset; i < offset + length; i++) {
    byte b = array[i];
    result = 31 * result + b;
  }
  return result;
}

For columns with many repeated values, the same Binary instance is hashed many times — once per dictionary lookup attempt across the entire page. As string length grows, the cost is dominated by these recomputations.

JMH (BinaryEncodingBenchmark.encodeDictionary, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3) on master:

cardinality stringLength ops/s
LOW 10 13.2M
LOW 100 3.0M
LOW 1000 300K
HIGH 10 848K
HIGH 100 418K
HIGH 1000 72.5K

The 1000-byte LOW case spending 3 ms to encode 100k values is essentially all hashing work.

Proposal

Cache hashCode() per Binary instance using the java.lang.String.hashCode() idiom — a single int field with sentinel 0 meaning "not yet computed":

@Override
public int hashCode() {
  int h = cachedHashCode;
  if (h != 0) {
    return h;
  }
  return cacheHashCode(Binary.hashCode(value, offset, length));
}

Properties:

  • Race-safe without volatile: the value computed is a deterministic function of immutable bytes, so any two threads that race on the first call produce the same int and either ordering is correct (per java.lang.String).
  • Reused (mutable-buffer) Binary instances do not cache, preserving the current contract that mutating the backing array between calls produces a different hash.
  • Hash that genuinely equals 0 is recomputed every call — acceptably rare and still semantically correct.

Expected speedup (JMH same configuration, 30 samples per row):

cardinality stringLength Before After Δ
LOW 10 13.2M 20.2M +53%
LOW 100 3.0M 18.0M +511%
LOW 1000 300K 21.9M +7193% (72.9x)
HIGH 10 848K 1.34M +58%
HIGH 100 418K 1.32M +216%
HIGH 1000 72.5K 1.30M +1687% (17.9x)

Scope

  • Single file change to parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java.
  • Adds two unit tests in TestBinary for cache correctness (constant cached, reused not cached).
  • No public-API change. Adds one package-private field (transient int cachedHashCode) and one package-private helper (cacheHashCode(int)).
  • No allocation-rate change (cache is one extra int field on existing instances).

Relation

This is part of a small series of focused PRs upstreaming optimizations from work in parquet-perf. Previous PRs in the series: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions