cache bytestring hashcode by pjfanning · Pull Request #2836 · apache/pekko

pjfanning · 2026-04-04T20:06:18Z

ByteString hashCode does not change because ByteStrings are immutable.

He-Pin · 2026-04-06T10:35:17Z

nice design

He-Pin

Simple and correct change. Caching hashCode on an immutable ByteString is a standard optimization.

One concern: The current implementation uses lazy val which has synchronization overhead on first access in Scala. For hot paths that create many small ByteStrings, the double-checked locking generated by lazy val could be measurable. Consider using a @volatile var with manual caching if benchmarks show this is an issue:

scala\n@volatile private var _hashCode: Int = 0\noverride def hashCode: Int = {\n if (_hashCode == 0) _hashCode = super.hashCode()\n _hashCode\n}\n

However, this has a problem: hash code of 0 cannot be distinguished from uncomputed. For most real-world ByteStrings this is unlikely to matter, but the lazy val approach is safer. The current implementation is fine as-is.

He-Pin

Deep CR: PR #2836 - cache bytestring hashcode

Architecture Review

Adding override lazy val hashCode: Int = super.hashCode() to the abstract ByteString class is a standard and correct optimization for immutable collections. Since ByteString extends IndexedSeq[Byte] with StrictOptimizedSeqOps, the super.hashCode() delegates to IndexedSeq.hashCode() which computes a content-based hash.

Key observation: ByteString does NOT currently override equals. This means:

equals is inherited from IndexedSeq which does content-based comparison
hashCode now caches the result of IndexedSeq.hashCode()
The equals/hashCode contract is maintained: equal ByteStrings will have equal hashCodes

This is important because if ByteString were used as HashMap keys, this optimization provides significant value.

Binary Compatibility

Adding lazy val hashCode introduces a new public field on the abstract ByteString class. Since this is Pekko 2.0 (major version), this is acceptable. All concrete subclasses (ByteString1C, ByteString1, ByteStrings) inherit this field automatically - no changes needed.

Potential issue: If external code has created subclasses of ByteString (unlikely since it's sealed), they would need to handle the new hashCode field. Since ByteString is sealed, this is safe.

Performance Considerations

lazy val overhead: Scala's lazy val uses double-checked locking with a bitmap field. On the first access:

Acquire monitor lock
Compute super.hashCode() (iterates all bytes)
Store result
Release lock

On subsequent accesses:

Check bitmap flag (single volatile read)
Return cached value

The initial cost is amortized over repeated accesses. For ByteStrings used in HashMaps, Sets, or compared frequently, this is a clear win. For one-off ByteStrings that are never hashed, there is a small memory overhead (the bitmap field per instance).

Memory overhead: Each ByteString instance gains one bitmap byte for the lazy val. Negligible for most use cases.

Test Coverage

The test verifies:

hashCode is consistent across multiple calls on the same instance
Equal ByteStrings have equal hashCodes
Different ByteString implementations (ByteString1, ByteStrings) produce consistent hashCodes

Missing tests:

No test for ByteString1C.hashCode
No test verifying that hashCode is computed from content (not reference)
No negative test (different content should produce different hashCode - though this is inherited behavior)

Code Quality

The implementation is minimal and correct. The comment "Cache the hash code since ByteStrings are immutable" is clear.

Java API Coverage

N/A - this is a Scala-only change. Java users calling .hashCode() on ByteString instances will benefit from the caching automatically.

Suggestions

Consider adding a test for ByteString1C.hashCode to ensure all three implementations are covered.
Consider testing that different content produces different hashCodes (probabilistic, but catches obvious bugs).
The lazy val approach is correct but has a small initial synchronization cost. If benchmarks show this is significant in ByteString-heavy workloads, a manual @volatile var approach could be explored, but the current implementation is the safer default choice.

cache bytestring hashcode

166b401

He-Pin approved these changes Apr 6, 2026

View reviewed changes

He-Pin reviewed Apr 6, 2026

View reviewed changes

pjfanning merged commit f207a17 into apache:main Apr 8, 2026
9 checks passed

pjfanning deleted the bs-hash branch April 8, 2026 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache bytestring hashcode#2836

cache bytestring hashcode#2836
pjfanning merged 1 commit intoapache:mainfrom
pjfanning:bs-hash

pjfanning commented Apr 4, 2026

Uh oh!

He-Pin commented Apr 6, 2026

Uh oh!

He-Pin left a comment

Uh oh!

He-Pin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pjfanning commented Apr 4, 2026

Uh oh!

He-Pin commented Apr 6, 2026

Uh oh!

He-Pin left a comment

Choose a reason for hiding this comment

Uh oh!

He-Pin left a comment

Choose a reason for hiding this comment

Deep CR: PR #2836 - cache bytestring hashcode

Architecture Review

Binary Compatibility

Performance Considerations

Test Coverage

Code Quality

Java API Coverage

Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants