Skip to content

cache bytestring hashcode#2836

Merged
pjfanning merged 1 commit intoapache:mainfrom
pjfanning:bs-hash
Apr 8, 2026
Merged

cache bytestring hashcode#2836
pjfanning merged 1 commit intoapache:mainfrom
pjfanning:bs-hash

Conversation

@pjfanning
Copy link
Copy Markdown
Member

ByteString hashCode does not change because ByteStrings are immutable.

@He-Pin
Copy link
Copy Markdown
Member

He-Pin commented Apr 6, 2026

nice design

Copy link
Copy Markdown
Member

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple and correct change. Caching hashCode on an immutable ByteString is a standard optimization.

One concern: The current implementation uses lazy val which has synchronization overhead on first access in Scala. For hot paths that create many small ByteStrings, the double-checked locking generated by lazy val could be measurable. Consider using a @volatile var with manual caching if benchmarks show this is an issue:

scala\n@volatile private var _hashCode: Int = 0\noverride def hashCode: Int = {\n if (_hashCode == 0) _hashCode = super.hashCode()\n _hashCode\n}\n

However, this has a problem: hash code of 0 cannot be distinguished from uncomputed. For most real-world ByteStrings this is unlikely to matter, but the lazy val approach is safer. The current implementation is fine as-is.

Copy link
Copy Markdown
Member

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep CR: PR #2836 - cache bytestring hashcode

Architecture Review

Adding override lazy val hashCode: Int = super.hashCode() to the abstract ByteString class is a standard and correct optimization for immutable collections. Since ByteString extends IndexedSeq[Byte] with StrictOptimizedSeqOps, the super.hashCode() delegates to IndexedSeq.hashCode() which computes a content-based hash.

Key observation: ByteString does NOT currently override equals. This means:

  • equals is inherited from IndexedSeq which does content-based comparison
  • hashCode now caches the result of IndexedSeq.hashCode()
  • The equals/hashCode contract is maintained: equal ByteStrings will have equal hashCodes

This is important because if ByteString were used as HashMap keys, this optimization provides significant value.

Binary Compatibility

Adding lazy val hashCode introduces a new public field on the abstract ByteString class. Since this is Pekko 2.0 (major version), this is acceptable. All concrete subclasses (ByteString1C, ByteString1, ByteStrings) inherit this field automatically - no changes needed.

Potential issue: If external code has created subclasses of ByteString (unlikely since it's sealed), they would need to handle the new hashCode field. Since ByteString is sealed, this is safe.

Performance Considerations

lazy val overhead: Scala's lazy val uses double-checked locking with a bitmap field. On the first access:

  1. Acquire monitor lock
  2. Compute super.hashCode() (iterates all bytes)
  3. Store result
  4. Release lock

On subsequent accesses:

  1. Check bitmap flag (single volatile read)
  2. Return cached value

The initial cost is amortized over repeated accesses. For ByteStrings used in HashMaps, Sets, or compared frequently, this is a clear win. For one-off ByteStrings that are never hashed, there is a small memory overhead (the bitmap field per instance).

Memory overhead: Each ByteString instance gains one bitmap byte for the lazy val. Negligible for most use cases.

Test Coverage

The test verifies:

  • hashCode is consistent across multiple calls on the same instance
  • Equal ByteStrings have equal hashCodes
  • Different ByteString implementations (ByteString1, ByteStrings) produce consistent hashCodes

Missing tests:

  • No test for ByteString1C.hashCode
  • No test verifying that hashCode is computed from content (not reference)
  • No negative test (different content should produce different hashCode - though this is inherited behavior)

Code Quality

The implementation is minimal and correct. The comment "Cache the hash code since ByteStrings are immutable" is clear.

Java API Coverage

N/A - this is a Scala-only change. Java users calling .hashCode() on ByteString instances will benefit from the caching automatically.

Suggestions

  1. Consider adding a test for ByteString1C.hashCode to ensure all three implementations are covered.
  2. Consider testing that different content produces different hashCodes (probabilistic, but catches obvious bugs).
  3. The lazy val approach is correct but has a small initial synchronization cost. If benchmarks show this is significant in ByteString-heavy workloads, a manual @volatile var approach could be explored, but the current implementation is the safer default choice.

@pjfanning pjfanning merged commit f207a17 into apache:main Apr 8, 2026
9 checks passed
@pjfanning pjfanning deleted the bs-hash branch April 8, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants