Skip to content

VarHandle multi-byte reads for ByteString lastIndexOf#2838

Open
pjfanning wants to merge 2 commits intoapache:mainfrom
pjfanning:copilot/optimize-byte-processing-performance
Open

VarHandle multi-byte reads for ByteString lastIndexOf#2838
pjfanning wants to merge 2 commits intoapache:mainfrom
pjfanning:copilot/optimize-byte-processing-performance

Conversation

@pjfanning
Copy link
Copy Markdown
Member

@pjfanning pjfanning commented Apr 4, 2026

SWARUtil — extended VarHandle infrastructure

  • Use byteArrayViewVarHandle instances for short (BE+LE), int (BE+LE), long (LE)
  • Added getLastIndex(word) — finds rightmost byte match in SWAR result using numberOfTrailingZeros

ByteString

  • Use SWARUtil for multi byte processing

Significant improvement in benchmark.

With Changes
[info] Benchmark                                                              Mode  Cnt          Score           Error  Units
[info] ByteString_lastIndexOf_Benchmark.bs1_lastIndexOf                      thrpt    3  107880125.372 ± 247014962.175  ops/s
[info] ByteString_lastIndexOf_Benchmark.bs1_lastIndexOf_byte                 thrpt    3   88478677.724 ± 181685386.699  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_best_case            thrpt    3   58313896.505 ± 806477112.611  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_far_index_case       thrpt    3   17472854.418 ±  94251978.262  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_far_index_case_byte  thrpt    3   19332603.332 ±   2245521.678  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_worst_case           thrpt    3   20053173.893 ±   2385070.269  ops/s

Without Changes
[info] Benchmark                                                              Mode  Cnt          Score          Error  Units
[info] ByteString_lastIndexOf_Benchmark.bs1_lastIndexOf                      thrpt    3   33298670.109 ± 63631788.200  ops/s
[info] ByteString_lastIndexOf_Benchmark.bs1_lastIndexOf_byte                 thrpt    3   27021365.677 ±  5557597.483  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_best_case            thrpt    3   34686168.040 ±  7080975.674  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_far_index_case       thrpt    3     635863.919 ±    69057.413  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_far_index_case_byte  thrpt    3     597510.828 ±    29481.618  ops/s
[info] ByteString_lastIndexOf_Benchmark.bss_lastIndexOf_worst_case           thrpt    3    2125744.038 ±   196176.723  ops/s

@pjfanning pjfanning marked this pull request as draft April 4, 2026 20:58
Copy link
Copy Markdown
Member

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good performance work, but a few concerns:

Concerns:

  1. ByteString1C.lastIndexOf and ByteString1.lastIndexOf -- The SWAR-based specialized lastIndexOf(elem: Byte, end: Int) looks correct but the unrolledLastIndexOf tail handling is complex. The tests cover basic cases but I'd recommend adding a test that specifically targets the boundary between tail and full 8-byte chunks (e.g. length=15, element at position 14).

  2. SWARUtil.getLastIndex -- the doc says Currently only supports big endian. However it's used by the lastIndexOf implementation which reads via SWARUtil.getLong(bytes, chunkStart) (the default big-endian path). This is correct, but worth explicitly documenting to prevent future misuse.

  3. The PR description says WIP and needs benchmarking. I'd recommend not merging until benchmarks confirm the SWAR lastIndexOf is actually faster than the simple loop for typical ByteString sizes (many ByteStrings are small, where SWAR overhead may not pay off).

  4. ByteIterator overrides -- using byteOrder == ByteOrder.BIG_ENDIAN boolean flag is cleaner than the if/else chain in #2847. Consider whether these two PRs should be coordinated to avoid duplicate implementations.

@pjfanning
Copy link
Copy Markdown
Member Author

I'm breaking this up into smaller PRs

@pjfanning pjfanning closed this Apr 6, 2026
@pjfanning pjfanning reopened this Apr 8, 2026
… SWAR lastIndexOf

Agent-Logs-Url: https://github.com/pjfanning/incubator-pekko/sessions/1a4cc51a-e270-46cd-9cf0-e59a0e608650

Co-authored-by: pjfanning <11783444+pjfanning@users.noreply.github.com>

perf: add clarifying comments to unrolledLastIndexOf methods

Agent-Logs-Url: https://github.com/pjfanning/incubator-pekko/sessions/1a4cc51a-e270-46cd-9cf0-e59a0e608650

Co-authored-by: pjfanning <11783444+pjfanning@users.noreply.github.com>

lastIndexOf (specialized)

scalafmt

Update SWARUtil.scala

Update ByteString.scala

Update ByteString.scala
@pjfanning pjfanning force-pushed the copilot/optimize-byte-processing-performance branch from 0be3a08 to 21cfe12 Compare April 8, 2026 11:52
@pjfanning pjfanning marked this pull request as ready for review April 8, 2026 12:21
@pjfanning
Copy link
Copy Markdown
Member Author

I removed the parts of the original PR and focused on SWAR lastIndexOf. The other parts were merged already in other PRs.

@pjfanning pjfanning changed the title VarHandle multi-byte reads for ByteIterator + SWAR lastIndexOf VarHandle multi-byte reads for ByteString lastIndexOf Apr 8, 2026
@pjfanning pjfanning added this to the 2.0.0-M2 milestone Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants