perf: add fast paths for strip chars operations#748
perf: add fast paths for strip chars operations#748stephenamar-db merged 1 commit intodatabricks:masterfrom
Conversation
He-Pin
left a comment
There was a problem hiding this comment.
Code Review
Overall: The optimization strategy is sound — the three-tier fast path approach is well-reasoned and the benchmark results support the performance claims. I found a few issues to address before merging.
1. Redundant isAllBmp(str) scan (Medium)
unspecializedStrip can call isAllBmp(str) twice on the same string:
if (charsSet.size == 1) {
val ch = charsSet.head
if (ch < Character.MIN_SUPPLEMENTARY_CODE_POINT && isAllBmp(str)) { // 1st call
return stripSingleChar(str, ch.toChar, left, right)
}
}
if (isAllBmp(str)) { // 2nd call — redundant O(n) scanWhen charsSet.size == 1, ch is BMP, but isAllBmp(str) returns false, the medium path re-scans the entire string. For long strings this is measurable overhead.
Suggested fix: hoist the check into a val:
val strAllBmp = isAllBmp(str)
if (charsSet.size == 1) {
val ch = charsSet.head
if (ch < Character.MIN_SUPPLEMENTARY_CODE_POINT && strAllBmp) {
return stripSingleChar(str, ch.toChar, left, right)
}
}
if (strAllBmp) {
var allBmp = true
...2. Misleading docstring (Low)
The scaladoc for unspecializedStrip lists:
- Small ASCII strip set — boolean array lookup (no hashing/boxing)
There is no boolean array lookup path implemented. This comment should be removed or the feature should be added.
3.
|
6a4566d to
005516f
Compare
Add optimized strip implementation with three fast paths for common cases: 1. Single-char BMP strip set: direct char comparison (no Set lookup) 2. BMP-only strings with BMP strip set: charAt-based iteration (avoids expensive codePointAt/offsetByCodePoints) 3. General case: falls back to full codepoint-based iteration The key insight is that most real-world strip operations use ASCII/BMP characters on ASCII/BMP strings, where surrogate pair handling is unnecessary overhead. The isAllBmp() check costs O(n) but enables O(1) per-character strip checks instead of O(log n) Set lookups.
005516f to
6ed0eea
Compare
Motivation
The
stripChars,lstripChars, andrstripCharsstdlib functions usecodePointAt()/offsetByCodePoints()for character iteration andSet[Int].contains()for strip-set membership checks. For the common case of ASCII/BMP characters — which covers virtually all real-world Jsonnet usage — this adds significant overhead from surrogate pair handling, hash-based set lookup, and integer boxing.Key Design Decision
Three-tier fast path strategy:
charAt()comparison — zero allocation, no Set overheadcharAt()-based iteration — avoidscodePointAt()/offsetByCodePoints()overheadThe
isAllBmp()pre-check costs O(n) but enables O(1) per-character checks instead of O(log n) Set lookups.Modification
StringModule.scala: AddedisAllBmp(),stripSingleChar(),stripBmp()fast-path methods toStripUtilsunspecializedStrip()to dispatch to fast paths when applicableBenchmark Results
JMH (JVM, Scala 3.3.7)
Scala Native (hyperfine, 50 runs, warmup 5)
Native improvement: 1.87× faster than master, now tied with jrsonnet (was 3.16× slower)
Analysis
References
go_suite/stripChars.jsonnet— strips 510"e"chars from both endsResult
Strip operations now match jrsonnet performance on Scala Native while maintaining full Unicode correctness.