[SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns #54241
+388
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add StringSearch object caching for
Contains,StartsWith, andEndsWithexpressions when used with ICU-based collations (UNICODE, UNICODE_CI) and a compile-time constant (foldable) pattern.Currently, every row evaluation creates a new
com.ibm.icu.text.StringSearchobject. When the pattern is constant, this repeated construction is unnecessary. With this change, a singleStringSearchis created once and reused viasetTarget()for each new input string — both in interpreted (@transient private lazy val) and codegen (ctx.addMutableState) paths.Changes:
CollationFactory: addgetStringSearchForPattern()factory methodCollationSupport: add cachedexecICU()overloads for Contains, StartsWith, EndsWithstringExpressions.scala: wire caching into expression eval and codegen when pattern is foldable and collation is ICU-basedCollationBenchmark: add fixed-pattern benchmarksWhy are the changes needed?
ICU StringSearch construction is expensive. For queries scanning large tables with constant string predicates under ICU collations, this overhead is incurred on every row. Caching yields 3-3.4X improvement.
Does this PR introduce any user-facing change?
No. Performance optimization only.
How was this patch tested?
All 192 existing collation tests pass across 7 test suites. New fixed-pattern benchmarks added:
Was this patch authored or co-authored using generative AI tooling?
Yes, Claude Code was used as an AI coding assistant.