Skip to content

Conversation

@nateab
Copy link

@nateab nateab commented Feb 10, 2026

What changes were proposed in this pull request?

Add StringSearch object caching for Contains, StartsWith, and EndsWith expressions when used with ICU-based collations (UNICODE, UNICODE_CI) and a compile-time constant (foldable) pattern.

Currently, every row evaluation creates a new com.ibm.icu.text.StringSearch object. When the pattern is constant, this repeated construction is unnecessary. With this change, a single StringSearch is created once and reused via setTarget() for each new input string — both in interpreted (@transient private lazy val) and codegen (ctx.addMutableState) paths.

Changes:

  • CollationFactory: add getStringSearchForPattern() factory method
  • CollationSupport: add cached execICU() overloads for Contains, StartsWith, EndsWith
  • stringExpressions.scala: wire caching into expression eval and codegen when pattern is foldable and collation is ICU-based
  • CollationBenchmark: add fixed-pattern benchmarks

Why are the changes needed?

ICU StringSearch construction is expensive. For queries scanning large tables with constant string predicates under ICU collations, this overhead is incurred on every row. Caching yields 3-3.4X improvement.

Does this PR introduce any user-facing change?

No. Performance optimization only.

How was this patch tested?

All 192 existing collation tests pass across 7 test suites. New fixed-pattern benchmarks added:

Operation Varying pattern Fixed pattern (cached) Improvement
Contains (UNICODE vs UTF8_BINARY) 115.0X slower 33.8X slower 3.4X
StartsWith 124.2X slower 37.1X slower 3.3X
EndsWith 137.5X slower 50.0X slower 2.8X

Was this patch authored or co-authored using generative AI tooling?

Yes, Claude Code was used as an AI coding assistant.

…tsWith/EndsWith

Add StringSearch object caching for UNICODE and UNICODE_CI collation
string predicates when the pattern is a compile-time constant (foldable).

Instead of creating a new StringSearch on every row evaluation, a single
StringSearch is created once and reused by calling setTarget() for each
new input string. This applies to both the interpreted path (via
@transient private lazy val) and the codegen path (via addMutableState).

Changes:
- CollationFactory: add getStringSearchForPattern() factory method
- CollationSupport: add cached execICU() overloads for Contains,
  StartsWith, EndsWith that accept a pre-built StringSearch
- stringExpressions: wire caching into Contains, StartsWith, EndsWith
  expression evaluation and code generation
- CollationBenchmark: add fixed-pattern benchmarks measuring caching
  benefit (3-3.4X improvement for UNICODE collation)
@nateab nateab force-pushed the cache-icu-stringsearch-collation branch from d73acc5 to f9d4f58 Compare February 10, 2026 04:01
@nateab nateab changed the title [SQL] Cache ICU StringSearch for collation string predicates with constant patterns [SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant