[SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns #54241

nateab · 2026-02-10T01:18:18Z

What changes were proposed in this pull request?

Add StringSearch object caching for Contains, StartsWith, and EndsWith expressions when used with ICU-based collations (UNICODE, UNICODE_CI) and a compile-time constant (foldable) pattern.

Currently, every row evaluation creates a new com.ibm.icu.text.StringSearch object. When the pattern is constant, this repeated construction is unnecessary. With this change, a single StringSearch is created once and reused via setTarget() for each new input string — both in interpreted (@transient private lazy val) and codegen (ctx.addMutableState) paths.

Changes:

CollationFactory: add getStringSearchForPattern() factory method
CollationSupport: add cached execICU() overloads for Contains, StartsWith, EndsWith
stringExpressions.scala: wire caching into expression eval and codegen when pattern is foldable and collation is ICU-based
CollationBenchmark: add fixed-pattern benchmarks

Why are the changes needed?

ICU StringSearch construction is expensive. For queries scanning large tables with constant string predicates under ICU collations, this overhead is incurred on every row. Caching yields 3-3.4X improvement.

Does this PR introduce any user-facing change?

No. Performance optimization only.

How was this patch tested?

All 192 existing collation tests pass across 7 test suites. New fixed-pattern benchmarks added:

Operation	Varying pattern	Fixed pattern (cached)	Improvement
Contains (UNICODE vs UTF8_BINARY)	115.0X slower	33.8X slower	3.4X
StartsWith	124.2X slower	37.1X slower	3.3X
EndsWith	137.5X slower	50.0X slower	2.8X

Was this patch authored or co-authored using generative AI tooling?

Yes, Claude Code was used as an AI coding assistant.

@transient

…tsWith/EndsWith Add StringSearch object caching for UNICODE and UNICODE_CI collation string predicates when the pattern is a compile-time constant (foldable). Instead of creating a new StringSearch on every row evaluation, a single StringSearch is created once and reused by calling setTarget() for each new input string. This applies to both the interpreted path (via @transient private lazy val) and the codegen path (via addMutableState). Changes: - CollationFactory: add getStringSearchForPattern() factory method - CollationSupport: add cached execICU() overloads for Contains, StartsWith, EndsWith that accept a pre-built StringSearch - stringExpressions: wire caching into Contains, StartsWith, EndsWith expression evaluation and code generation - CollationBenchmark: add fixed-pattern benchmarks measuring caching benefit (3-3.4X improvement for UNICODE collation)

nateab force-pushed the cache-icu-stringsearch-collation branch from d73acc5 to f9d4f58 Compare February 10, 2026 04:01

nateab changed the title ~~[SQL] Cache ICU StringSearch for collation string predicates with constant patterns~~ [SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns Feb 10, 2026

trigger CI

4a12cb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns #54241

[SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns #54241

nateab commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns #54241

Are you sure you want to change the base?

[SPARK-55430][SQL] Cache ICU StringSearch for collation string predicates with constant patterns #54241

Conversation

nateab commented Feb 10, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant