[VL] Fix silent stats prune for non-binary collation StringType in cache by yaooqinn · Pull Request #12112 · apache/gluten

yaooqinn · 2026-05-19T05:03:13Z

What changes were proposed in this pull request?

Gate non-binary-collation StringType columns in the Velox cache path to supported=0 (writer side) AND strip any AND-conjunct that references such a column from the reader-side buildFilter predicate vector (before delegating to super.buildFilter). Writer / wire format unchanged.

New shim API SparkShims.isBinaryCollationString — default true for Spark 3.x shims (no collation concept), overridden on Spark 4.0 / 4.1 to check collationId == UTF8_BINARY_COLLATION_ID.

Why are the changes needed?

On Spark 4.x with a non-binary collation, Velox's scanMinMax<StringView> does an unsigned byte-order compare while Spark's filter compare is collation-aware (PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator). The two disagree, so stats-based pruning can silently drop matching rows.

Repro:

spark.sql("CREATE TABLE t(s STRING COLLATE UTF8_LCASE) USING parquet")
spark.sql("INSERT INTO t VALUES 'abc', 'XYZ'")
spark.sql("CACHE TABLE t")
spark.sql("SELECT * FROM t WHERE s = 'ABC'").show()
// Before: 0 rows (wrong). After: 1 row.

Vanilla Spark's StringColumnStats is collation-aware, so this is Gluten-specific.

Reader-side approach (rev 3.2)

Earlier revisions filled a 0xFF * 256B sentinel upper bound on the deserialize side to keep vanilla buildFilter from pruning. As pointed out by @zhli1142015, that sentinel is not a universal upper bound under non-binary collation orderings, so it is not safe.

Rev 3.2 drops the sentinel and instead wraps SimpleMetricsCachedBatchSerializer.buildFilter with a splitConjunctivePredicates-based predicate-strip layer (stripUnsupportedConjuncts):

For each input predicate, split into AND-conjuncts.
Drop every conjunct whose references contain any attribute that was demoted to supported=0 (i.e. a non-binary collation StringType in cachedAttributes).
Rebuild surviving conjuncts with And.reduce; bypass entirely if nothing references a demoted column.
Or sub-trees stay intact (one losing-stats disjunct already loses the whole Or anyway, so it's conservative).

Empty filtered predicates degrade gracefully: vanilla SimpleMetricsCachedBatchSerializer reduces partitionFilters with .reduceOption(And).getOrElse(Literal(true)), so the partition filter becomes pass-through (verified against spark-sql_2.13-4.0.1-sources CachedBatchSerializer.scala).

A real collation-aware bound (matching vanilla StringColumnStats.semanticCompare) would require teaching the cpp scanMinMax path about collations, likely via ICU sort keys — tracked as a Phase-2 follow-up.

Does this PR introduce any user-facing change?

Yes — correctness fix. No new config.

How was this patch tested?

New ColumnarCachedBatchBuildFilterPruneSuite W1–W8 (wrapper behavior + anti-regression bypass).
New ColumnarCachedBatchE2ESuite cases for UTF8_LCASE + UNICODE_CI predicate over cached batch.
Existing suites: ColumnarCacheShipBlockerMarshalSuite, ColumnarCachedBatchStatsBlobSuite, ColumnarCachedBatchIntFamilyMarshalSuite.
mvn clean install + suites verified on:
- -Pspark-4.0 -Pscala-2.13 — 42/42 PASS
- -Pspark-4.1 -Pscala-2.13 — 42/42 PASS
- -Pspark-3.5 -Pscala-2.12 — 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases guarded by assume(isCollationAware), since CollationFactory does not exist on 3.5)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

github-actions · 2026-05-19T05:33:16Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-19T06:20:32Z

Run Gluten Clickhouse CI on x86

zhli1142015 · 2026-05-19T09:38:49Z

Comment 1 — Test coverage gap

The new sentinel buildFilter tests currently use AttributeReference("c", StringType)(), which is the default binary-collation StringType. That does not exercise the actual non-binary collation path this PR is trying to fix. SimpleMetricsCachedBatchSerializer.buildFilter builds lower/upper-bound attributes from the cached attribute data type, so for the real failure case comparisons are collation-aware, not binary byte-order comparisons.

Could you add test coverage with an explicit non-binary collation StringType (for example UTF8_BINARY_LCASE/UNICODE, depending on the supported Spark profile) so EqualTo, In, StartsWith, and range predicates are evaluated through the same collation-aware comparison path as production?

Comment 2 — Sentinel upper bound correctness under non-binary collation

The sentinel upper bound is currently 256 bytes of 0xff. That is a safe “max” value for binary byte ordering, but for non-binary collations Spark comparisons are collation-aware. Those bytes are invalid UTF-8 and will not necessarily compare as a universal maximum value under every Spark collation. If a literal can compare greater than this sentinel under a non-binary collation, the parent buildFilter may still prune the batch incorrectly.

A safer design would be to make the Velox serializer’s buildFilter wrapper ignore/minimize min/max pruning for non-binary-collation string attributes, rather than fabricating synthetic min/max bounds. That would preserve correctness by passing through batches whenever the predicate depends on unsupported string ordering stats.

yaooqinn · 2026-05-19T16:08:00Z

Thanks @zhli1142015 for the thorough review — both points are valid and addressed in rev 3.2 (48c8bc749d).

C1 (test coverage): The earlier sentinel-suite only exercised the deserialize hot path. Added 8 wrapper-behavior tests (ColumnarCachedBatchBuildFilterPruneSuite W1–W8) plus end-to-end UTF8_LCASE / UNICODE_CI cases in ColumnarCachedBatchE2ESuite. W8 is an explicit anti-regression that bypasses the wrapper to prove a UTF8_LCASE bound check would have wrongly pruned the batch.

C2 (sentinel correctness): You are right — 0xFF * 256B is not a universal upper bound under non-binary collations. PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator (PhysicalDataType.scala:334) means <= on collation-aware StringType is governed by the collation’s comparator, not by raw byte order, so any fixed sentinel can falsify the bound check.

Rev 3.2 abandons the sentinel approach entirely. Reader-side mechanism is now a splitConjunctivePredicates-based wrapper around SimpleMetricsCachedBatchSerializer.buildFilter: any AND-conjunct referencing a non-binary collation StringType attribute is dropped before partition-stats evaluation; binary attributes still prune. Conjuncts that cannot be split (Or, etc.) referencing such attributes conservatively keep the batch. Writer/wire-format unchanged.

Verified locally:

mvn clean install + suites on -Pspark-3.5/scala-2.12, -Pspark-4.0/scala-2.13, -Pspark-4.1/scala-2.13
spark-4.0 / 4.1: 42/42 PASS
spark-3.5: 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases guarded by assume(isCollationAware), since CollationFactory does not exist on 3.5)

PR description updated to reflect rev 3.2. PTAL.

github-actions · 2026-05-19T16:59:24Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-19T17:09:02Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-19T17:21:16Z

Run Gluten Clickhouse CI on x86

zhli1142015

LGTM, thanks

@zhli1142015

Apache Spark 4.0+ introduced collation-aware StringType. Cached batch partition-stats currently dispatches any StringType through the supported=1 fast path; cpp scanMinMax<StringView> + JVM encodeStringBounds use unsigned byte order while non-binary collations (UTF8_LCASE, UNICODE_CI, etc) use collation rules for equality/order. Mismatch can silently prune correct rows on collation-aware predicates. This patch gates StringType dispatch to UTF8_BINARY only via a new shim method. Spark 3.x shims do not need to override (the default returns true; binary-only behavior is correct on those branches' cached batch path). Spark 4.0/4.1 shims override using CollationFactory.UTF8_BINARY_COLLATION_ID. Non-binary collation columns are demoted to supported=0. For the reader side, this patch wraps SimpleMetricsCachedBatchSerializer.buildFilter with a predicate-stripping wrapper: any AND-conjunct that references a non-binary-collation StringType attribute is dropped (via splitConjunctivePredicates), leaving only conjuncts on binary-safe attributes for partition-stats evaluation. Predicates that cannot be split (Or, etc.) referencing such an attribute conservatively keep the batch. This replaces the earlier sentinel-bound approach. As pointed out by @zhli1142015 in the review, a fixed 0xFF upper sentinel is not a universal upper bound under non-binary collations (where ordering is governed by PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator). The wrapper avoids the question entirely by removing the predicate before it ever reaches the stats-based bound check, so correctness no longer depends on what "max" means for the collation. A real collation-aware bound (matching vanilla StringColumnStats.semanticCompare) would require teaching the cpp scanMinMax path about collations, likely via ICU sort keys. That is tracked as a Phase-2 follow-up. Tests: ColumnarCachedBatchBuildFilterPruneSuite (existing, extended): W1: wrapper strips predicate on non-binary collation StringType attr W2: wrapper preserves predicate on binary collation StringType attr W3: mixed-attr conjunct keeps binary, batch pruned by int W4: nested And splits deeply W5: Or branch conservatively stripped, batch kept W6: IsNull(nb) stripped, IsNotNull(int) kept W7: In + StartsWith on nb both stripped W8 (anti-regression): bypassing wrapper would let UTF8_LCASE bound prune ColumnarCachedBatchE2ESuite (new cases): end-to-end UTF8_LCASE + UNICODE_CI predicate over cached batch, no incorrect pruning. W1-W8 + UNICODE_CI case are guarded by assume(isCollationAware) and skip cleanly on Spark 3.5 where CollationFactory does not exist. Verified: mvn clean install + suites on -Pspark-3.5/scala-2.12, -Pspark-4.0/scala-2.13, -Pspark-4.1/scala-2.13. spark-4.0/4.1: 42/42 PASS. spark-3.5: 32 PASS, 10 cleanly canceled by assume. Generated-by: Claude Opus 4.7

github-actions · 2026-05-20T02:00:50Z

Run Gluten Clickhouse CI on x86

github-actions Bot added CORE works for Gluten Core VELOX labels May 19, 2026

yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch from 9641baa to cab3e3e Compare May 19, 2026 05:56

yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch 3 times, most recently from 48c8bc7 to bc74c18 Compare May 19, 2026 16:07

yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch from bc74c18 to 2788186 Compare May 19, 2026 17:00

zhli1142015 approved these changes May 19, 2026

View reviewed changes

yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch from 2788186 to 8a61a7d Compare May 20, 2026 02:00

yaooqinn merged commit 9e6e8bf into apache:main May 20, 2026
62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Fix silent stats prune for non-binary collation StringType in cache#12112

[VL] Fix silent stats prune for non-binary collation StringType in cache#12112
yaooqinn merged 1 commit into
apache:mainfrom
yaooqinn:users/kentyao/cache-stats-collation-gate

yaooqinn commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

zhli1142015 commented May 19, 2026

Uh oh!

yaooqinn commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

zhli1142015 left a comment

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaooqinn commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Reader-side approach (rev 3.2)

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

zhli1142015 commented May 19, 2026

Uh oh!

yaooqinn commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

zhli1142015 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaooqinn commented May 19, 2026 •

edited

Loading