Skip to content

[VL] Fix silent stats prune for non-binary collation StringType in cache#12112

Merged
yaooqinn merged 1 commit into
apache:mainfrom
yaooqinn:users/kentyao/cache-stats-collation-gate
May 20, 2026
Merged

[VL] Fix silent stats prune for non-binary collation StringType in cache#12112
yaooqinn merged 1 commit into
apache:mainfrom
yaooqinn:users/kentyao/cache-stats-collation-gate

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

@yaooqinn yaooqinn commented May 19, 2026

What changes were proposed in this pull request?

Gate non-binary-collation StringType columns in the Velox cache path to supported=0 (writer side) AND strip any AND-conjunct that references such a column from the reader-side buildFilter predicate vector (before delegating to super.buildFilter). Writer / wire format unchanged.

New shim API SparkShims.isBinaryCollationString — default true for Spark 3.x shims (no collation concept), overridden on Spark 4.0 / 4.1 to check collationId == UTF8_BINARY_COLLATION_ID.

Why are the changes needed?

On Spark 4.x with a non-binary collation, Velox's scanMinMax<StringView> does an unsigned byte-order compare while Spark's filter compare is collation-aware (PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator). The two disagree, so stats-based pruning can silently drop matching rows.

Repro:

spark.sql("CREATE TABLE t(s STRING COLLATE UTF8_LCASE) USING parquet")
spark.sql("INSERT INTO t VALUES 'abc', 'XYZ'")
spark.sql("CACHE TABLE t")
spark.sql("SELECT * FROM t WHERE s = 'ABC'").show()
// Before: 0 rows (wrong). After: 1 row.

Vanilla Spark's StringColumnStats is collation-aware, so this is Gluten-specific.

Reader-side approach (rev 3.2)

Earlier revisions filled a 0xFF * 256B sentinel upper bound on the deserialize side to keep vanilla buildFilter from pruning. As pointed out by @zhli1142015, that sentinel is not a universal upper bound under non-binary collation orderings, so it is not safe.

Rev 3.2 drops the sentinel and instead wraps SimpleMetricsCachedBatchSerializer.buildFilter with a splitConjunctivePredicates-based predicate-strip layer (stripUnsupportedConjuncts):

  • For each input predicate, split into AND-conjuncts.
  • Drop every conjunct whose references contain any attribute that was demoted to supported=0 (i.e. a non-binary collation StringType in cachedAttributes).
  • Rebuild surviving conjuncts with And.reduce; bypass entirely if nothing references a demoted column.
  • Or sub-trees stay intact (one losing-stats disjunct already loses the whole Or anyway, so it's conservative).

Empty filtered predicates degrade gracefully: vanilla SimpleMetricsCachedBatchSerializer reduces partitionFilters with .reduceOption(And).getOrElse(Literal(true)), so the partition filter becomes pass-through (verified against spark-sql_2.13-4.0.1-sources CachedBatchSerializer.scala).

A real collation-aware bound (matching vanilla StringColumnStats.semanticCompare) would require teaching the cpp scanMinMax path about collations, likely via ICU sort keys — tracked as a Phase-2 follow-up.

Does this PR introduce any user-facing change?

Yes — correctness fix. No new config.

How was this patch tested?

  • New ColumnarCachedBatchBuildFilterPruneSuite W1–W8 (wrapper behavior + anti-regression bypass).
  • New ColumnarCachedBatchE2ESuite cases for UTF8_LCASE + UNICODE_CI predicate over cached batch.
  • Existing suites: ColumnarCacheShipBlockerMarshalSuite, ColumnarCachedBatchStatsBlobSuite, ColumnarCachedBatchIntFamilyMarshalSuite.
  • mvn clean install + suites verified on:
    • -Pspark-4.0 -Pscala-2.13 — 42/42 PASS
    • -Pspark-4.1 -Pscala-2.13 — 42/42 PASS
    • -Pspark-3.5 -Pscala-2.12 — 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases guarded by assume(isCollationAware), since CollationFactory does not exist on 3.5)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 19, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@yaooqinn yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch from 9641baa to cab3e3e Compare May 19, 2026 05:56
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhli1142015
Copy link
Copy Markdown
Contributor

Comment 1 — Test coverage gap

The new sentinel buildFilter tests currently use AttributeReference("c", StringType)(), which is the default binary-collation StringType. That does not exercise the actual non-binary collation path this PR is trying to fix. SimpleMetricsCachedBatchSerializer.buildFilter builds lower/upper-bound attributes from the cached attribute data type, so for the real failure case comparisons are collation-aware, not binary byte-order comparisons.

Could you add test coverage with an explicit non-binary collation StringType (for example UTF8_BINARY_LCASE/UNICODE, depending on the supported Spark profile) so EqualTo, In, StartsWith, and range predicates are evaluated through the same collation-aware comparison path as production?

Comment 2 — Sentinel upper bound correctness under non-binary collation

The sentinel upper bound is currently 256 bytes of 0xff. That is a safe “max” value for binary byte ordering, but for non-binary collations Spark comparisons are collation-aware. Those bytes are invalid UTF-8 and will not necessarily compare as a universal maximum value under every Spark collation. If a literal can compare greater than this sentinel under a non-binary collation, the parent buildFilter may still prune the batch incorrectly.

A safer design would be to make the Velox serializer’s buildFilter wrapper ignore/minimize min/max pruning for non-binary-collation string attributes, rather than fabricating synthetic min/max bounds. That would preserve correctness by passing through batches whenever the predicate depends on unsupported string ordering stats.

@yaooqinn yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch 3 times, most recently from 48c8bc7 to bc74c18 Compare May 19, 2026 16:07
@yaooqinn
Copy link
Copy Markdown
Member Author

Thanks @zhli1142015 for the thorough review — both points are valid and addressed in rev 3.2 (48c8bc749d).

C1 (test coverage): The earlier sentinel-suite only exercised the deserialize hot path. Added 8 wrapper-behavior tests (ColumnarCachedBatchBuildFilterPruneSuite W1–W8) plus end-to-end UTF8_LCASE / UNICODE_CI cases in ColumnarCachedBatchE2ESuite. W8 is an explicit anti-regression that bypasses the wrapper to prove a UTF8_LCASE bound check would have wrongly pruned the batch.

C2 (sentinel correctness): You are right — 0xFF * 256B is not a universal upper bound under non-binary collations. PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator (PhysicalDataType.scala:334) means <= on collation-aware StringType is governed by the collation’s comparator, not by raw byte order, so any fixed sentinel can falsify the bound check.

Rev 3.2 abandons the sentinel approach entirely. Reader-side mechanism is now a splitConjunctivePredicates-based wrapper around SimpleMetricsCachedBatchSerializer.buildFilter: any AND-conjunct referencing a non-binary collation StringType attribute is dropped before partition-stats evaluation; binary attributes still prune. Conjuncts that cannot be split (Or, etc.) referencing such attributes conservatively keep the batch. Writer/wire-format unchanged.

Verified locally:

  • mvn clean install + suites on -Pspark-3.5/scala-2.12, -Pspark-4.0/scala-2.13, -Pspark-4.1/scala-2.13
  • spark-4.0 / 4.1: 42/42 PASS
  • spark-3.5: 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases guarded by assume(isCollationAware), since CollationFactory does not exist on 3.5)

PR description updated to reflect rev 3.2. PTAL.

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@yaooqinn yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch from bc74c18 to 2788186 Compare May 19, 2026 17:00
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Contributor

@zhli1142015 zhli1142015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Apache Spark 4.0+ introduced collation-aware StringType. Cached batch
partition-stats currently dispatches any StringType through the supported=1
fast path; cpp scanMinMax<StringView> + JVM encodeStringBounds use unsigned
byte order while non-binary collations (UTF8_LCASE, UNICODE_CI, etc) use
collation rules for equality/order. Mismatch can silently prune correct
rows on collation-aware predicates.

This patch gates StringType dispatch to UTF8_BINARY only via a new shim
method. Spark 3.x shims do not need to override (the default returns true;
binary-only behavior is correct on those branches' cached batch path).
Spark 4.0/4.1 shims override using CollationFactory.UTF8_BINARY_COLLATION_ID.

Non-binary collation columns are demoted to supported=0. For the reader
side, this patch wraps SimpleMetricsCachedBatchSerializer.buildFilter with
a predicate-stripping wrapper: any AND-conjunct that references a
non-binary-collation StringType attribute is dropped (via
splitConjunctivePredicates), leaving only conjuncts on binary-safe
attributes for partition-stats evaluation. Predicates that cannot be split
(Or, etc.) referencing such an attribute conservatively keep the batch.

This replaces the earlier sentinel-bound approach. As pointed out by
@zhli1142015 in the review, a fixed 0xFF upper sentinel is not a universal
upper bound under non-binary collations (where ordering is governed by
PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator).
The wrapper avoids the question entirely by removing the predicate before
it ever reaches the stats-based bound check, so correctness no longer
depends on what "max" means for the collation.

A real collation-aware bound (matching vanilla StringColumnStats.semanticCompare)
would require teaching the cpp scanMinMax path about collations, likely via
ICU sort keys. That is tracked as a Phase-2 follow-up.

Tests:

  ColumnarCachedBatchBuildFilterPruneSuite (existing, extended):
    W1: wrapper strips predicate on non-binary collation StringType attr
    W2: wrapper preserves predicate on binary collation StringType attr
    W3: mixed-attr conjunct keeps binary, batch pruned by int
    W4: nested And splits deeply
    W5: Or branch conservatively stripped, batch kept
    W6: IsNull(nb) stripped, IsNotNull(int) kept
    W7: In + StartsWith on nb both stripped
    W8 (anti-regression): bypassing wrapper would let UTF8_LCASE bound prune

  ColumnarCachedBatchE2ESuite (new cases):
    end-to-end UTF8_LCASE + UNICODE_CI predicate over cached batch, no
    incorrect pruning.

  W1-W8 + UNICODE_CI case are guarded by assume(isCollationAware) and
  skip cleanly on Spark 3.5 where CollationFactory does not exist.

Verified:
  mvn clean install + suites on -Pspark-3.5/scala-2.12, -Pspark-4.0/scala-2.13,
  -Pspark-4.1/scala-2.13. spark-4.0/4.1: 42/42 PASS. spark-3.5: 32 PASS,
  10 cleanly canceled by assume.

Generated-by: Claude Opus 4.7
@yaooqinn yaooqinn force-pushed the users/kentyao/cache-stats-collation-gate branch from 2788186 to 8a61a7d Compare May 20, 2026 02:00
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@yaooqinn yaooqinn merged commit 9e6e8bf into apache:main May 20, 2026
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants