[VL] Fix silent stats prune for non-binary collation StringType in cache#12112
Conversation
|
Run Gluten Clickhouse CI on x86 |
9641baa to
cab3e3e
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Comment 1 — Test coverage gap The new sentinel buildFilter tests currently use AttributeReference("c", StringType)(), which is the default binary-collation StringType. That does not exercise the actual non-binary collation path this PR is trying to fix. SimpleMetricsCachedBatchSerializer.buildFilter builds lower/upper-bound attributes from the cached attribute data type, so for the real failure case comparisons are collation-aware, not binary byte-order comparisons. Could you add test coverage with an explicit non-binary collation StringType (for example UTF8_BINARY_LCASE/UNICODE, depending on the supported Spark profile) so EqualTo, In, StartsWith, and range predicates are evaluated through the same collation-aware comparison path as production? Comment 2 — Sentinel upper bound correctness under non-binary collation The sentinel upper bound is currently 256 bytes of 0xff. That is a safe “max” value for binary byte ordering, but for non-binary collations Spark comparisons are collation-aware. Those bytes are invalid UTF-8 and will not necessarily compare as a universal maximum value under every Spark collation. If a literal can compare greater than this sentinel under a non-binary collation, the parent buildFilter may still prune the batch incorrectly. A safer design would be to make the Velox serializer’s buildFilter wrapper ignore/minimize min/max pruning for non-binary-collation string attributes, rather than fabricating synthetic min/max bounds. That would preserve correctness by passing through batches whenever the predicate depends on unsupported string ordering stats. |
48c8bc7 to
bc74c18
Compare
|
Thanks @zhli1142015 for the thorough review — both points are valid and addressed in rev 3.2 ( C1 (test coverage): The earlier sentinel-suite only exercised the deserialize hot path. Added 8 wrapper-behavior tests ( C2 (sentinel correctness): You are right — Rev 3.2 abandons the sentinel approach entirely. Reader-side mechanism is now a Verified locally:
PR description updated to reflect rev 3.2. PTAL. |
|
Run Gluten Clickhouse CI on x86 |
bc74c18 to
2788186
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
Apache Spark 4.0+ introduced collation-aware StringType. Cached batch partition-stats currently dispatches any StringType through the supported=1 fast path; cpp scanMinMax<StringView> + JVM encodeStringBounds use unsigned byte order while non-binary collations (UTF8_LCASE, UNICODE_CI, etc) use collation rules for equality/order. Mismatch can silently prune correct rows on collation-aware predicates. This patch gates StringType dispatch to UTF8_BINARY only via a new shim method. Spark 3.x shims do not need to override (the default returns true; binary-only behavior is correct on those branches' cached batch path). Spark 4.0/4.1 shims override using CollationFactory.UTF8_BINARY_COLLATION_ID. Non-binary collation columns are demoted to supported=0. For the reader side, this patch wraps SimpleMetricsCachedBatchSerializer.buildFilter with a predicate-stripping wrapper: any AND-conjunct that references a non-binary-collation StringType attribute is dropped (via splitConjunctivePredicates), leaving only conjuncts on binary-safe attributes for partition-stats evaluation. Predicates that cannot be split (Or, etc.) referencing such an attribute conservatively keep the batch. This replaces the earlier sentinel-bound approach. As pointed out by @zhli1142015 in the review, a fixed 0xFF upper sentinel is not a universal upper bound under non-binary collations (where ordering is governed by PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator). The wrapper avoids the question entirely by removing the predicate before it ever reaches the stats-based bound check, so correctness no longer depends on what "max" means for the collation. A real collation-aware bound (matching vanilla StringColumnStats.semanticCompare) would require teaching the cpp scanMinMax path about collations, likely via ICU sort keys. That is tracked as a Phase-2 follow-up. Tests: ColumnarCachedBatchBuildFilterPruneSuite (existing, extended): W1: wrapper strips predicate on non-binary collation StringType attr W2: wrapper preserves predicate on binary collation StringType attr W3: mixed-attr conjunct keeps binary, batch pruned by int W4: nested And splits deeply W5: Or branch conservatively stripped, batch kept W6: IsNull(nb) stripped, IsNotNull(int) kept W7: In + StartsWith on nb both stripped W8 (anti-regression): bypassing wrapper would let UTF8_LCASE bound prune ColumnarCachedBatchE2ESuite (new cases): end-to-end UTF8_LCASE + UNICODE_CI predicate over cached batch, no incorrect pruning. W1-W8 + UNICODE_CI case are guarded by assume(isCollationAware) and skip cleanly on Spark 3.5 where CollationFactory does not exist. Verified: mvn clean install + suites on -Pspark-3.5/scala-2.12, -Pspark-4.0/scala-2.13, -Pspark-4.1/scala-2.13. spark-4.0/4.1: 42/42 PASS. spark-3.5: 32 PASS, 10 cleanly canceled by assume. Generated-by: Claude Opus 4.7
2788186 to
8a61a7d
Compare
|
Run Gluten Clickhouse CI on x86 |
What changes were proposed in this pull request?
Gate non-binary-collation
StringTypecolumns in the Velox cache path tosupported=0(writer side) AND strip any AND-conjunct that references such a column from the reader-sidebuildFilterpredicate vector (before delegating tosuper.buildFilter). Writer / wire format unchanged.New shim API
SparkShims.isBinaryCollationString— defaulttruefor Spark 3.x shims (no collation concept), overridden on Spark 4.0 / 4.1 to checkcollationId == UTF8_BINARY_COLLATION_ID.Why are the changes needed?
On Spark 4.x with a non-binary collation, Velox's
scanMinMax<StringView>does an unsigned byte-order compare while Spark's filter compare is collation-aware (PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator). The two disagree, so stats-based pruning can silently drop matching rows.Repro:
Vanilla Spark's
StringColumnStatsis collation-aware, so this is Gluten-specific.Reader-side approach (rev 3.2)
Earlier revisions filled a
0xFF * 256Bsentinel upper bound on the deserialize side to keep vanillabuildFilterfrom pruning. As pointed out by @zhli1142015, that sentinel is not a universal upper bound under non-binary collation orderings, so it is not safe.Rev 3.2 drops the sentinel and instead wraps
SimpleMetricsCachedBatchSerializer.buildFilterwith asplitConjunctivePredicates-based predicate-strip layer (stripUnsupportedConjuncts):referencescontain any attribute that was demoted tosupported=0(i.e. a non-binary collation StringType incachedAttributes).And.reduce; bypass entirely if nothing references a demoted column.Orsub-trees stay intact (one losing-stats disjunct already loses the whole Or anyway, so it's conservative).Empty filtered predicates degrade gracefully: vanilla
SimpleMetricsCachedBatchSerializerreducespartitionFilterswith.reduceOption(And).getOrElse(Literal(true)), so the partition filter becomes pass-through (verified againstspark-sql_2.13-4.0.1-sourcesCachedBatchSerializer.scala).A real collation-aware bound (matching vanilla
StringColumnStats.semanticCompare) would require teaching the cppscanMinMaxpath about collations, likely via ICU sort keys — tracked as a Phase-2 follow-up.Does this PR introduce any user-facing change?
Yes — correctness fix. No new config.
How was this patch tested?
ColumnarCachedBatchBuildFilterPruneSuiteW1–W8 (wrapper behavior + anti-regression bypass).ColumnarCachedBatchE2ESuitecases for UTF8_LCASE + UNICODE_CI predicate over cached batch.ColumnarCacheShipBlockerMarshalSuite,ColumnarCachedBatchStatsBlobSuite,ColumnarCachedBatchIntFamilyMarshalSuite.mvn clean install+ suites verified on:-Pspark-4.0 -Pscala-2.13— 42/42 PASS-Pspark-4.1 -Pscala-2.13— 42/42 PASS-Pspark-3.5 -Pscala-2.12— 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases guarded byassume(isCollationAware), sinceCollationFactorydoes not exist on 3.5)Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7