[fix](variant) Support safe widening cast pushdown for variant inverted indexes#63118
[fix](variant) Support safe widening cast pushdown for variant inverted indexes#63118wuguowei1994 wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
e75111a to
904d4c0
Compare
|
run buildall |
|
/review |
eldenmoon
left a comment
There was a problem hiding this comment.
I found a correctness blocker in the relaxed variant predicate compatibility check. The current regression covers only the same-width CAST(... AS INT) case, but this change also enables cross-width integer casts and same-family string casts without normalizing the predicate value to the segment storage encoding.
TPC-H: Total hot run time: 29643 ms |
|
Thank you for the detailed feedback. Based on your suggestions, I made the pushdown behavior more conservative and added additional compatibility checks, especially for FLOAT/DOUBLE round-trip safety and unsafe parametric type conversions. I also updated the regression tests to cover both positive and negative cases for the newly supported cast scenarios. PS: |
TPC-DS: Total hot run time: 170611 ms |
d4d86b8 to
6c2b533
Compare
There was a problem hiding this comment.
Thanks for updating the patch to avoid unsafe cross-width index encoding. I still think this needs changes because the latest exact-type rule no longer fixes the original CAST(v["int_key"] AS INT) scenario described in the PR body. Critical checkpoints: goal/test coverage is not satisfied because only same-width TINYINT pushdown is now proven while the documented INT reproduction remains non-pushdown; the code change is small and focused; no new concurrency, lifecycle, config, persistence, FE-BE protocol, or storage-format compatibility concerns were introduced; the main correctness risk is now an incomplete fix rather than wrong-result pushdown; observability is unchanged and adequate for this path through the existing debug/profile checks. User focus: no additional user-provided focus was specified.
|
@eldenmoon After reconsidering it, I believe we should strive for higher standards ourselves. I’ve revised the approach described in the comment above. Please give me one week to come back with a better implementation. |
|
currently only bigint in interger types will be infered |
6c2b533 to
3096cdc
Compare
3096cdc to
d397fa1
Compare
TPC-H: Total hot run time: 31829 ms |
TPC-DS: Total hot run time: 169281 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
37a0544 to
571e6dd
Compare
TPC-H: Total hot run time: 31058 ms |
TPC-DS: Total hot run time: 169495 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve? Related PR: apache#63118 Problem Summary: Variant subcolumn predicates such as cast(v["int_key"] as bigint) IN (...) and cast(v["float_key"] as double) = 1.5 were not being pushed down to the inverted index. The FE plan wraps these as nested casts of the form CAST(CAST(slot(v) AS storage_dtype) AS user_target), but three BE code paths only accepted a single-level CAST(slot): - _filter_and_collect_cast_type_for_variant() returned early when the outer cast's child was not SLOT_REF, so it never recorded the user target type and the slot kept its original VARIANT value range. - is_valid_push_down_cast() required children[0]->children().at(0) to be a slot ref, so _is_predicate_acting_on_slot() rejected nested casts and skipped column-predicate construction. - _evaluate_inverted_index() in the common-expr-pushdown path also required cast_expr->get_child(0)->is_slot_ref(), so the second-pass index probe could not peel the cast either. This change peels the whole cast chain via VExpr::expr_without_cast() in all three places, while still using the outermost cast's target type for compatibility / round-trip checks. After the fix the column predicate path constructs widened ColumnValueRange / ColumnPredicate correctly, and convert_to_storage_value() normalizes each literal back to the storage type before probing. ### Release note None ### Check List (For Author) - Test: No need to test (covered by regression tests in apache#63118 once the test expectations are aligned in the follow-up commit) - Behavior changed: No - Does this need documentation: No Co-authored-by: Cursor <cursoragent@cursor.com>
…_without_cast ### What problem does this PR solve? Related PR: apache#63118 Problem Summary: The original regression-test/suites/inverted_index_p0/ test_variant_inverted_index_cast.groovy had three issues: - The IndexFilter section and HitRows counters appear in the profile as soon as the segment iterator initializes its inverted-index runtime, even when no predicate was actually pushed down. Asserting on profileText.contains("IndexFilter:") and on HitRows is therefore brittle. Switch the "index used" / "index not used" judgement to RowsInvertedIndexFiltered (rows the inverted index actually removed from the scan), which is exactly zero iff the index did not prune anything. - The cast(v["int_key"] as double) = 13.0 case is folded by Nereids into CAST(v AS int) = 13 and goes through the existing INT equality path, not the new BE widening logic. Drop that case (and the related OR variant) since the assertion no longer matches the PR semantics. - The cast(v["string_key"] as varchar(20)) case is wrapped by the FE as substring(CAST(CAST(v AS text) AS varchar(20)), 1, 20), which is outside the slot/cast-only contract this PR enables. Drop the case and document the limitation; the cast-to-text case continues to cover string-family widening. To balance the deletions, add a real negative case cast(v["int_key"] as bigint) = 5000000000 to verify that convert_to_storage_value() rejects out-of-range literals at probe time and falls back to full scan (RowsInvertedIndexFiltered == 0, ScanRows == 20) while still returning the correct empty result. Also add four lightweight unit tests in be/test/exprs/vexpr_test.cpp to pin down VExpr::expr_without_cast() behavior (no-cast / single-level / nested / non-slot leaf), since the variant pushdown paths fixed in the previous commit rely on it to find the underlying slot beneath a chain of FE-emitted casts. ### Release note None ### Check List (For Author) - Test: Regression test (this commit only adjusts test expectations and adds new test coverage) / Unit Test - Behavior changed: No - Does this need documentation: No Co-authored-by: Cursor <cursoragent@cursor.com>
…ed indexes ### What problem does this PR solve? Issue Number: None Related PR: apache#63118 Problem Summary: Inverted index predicate pushdown did not support safe widening casts on indexed VARIANT subcolumns, so type-compatible cast predicates could not use the inverted index and scanned more rows than necessary. This change allows storage-compatible cast predicates to be pushed down and converts compatible query literals to the segment storage type before building inverted-index query values, keeping the query encoding consistent with the indexed storage representation. ### Release note Support inverted index predicate pushdown for safe widening cast predicates on VARIANT subcolumns. ### Check List (For Author) - Test: Added regression and unit test coverage - Regression test: test_variant_inverted_index_cast covers positive and negative VARIANT inverted-index cast pushdown cases. - Unit Test: vexpr_test and inverted_index_reader_test cover cast peeling and storage-value conversion. - Behavior changed: Yes. Compatible VARIANT subcolumn cast predicates can now use inverted-index pushdown; unsupported or unsafe casts remain unpushed. - Does this need documentation: No
17f631e to
64bc776
Compare
…ed indexes ### What problem does this PR solve? Issue Number: None Related PR: apache#63118 Problem Summary: Inverted index predicate pushdown did not support safe widening casts on indexed VARIANT subcolumns, so type-compatible cast predicates could not use the inverted index and scanned more rows than necessary. This change allows storage-compatible cast predicates to be pushed down and converts compatible query literals to the segment storage type before building inverted-index query values, keeping the query encoding consistent with the indexed storage representation. ### Release note Support inverted index predicate pushdown for safe widening cast predicates on VARIANT subcolumns. ### Check List (For Author) - Test: Added regression and unit test coverage - Regression test: test_variant_inverted_index_cast covers positive and negative VARIANT inverted-index cast pushdown cases. - Unit Test: vexpr_test and inverted_index_reader_test cover cast peeling and storage-value conversion. - Behavior changed: Yes. Compatible VARIANT subcolumn cast predicates can now use inverted-index pushdown; unsupported or unsafe casts remain unpushed. - Does this need documentation: No
f4c84b6 to
3895a08
Compare
|
@eldenmoon Done! Could you take a look? |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Summary
Inverted index predicate pushdown currently does not support safe widening casts on indexed VARIANT subcolumns.
As a result, many type-compatible cast predicates cannot use the inverted index and end up scanning more rows than necessary.
This change extends pushdown support to a limited set of storage-compatible casts where query literals can be safely converted to the segment storage type while preserving inverted-index encoding consistency.
Supported cast categories include:
Integer widening:
TINYINT -> SMALLINT -> INT -> BIGINT -> LARGEINTFLOAT -> DOUBLECHAR/VARCHAR/STRINGcompatibility, because the inverted-index string query path uses the same byte representation for all string primitive typesARRAYpushdown only for exact array type matchesCompatible query literals are converted to the segment storage type before building the inverted-index query value, ensuring that the generated query encoding matches the indexed storage representation.
Reproduction
Expected Behavior
The predicate:
should be pushed down to the inverted index. Only the matching row should be scanned after index filtering.
Actual Behavior
The query result is correct, but the inverted index does not effectively filter the data.
All 20 rows are scanned because the current pushdown logic does not handle safe widening casts between the query
CASTtype and the indexedVARIANTsubcolumn storage type.