Skip to content

[Bug](scan) Preserve IN_LIST runtime filter predicates when key range…#62115

Merged
yiguolei merged 1 commit intoapache:branch-4.0from
BiteTheDDDDt:cp_0404_2
Apr 4, 2026
Merged

[Bug](scan) Preserve IN_LIST runtime filter predicates when key range…#62115
yiguolei merged 1 commit intoapache:branch-4.0from
BiteTheDDDDt:cp_0404_2

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

… is a scope range (#62027)

This pull request addresses a bug in the OLAP scan operator where IN_LIST predicates could be incorrectly erased when both MINMAX and IN runtime filters targeted the same key column, and the number of IN values exceeded the maximum allowed for pushdown. The changes ensure that IN_LIST predicates are preserved in such cases, preventing incorrect query results. Additionally, a regression test is added to verify the fix.

Bug fix in predicate handling:

  • Modified the logic in _build_key_ranges_and_filters() within olap_scan_operator.cpp to ensure that IN_LIST predicates are not erased when the key range is a scope range (e.g., >= X AND <= Y) and the IN filter's value count exceeds
    max_pushdown_conditions_per_column. This preserves filtering semantics that are not captured by the scope range.
    [1] [2] [3]

  • Enhanced the profiling output in _process_conjuncts() to accurately reflect the set of predicates that will reach the storage layer after key range and filter construction. This helps with debugging and verification of predicate pushdown.

Testing and regression coverage:

  • Added a new regression test test_rf_in_list_not_erased_by_scope_range.groovy to verify that IN_LIST predicates are not incorrectly erased when both MINMAX and IN filters are present and the IN list is too large to be absorbed into the key range.

  • Added the corresponding expected output file for the new regression test.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

… is a scope range (apache#62027)

This pull request addresses a bug in the OLAP scan operator where
`IN_LIST` predicates could be incorrectly erased when both `MINMAX` and
`IN` runtime filters targeted the same key column, and the number of
`IN` values exceeded the maximum allowed for pushdown. The changes
ensure that `IN_LIST` predicates are preserved in such cases, preventing
incorrect query results. Additionally, a regression test is added to
verify the fix.

**Bug fix in predicate handling:**

* Modified the logic in `_build_key_ranges_and_filters()` within
`olap_scan_operator.cpp` to ensure that `IN_LIST` predicates are not
erased when the key range is a scope range (e.g., `>= X AND <= Y`) and
the `IN` filter's value count exceeds
`max_pushdown_conditions_per_column`. This preserves filtering semantics
that are not captured by the scope range.
[[1]](diffhunk://#diff-3ddc75656071d9c0e6b0be450e152a1c94559f7e70ea820e7f0c80a7078e3292R972)
[[2]](diffhunk://#diff-3ddc75656071d9c0e6b0be450e152a1c94559f7e70ea820e7f0c80a7078e3292R986)
[[3]](diffhunk://#diff-3ddc75656071d9c0e6b0be450e152a1c94559f7e70ea820e7f0c80a7078e3292L986-R1013)

* Enhanced the profiling output in `_process_conjuncts()` to accurately
reflect the set of predicates that will reach the storage layer after
key range and filter construction. This helps with debugging and
verification of predicate pushdown.

**Testing and regression coverage:**

* Added a new regression test
`test_rf_in_list_not_erased_by_scope_range.groovy` to verify that
`IN_LIST` predicates are not incorrectly erased when both `MINMAX` and
`IN` filters are present and the `IN` list is too large to be absorbed
into the key range.

* Added the corresponding expected output file for the new regression
test.
Copilot AI review requested due to automatic review settings April 3, 2026 18:06
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an OLAP scan predicate-pushdown bug where IN_LIST runtime filter predicates could be incorrectly removed when a MINMAX runtime filter produced a scope key range on the same key column (especially when IN values exceed max_pushdown_conditions_per_column), leading to incorrect filtering semantics reaching storage.

Changes:

  • Adjusted key-range construction in OlapScanLocalState::_build_key_ranges_and_filters() to only erase predicates that are truly subsumed by the generated scan key range (preserving IN_LIST for scope ranges).
  • Centralized/removed ColumnPredicate::could_be_erased() overrides and moved predicate-erasure decision logic into the OLAP scan operator based on predicate type + range shape (fixed vs scope).
  • Added a regression test + expected output to validate that IN_LIST predicates are not erased by scope ranges.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
regression-test/suites/correctness_p0/test_rf_in_list_not_erased_by_scope_range.groovy Adds a regression test reproducing the scope-range + oversized-IN runtime filter scenario.
regression-test/data/correctness_p0/test_rf_in_list_not_erased_by_scope_range.out Expected output for the new regression test.
be/src/pipeline/exec/scan_operator.cpp Updates profiling to report pushdown predicates after conjunct processing (more accurate post-normalization view).
be/src/pipeline/exec/olap_scan_operator.cpp Fixes predicate erasure logic when key range is exact but only a scope range (preserve IN_LIST when not subsumed).
be/src/olap/null_predicate.h Removes could_be_erased() override (logic moved to OLAP scan operator).
be/src/olap/in_list_predicate.h Removes could_be_erased() override (logic moved to OLAP scan operator).
be/src/olap/comparison_predicate.h Removes could_be_erased() override (logic moved to OLAP scan operator).
be/src/olap/column_predicate.h Removes the could_be_erased() API from the base predicate interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +73 to +74
// Use both IN and MIN_MAX runtime filter types so both are generated on the join key.
sql "set runtime_filter_type = 'IN_OR_BLOOM_FILTER,MIN_MAX';"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression test is intended to exercise the IN_LIST runtime filter path, but runtime_filter_type = 'IN_OR_BLOOM_FILTER,MIN_MAX' can legitimately switch to a Bloom filter (based on runtime_filter_max_in_num / build-side cardinality). If it switches to Bloom, the IN_LIST predicate isn’t generated and the test may no longer cover the bug it’s meant to prevent. Consider forcing an IN filter here (e.g. use runtime_filter_type = 'IN,MIN_MAX' and/or explicitly set runtime_filter_max_in_num to a value >= the build-side distinct count) so the test reliably validates IN_LIST-not-erased behavior.

Suggested change
// Use both IN and MIN_MAX runtime filter types so both are generated on the join key.
sql "set runtime_filter_type = 'IN_OR_BLOOM_FILTER,MIN_MAX';"
// Force the runtime filter path to generate an IN_LIST predicate together with MIN_MAX.
// Also set runtime_filter_max_in_num above the 6 distinct build-side keys so the
// engine cannot legitimately switch this test to a Bloom filter.
sql "set runtime_filter_type = 'IN,MIN_MAX';"
sql "set runtime_filter_max_in_num = 16;"

Copilot uses AI. Check for mistakes.
@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/38) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.92% (19239/36356)
Line Coverage 36.11% (179242/496405)
Region Coverage 32.72% (139080/425110)
Branch Coverage 33.66% (60325/179210)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (38/38) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.31% (25380/35590)
Line Coverage 53.99% (267521/495511)
Region Coverage 51.59% (221505/429383)
Branch Coverage 53.01% (95342/179853)

@yiguolei yiguolei merged commit 3d10da9 into apache:branch-4.0 Apr 4, 2026
31 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants