Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18589
Closed
deepthi912 wants to merge 1 commit into
Closed
Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18589deepthi912 wants to merge 1 commit into
deepthi912 wants to merge 1 commit into
Conversation
…ndex is available
When FST/IFST exists but the column has no sorted/inverted index that can consume a
dict-id-based predicate evaluator, FilterPlanNode previously built the FST/IFST
evaluator unconditionally. With a RAW forward index, FilterOperatorUtils then fell
through to ScanBasedFilterOperator, which calls applySV(String) on the dict-id
evaluator — that throws UnsupportedOperationException
(BaseDictionaryBasedPredicateEvaluator), crashing queries such as
`regexp_like(col, 'pat', 'i')` and `LIKE 'pat'` on external/iceberg-backed tables
with `encodingType: RAW` + `dictionary: {}` + `ifst: { enabled: true }`.
Add canConsumeDictIdEvaluator() — only construct the FST/IFST dict-id evaluator
when a sorted or inverted index is available for this data source (matching the
operator-routing logic in FilterOperatorUtils#getLeafFilterOperator). Otherwise
fall through to PredicateEvaluatorProvider, which returns
RawValueBasedRegexpLikePredicateEvaluator — already implements applySV(String)
correctly. No changes to base classes or scan iterator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Closing — opened in error against apache/master. Will keep the change in a fork PR for offline review. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18589 +/- ##
============================================
- Coverage 64.25% 64.25% -0.01%
Complexity 1137 1137
============================================
Files 3335 3335
Lines 205708 205906 +198
Branches 32084 32133 +49
============================================
+ Hits 132181 132302 +121
- Misses 62901 62956 +55
- Partials 10626 10648 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #18579 per @Jackie-Jiang's review feedback: #18579 (comment)
#18579 routed raw-value scans through a per-row
dictionary.indexOf(value)translation inSVScanDocIdIterator. This PR replaces that approach with a planner-level guard that avoids constructing a dict-id-based evaluator in the first place when no dict-consuming operator is available.Bug
Even after #18579,
regexp_like(col, ...)andLIKEagainst a RAW-encoded string column with a separate dictionary plus FST or IFST (and no inverted/sorted index) follow a suboptimal path: the IFST/FST evaluator is constructed (wasted FST automaton work) and then the scan does a per-row dict lookup to translate every value. This is what Jackie called out.When the FST/IFST evaluator cannot be consumed by an index-based filter operator, it shouldn't be constructed at all — the existing
RawValueBasedRegexpLikePredicateEvaluatoris the right choice and already implementsapplySV(String)correctly.Root cause
FilterPlanNode.case REGEXP_LIKEunconditionally builds the IFST/FST dict-id evaluator when the corresponding index exists, bypassing the invariant thatPredicateEvaluatorProvider.getDictionaryUsableForFilteringenforces for every other dict-based predicate type:Fix
Add the same invariant to the IFST/FST branches in
FilterPlanNode:The helper covers all three runtime consumers of a dict-id evaluator:
SortedIndexBasedFilterOperatorInvertedIndexFilterOperatorScanBasedFilterOperatorwhen the forward index is dict-encoded (usesDictIdMatcher, which callsapplySV(int dictId))When none of these will be picked, the IFST/FST evaluator is skipped and the raw-value evaluator is used instead —
RawValueBasedRegexpLikePredicateEvaluatoralready implementsapplySV(String).Behavior matrix
DictIdMatcherStringMatcherReverts
This effectively reverts the matcher-level changes from #18579 (which Jackie identified as too expensive) by addressing the same crash at a higher layer with fewer lines and no public API change.
What's NOT changed
SVScanDocIdIterator— reverted to upstream/master state (removes the matchers added in Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available #18579).BaseDictionaryBasedPredicateEvaluator— untouched.FilterOperatorUtils.getLeafFilterOperator— untouched.Known limitation (separate follow-up)
The
ScanBasedRegexpLikePredicateEvaluator(used when dict ≥10K entries) extendsBaseDictionaryBasedPredicateEvaluatordirectly, notBaseDictIdBasedRegexpLikePredicateEvaluator.FilterOperatorUtils.getLeafFilterOperatorchecks for the latter only, so on RAW forward + dict ≥10K + inverted index (no FST/IFST), the planner builds aScanBasedevaluator but the operator selector falls through to scan, producing the same crash. Out of scope for this PR; will file a separate fix to broaden theinstanceofcheck.Test plan
PredicateEvaluatorProviderTestshould pass unchangedregexp_like(col, 'pat'),regexp_like(col, 'pat', 'i'), andLIKE 'pat%'all return correct results (no crash) on a table with RAW forward + dict + FST/IFST + no inverted (the iceberg/external-table scenario that triggered Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available #18579)InvertedIndexFilterOperator(fast path)🤖 Generated with Claude Code