Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18579
Merged
xiangfu0 merged 2 commits intoMay 26, 2026
Merged
Conversation
… index is RAW + dictionary exists When a column has encodingType=RAW but a separate dictionary is built for secondary indexes (INVERTED, FST, IFST, RANGE), dict-based predicate evaluators created by FilterPlanNode (e.g. DictIdBasedRegexpLikePredicateEvaluator, IFSTBasedRegexpPredicateEvaluator) only implement applySV(int dictId). The previous fallback in SVScanDocIdIterator.getValueMatcher() selected typed raw matchers (e.g. StringMatcher) which call applySV(<rawType>) on the predicate evaluator — those overloads in BaseDictionaryBasedPredicateEvaluator throw UnsupportedOperationException, crashing queries such as `regexp_like(col, 'pattern', 'i')` or `LIKE 'pattern'` (which is internally case-insensitive) on external/iceberg-backed tables with this layout. Add a dict-lookup matcher per stored type that reads the raw value from the forward index, translates it to a dict id via the separate dictionary, and applies the predicate with applySV(int dictId). Selected only when (a) the forward index reports isDictionaryEncoded() == false, (b) the data source still exposes a non-null Dictionary, and (c) the predicate evaluator is a BaseDictionaryBasedPredicateEvaluator. Existing DictIdMatcher and typed raw matcher paths are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the RAW forward index + separate dictionary + dict-based predicate evaluator configuration that previously triggered UnsupportedOperationException. The test stand-in evaluator inherits the default applySV(String) that throws, so any regression in the matcher-selection logic surfaces as a test failure rather than silent data corruption. Two cases: - Multiple matching dict ids return the correct doc ids in order. - A raw value absent from the dictionary (indexOf returns -1) is treated as no-match without invoking applySV on the evaluator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18579 +/- ##
==========================================
Coverage 64.25% 64.26%
Complexity 1137 1137
==========================================
Files 3335 3335
Lines 205708 205845 +137
Branches 32084 32115 +31
==========================================
+ Hits 132181 132282 +101
- Misses 62901 62920 +19
- Partials 10626 10643 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
xiangfu0
approved these changes
May 26, 2026
Contributor
|
This is not the correct fix. It is way too expensive to lookup every single value within forward index. The correct fix should be within predicate evaluator to handle raw values even when dictionary exist |
This was referenced May 27, 2026
Closed
deepthi912
added a commit
to deepthi912/pinot
that referenced
this pull request
May 27, 2026
…RAW forward index + separate dictionary (apache#18579)" This reverts commit acf9d4f.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
regexp_like(col, ...)andLIKE(case-insensitiveregexp_like) crash withUnsupportedOperationExceptionwhen a column hasencodingType: RAW+ a separate dictionary built for secondary indexes (FST / IFST / INVERTED / RANGE), with no sorted or inverted index — common on external/iceberg-backed tables.Example table config:
{ "name": "col_string", "encodingType": "RAW", "indexes": { "forward": { "encodingType": "RAW" }, "dictionary": {}, "ifst": { "enabled": true } } }Stack:
The bug is symmetric across FST and IFST: whichever side has the index file built, that query path crashes. The other path is accidentally safe because it falls back to
RawValueBasedRegexpLikePredicateEvaluator.Root cause
FilterPlanNode.case REGEXP_LIKEbuilds the dict-id evaluator (IFSTBasedRegexpPredicateEvaluator/FSTBasedRegexpPredicateEvaluator) unconditionally when the corresponding index exists. These evaluators only implementapplySV(int dictId). When the operator selector inFilterOperatorUtils#getLeafFilterOperatorfinds no sorted/inverted index to consume those dict ids, it falls through toScanBasedFilterOperator, which reads raw values from the forward index and callsapplySV(String)— that throws.The existing
PredicateEvaluatorProvider.getDictionaryUsableForFilteringalready encodes the correct invariant for non-FST/IFST cases:The FST/IFST short-circuit in
FilterPlanNodepreviously bypassed this guard.Fix
Only construct the FST/IFST dict-id evaluator when a sorted or inverted index is actually available (matches the routing condition the operator selector uses). Otherwise fall through to
PredicateEvaluatorProvider, which returns the raw-value evaluator that already implementsapplySV(String)correctly.What's NOT changed
BaseDictionaryBasedPredicateEvaluator— untouched. Nofinalremoval, no new defaults.SVScanDocIdIterator— untouched. No new matchers.FilterOperatorUtils.getLeafFilterOperator— untouched.RawValueBasedRegexpLikePredicateEvaluator— reused unchanged.Diff
pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java: +24 / -2, single file.Behavior after fix
InvertedIndexFilterOperator(bitmap union, unchanged fast path)ScanBasedFilterOperator→StringMatcher.applySV(String)worksTest plan
PredicateEvaluatorProviderTestpasses unchangedregexp_like(col, 'pat'),regexp_like(col, 'pat', 'i'), andLIKE 'pat%'all return results (no crash) on tables with RAW forward + dict + FST/IFST + no invertedInvertedIndexFilterOperator(fast path)