Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available by deepthi912 · Pull Request #18589 · apache/pinot

deepthi912 · 2026-05-27T03:24:27Z

Follow-up to #18579 per @Jackie-Jiang's review feedback: #18579 (comment)

"This is not the correct fix. It is way too expensive to lookup every single value within forward index. The correct fix should be within predicate evaluator to handle raw values even when dictionary exist"

#18579 routed raw-value scans through a per-row dictionary.indexOf(value) translation in SVScanDocIdIterator. This PR replaces that approach with a planner-level guard that avoids constructing a dict-id-based evaluator in the first place when no dict-consuming operator is available.

Bug

Even after #18579, regexp_like(col, ...) and LIKE against a RAW-encoded string column with a separate dictionary plus FST or IFST (and no inverted/sorted index) follow a suboptimal path: the IFST/FST evaluator is constructed (wasted FST automaton work) and then the scan does a per-row dict lookup to translate every value. This is what Jackie called out.

When the FST/IFST evaluator cannot be consumed by an index-based filter operator, it shouldn't be constructed at all — the existing RawValueBasedRegexpLikePredicateEvaluator is the right choice and already implements applySV(String) correctly.

Root cause

FilterPlanNode.case REGEXP_LIKE unconditionally builds the IFST/FST dict-id evaluator when the corresponding index exists, bypassing the invariant that PredicateEvaluatorProvider.getDictionaryUsableForFiltering enforces for every other dict-based predicate type:

case REGEXP_LIKE:
  return invertedAvailable ? dictionary : null;   // drops dict if no dict-consuming op

Fix

Add the same invariant to the IFST/FST branches in FilterPlanNode:

if (caseInsensitive) {
  if (dataSource.getIFSTIndex() != null
      && canConsumeDictIdEvaluator(dataSource, _queryContext)) {
    predicateEvaluator = IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...);
  } else {
    predicateEvaluator = PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, _queryContext);
  }
}
// mirror for case-sensitive FST branch

private static boolean canConsumeDictIdEvaluator(DataSource dataSource, QueryContext queryContext) {
  if (dataSource.getDataSourceMetadata().isSorted()
      && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.SORTED)) {
    return true;
  }
  if (dataSource.getInvertedIndex() != null
      && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.INVERTED)) {
    return true;
  }
  // Scan with DictIdMatcher can consume dict-id evaluators when the forward index is dict-encoded.
  ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
  return forwardIndex != null && forwardIndex.isDictionaryEncoded();
}

The helper covers all three runtime consumers of a dict-id evaluator:

SortedIndexBasedFilterOperator
InvertedIndexFilterOperator
ScanBasedFilterOperator when the forward index is dict-encoded (uses DictIdMatcher, which calls applySV(int dictId))

When none of these will be picked, the IFST/FST evaluator is skipped and the raw-value evaluator is used instead — RawValueBasedRegexpLikePredicateEvaluator already implements applySV(String).

Behavior matrix

Forward index	FST/IFST	Sorted	Inverted	Evaluator chosen	Filter operator	Status
DICT	yes	no	no	FST/IFST eval	scan + `DictIdMatcher`	✅
DICT	yes	no	yes	FST/IFST eval	InvertedIndex	✅
DICT	yes	yes	no	FST/IFST eval	SortedIndex	✅
RAW	yes	no	no	raw eval	scan + `StringMatcher`	✅ was crashing
RAW	yes	no	yes	FST/IFST eval	InvertedIndex	✅

Reverts

This effectively reverts the matcher-level changes from #18579 (which Jackie identified as too expensive) by addressing the same crash at a higher layer with fewer lines and no public API change.

What's NOT changed

SVScanDocIdIterator — reverted to upstream/master state (removes the matchers added in Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available #18579).
BaseDictionaryBasedPredicateEvaluator — untouched.
FilterOperatorUtils.getLeafFilterOperator — untouched.
Concrete evaluator subclasses — untouched.

Known limitation (separate follow-up)

The ScanBasedRegexpLikePredicateEvaluator (used when dict ≥10K entries) extends BaseDictionaryBasedPredicateEvaluator directly, not BaseDictIdBasedRegexpLikePredicateEvaluator. FilterOperatorUtils.getLeafFilterOperator checks for the latter only, so on RAW forward + dict ≥10K + inverted index (no FST/IFST), the planner builds a ScanBased evaluator but the operator selector falls through to scan, producing the same crash. Out of scope for this PR; will file a separate fix to broaden the instanceof check.

Test plan

Existing PredicateEvaluatorProviderTest should pass unchanged
Manual: regexp_like(col, 'pat'), regexp_like(col, 'pat', 'i'), and LIKE 'pat%' all return correct results (no crash) on a table with RAW forward + dict + FST/IFST + no inverted (the iceberg/external-table scenario that triggered Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available #18579)
With inverted index added: queries still route to InvertedIndexFilterOperator (fast path)

🤖 Generated with Claude Code

…ndex is available When FST/IFST exists but the column has no sorted/inverted index that can consume a dict-id-based predicate evaluator, FilterPlanNode previously built the FST/IFST evaluator unconditionally. With a RAW forward index, FilterOperatorUtils then fell through to ScanBasedFilterOperator, which calls applySV(String) on the dict-id evaluator — that throws UnsupportedOperationException (BaseDictionaryBasedPredicateEvaluator), crashing queries such as `regexp_like(col, 'pat', 'i')` and `LIKE 'pat'` on external/iceberg-backed tables with `encodingType: RAW` + `dictionary: {}` + `ifst: { enabled: true }`. Add canConsumeDictIdEvaluator() — only construct the FST/IFST dict-id evaluator when a sorted or inverted index is available for this data source (matching the operator-routing logic in FilterOperatorUtils#getLeafFilterOperator). Otherwise fall through to PredicateEvaluatorProvider, which returns RawValueBasedRegexpLikePredicateEvaluator — already implements applySV(String) correctly. No changes to base classes or scan iterator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

deepthi912 · 2026-05-27T03:25:01Z

Closing — opened in error against apache/master. Will keep the change in a fork PR for offline review.

codecov-commenter · 2026-05-27T04:19:46Z

Codecov Report

❌ Patch coverage is 10.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.25%. Comparing base (f231ee0) to head (7358a71).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...ava/org/apache/pinot/core/plan/FilterPlanNode.java	10.00%	4 Missing and 5 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18589      +/-   ##
============================================
- Coverage     64.25%   64.25%   -0.01%     
  Complexity     1137     1137              
============================================
  Files          3335     3335              
  Lines        205708   205906     +198     
  Branches      32084    32133      +49     
============================================
+ Hits         132181   132302     +121     
- Misses        62901    62956      +55     
- Partials      10626    10648      +22

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.25% <10.00%> (-0.01%)`	⬇️
temurin	`64.25% <10.00%> (-0.01%)`	⬇️
unittests	`64.25% <10.00%> (-0.01%)`	⬇️
unittests1	`56.74% <10.00%> (+<0.01%)`	⬆️
unittests2	`36.83% <0.00%> (+1.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

deepthi912 closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18589

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18589
deepthi912 wants to merge 1 commit into
apache:masterfrom
deepthi912:deepthi/filter-plan-node-raw-fallback

deepthi912 commented May 27, 2026

Uh oh!

deepthi912 commented May 27, 2026

Uh oh!

codecov-commenter commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deepthi912 commented May 27, 2026

Bug

Root cause

Fix

Behavior matrix

Reverts

What's NOT changed

Known limitation (separate follow-up)

Test plan

Uh oh!

deepthi912 commented May 27, 2026

Uh oh!

codecov-commenter commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented May 27, 2026 •

edited

Loading