Skip to content

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18589

Closed
deepthi912 wants to merge 1 commit into
apache:masterfrom
deepthi912:deepthi/filter-plan-node-raw-fallback
Closed

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18589
deepthi912 wants to merge 1 commit into
apache:masterfrom
deepthi912:deepthi/filter-plan-node-raw-fallback

Conversation

@deepthi912
Copy link
Copy Markdown
Collaborator

Follow-up to #18579 per @Jackie-Jiang's review feedback: #18579 (comment)

"This is not the correct fix. It is way too expensive to lookup every single value within forward index. The correct fix should be within predicate evaluator to handle raw values even when dictionary exist"

#18579 routed raw-value scans through a per-row dictionary.indexOf(value) translation in SVScanDocIdIterator. This PR replaces that approach with a planner-level guard that avoids constructing a dict-id-based evaluator in the first place when no dict-consuming operator is available.

Bug

Even after #18579, regexp_like(col, ...) and LIKE against a RAW-encoded string column with a separate dictionary plus FST or IFST (and no inverted/sorted index) follow a suboptimal path: the IFST/FST evaluator is constructed (wasted FST automaton work) and then the scan does a per-row dict lookup to translate every value. This is what Jackie called out.

When the FST/IFST evaluator cannot be consumed by an index-based filter operator, it shouldn't be constructed at all — the existing RawValueBasedRegexpLikePredicateEvaluator is the right choice and already implements applySV(String) correctly.

Root cause

FilterPlanNode.case REGEXP_LIKE unconditionally builds the IFST/FST dict-id evaluator when the corresponding index exists, bypassing the invariant that PredicateEvaluatorProvider.getDictionaryUsableForFiltering enforces for every other dict-based predicate type:

case REGEXP_LIKE:
  return invertedAvailable ? dictionary : null;   // drops dict if no dict-consuming op

Fix

Add the same invariant to the IFST/FST branches in FilterPlanNode:

if (caseInsensitive) {
  if (dataSource.getIFSTIndex() != null
      && canConsumeDictIdEvaluator(dataSource, _queryContext)) {
    predicateEvaluator = IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...);
  } else {
    predicateEvaluator = PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, _queryContext);
  }
}
// mirror for case-sensitive FST branch

private static boolean canConsumeDictIdEvaluator(DataSource dataSource, QueryContext queryContext) {
  if (dataSource.getDataSourceMetadata().isSorted()
      && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.SORTED)) {
    return true;
  }
  if (dataSource.getInvertedIndex() != null
      && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.INVERTED)) {
    return true;
  }
  // Scan with DictIdMatcher can consume dict-id evaluators when the forward index is dict-encoded.
  ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
  return forwardIndex != null && forwardIndex.isDictionaryEncoded();
}

The helper covers all three runtime consumers of a dict-id evaluator:

  1. SortedIndexBasedFilterOperator
  2. InvertedIndexFilterOperator
  3. ScanBasedFilterOperator when the forward index is dict-encoded (uses DictIdMatcher, which calls applySV(int dictId))

When none of these will be picked, the IFST/FST evaluator is skipped and the raw-value evaluator is used instead — RawValueBasedRegexpLikePredicateEvaluator already implements applySV(String).

Behavior matrix

Forward index FST/IFST Sorted Inverted Evaluator chosen Filter operator Status
DICT yes no no FST/IFST eval scan + DictIdMatcher
DICT yes no yes FST/IFST eval InvertedIndex
DICT yes yes no FST/IFST eval SortedIndex
RAW yes no no raw eval scan + StringMatcher ✅ was crashing
RAW yes no yes FST/IFST eval InvertedIndex

Reverts

This effectively reverts the matcher-level changes from #18579 (which Jackie identified as too expensive) by addressing the same crash at a higher layer with fewer lines and no public API change.

What's NOT changed

Known limitation (separate follow-up)

The ScanBasedRegexpLikePredicateEvaluator (used when dict ≥10K entries) extends BaseDictionaryBasedPredicateEvaluator directly, not BaseDictIdBasedRegexpLikePredicateEvaluator. FilterOperatorUtils.getLeafFilterOperator checks for the latter only, so on RAW forward + dict ≥10K + inverted index (no FST/IFST), the planner builds a ScanBased evaluator but the operator selector falls through to scan, producing the same crash. Out of scope for this PR; will file a separate fix to broaden the instanceof check.

Test plan

  • Existing PredicateEvaluatorProviderTest should pass unchanged
  • Manual: regexp_like(col, 'pat'), regexp_like(col, 'pat', 'i'), and LIKE 'pat%' all return correct results (no crash) on a table with RAW forward + dict + FST/IFST + no inverted (the iceberg/external-table scenario that triggered Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available #18579)
  • With inverted index added: queries still route to InvertedIndexFilterOperator (fast path)

🤖 Generated with Claude Code

…ndex is available

When FST/IFST exists but the column has no sorted/inverted index that can consume a
dict-id-based predicate evaluator, FilterPlanNode previously built the FST/IFST
evaluator unconditionally. With a RAW forward index, FilterOperatorUtils then fell
through to ScanBasedFilterOperator, which calls applySV(String) on the dict-id
evaluator — that throws UnsupportedOperationException
(BaseDictionaryBasedPredicateEvaluator), crashing queries such as
`regexp_like(col, 'pat', 'i')` and `LIKE 'pat'` on external/iceberg-backed tables
with `encodingType: RAW` + `dictionary: {}` + `ifst: { enabled: true }`.

Add canConsumeDictIdEvaluator() — only construct the FST/IFST dict-id evaluator
when a sorted or inverted index is available for this data source (matching the
operator-routing logic in FilterOperatorUtils#getLeafFilterOperator). Otherwise
fall through to PredicateEvaluatorProvider, which returns
RawValueBasedRegexpLikePredicateEvaluator — already implements applySV(String)
correctly. No changes to base classes or scan iterator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@deepthi912
Copy link
Copy Markdown
Collaborator Author

Closing — opened in error against apache/master. Will keep the change in a fork PR for offline review.

@deepthi912 deepthi912 closed this May 27, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

❌ Patch coverage is 10.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.25%. Comparing base (f231ee0) to head (7358a71).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...ava/org/apache/pinot/core/plan/FilterPlanNode.java 10.00% 4 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18589      +/-   ##
============================================
- Coverage     64.25%   64.25%   -0.01%     
  Complexity     1137     1137              
============================================
  Files          3335     3335              
  Lines        205708   205906     +198     
  Branches      32084    32133      +49     
============================================
+ Hits         132181   132302     +121     
- Misses        62901    62956      +55     
- Partials      10626    10648      +22     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.25% <10.00%> (-0.01%) ⬇️
temurin 64.25% <10.00%> (-0.01%) ⬇️
unittests 64.25% <10.00%> (-0.01%) ⬇️
unittests1 56.74% <10.00%> (+<0.01%) ⬆️
unittests2 36.83% <0.00%> (+1.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants