Skip to content

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18579

Merged
xiangfu0 merged 2 commits into
apache:masterfrom
deepthi912:deepthi/svscan-dict-lookup-matcher
May 26, 2026
Merged

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18579
xiangfu0 merged 2 commits into
apache:masterfrom
deepthi912:deepthi/svscan-dict-lookup-matcher

Conversation

@deepthi912
Copy link
Copy Markdown
Collaborator

@deepthi912 deepthi912 commented May 25, 2026

Bug

regexp_like(col, ...) and LIKE (case-insensitive regexp_like) crash with UnsupportedOperationException when a column has encodingType: RAW + a separate dictionary built for secondary indexes (FST / IFST / INVERTED / RANGE), with no sorted or inverted index — common on external/iceberg-backed tables.

Example table config:

{
  "name": "col_string",
  "encodingType": "RAW",
  "indexes": {
    "forward": { "encodingType": "RAW" },
    "dictionary": {},
    "ifst": { "enabled": true }
  }
}

Stack:

BaseDictionaryBasedPredicateEvaluator.applySV(String):133  → throws
SVScanDocIdIterator$StringMatcher.doesValueMatch:308

The bug is symmetric across FST and IFST: whichever side has the index file built, that query path crashes. The other path is accidentally safe because it falls back to RawValueBasedRegexpLikePredicateEvaluator.

Root cause

FilterPlanNode.case REGEXP_LIKE builds the dict-id evaluator (IFSTBasedRegexpPredicateEvaluator / FSTBasedRegexpPredicateEvaluator) unconditionally when the corresponding index exists. These evaluators only implement applySV(int dictId). When the operator selector in FilterOperatorUtils#getLeafFilterOperator finds no sorted/inverted index to consume those dict ids, it falls through to ScanBasedFilterOperator, which reads raw values from the forward index and calls applySV(String) — that throws.

The existing PredicateEvaluatorProvider.getDictionaryUsableForFiltering already encodes the correct invariant for non-FST/IFST cases:

case REGEXP_LIKE:
  return invertedAvailable ? dictionary : null;   // drops dict if no dict-consuming op

The FST/IFST short-circuit in FilterPlanNode previously bypassed this guard.

Fix

Only construct the FST/IFST dict-id evaluator when a sorted or inverted index is actually available (matches the routing condition the operator selector uses). Otherwise fall through to PredicateEvaluatorProvider, which returns the raw-value evaluator that already implements applySV(String) correctly.

if (caseInsensitive) {
  if (dataSource.getIFSTIndex() != null && canConsumeDictIdEvaluator(dataSource, _queryContext)) {
    predicateEvaluator = IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...);
  } else {
    predicateEvaluator = PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, _queryContext);
  }
}
// mirror logic for case-sensitive FST branch

private static boolean canConsumeDictIdEvaluator(DataSource dataSource, QueryContext queryContext) {
  if (dataSource.getDataSourceMetadata().isSorted()
      && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.SORTED)) {
    return true;
  }
  if (dataSource.getInvertedIndex() != null
      && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.INVERTED)) {
    return true;
  }
  return false;
}

What's NOT changed

  • BaseDictionaryBasedPredicateEvaluator — untouched. No final removal, no new defaults.
  • SVScanDocIdIterator — untouched. No new matchers.
  • FilterOperatorUtils.getLeafFilterOperator — untouched.
  • All concrete evaluator subclasses — untouched.
  • Existing RawValueBasedRegexpLikePredicateEvaluator — reused unchanged.

Diff

pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java: +24 / -2, single file.

Behavior after fix

Indexes on column Behavior
FST/IFST + sorted or inverted dict-id evaluator → InvertedIndexFilterOperator (bitmap union, unchanged fast path)
FST/IFST + no sorted/inverted raw evaluator → ScanBasedFilterOperatorStringMatcher.applySV(String) works
No FST/IFST raw evaluator (existing behavior, unchanged)

Test plan

  • Existing PredicateEvaluatorProviderTest passes unchanged
  • Manual: regexp_like(col, 'pat'), regexp_like(col, 'pat', 'i'), and LIKE 'pat%' all return results (no crash) on tables with RAW forward + dict + FST/IFST + no inverted
  • With inverted index added: queries still route to InvertedIndexFilterOperator (fast path)

… index is RAW + dictionary exists

When a column has encodingType=RAW but a separate dictionary is built for secondary indexes
(INVERTED, FST, IFST, RANGE), dict-based predicate evaluators created by FilterPlanNode
(e.g. DictIdBasedRegexpLikePredicateEvaluator, IFSTBasedRegexpPredicateEvaluator) only
implement applySV(int dictId). The previous fallback in SVScanDocIdIterator.getValueMatcher()
selected typed raw matchers (e.g. StringMatcher) which call applySV(<rawType>) on the
predicate evaluator — those overloads in BaseDictionaryBasedPredicateEvaluator throw
UnsupportedOperationException, crashing queries such as
`regexp_like(col, 'pattern', 'i')` or `LIKE 'pattern'` (which is internally case-insensitive)
on external/iceberg-backed tables with this layout.

Add a dict-lookup matcher per stored type that reads the raw value from the forward index,
translates it to a dict id via the separate dictionary, and applies the predicate with
applySV(int dictId). Selected only when (a) the forward index reports
isDictionaryEncoded() == false, (b) the data source still exposes a non-null Dictionary,
and (c) the predicate evaluator is a BaseDictionaryBasedPredicateEvaluator. Existing
DictIdMatcher and typed raw matcher paths are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@deepthi912 deepthi912 added the index Related to indexing (general) label May 25, 2026
Covers the RAW forward index + separate dictionary + dict-based predicate evaluator
configuration that previously triggered UnsupportedOperationException. The test stand-in
evaluator inherits the default applySV(String) that throws, so any regression in the
matcher-selection logic surfaces as a test failure rather than silent data corruption.

Two cases:
- Multiple matching dict ids return the correct doc ids in order.
- A raw value absent from the dictionary (indexOf returns -1) is treated as no-match
  without invoking applySV on the evaluator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

Codecov Report

❌ Patch coverage is 29.54545% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.26%. Comparing base (f231ee0) to head (20f13d4).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...e/operator/dociditerators/SVScanDocIdIterator.java 29.54% 29 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #18579    +/-   ##
==========================================
  Coverage     64.25%   64.26%            
  Complexity     1137     1137            
==========================================
  Files          3335     3335            
  Lines        205708   205845   +137     
  Branches      32084    32115    +31     
==========================================
+ Hits         132181   132282   +101     
- Misses        62901    62920    +19     
- Partials      10626    10643    +17     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.26% <29.54%> (+<0.01%) ⬆️
temurin 64.26% <29.54%> (+<0.01%) ⬆️
unittests 64.26% <29.54%> (+<0.01%) ⬆️
unittests1 56.73% <29.54%> (+<0.01%) ⬆️
unittests2 36.83% <0.00%> (+1.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 merged commit acf9d4f into apache:master May 26, 2026
11 checks passed
@Jackie-Jiang
Copy link
Copy Markdown
Contributor

This is not the correct fix. It is way too expensive to lookup every single value within forward index. The correct fix should be within predicate evaluator to handle raw values even when dictionary exist

@deepthi912 deepthi912 changed the title Fix UnsupportedOperationException in SVScanDocIdIterator for RAW forward index + separate dictionary Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available May 27, 2026
deepthi912 added a commit to deepthi912/pinot that referenced this pull request May 27, 2026
…RAW forward index + separate dictionary (apache#18579)"

This reverts commit acf9d4f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

index Related to indexing (general)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants