Skip to content

API: Implement notStartsWith bounds check in StrictMetricsEvaluator#15883

Open
bharos wants to merge 3 commits intoapache:mainfrom
bharos:perf/strict-metrics-not-starts-with-bounds
Open

API: Implement notStartsWith bounds check in StrictMetricsEvaluator#15883
bharos wants to merge 3 commits intoapache:mainfrom
bharos:perf/strict-metrics-not-starts-with-bounds

Conversation

@bharos
Copy link
Copy Markdown
Contributor

@bharos bharos commented Apr 3, 2026

What

Implements bounds-based evaluation for notStartsWith in
StrictMetricsEvaluator, replacing the existing TODO with actual logic.

Previously, notStartsWith always returned ROWS_MIGHT_NOT_MATCH,
which prevented the engine from eliminating the residual predicate even
when file-level column bounds made it provable that no value could start
with the given prefix.

Changes

  • StrictMetricsEvaluator.notStartsWith: Added checks for nested
    columns, all-nulls columns, and lower/upper bound comparisons against
    the prefix. Returns ROWS_MUST_MATCH when bounds prove the prefix is
    entirely outside the value range.
  • TestStrictMetricsEvaluator: Added 8 test methods covering:
    all-nulls, bounds above/below/overlapping the prefix, wider ranges,
    missing stats, some-nulls with bounds outside prefix, and prefix
    longer than bounds.

How it works

For NOT STARTS WITH <prefix>:

  • If the lower bound (truncated to min(prefixLen, boundLen)) is
    strictly greater than the prefix, all values are above the prefix
    range → ROWS_MUST_MATCH
  • If the upper bound (truncated to min(prefixLen, boundLen)) is
    strictly less than the prefix, all values are below the prefix range
    ROWS_MUST_MATCH
  • Otherwise, fall through to ROWS_MIGHT_NOT_MATCH (conservative)

This follows the same pattern used by notEq and notIn in this
class, including the null-handling convention.

Closes #15882

When column bounds are entirely outside the prefix range, all rows
must satisfy notStartsWith. Previously this always returned
ROWS_MIGHT_NOT_MATCH regardless of bounds, missing an optimization
opportunity for file-level pruning.

Now returns ROWS_MUST_MATCH when:
- Lower bound truncated to prefix length > prefix (all values above)
- Upper bound truncated to prefix length < prefix (all values below)
- Column contains only null values (nulls satisfy NOT predicates)

Follows the same truncation pattern used in
InclusiveMetricsEvaluator.startsWith and the null-handling pattern
from StrictMetricsEvaluator.notEq.
@github-actions github-actions bot added the API label Apr 3, 2026
// TODO: Handle cases that definitely cannot match, such as notStartsWith("x") when the bounds
// are ["a", "b"].
int id = ref.fieldId();
if (isNestedColumn(id)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a test for nested column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anoopj , added a nested string field "nested_string_col" in test and a test case
PTAL

@anoopj
Copy link
Copy Markdown
Contributor

anoopj commented Apr 7, 2026

The change looks reasonable to me. The only callout is the null handling: if there are null values, the implementation will return ROWS_MUST_MATCH for them as well. This doesn't follow SQL's 3-valued semantics, but is consistent with the current implementation of notEq, so I think the change is reasonable.

Please get this reviewed by a committer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StrictMetricsEvaluator does not use column bounds to evaluate notStartsWith

2 participants