[core] Fix data evolution index split row ranges by leaves12138 · Pull Request #7953 · apache/paimon

leaves12138 · 2026-05-25T04:48:56Z

Purpose

Fix BTree global index building for data-evolution append tables when a data split contains unordered data files or intersects multiple discontiguous row ranges.

Previously DataEvolutionBatchScan.wrap derived the split range from the first and last data files. This assumes dataFiles() are ordered by row id, but RangeHelper.mergeOverlappingRanges may keep the original order inside a merged group. As a result, wrap could calculate an inverted range and fail to find any intersected row ranges.

Also, BTree index building assumed each split maps to exactly one row range after applying the row range index. A contiguous data-file split can still intersect multiple discontiguous indexed/non-indexed row ranges, so the builder needs to fan out the split by row range before grouping.

Changes

Compute IndexedSplit row ranges by intersecting the row range index with each data file's actual row-id range, then sort/merge the result.
Add splitByRowRangeIndex and let BTree range grouping handle multiple row ranges from one split.
Wrap each row range as a single-range IndexedSplit so downstream reads push down the exact row range.
Add regression tests for unordered/discontiguous data files and discontiguous row range indexes.

Tests

mvn -pl paimon-core -Pfast-build spotless:apply
mvn -pl paimon-core -Pfast-build -DskipTests compile
mvn -pl paimon-core -Pfast-build -DfailIfNoTests=false -Dtest=DataEvolutionBatchScanTest#testWrapToIndexSplitsWithUnorderedAndDiscontiguousDataFiles,BTreeGlobalIndexBuilderSplitTest#testGroupSplitsByDiscontiguousRowRangeIndex test

JingsongLi

Overall this is a solid bug fix with good test coverage. The root cause analysis is clear: wrap assumed data files were ordered by row ID, and calcRowRangeWithRowIndex assumed each split maps to exactly one row range — both assumptions can be violated.

A few observations:

DataEvolutionBatchScan.wrap — performance consideration

The new loop iterates over all files and calls intersectedRanges per file, then merges. For splits with many data files, this may produce many intermediate Range objects before sortAndMergeOverlap deduplicates them. Consider computing the global min/max first (which is O(n) over files) and calling intersectedRanges once — but only if the files' row-id ranges are guaranteed contiguous (which they aren't per this bug). So the per-file approach is correct here. Just noting the tradeoff.

BTreeIndexTopoBuilder — return false → continue change

This is a subtle semantic change: previously if any index column had no splits/ranges, the entire method returned false (no index built). Now it skips that column and continues with the others. This makes the method return true even if some columns couldn't be indexed. Is this intentional? The PR description doesn't mention this change. If a column has no data to index, silently skipping it seems reasonable, but callers that check the return false to detect "nothing happened" may now get a false positive. Worth a comment or explicit mention in the PR description.

splitByRowRangeIndex — the splitRange == null check in groupSplitsByRange

After the refactor, splitByRowRangeIndex returns Collections.emptyList() when calcRowRange returns null (the no-row-range case). So the if (splitRange == null) check inside the for-each loop is now dead code — splitByRowRangeIndex never produces a pair with a null key. Consider removing it for clarity, or add a comment explaining it's defensive.

Test: testWrapToIndexSplitsWithUnorderedAndDiscontiguousDataFiles

The test expects [Range(4200, 4450), Range(4650, 4700)] — this implies:

file3: 4200 + 208 - 1 = 4407, intersected with [0, 5000] → Range(4200, 4407)
file2: 4300 + 151 - 1 = 4450, intersected with [0, 5000] → Range(4300, 4450)
Merged (adjacent=true): Range(4200, 4450) ✓
file1: 4650 + 51 - 1 = 4700, intersected with [0, 5000] → Range(4650, 4700) ✓

The math checks out. Good edge case with overlapping file ranges (file2 and file3 overlap in [4300, 4407]).

Minor: the BTreeGlobalIndexBuilder no longer imports Arrays — confirmed it's only used in the removed Arrays.asList(dataSplit) call, replaced by Collections.singletonList. Clean.

LGTM with the caveat about the BTreeIndexTopoBuilder behavioral change — please confirm it's intentional.

JingsongLi · 2026-05-25T09:46:49Z

+1

leaves12138 marked this pull request as draft May 25, 2026 05:42

leaves12138 marked this pull request as ready for review May 25, 2026 05:56

[core] Fix data evolution index split row ranges

2910aac

leaves12138 force-pushed the fix-data-evolution-indexed-split-ranges branch from 1b04eab to 2910aac Compare May 25, 2026 06:12

JingsongLi reviewed May 25, 2026

View reviewed changes

JingsongLi merged commit 27d87ab into apache:master May 25, 2026
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fix data evolution index split row ranges#7953

[core] Fix data evolution index split row ranges#7953
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:fix-data-evolution-indexed-split-ranges

leaves12138 commented May 25, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

JingsongLi commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leaves12138 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leaves12138 commented May 25, 2026 •

edited

Loading