Skip to content

[core] Fix data evolution index split row ranges#7953

Merged
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:fix-data-evolution-indexed-split-ranges
May 25, 2026
Merged

[core] Fix data evolution index split row ranges#7953
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:fix-data-evolution-indexed-split-ranges

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 commented May 25, 2026

Purpose

Fix BTree global index building for data-evolution append tables when a data split contains unordered data files or intersects multiple discontiguous row ranges.

Previously DataEvolutionBatchScan.wrap derived the split range from the first and last data files. This assumes dataFiles() are ordered by row id, but RangeHelper.mergeOverlappingRanges may keep the original order inside a merged group. As a result, wrap could calculate an inverted range and fail to find any intersected row ranges.

Also, BTree index building assumed each split maps to exactly one row range after applying the row range index. A contiguous data-file split can still intersect multiple discontiguous indexed/non-indexed row ranges, so the builder needs to fan out the split by row range before grouping.

Changes

  • Compute IndexedSplit row ranges by intersecting the row range index with each data file's actual row-id range, then sort/merge the result.
  • Add splitByRowRangeIndex and let BTree range grouping handle multiple row ranges from one split.
  • Wrap each row range as a single-range IndexedSplit so downstream reads push down the exact row range.
  • Add regression tests for unordered/discontiguous data files and discontiguous row range indexes.

Tests

  • mvn -pl paimon-core -Pfast-build spotless:apply
  • mvn -pl paimon-core -Pfast-build -DskipTests compile
  • mvn -pl paimon-core -Pfast-build -DfailIfNoTests=false -Dtest=DataEvolutionBatchScanTest#testWrapToIndexSplitsWithUnorderedAndDiscontiguousDataFiles,BTreeGlobalIndexBuilderSplitTest#testGroupSplitsByDiscontiguousRowRangeIndex test

@leaves12138 leaves12138 marked this pull request as draft May 25, 2026 05:42
@leaves12138 leaves12138 marked this pull request as ready for review May 25, 2026 05:56
@leaves12138 leaves12138 force-pushed the fix-data-evolution-indexed-split-ranges branch from 1b04eab to 2910aac Compare May 25, 2026 06:12
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a solid bug fix with good test coverage. The root cause analysis is clear: wrap assumed data files were ordered by row ID, and calcRowRangeWithRowIndex assumed each split maps to exactly one row range — both assumptions can be violated.

A few observations:


DataEvolutionBatchScan.wrap — performance consideration

The new loop iterates over all files and calls intersectedRanges per file, then merges. For splits with many data files, this may produce many intermediate Range objects before sortAndMergeOverlap deduplicates them. Consider computing the global min/max first (which is O(n) over files) and calling intersectedRanges once — but only if the files' row-id ranges are guaranteed contiguous (which they aren't per this bug). So the per-file approach is correct here. Just noting the tradeoff.


BTreeIndexTopoBuilderreturn falsecontinue change

This is a subtle semantic change: previously if any index column had no splits/ranges, the entire method returned false (no index built). Now it skips that column and continues with the others. This makes the method return true even if some columns couldn't be indexed. Is this intentional? The PR description doesn't mention this change. If a column has no data to index, silently skipping it seems reasonable, but callers that check the return false to detect "nothing happened" may now get a false positive. Worth a comment or explicit mention in the PR description.


splitByRowRangeIndex — the splitRange == null check in groupSplitsByRange

After the refactor, splitByRowRangeIndex returns Collections.emptyList() when calcRowRange returns null (the no-row-range case). So the if (splitRange == null) check inside the for-each loop is now dead code — splitByRowRangeIndex never produces a pair with a null key. Consider removing it for clarity, or add a comment explaining it's defensive.


Test: testWrapToIndexSplitsWithUnorderedAndDiscontiguousDataFiles

The test expects [Range(4200, 4450), Range(4650, 4700)] — this implies:

  • file3: 4200 + 208 - 1 = 4407, intersected with [0, 5000] → Range(4200, 4407)
  • file2: 4300 + 151 - 1 = 4450, intersected with [0, 5000] → Range(4300, 4450)
  • Merged (adjacent=true): Range(4200, 4450) ✓
  • file1: 4650 + 51 - 1 = 4700, intersected with [0, 5000] → Range(4650, 4700) ✓

The math checks out. Good edge case with overlapping file ranges (file2 and file3 overlap in [4300, 4407]).


Minor: the BTreeGlobalIndexBuilder no longer imports Arrays — confirmed it's only used in the removed Arrays.asList(dataSplit) call, replaced by Collections.singletonList. Clean.

LGTM with the caveat about the BTreeIndexTopoBuilder behavioral change — please confirm it's intentional.

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit 27d87ab into apache:master May 25, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants