[core] Fix data evolution index split row ranges#7953
Conversation
1b04eab to
2910aac
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Overall this is a solid bug fix with good test coverage. The root cause analysis is clear: wrap assumed data files were ordered by row ID, and calcRowRangeWithRowIndex assumed each split maps to exactly one row range — both assumptions can be violated.
A few observations:
DataEvolutionBatchScan.wrap — performance consideration
The new loop iterates over all files and calls intersectedRanges per file, then merges. For splits with many data files, this may produce many intermediate Range objects before sortAndMergeOverlap deduplicates them. Consider computing the global min/max first (which is O(n) over files) and calling intersectedRanges once — but only if the files' row-id ranges are guaranteed contiguous (which they aren't per this bug). So the per-file approach is correct here. Just noting the tradeoff.
BTreeIndexTopoBuilder — return false → continue change
This is a subtle semantic change: previously if any index column had no splits/ranges, the entire method returned false (no index built). Now it skips that column and continues with the others. This makes the method return true even if some columns couldn't be indexed. Is this intentional? The PR description doesn't mention this change. If a column has no data to index, silently skipping it seems reasonable, but callers that check the return false to detect "nothing happened" may now get a false positive. Worth a comment or explicit mention in the PR description.
splitByRowRangeIndex — the splitRange == null check in groupSplitsByRange
After the refactor, splitByRowRangeIndex returns Collections.emptyList() when calcRowRange returns null (the no-row-range case). So the if (splitRange == null) check inside the for-each loop is now dead code — splitByRowRangeIndex never produces a pair with a null key. Consider removing it for clarity, or add a comment explaining it's defensive.
Test: testWrapToIndexSplitsWithUnorderedAndDiscontiguousDataFiles
The test expects [Range(4200, 4450), Range(4650, 4700)] — this implies:
- file3: 4200 + 208 - 1 = 4407, intersected with [0, 5000] → Range(4200, 4407)
- file2: 4300 + 151 - 1 = 4450, intersected with [0, 5000] → Range(4300, 4450)
- Merged (adjacent=true): Range(4200, 4450) ✓
- file1: 4650 + 51 - 1 = 4700, intersected with [0, 5000] → Range(4650, 4700) ✓
The math checks out. Good edge case with overlapping file ranges (file2 and file3 overlap in [4300, 4407]).
Minor: the BTreeGlobalIndexBuilder no longer imports Arrays — confirmed it's only used in the removed Arrays.asList(dataSplit) call, replaced by Collections.singletonList. Clean.
LGTM with the caveat about the BTreeIndexTopoBuilder behavioral change — please confirm it's intentional.
|
+1 |
Purpose
Fix BTree global index building for data-evolution append tables when a data split contains unordered data files or intersects multiple discontiguous row ranges.
Previously
DataEvolutionBatchScan.wrapderived the split range from the first and last data files. This assumesdataFiles()are ordered by row id, butRangeHelper.mergeOverlappingRangesmay keep the original order inside a merged group. As a result,wrapcould calculate an inverted range and fail to find any intersected row ranges.Also, BTree index building assumed each split maps to exactly one row range after applying the row range index. A contiguous data-file split can still intersect multiple discontiguous indexed/non-indexed row ranges, so the builder needs to fan out the split by row range before grouping.
Changes
IndexedSplitrow ranges by intersecting the row range index with each data file's actual row-id range, then sort/merge the result.splitByRowRangeIndexand let BTree range grouping handle multiple row ranges from one split.IndexedSplitso downstream reads push down the exact row range.Tests
mvn -pl paimon-core -Pfast-build spotless:applymvn -pl paimon-core -Pfast-build -DskipTests compilemvn -pl paimon-core -Pfast-build -DfailIfNoTests=false -Dtest=DataEvolutionBatchScanTest#testWrapToIndexSplitsWithUnorderedAndDiscontiguousDataFiles,BTreeGlobalIndexBuilderSplitTest#testGroupSplitsByDiscontiguousRowRangeIndex test