feat(scan): support data evolution row ID range filter by XiaoHongbo-Hope · Pull Request #207 · apache/paimon-rust

XiaoHongbo-Hope · 2026-04-05T08:35:09Z

Purpose

Linked issue: sub task of #173

Brief change log

Tests

API and Format

Documentation

JingsongLi · 2026-04-05T12:51:18Z

We can also support query _ROW_ID in this PR too.

JingsongLi · 2026-04-05T13:37:40Z

And please do not add issue in commit, this will make issue a lot of news.

XiaoHongbo-Hope · 2026-04-05T14:08:42Z

And please do not add issue in commit, this will make issue a lot of news.

👌

…pen)

…king from 0

…n and append_null_row_id_column

…Selection instead of skipping IO filtering - Replace post-read filter approach with pre-computed selected row ID sequence - RowSelection is always applied at Parquet level for IO optimization - Row IDs are assigned from the pre-computed sequence matching RowSelection output - Extract insert_column_at to deduplicate column insertion logic - Empty row_ranges treated as None (no filtering) - Use saturating_add to prevent overflow in merge_row_ranges and build_row_ranges_selection - Compute row_ranges before moving file_group to avoid clone - Remove arrow-select dependency (no longer needed)

JingsongLi · 2026-04-06T13:26:45Z

        self
    }

+    pub fn is_row_id_field(&self) -> bool {


Remove this.

Remove this.

Removed

JingsongLi · 2026-04-06T13:27:56Z

+        self.to - self.from + 1
+    }
+
+    pub fn is_empty(&self) -> bool {


It is useless. Already debug_assert!(from <= to);.

It is useless. Already debug_assert!(from <= to);.

👌

It is useless. Already debug_assert!(from <= to);.

Removed

…ocation

… RowSelection

…gument

luoyuxia

@XiaoHongbo-Hope Thanks. Left minor comments

luoyuxia · 2026-04-06T13:49:11Z

        // With data predicates, merged_row_count() reflects pre-filter row counts,
        // so stopping early could return fewer rows than the limit after filtering.
-        let splits = if self.data_predicates.is_empty() {
+        let splits = if self.data_predicates.is_empty() && self.row_ranges.is_none() {


nit: may also update comment to reflect newly row_ranges

nit: may also update comment to reflect newly row_ranges

Added

luoyuxia · 2026-04-06T13:56:32Z

+
+impl RowRange {
+    pub fn new(from: i64, to: i64) -> Self {
+        debug_assert!(from <= to, "RowRange from ({from}) must be <= to ({to})");


debug assert! won't prevent illegal RowRange in release.
Either assert! or return Result<Self>

debug assert! won't prevent illegal RowRange in release. Either assert! or return Result<Self>

Thanks, updated

luoyuxia · 2026-04-06T13:58:37Z


+/// Expand row_ranges into a flat sequence of selected row IDs for a file.
+fn expand_selected_row_ids(first_row_id: i64, row_count: i64, row_ranges: &[RowRange]) -> Vec<i64> {
+    let file_end = first_row_id + row_count - 1;


what if row_count is 0, is expected file_end will be less than first_row_id?

what if row_count is 0, is expected file_end will be less than first_row_id?

Added an early return.

luoyuxia · 2026-04-06T14:08:48Z

+            let first_row_id = data_files[0].first_row_id.unwrap_or(0);
+            let file_row_count = data_files[0].row_count;
+            let total_rows = match &row_ranges {
+                Some(ranges) => expand_selected_row_ids(first_row_id, file_row_count, ranges).len(),


nit:
in plan phase the ranges in the data split are already merge_row_ranges, but the read phase here, it will still do another merge_row_ranges? Is it duplicated?

nit: in plan phase the ranges in the data split are already merge_row_ranges, but the read phase here, it will still do another merge_row_ranges? Is it duplicated?

Fixed

luoyuxia · 2026-04-06T14:10:53Z

+            current.to = current.to.max(r.to);
+        } else {
+            merged.push(current);
+            current = r.clone();


nit: we don't need to do the clone, just take over the owership

if ranges.len() <= 1 { return ranges; } ranges.sort_by_key(|r| r.from); let mut merged = Vec::with_capacity(ranges.len()); let mut iter = ranges.into_iter(); let mut current = iter.next().unwrap(); for r in iter { if r.from <= current.to.saturating_add(1) { current.to = current.to.max(r.to); } else { merged.push(current); current = r; } } merged.push(current); merged

nit: we don't need to do the clone, just take over the owership

if ranges.len() <= 1 { return ranges; } ranges.sort_by_key(|r| r.from); let mut merged = Vec::with_capacity(ranges.len()); let mut iter = ranges.into_iter(); let mut current = iter.next().unwrap(); for r in iter { if r.from <= current.to.saturating_add(1) { current.to = current.to.max(r.to); } else { merged.push(current); current = r; } } merged.push(current); merged

Updated.

…redundant merge, avoid clone in merge_row_ranges, update limit comment

JingsongLi

Can you also create a SQL test (select and filter _ROW_ID) in datafusion?

…LECT *

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from d836c64 to d20243e Compare April 5, 2026 13:32

XiaoHongbo-Hope force-pushed the support_row_id_filter branch 5 times, most recently from d140ab8 to c4a34e7 Compare April 5, 2026 14:06

XiaoHongbo-Hope changed the title ~~feat(scan): support data evolution row ID range filter (#173)~~ feat(scan): support data evolution row ID range filter Apr 5, 2026

XiaoHongbo-Hope force-pushed the support_row_id_filter branch 8 times, most recently from 8145479 to 8728044 Compare April 6, 2026 09:46

XiaoHongbo-Hope marked this pull request as ready for review April 6, 2026 09:51

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from ed544fd to 7196a74 Compare April 6, 2026 10:08

XiaoHongbo-Hope added 6 commits April 6, 2026 18:10

support data evolution row id

44bd80e

clean code

d4f8cad

fix: skip parquet-level row_ranges filtering when _ROW_ID is projected

32f3f03

fix: disable limit pushdown when row_ranges is set

d5e99b5

fix: apply row_ranges post-read filter when _ROW_ID is projected

6d79bfe

fix: skip row_ranges filtering for files without first_row_id (fail-o…

2cf0c78

…pen)

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 0803575 to 2cf0c78 Compare April 6, 2026 10:10

XiaoHongbo-Hope added 3 commits April 6, 2026 19:38

fix: return null _ROW_ID for files without first_row_id instead of fa…

3f0b3d9

…king from 0

clean code

60022ef

fix: use saturating_add to avoid i64 overflow in merge_row_ranges

c23c392

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 6ab9e31 to c23c392 Compare April 6, 2026 11:38

XiaoHongbo-Hope marked this pull request as draft April 6, 2026 12:03

XiaoHongbo-Hope marked this pull request as ready for review April 6, 2026 12:14

XiaoHongbo-Hope marked this pull request as draft April 6, 2026 12:16

XiaoHongbo-Hope added 3 commits April 6, 2026 20:16

refactor: extract insert_column_at to deduplicate append_row_id_colum…

af4034c

…n and append_null_row_id_column

style: remove unnecessary comments

435854c

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from e894feb to 435854c Compare April 6, 2026 12:17

XiaoHongbo-Hope added 2 commits April 6, 2026 20:32

fix: guard against clamped_to underflow and extract attach_row_id helper

6f1845e

fix: NULL-fill branch in merge_files_by_columns respects row_ranges

bfc11c0

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 24fa539 to bfc11c0 Compare April 6, 2026 12:32

XiaoHongbo-Hope marked this pull request as ready for review April 6, 2026 12:49

JingsongLi reviewed Apr 6, 2026

View reviewed changes

XiaoHongbo-Hope added 5 commits April 6, 2026 21:32

perf: lazy row ID computation when no row_ranges, avoid full-file all…

fddb6b6

…ocation

fix: merge overlapping row_ranges in expand_selected_row_ids to match…

e9f7566

… RowSelection

add debug_assert for merge group invariants aligned with Java checkAr…

142e3c0

…gument

use assert! instead of debug_assert! for merge group invariants

ac911b5

remove unused is_row_id_field and is_empty methods

d90879e

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from e7cd932 to d90879e Compare April 6, 2026 13:33

luoyuxia reviewed Apr 6, 2026

View reviewed changes

address review: assert in RowRange::new, handle row_count==0, remove …

1158447

…redundant merge, avoid clone in merge_row_ranges, update limit comment

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from fc63da0 to 1158447 Compare April 6, 2026 14:28

JingsongLi reviewed Apr 6, 2026

View reviewed changes

XiaoHongbo-Hope added 3 commits April 7, 2026 09:14

support _ROW_ID in DataFusion SQL for data evolution tables

481fe2c

fix: quote _ROW_ID in SQL to preserve case sensitivity

de7db67

fix: always pass column names to ReadBuilder to include _ROW_ID in SE…

1d3cf97

…LECT *

XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 718c2c7 to 1d3cf97 Compare April 7, 2026 01:14

XiaoHongbo-Hope closed this Apr 17, 2026

Conversation

XiaoHongbo-Hope commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

JingsongLi commented Apr 5, 2026

Uh oh!

JingsongLi commented Apr 5, 2026

Uh oh!

XiaoHongbo-Hope commented Apr 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XiaoHongbo-Hope commented Apr 5, 2026 •

edited

Loading

JingsongLi Apr 6, 2026 •

edited

Loading

JingsongLi left a comment •

edited

Loading