Skip to content

feat(scan): support data evolution row ID range filter#207

Closed
XiaoHongbo-Hope wants to merge 23 commits into
apache:mainfrom
XiaoHongbo-Hope:support_row_id_filter
Closed

feat(scan): support data evolution row ID range filter#207
XiaoHongbo-Hope wants to merge 23 commits into
apache:mainfrom
XiaoHongbo-Hope:support_row_id_filter

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

@XiaoHongbo-Hope XiaoHongbo-Hope commented Apr 5, 2026

Purpose

Linked issue: sub task of #173

Brief change log

Tests

API and Format

Documentation

@JingsongLi
Copy link
Copy Markdown
Contributor

We can also support query _ROW_ID in this PR too.

@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from d836c64 to d20243e Compare April 5, 2026 13:32
@JingsongLi
Copy link
Copy Markdown
Contributor

And please do not add issue in commit, this will make issue a lot of news.

@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch 5 times, most recently from d140ab8 to c4a34e7 Compare April 5, 2026 14:06
@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor Author

And please do not add issue in commit, this will make issue a lot of news.

👌

@XiaoHongbo-Hope XiaoHongbo-Hope changed the title feat(scan): support data evolution row ID range filter (#173) feat(scan): support data evolution row ID range filter Apr 5, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch 8 times, most recently from 8145479 to 8728044 Compare April 6, 2026 09:46
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review April 6, 2026 09:51
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from ed544fd to 7196a74 Compare April 6, 2026 10:08
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 0803575 to 2cf0c78 Compare April 6, 2026 10:10
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 6ab9e31 to c23c392 Compare April 6, 2026 11:38
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft April 6, 2026 12:03
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review April 6, 2026 12:14
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft April 6, 2026 12:16
…Selection instead of skipping IO filtering

- Replace post-read filter approach with pre-computed selected row ID sequence
- RowSelection is always applied at Parquet level for IO optimization
- Row IDs are assigned from the pre-computed sequence matching RowSelection output
- Extract insert_column_at to deduplicate column insertion logic
- Empty row_ranges treated as None (no filtering)
- Use saturating_add to prevent overflow in merge_row_ranges and build_row_ranges_selection
- Compute row_ranges before moving file_group to avoid clone
- Remove arrow-select dependency (no longer needed)
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from e894feb to 435854c Compare April 6, 2026 12:17
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from 24fa539 to bfc11c0 Compare April 6, 2026 12:32
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review April 6, 2026 12:49
Comment thread crates/paimon/src/spec/schema.rs Outdated
self
}

pub fn is_row_id_field(&self) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

Removed

Comment thread crates/paimon/src/table/source.rs Outdated
self.to - self.from + 1
}

pub fn is_empty(&self) -> bool {
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useless. Already debug_assert!(from <= to);.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useless. Already debug_assert!(from <= to);.

👌

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useless. Already debug_assert!(from <= to);.

Removed

@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from e7cd932 to d90879e Compare April 6, 2026 13:33
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaoHongbo-Hope Thanks. Left minor comments

// With data predicates, merged_row_count() reflects pre-filter row counts,
// so stopping early could return fewer rows than the limit after filtering.
let splits = if self.data_predicates.is_empty() {
let splits = if self.data_predicates.is_empty() && self.row_ranges.is_none() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may also update comment to reflect newly row_ranges

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may also update comment to reflect newly row_ranges

Added

Comment thread crates/paimon/src/table/source.rs Outdated

impl RowRange {
pub fn new(from: i64, to: i64) -> Self {
debug_assert!(from <= to, "RowRange from ({from}) must be <= to ({to})");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug assert! won't prevent illegal RowRange in release.
Either assert! or return Result<Self>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug assert! won't prevent illegal RowRange in release. Either assert! or return Result<Self>

Thanks, updated


/// Expand row_ranges into a flat sequence of selected row IDs for a file.
fn expand_selected_row_ids(first_row_id: i64, row_count: i64, row_ranges: &[RowRange]) -> Vec<i64> {
let file_end = first_row_id + row_count - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if row_count is 0, is expected file_end will be less than first_row_id?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if row_count is 0, is expected file_end will be less than first_row_id?

Added an early return.

let first_row_id = data_files[0].first_row_id.unwrap_or(0);
let file_row_count = data_files[0].row_count;
let total_rows = match &row_ranges {
Some(ranges) => expand_selected_row_ids(first_row_id, file_row_count, ranges).len(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
in plan phase the ranges in the data split are already merge_row_ranges, but the read phase here, it will still do another merge_row_ranges? Is it duplicated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in plan phase the ranges in the data split are already merge_row_ranges, but the read phase here, it will still do another merge_row_ranges? Is it duplicated?

Fixed

Comment thread crates/paimon/src/table/source.rs Outdated
current.to = current.to.max(r.to);
} else {
merged.push(current);
current = r.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to do the clone, just take over the owership

if ranges.len() <= 1 {
        return ranges;
    }

    ranges.sort_by_key(|r| r.from);

    let mut merged = Vec::with_capacity(ranges.len());
    let mut iter = ranges.into_iter();
    let mut current = iter.next().unwrap();

    for r in iter {
        if r.from <= current.to.saturating_add(1) {
            current.to = current.to.max(r.to);
        } else {
            merged.push(current);
            current = r;
        }
    }

    merged.push(current);
    merged

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to do the clone, just take over the owership

if ranges.len() <= 1 {
        return ranges;
    }

    ranges.sort_by_key(|r| r.from);

    let mut merged = Vec::with_capacity(ranges.len());
    let mut iter = ranges.into_iter();
    let mut current = iter.next().unwrap();

    for r in iter {
        if r.from <= current.to.saturating_add(1) {
            current.to = current.to.max(r.to);
        } else {
            merged.push(current);
            current = r;
        }
    }

    merged.push(current);
    merged

Updated.

…redundant merge, avoid clone in merge_row_ranges, update limit comment
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the support_row_id_filter branch from fc63da0 to 1158447 Compare April 6, 2026 14:28
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also create a SQL test (select and filter _ROW_ID) in datafusion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants