feat: support data evolution table mode by JingsongLi · Pull Request #193 · apache/paimon-rust

JingsongLi · 2026-04-02T07:38:07Z

Purpose

Supports https://paimon.apache.org/docs/master/append-table/data-evolution/

Sub task of #173

Brief change log

Tests

API and Format

Documentation

QuakeWang

@JingsongLi Hi, I have reviewed the pr, just leave some minor comments.

QuakeWang · 2026-04-02T08:57:28Z

+
+        Ok(try_stream! {
+            for split in &splits {
+                if split.raw_convertible() || split.data_files().len() == 1 {


The raw-convertible / single-file branch yields read_single_file() directly, but this helper does not reorder columns back to projected_column_names after ProjectionMask, unlike the existing read() path.

Parquet returns file-schema order, not request order, so a projection like ["value", "id"] will come back as ["id", "value"] for single-file splits. That makes this path inconsistent with the current read() behavior, and it can also make the new projection test flaky when the plan contains single-file groups.

Should we preserve the same reorder logic here so that the data-evolution raw path matches the normal read path?

QuakeWang · 2026-04-02T08:59:15Z

+    let mut column_source: HashMap<String, (usize, i64)> = HashMap::new();
+
+    for (file_idx, file_meta) in data_files.iter().enumerate() {
+        let file_columns: Vec<String> = if let Some(ref wc) = file_meta.write_cols {


Falling back to “all columns from the current table schema” when write_cols == None is not correct here.

One of the main data-evolution scenarios is reading old files after the table has added new columns. In that case, an old file only contains fields from the schema identified by file_meta.schema_id; using the current table schema here will incorrectly mark later-added columns as if this file already provided them. That affects winning-column resolution, and it can also drop entire row groups when the projection only contains newly added columns.

I cannot get your point. Please raise an example.

I cannot get your point. Please raise an example.

My point is about old files after schema evolution.

If an old file was written when the table schema was (id, name), and later the table becomes (id, name, age), then write_cols == None should not mean that this old file contains age too. But the current fallback uses the current table schema, so it will treat the old file as if it also provides the newly added columns.

I think for write_cols == None, we should use the file schema from file_meta.schema_id, not the current table schema.

The old file can be treaded as (id, name, age), the age is null. Actually, this is a schema evolution.

I don't want to introduce the reading of old schema files because it should address issues such as column type changes.

The old file can be treaded as (id, name, age), the age is null. Actually, this is a schema evolution.

I don't want to introduce the reading of old schema files because it should address issues such as column type changes.

@JingsongLi I get your point. I agree we do not necessarily need to introduce historical schema reading here, especially if that expands the scope to type evolution.

My concern is mainly that, even under the “current schema + NULL for missing columns” semantics, the current fallback may still behave incorrectly in some cases. For example, when projecting only a newly added column, I think this path may return 0 rows instead of preserving the row count and filling NULLs.

So my point is less about requiring old-schema reads, and more about whether the current fallback already implements the expected schema-evolution behavior correctly.

81665a1
You mean these three lines code? We should remove it.

Yes, those three lines are part of the issue, so removing them makes sense.

I just want to make sure this is sufficient: for the add-column case, if the projected column does not physically exist in the old files, we should still preserve the row count and fill NULL, instead of dropping rows.

I found it quite troublesome to fill in NULL, and we also need the corresponding Arrow type, which we don't have here, so the current implementation doesn't have a corresponding column. In the future, when we need to support schema evolution, we will see how to change it.

I found it quite troublesome to fill in NULL, and we also need the corresponding Arrow type, which we don't have here, so the current implementation doesn't have a corresponding column. In the future, when we need to support schema evolution, we will see how to change it.

OK, that makes sense to me.

QuakeWang · 2026-04-02T09:06:17Z

-                if let Some(files) = data_deletion_files {
-                    builder = builder.with_data_deletion_files(files);
+            if data_evolution_enabled {
+                let file_groups = split_by_row_id(data_files);


The split-generation logic here diverges from upstream DataEvolutionSplitGenerator.

Upstream does not simply group by equal first_row_id and emit one split per group. It first merges overlapping row_id_ranges, then applies ordered bin packing using target_split_size/open_file_cost, and computes rawConvertible from the packed result. The current implementation introduces two regressions:

source.split.* is effectively bypassed in data-evolution mode, so splits become much more fragmented.

Grouping only by first_row_id misses the overlapping row-id-range case, which no longer matches upstream grouping semantics.

I think we should align this with the Java split generator before merging the new read path.

Packaging different rowids together makes the implementation very complex, which requires re grouping during read, which is not necessary for Rust.

QuakeWang

+1

XiaoHongbo-Hope · 2026-04-02T13:47:49Z

+1

JingsongLi added 3 commits April 2, 2026 15:36

feat: support data evolution table mode

e050e96

Fix

f0342ba

Fix

04266c3

QuakeWang reviewed Apr 2, 2026

View reviewed changes

JingsongLi changed the title ~~feat: support data evolution table mode~~ [WIP] feat: support data evolution table mode Apr 2, 2026

JingsongLi added 10 commits April 2, 2026 19:25

Fix comments and merge in streaming way

a3e904e

fix

3c56bc7

fix

9b30490

fix

a3ec2d3

fix

a0338f6

fix

81665a1

fix

0d1d1e0

fix

8e91b7d

fix

f5dc0f9

fix

baab555

JingsongLi changed the title ~~[WIP] feat: support data evolution table mode~~ feat: support data evolution table mode Apr 2, 2026

QuakeWang approved these changes Apr 2, 2026

View reviewed changes

XiaoHongbo-Hope approved these changes Apr 2, 2026

View reviewed changes

XiaoHongbo-Hope merged commit cd3670c into apache:main Apr 2, 2026
8 checks passed

Conversation

JingsongLi commented Apr 2, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

QuakeWang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuakeWang left a comment

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants