Skip to content

Conversation

@steFaiz
Copy link
Contributor

@steFaiz steFaiz commented Jan 27, 2026

Purpose

Linked issue: #7019

This PR is about to introduce a simplified MERGE INTO action/procedure on DataEvolutionTable for flink.
The motivation is that for data-evolution tables, we could efficiently update or insert columns without rewriting existing data files. Paimon-spark module implements it through merge into syntax, which is not supported by flink. So we introduce this action to simulate merge into behavier.

NOTE: Due to limitations in Flink’s implementation, compared with paimon-spark we currently only support the MERGE UPDATE SET branch and do not support inserting new records yet.

The process can be illustrated as below:
image

  1. Source table INNER JOIN target table on merge condition.
    This step will assign _row_id for each row in source table.
  2. Shuffle the joined table by corresponding FirstRowId of each newly assigned _row_id.
    This step is about to ensure that rows belonging to the same files should be processed by same writer operators.
  3. Write updated/inserted columns to new files.
    a. Sort rows by _row_id
    b. Read original data from each row ranges, merge original data with new rows
    c. Write out merged data
  4. Commit new files.

This implementation is specially designed for cases where the source table may be much smaller than the target table. Each writer is responsible for reading the original file data. Another possible approach is to perform a left outer join of the target table with the source table, rather than an inner join.

Merge Detail

New rows will be merged with existing rows to make new files aligned with existing files. For example, consider existing rows:

_row_id value (double) first_row_id
1 12.34 1
2 0.00 1
3 -7.50 1
4 100.01 1
5 3.14 1

They belong to a same file whose row range is [1, 5]
Then a new updated row comes:

_row_id value (double) first_row_id
3 10000.00 1

We will merge exiting file and the new file, write out:

_row_id value (double) first_row_id
1 12.34 1
2 0.00 1
3 10000.00 1
4 100.01 1
5 3.14 1

Tests

Please see org.apache.paimon.flink.action.DataEvolutionMergeIntoActionITCase

API and Format

Do not modify any existing api.

Documentation

Will be added ASAP

@steFaiz steFaiz marked this pull request as draft January 27, 2026 03:07
@steFaiz steFaiz changed the title [wip][flink] introduce a simplified MERGE INTO action on data-evolution-table for flink [flink] introduce a simplified MERGE INTO action on data-evolution-table for flink Jan 28, 2026
@steFaiz steFaiz changed the title [flink] introduce a simplified MERGE INTO action on data-evolution-table for flink [flink] introduce a simplified MERGE INTO procedure on data-evolution-table for flink Jan 28, 2026
@steFaiz steFaiz marked this pull request as ready for review January 28, 2026 07:15
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation in /append-table/data-evolution too.

@JingsongLi
Copy link
Contributor

+1

@JingsongLi
Copy link
Contributor

Thanks @steFaiz !

@JingsongLi JingsongLi merged commit bc6d341 into apache:master Jan 29, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants