[core][python] Add row ID existence check and column-level conflict detection#7971
Conversation
…etection Add a Rust-style row ID existence check to both Java and Python commit validation, ensuring pre-assigned firstRowId files reference base files that still exist in the current snapshot. This covers a gap where checkRowIdRangeConflicts misses conflicts when base files are fully deleted (e.g., DROP PARTITION / OVERWRITE) with no replacement. Also port Java's RowIdColumnConflictChecker to Python, enabling row×column two-dimensional conflict detection in check_row_id_from_snapshot. This allows concurrent MERGE INTO on different columns of the same rows to proceed without false conflicts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I found a blocking regression in the new row-id existence check.
checkRowIdExistence / check_row_id_existence currently checks every ADD delta entry with a non-null firstRowId. However, row tracking metadata is assigned before conflict detection, so newly appended files in the same commit also have firstRowId. Those files are expected to be absent from the latest snapshot, but this check treats them as stale references and rejects the commit.
This breaks an existing mixed upsert flow where one row is updated and another row is inserted in the same commit. I can reproduce it on this PR head with:
cd paimon-python
PYTHONPATH=. python -m pytest \
pypaimon/tests/table_upsert_by_key_test.py::TableUpsertByKeyBatchTest::test_mixed_update_and_append \
pypaimon/tests/table_upsert_by_key_test.py::TableUpsertByKeyStreamTest::test_mixed_update_and_append \
-qBoth tests fail with:
RuntimeError: Row ID existence conflict: file 'data-...parquet' references firstRowId=2, rowCount=1 in bucket 0, but no matching file exists in the current snapshot.
The file with firstRowId=2 is the newly inserted row, not a pre-assigned rewrite of an existing row-id range. The existence check needs to distinguish newly allocated append row-id ranges from files that intentionally reference existing row ids (for example by only checking ranges below the snapshot's nextRowId, or by carrying explicit source/rewrite semantics), and this mixed update+insert case should be covered by an end-to-end test.
Row tracking assigns firstRowId before conflict detection, so newly appended files also have firstRowId set and were falsely rejected. Add nextRowId parameter to only check files where firstRowId < nextRowId. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The previous false positive for mixed update+append upsert is fixed now.
I verified the PR head with:
cd paimon-python
PYTHONPATH=. python -m pytest \
pypaimon/tests/table_upsert_by_key_test.py \
pypaimon/tests/table_update_test.py \
pypaimon/tests/write/conflict_detection_test.py \
-qand:
mvn -pl paimon-core -DskipITs -Dcheckstyle.skip -Drat.skip=true \
-Dspotless.check.skip=true -Dtest=ConflictDetectionTest testBoth passed. Looks good to me.
…h Java In Java, conflict detection runs before row ID assignment. Python had the opposite order, which meant delta entries already had firstRowId set during conflict checks. Reorder to match Java's flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leaves12138
left a comment
There was a problem hiding this comment.
I checked the Python ordering change in the latest head as well.
Moving conflict detection before row-id assignment matches the Java commit ordering and avoids treating newly assigned append row ids as pre-existing row-id references. I also verified these paths locally:
cd paimon-python
PYTHONPATH=. python -m pytest \
pypaimon/tests/table_upsert_by_key_test.py \
pypaimon/tests/table_update_test.py \
pypaimon/tests/write/conflict_detection_test.py \
-qResult: 100 passed, 20 subtests passed.
I additionally checked concurrent mixed upsert/update scenarios manually: same row+same column still conflicts, same row+different columns still succeeds, and concurrent pure appends still succeed. Looks good to me.
…essage The check_row_id_existence error message uses singular 'conflict', while the test asserted on plural 'conflicts'. Use 'conflict' to match both existence check and range conflict messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The concurrent merge+compact test retries on conflict exceptions, but only caught the range conflict message. Add the new existence conflict message so the retry works for both conflict types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a Rust-style row ID existence check to both Java and Python commit validation, ensuring pre-assigned firstRowId files reference base files that still exist in the current snapshot. This covers a gap where checkRowIdRangeConflicts misses conflicts when base files are fully deleted (e.g., DROP PARTITION / OVERWRITE) with no replacement.
Also port Java's RowIdColumnConflictChecker to Python, enabling row×column two-dimensional conflict detection in check_row_id_from_snapshot. This allows concurrent MERGE INTO on different columns of the same rows to proceed without false conflicts.