Core: Fix the concurrent REPLACE TABLE transaction commits#16289
Core: Fix the concurrent REPLACE TABLE transaction commits#16289sshivampeta wants to merge 14 commits into
Conversation
Replace commits now validate that underlying table metadata is unchanged since the replace started after refresh(); otherwise commit fails with CommitFailedException rather than overwriting concurrent updates (apacheGH-16232). Includes UpdateRequirements alignment, CatalogTests and Spark concurrent replace expectations, new TestReplaceTableSafety coverage, and hasMessageContaining("replace transaction") assertions for Checkstyle.
Fix the Test cases to align with the replace transaction concurrency. authored-by : shivakumar shivampeta
…tion-concurrency Core: Fail concurrent replace transaction commits Merge branch 'main' into core/fix-16232-replace-transaction-concurrency Keep the replace transaction concurrency fix up to date with main
Fix the assertion for the failing build
| // Replace transactions must not silently overwrite concurrent commits. If the table | ||
| // metadata has changed since the transaction started, fail instead of rebasing and | ||
| // merging staged updates. | ||
| if (base != null && underlyingOps.current() != base) { |
There was a problem hiding this comment.
Since the base assignment was removed, this would cause the base to not be updated and keep retrying. applyUpdates has a better approach here with PendingUpdateFailedException to break the retry loop.
There was a problem hiding this comment.
Good catch. Since onlyRetryOn(CommitFailedException.class) retries on CommitFailedException, throwing it directly inside the lambda would cause the retry loop to keep hitting the same concurrent-modification check (since base is never updated) until retries are exhausted — wasteful.
Fixed: now wrapping the CommitFailedException in PendingUpdateFailedException (the same pattern used by applyUpdates in commitSimpleTransaction) to immediately break the retry loop. Added a matching catch (PendingUpdateFailedException e) block in commitReplaceTransaction that cleans up and re-throws the unwrapped CommitFailedException
| Builder builder = new Builder(base, true); | ||
| // use the same optimistic concurrency checks as ordinary commits; replace commits must not | ||
| // silently drop concurrent schema, partition spec, sort order, or ref changes. | ||
| Builder builder = new Builder(base, false); |
There was a problem hiding this comment.
Can you add a test for this change?
There was a problem hiding this comment.
Added 5 tests in TestUpdateRequirements:
replaceTableAddSchemaProducesFieldIdRequirement — verifies forReplaceTable now emits AssertLastAssignedFieldId
replaceTableAddSchemaFailsOnConcurrentFieldIdChange — verifies field-id conflict detection fails as expected
replaceTableSetCurrentSchemaProducesSchemaIdRequirement — verifies AssertCurrentSchemaID is emitted
replaceTableAddPartitionSpecProducesPartitionIdRequirement — verifies AssertLastAssignedPartitionId is emitted
replaceTableAddSortOrderProducesDefaultSortOrderRequirement — verifies AssertDefaultSortOrderID is emitted
These confirm that forReplaceTable with Builder(base, false) now produces the same OCC requirements as forUpdateTable, rather than skipping them as it did with Builder(base, true).
| assertThat(after.schema().asStruct()).isEqualTo(SCHEMA_WITH_EXTRA_COL.asStruct()); | ||
| assertThat(after.properties()).containsEntry("k1", "v1"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Few more tests would help.
- For the
createOrReplaceTransaction (orCreate=true) - v3 row-lineage tests.
There was a problem hiding this comment.
Added 4 additional tests:
createOrReplaceVsSchemaUpdateFailsAndPreservesSchema — createOrReplaceTransaction (orCreate=true) on an existing table detects concurrent schema update and fails, preserving the concurrently-added column.
createOrReplaceNewTableSucceeds — createOrReplaceTransaction for a table that doesn't exist yet succeeds normally (happy-path / no conflict).
createOrReplaceVsPropertyWriteFailsAndPreservesProperty — createOrReplaceTransaction on an existing table detects concurrent property write and fails.
replaceV3TableVsConcurrentAppendFails — creates a v3 format table, starts a replace transaction, makes a concurrent append (which would advance next-row-id), and verifies the replace fails instead of committing stale row lineage metadata.
|
The contract of REPLACE TABLE is that the table's data, schema, and partitioning are replaced. There are no assumptions to validate, so checking for conflicts between concurrent changes and those changes is not required. If this is what you're trying to change, then I'm -1 for this commit because it is changing behavior that is already considered and chosen. It looks like this is the intent of this PR, but please reply if I'm wrong. That said, REPLACE TABLE is also intended to keep table history (it is not a drop/re-create). So there are, potentially, valid problems with concurrency. For example, if a table's schema is updated from schema-id 1 to schema-id 2 and then the REPLACE TABLE operation removes the new schema-id 2 -- as though the concurrent commit never happened -- then that's a bug that we should fix. |
…eplaceTransaction Wraps CommitFailedException in PendingUpdateFailedException so the retry loop exits immediately on concurrent modification detection, instead of wastefully retrying until exhaustion. Adds matching catch block for cleanup. Co-authored-by: shivakumar shivampeta <sshivampeta@users.noreply.github.com>
…irements Verifies that forReplaceTable now emits the same optimistic concurrency requirements as forUpdateTable for schema, partition spec, and sort order changes. Co-authored-by: shivakumar shivampeta <sshivampeta@users.noreply.github.com>
Adds coverage for createOrReplaceTransaction (orCreate=true) and v3 format tables with row lineage, as requested in review. Co-authored-by: shivakumar shivampeta <sshivampeta@users.noreply.github.com>
…-d7b4 address review comments d7b4
Thanks for the detailed feedback, @rdblue. You're right that REPLACE TABLE by contract replaces the table's data, schema, and partitioning — and that's intentional. This PR is not trying to change that contract. The concern is specifically about table history preservation. As you noted, REPLACE TABLE should keep table history (it is not a drop/re-create). The bug is that commitReplaceTransaction refreshes base to detect version conflicts but then commits the original stale current metadata wholesale — silently erasing concurrently-committed schemas, expired snapshots, and other history changes. For example: A concurrent expireSnapshots successfully removes snapshot-1, but the replace brings it back because its stale metadata still references it — this is table corruption (the snapshot files may have been physically deleted). Would this framing address your concern? Happy to discuss further or adjust the approach. |
Summary
Fixes #16232.
This change prevents REPLACE TABLE transactions from silently overwriting concurrent committed table changes. If table metadata changes after a replace transaction starts, the replace transaction now fails with CommitFailedException instead of committing stale metadata.
Problem
Before this change, replaceTransaction() could overwrite concurrent commits without conflict detection. This affected concurrent:
The root problem was that replace transactions refreshed the underlying table state but still committed the stale replacement metadata built when the transaction started. In addition, replace-table update requirements skipped several optimistic concurrency checks.
This allowed committed changes to disappear silently.
Changes
Why this fixes the issue
A replace transaction now compares the original base metadata with the latest table metadata after refresh. If another writer committed in between, the replace transaction fails before committing stale metadata.
This prevents stale replace metadata from overwriting committed schema, property, snapshot, or data changes.
For REST-style commits, UpdateRequirements.forReplaceTable now emits normal optimistic concurrency requirements, so server-side validation can also reject conflicting replace commits.
Testing
Ran:
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestReplaceTableSafety" --tests "org.apache.iceberg.TestUpdateRequirements"
./gradlew :iceberg-hive-metastore:test --tests "org.apache.iceberg.hive.TestHiveCreateReplaceTable"
./gradlew :iceberg-hive-metastore:test
./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:checkstyleTest
./gradlew :iceberg-azure:integrationTest :iceberg-aws:integrationTest --continue