Skip to content

Core: Fix the concurrent REPLACE TABLE transaction commits#16289

Open
sshivampeta wants to merge 14 commits into
apache:mainfrom
sshivampeta:main
Open

Core: Fix the concurrent REPLACE TABLE transaction commits#16289
sshivampeta wants to merge 14 commits into
apache:mainfrom
sshivampeta:main

Conversation

@sshivampeta
Copy link
Copy Markdown

Summary

Fixes #16232.

This change prevents REPLACE TABLE transactions from silently overwriting concurrent committed table changes. If table metadata changes after a replace transaction starts, the replace transaction now fails with CommitFailedException instead of committing stale metadata.
Problem

Before this change, replaceTransaction() could overwrite concurrent commits without conflict detection. This affected concurrent:

schema updates
table property updates
data appends
snapshot expiration
another replace transaction

The root problem was that replace transactions refreshed the underlying table state but still committed the stale replacement metadata built when the transaction started. In addition, replace-table update requirements skipped several optimistic concurrency checks.

This allowed committed changes to disappear silently.
Changes

Updated BaseTransaction.commitReplaceTransaction to fail if the table metadata changed since the replace transaction started.
Updated UpdateRequirements.forReplaceTable to use the same optimistic concurrency checks as ordinary table updates.
Added regression coverage in TestReplaceTableSafety.
Updated catalog tests to expect concurrent replace conflicts instead of last-writer-wins behavior.
Updated Hive replace transaction tests to preserve concurrent changes and expect conflict failures.
Updated Spark encryption replace test expectations for concurrent replace behavior.

Why this fixes the issue

A replace transaction now compares the original base metadata with the latest table metadata after refresh. If another writer committed in between, the replace transaction fails before committing stale metadata.

This prevents stale replace metadata from overwriting committed schema, property, snapshot, or data changes.

For REST-style commits, UpdateRequirements.forReplaceTable now emits normal optimistic concurrency requirements, so server-side validation can also reject conflicting replace commits.
Testing

Ran:

./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestReplaceTableSafety" --tests "org.apache.iceberg.TestUpdateRequirements"
./gradlew :iceberg-hive-metastore:test --tests "org.apache.iceberg.hive.TestHiveCreateReplaceTable"
./gradlew :iceberg-hive-metastore:test
./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:checkstyleTest
./gradlew :iceberg-azure:integrationTest :iceberg-aws:integrationTest --continue

sshivampeta and others added 6 commits May 9, 2026 00:58
Replace commits now validate that underlying table metadata is unchanged
since the replace started after refresh(); otherwise commit fails with
CommitFailedException rather than overwriting concurrent updates (apacheGH-16232).
Includes UpdateRequirements alignment, CatalogTests and Spark concurrent
replace expectations, new TestReplaceTableSafety coverage, and
hasMessageContaining("replace transaction") assertions for Checkstyle.
Fix the Test cases to align with the replace transaction concurrency.

authored-by : shivakumar shivampeta
…tion-concurrency

Core: Fail concurrent replace transaction commits
Merge branch 'main' into core/fix-16232-replace-transaction-concurrency
Keep the replace transaction concurrency fix up to date with main
Fix the assertion for the failing build
// Replace transactions must not silently overwrite concurrent commits. If the table
// metadata has changed since the transaction started, fail instead of rebasing and
// merging staged updates.
if (base != null && underlyingOps.current() != base) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the base assignment was removed, this would cause the base to not be updated and keep retrying. applyUpdates has a better approach here with PendingUpdateFailedException to break the retry loop.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Since onlyRetryOn(CommitFailedException.class) retries on CommitFailedException, throwing it directly inside the lambda would cause the retry loop to keep hitting the same concurrent-modification check (since base is never updated) until retries are exhausted — wasteful.

Fixed: now wrapping the CommitFailedException in PendingUpdateFailedException (the same pattern used by applyUpdates in commitSimpleTransaction) to immediately break the retry loop. Added a matching catch (PendingUpdateFailedException e) block in commitReplaceTransaction that cleans up and re-throws the unwrapped CommitFailedException

Builder builder = new Builder(base, true);
// use the same optimistic concurrency checks as ordinary commits; replace commits must not
// silently drop concurrent schema, partition spec, sort order, or ref changes.
Builder builder = new Builder(base, false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 5 tests in TestUpdateRequirements:

replaceTableAddSchemaProducesFieldIdRequirement — verifies forReplaceTable now emits AssertLastAssignedFieldId
replaceTableAddSchemaFailsOnConcurrentFieldIdChange — verifies field-id conflict detection fails as expected
replaceTableSetCurrentSchemaProducesSchemaIdRequirement — verifies AssertCurrentSchemaID is emitted
replaceTableAddPartitionSpecProducesPartitionIdRequirement — verifies AssertLastAssignedPartitionId is emitted
replaceTableAddSortOrderProducesDefaultSortOrderRequirement — verifies AssertDefaultSortOrderID is emitted
These confirm that forReplaceTable with Builder(base, false) now produces the same OCC requirements as forUpdateTable, rather than skipping them as it did with Builder(base, true).

assertThat(after.schema().asStruct()).isEqualTo(SCHEMA_WITH_EXTRA_COL.asStruct());
assertThat(after.properties()).containsEntry("k1", "v1");
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more tests would help.

  1. For the createOrReplaceTransaction (orCreate=true)
  2. v3 row-lineage tests.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 4 additional tests:

createOrReplaceVsSchemaUpdateFailsAndPreservesSchema — createOrReplaceTransaction (orCreate=true) on an existing table detects concurrent schema update and fails, preserving the concurrently-added column.
createOrReplaceNewTableSucceeds — createOrReplaceTransaction for a table that doesn't exist yet succeeds normally (happy-path / no conflict).
createOrReplaceVsPropertyWriteFailsAndPreservesProperty — createOrReplaceTransaction on an existing table detects concurrent property write and fails.
replaceV3TableVsConcurrentAppendFails — creates a v3 format table, starts a replace transaction, makes a concurrent append (which would advance next-row-id), and verifies the replace fails instead of committing stale row lineage metadata.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 15, 2026

The contract of REPLACE TABLE is that the table's data, schema, and partitioning are replaced. There are no assumptions to validate, so checking for conflicts between concurrent changes and those changes is not required. If this is what you're trying to change, then I'm -1 for this commit because it is changing behavior that is already considered and chosen. It looks like this is the intent of this PR, but please reply if I'm wrong.

That said, REPLACE TABLE is also intended to keep table history (it is not a drop/re-create). So there are, potentially, valid problems with concurrency. For example, if a table's schema is updated from schema-id 1 to schema-id 2 and then the REPLACE TABLE operation removes the new schema-id 2 -- as though the concurrent commit never happened -- then that's a bug that we should fix.

sshivampeta and others added 5 commits May 20, 2026 17:55
…eplaceTransaction

Wraps CommitFailedException in PendingUpdateFailedException so the
retry loop exits immediately on concurrent modification detection,
instead of wastefully retrying until exhaustion. Adds matching catch
block for cleanup.

Co-authored-by: shivakumar shivampeta <sshivampeta@users.noreply.github.com>
…irements

Verifies that forReplaceTable now emits the same optimistic concurrency
requirements as forUpdateTable for schema, partition spec, and sort order
changes.

Co-authored-by: shivakumar shivampeta <sshivampeta@users.noreply.github.com>
Adds coverage for createOrReplaceTransaction (orCreate=true) and v3
format tables with row lineage, as requested in review.

Co-authored-by: shivakumar shivampeta <sshivampeta@users.noreply.github.com>
@sshivampeta
Copy link
Copy Markdown
Author

The contract of REPLACE TABLE is that the table's data, schema, and partitioning are replaced. There are no assumptions to validate, so checking for conflicts between concurrent changes and those changes is not required. If this is what you're trying to change, then I'm -1 for this commit because it is changing behavior that is already considered and chosen. It looks like this is the intent of this PR, but please reply if I'm wrong.

That said, REPLACE TABLE is also intended to keep table history (it is not a drop/re-create). So there are, potentially, valid problems with concurrency. For example, if a table's schema is updated from schema-id 1 to schema-id 2 and then the REPLACE TABLE operation removes the new schema-id 2 -- as though the concurrent commit never happened -- then that's a bug that we should fix.

Thanks for the detailed feedback, @rdblue. You're right that REPLACE TABLE by contract replaces the table's data, schema, and partitioning — and that's intentional. This PR is not trying to change that contract.

The concern is specifically about table history preservation. As you noted, REPLACE TABLE should keep table history (it is not a drop/re-create). The bug is that commitReplaceTransaction refreshes base to detect version conflicts but then commits the original stale current metadata wholesale — silently erasing concurrently-committed schemas, expired snapshots, and other history changes. For example:

A concurrent expireSnapshots successfully removes snapshot-1, but the replace brings it back because its stale metadata still references it — this is table corruption (the snapshot files may have been physically deleted).
A concurrent updateSchema adds schema-id 2, but the replace drops it — as you described, "as though the concurrent commit never happened."
The current approach (fail-fast on any concurrent metadata change) is the simplest safe starting point to prevent this class of bugs. It's conservative — it will reject even non-conflicting concurrent changes — but it ensures no committed work is silently lost. A more refined approach that allows non-conflicting changes to survive (e.g., rebasing replace metadata on top of the refreshed base) could follow as a separate PR once the immediate safety issue is addressed.

Would this framing address your concern? Happy to discuss further or adjust the approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core: REPLACE TABLE transaction silently overwrites concurrent committed changes

4 participants