Skip to content

[WIP] Replace transactions rebase onto refreshed metadata#15904

Draft
smaheshwar-pltr wants to merge 2 commits intoapache:mainfrom
smaheshwar-pltr:sm/replace-rebase-v2
Draft

[WIP] Replace transactions rebase onto refreshed metadata#15904
smaheshwar-pltr wants to merge 2 commits intoapache:mainfrom
smaheshwar-pltr:sm/replace-rebase-v2

Conversation

@smaheshwar-pltr
Copy link
Copy Markdown
Contributor

@smaheshwar-pltr smaheshwar-pltr commented Apr 7, 2026

Supersedes #15092.

Motivation

There are a few issues related to table replaces. BaseTransaction.commitReplaceTransaction() does not re-apply replacement and transaction updates onto refreshed metadata. When concurrent changes occur, the transaction therefore commits stale metadata.

When a REPLACE transaction commits after concurrent changes (appends, snapshot expiration, other replaces), it overwrites those changes with stale metadata. This can lead to snapshot history loss, and concurrent snapshot expiration can even cause table corruption. (#15090)

V3 tables require that snapshot.first-row-id >= table.next-row-id when adding a snapshot. The snapshot's first-row-id is set from base.nextRowId() when the snapshot is produced.

With REST catalogs, updates are sent to the server which are generally applied to the server's current metadata. If a concurrent commit advanced the server's next-row-id, the snapshot's first-row-id (based on stale metadata) will be behind:

Cannot add a snapshot, first-row-id is behind table next-row-id: 100 < 150

This is returned as CommitFailedException so the client can retry, but commitReplaceTransaction retries the same stale current — the snapshot still has the old first-row-id, so it fails every time. Therefore, I believe that in V3, any concurrent snapshot change in general (append, compaction, other replace) causes the replace to fail entirely. (#15905)

Less severe, but there are currently behaviour differences in concurrent replaces for REST vs non-REST catalogs due to this. E.g. for REST catalogs, properties are sent as a SetProperties delta and the server generally merges them via putAll, so concurrent property additions that have succeed survive a concurrent table replace. For non-REST catalogs though, they don't as the full TableMetadata object is committed directly, so the stale current overwrites all concurrent property changes.

This PR

This PR makes replace (and createOrReplace) transactions rebase their changes onto refreshed table metadata, using the same applyUpdates mechanism that commitSimpleTransaction already uses.

The start metadata (the initial buildReplacement result) is stored on BaseTransaction to allow the replacement to be rebuilt

Also: in RESTTableOperations, the replaceBase field used before to generate requirements is removed - requirements are now generated from base and kept in sync viaapplyUpdates.

Noting:

  • With the current PR, schema field IDs may be re-derived on rebase as the metadata is rebuilt. That could then lead to old files referencing old IDs added during the transaction (I think, need to think about this...).
  • A lot of existing tests (Hive in particular) asserting the prior behaviour of concurrent update loss seem to fail now. That does raise the question of whether this change in behaviour is acceptable.

@github-actions github-actions bot added the core label Apr 7, 2026
private TableMetadata startingMetadataFor(TableMetadata refreshed) {
return switch (type) {
case REPLACE_TABLE, CREATE_OR_REPLACE_TABLE ->
refreshed.buildReplacement(
Copy link
Copy Markdown
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: New field IDs will be assigned here, need to think about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant