Skip to content

Conversation

@agnes-xinyi-lu
Copy link
Contributor

@agnes-xinyi-lu agnes-xinyi-lu commented Jan 23, 2026

When multiple processes concurrently commit to different branches of the
same table through the REST catalog, sequence number validation failures
in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
instead of retryable CommitFailedException.
This fix catches the sequence number validation error in CatalogHandlers.commit()
and wraps it in ValidationFailureException(CommitFailedException) to:
- Skip server-side retry (which won't help since sequence number is in the request)
- Return CommitFailedException to the client so it can retry with refreshed metadata

Issue #15001

    When multiple processes concurrently commit to different branches of the
    same table through the REST catalog, sequence number validation failures
    in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
    instead of retryable CommitFailedException.

    This fix catches the sequence number validation error in CatalogHandlers.commit()
    and wraps it in ValidationFailureException(CommitFailedException) to:
    - Skip server-side retry (which won't help since sequence number is in the request)
    - Return CommitFailedException to the client so it can retry with refreshed metadata
request.updates().forEach(update -> update.applyTo(metadataBuilder));
} catch (ValidationException e) {
// Sequence number conflicts from concurrent commits are retryable by the client,
// but server-side retry won't help since the sequence number is in the request.
Copy link
Contributor

@singhpk234 singhpk234 Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point ! since the snapshot obj is created in the client and sent to the server the sequence number is locked in and server can't do much fail fast seems reasonable.

I wonder if we can refactor / introduce some other mechanism rather than relying on exception message text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @singhpk234 !
Checking exception message is not an uncommon pattern within iceberg repo , it helps target particular scenarios that were thrown in a more generic exception type. Refactoring the exception itself will require TableMetadata change which increases risks.
I'm trying to minimize the change to get this issue fixed as per my understanding of the comment on the issue. As my original idea was to add an UpdateRequirement to the spec for this assertion.
Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants