Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Surface better error message during streaming planning when checkpoint snapshot not found #6480

Merged

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Dec 22, 2022

Fixes #6388.

Based on the stack trace the following sequence of events seems plausible.

1.) The snapshot ID for current offset no longer exists (my hunch is due to expiration). So table.snapshot(currentOffset.snapshotId()) returns null.

2.) Then planning throws an unclear NPE here when trying to get the operation associated with the snapshot.

This is possible when resuming reading from a checkpoint where the snapshot associated with the checkpoint was expired.

@github-actions github-actions bot added the spark label Dec 22, 2022
Copy link

@PraveenNanda124 PraveenNanda124 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Apr 17, 2023

Seems like surfacing a better error message is indeed the right fix for this situation based on a new issue which came up: #7340

I've updated with a test cc @aokolnychyi @RussellSpitzer @szehon-ho @singhpk234 @jackye1995


if (snapshot == null) {
throw new IllegalStateException(
String.format("Failed to find expected snapshot %d", currentOffset.snapshotId()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably improve this message. "Cannot load current offset at snapshot x, the snapshot was expired or removed"?
I'm trying to figure out how we can give the user a bit more information about what to do in this situation. Our currentOffset should always have a valid snapshot ID so I think it's probably ok to just say that that the snapshot was actually expired if we can't find it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it'll be helpful to provide the context that expiration or some external mechanism somehow removed the snapshot, will update

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, although I have a suggestion around the error message.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, +1 for the comment from Russell.

@amogh-jahagirdar amogh-jahagirdar changed the title Spark: Fail streaming planning when snapshot not found Spark: Surface better error during streaming planning when checkpoint snapshot not found Apr 19, 2023
@amogh-jahagirdar amogh-jahagirdar changed the title Spark: Surface better error during streaming planning when checkpoint snapshot not found Spark: Surface better error message during streaming planning when checkpoint snapshot not found Apr 19, 2023
@jackye1995
Copy link
Contributor

Looks like the message is updated, thanks Amogh!

@jackye1995 jackye1995 merged commit 456c926 into apache:master Apr 19, 2023
21 checks passed
aokolnychyi pushed a commit to aokolnychyi/iceberg that referenced this pull request Apr 20, 2023
…n checkpoint snapshot not found (apache#6480)

(cherry picked from commit 456c926)
aokolnychyi added a commit that referenced this pull request Apr 20, 2023
…n checkpoint snapshot not found (#7381)

This change backports PR #6480 to 3.3.

Co-authored-by: Amogh Jahagirdar <jahamogh@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Apr 25, 2023
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Apr 25, 2023
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Apr 25, 2023
Fokko pushed a commit that referenced this pull request Apr 25, 2023
coufon pushed a commit to coufon/iceberg that referenced this pull request Apr 25, 2023
manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
…n checkpoint snapshot not found (apache#7381)

This change backports PR apache#6480 to 3.3.

Co-authored-by: Amogh Jahagirdar <jahamogh@amazon.com>
manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants