-
Notifications
You must be signed in to change notification settings - Fork 341
Correct schema behavior #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When we alter the schema, we want to use the latest schema by default, except when you select a specific snapshot that has a schema-id.
snapshot_schema = self.table.schemas()[snapshot.schema_id] | ||
current_schema = self.table.schema() | ||
if self.snapshot_id is not None: | ||
snapshot = self.table.snapshot_by_id(self.snapshot_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be an invalid state if snapshot is None
but a snapshot_id
is set, should we throw?
Maybe we could consider a schema_for(snapshot_id)
API similar to https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L368 .
I think there's a difference in the Java implementation and Python implementation on the case where there is a schema ID on the snapshot but for whatever reason the schema with that ID cannot be found. In the schemaFor
Java API implementation we throw, but here we fall back to the latest. I think we should probably throw rather than assume the latest in that case because that implies there is some bad metadata and it's safer to fail than coerce to the latest schema. I think latest should only be used when there is no schema ID on the snapshot and the original case when there is no snapshot_id
set. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch @amogh-jahagirdar I'm not super strong on this one. Typically, I would not fail in these situations, but I agree that raising a warning might be appropriate here.
I know there are thoughts of pruning old schemas, which might lead to this situation, but I would expect this to happen regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code with a warning, let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the warning makes sense for the missing schema ID case but what about the case where the snapshot_id is set but cannot be found (if line 948 returns None
)? I think the only option there would be to throw because that means there was some established snapshot_id but we can't find it anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, that's a good one. I think we should check if the snapshot-id is valid earlier in the process. I've added a check now, but I'll follow up with another PR to make this more strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, this looks great to me now, thanks @Fokko !
* Correct schema behavior When we alter the schema, we want to use the latest schema by default, except when you select a specific snapshot that has a schema-id. * Add warning if schema-id is missing from the metadata * Catch unexisting snapshots
When we alter the schema, we want to use the latest schema by default, except when you select a specific snapshot that has a schema-id.