Skip to content

Correct schema behavior #247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 5, 2024
Merged

Correct schema behavior #247

merged 3 commits into from
Jan 5, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jan 1, 2024

When we alter the schema, we want to use the latest schema by default, except when you select a specific snapshot that has a schema-id.

When we alter the schema, we want to use the latest
schema by default, except when you select a specific
snapshot that has a schema-id.
snapshot_schema = self.table.schemas()[snapshot.schema_id]
current_schema = self.table.schema()
if self.snapshot_id is not None:
snapshot = self.table.snapshot_by_id(self.snapshot_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be an invalid state if snapshot is None but a snapshot_id is set, should we throw?

Maybe we could consider a schema_for(snapshot_id) API similar to https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L368 .

I think there's a difference in the Java implementation and Python implementation on the case where there is a schema ID on the snapshot but for whatever reason the schema with that ID cannot be found. In the schemaFor Java API implementation we throw, but here we fall back to the latest. I think we should probably throw rather than assume the latest in that case because that implies there is some bad metadata and it's safer to fail than coerce to the latest schema. I think latest should only be used when there is no schema ID on the snapshot and the original case when there is no snapshot_id set. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @amogh-jahagirdar I'm not super strong on this one. Typically, I would not fail in these situations, but I agree that raising a warning might be appropriate here.

I know there are thoughts of pruning old schemas, which might lead to this situation, but I would expect this to happen regularly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code with a warning, let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the warning makes sense for the missing schema ID case but what about the case where the snapshot_id is set but cannot be found (if line 948 returns None)? I think the only option there would be to throw because that means there was some established snapshot_id but we can't find it anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, that's a good one. I think we should check if the snapshot-id is valid earlier in the process. I've added a check now, but I'll follow up with another PR to make this more strict.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, this looks great to me now, thanks @Fokko !

@Fokko Fokko merged commit dba1ef8 into apache:main Jan 5, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Jan 13, 2024
* Correct schema behavior

When we alter the schema, we want to use the latest
schema by default, except when you select a specific
snapshot that has a schema-id.

* Add warning if schema-id is missing from the metadata

* Catch unexisting snapshots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants