Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented May 4, 2023

I noticed that the Iceberg Java writers are setting this field on the Parquet/Avro/ORC files, but others (I can confirm that this is missing for Athena and Trino) don't do this. I think it would be great to formalize this in the spec, and also encourage others to implement this.

I noticed that the Iceberg Java writers are setting this field
on the Parquet/Avro/ORC files, but others (I can confirm that
this is missing for Athena and Trino) don't do this.

I think this is nice because we just hit an edge case:
apache#7523

Where the Arrow reader doesn't properly infer the
logical UUID type. With the schema in there, we know
which schema that was used for writing the file,
and we can use that as the source of truth.
@rdblue
Copy link
Contributor

rdblue commented May 25, 2023

This was not originally part of the spec because the Parquet/Avro/ORC schema is the source of truth. We don't want to have multiple places for the same information that could conflict with one another. That would make the spec more complicated and harder to implement. (Also, what if some lazy person writes iceberg.schema but not field IDs? We don't want to need to reconcile them.)

We wrote the Iceberg schema into file metadata primarily for debugging and informational purposes. I think we could add a recommendation to the spec to do the same, but I don't think that we should require iceberg.schema.

The problem with making it required now is not just that it would make the spec larger and complicate the source of truth for a schema. The issue is that this would not help because some writers don't currently write the property and there are existing files in tables without it. If we can't rely on it, what is the value?

@Fokko
Copy link
Contributor Author

Fokko commented May 25, 2023

Should we just remove it then? For Python, it was very easy to get things started, but I agree that it is best to have one single source of truth.

@Fokko Fokko closed this May 25, 2023
@rdblue
Copy link
Contributor

rdblue commented May 25, 2023

I think it's useful for debugging still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants