New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add external schema mappings for files written with name-based schemas #40
Comments
An incomplete implementation is available as a PR in the Netflix repository: Netflix/iceberg#80 |
@YuvalItzchakov are you still working on this? If not then I can take this up. |
@rdblue, I spent some time today looking into it. I have a few questions
|
This is for mapping external schemas into Iceberg. Basically adding Iceberg's IDs during conversion. This could be done for any format. There's code for "fallback" in Parquet that does this using the column ordinal, which is how we read old Parquet data. This particular issue is to add a mapping for Avro, which uses names to resolve columns. This is useful for data coming from an external writer that can write Avro schemas, but doesn't have an Iceberg schema. |
Iceberg names are not necessarily the names used by the writer. Avro coming from Kafka, for example, allows renaming a column. You can still read older data because Avro contains an Alias for it. If we only used Iceberg's current name for a column, a change like this would cause Iceberg to start ignoring records. |
Thanks @rdblue for that information. Here's a rough spec of what we can do :-
Updating In For the new function which uses a map to assign ids to fields, I presume for other types - map, list etc we can use non conflicting increasing ids to assign to each field in these container types, as they will not be specified in our mapping. Possible Alternative We will basically be overriding all the ids which |
Hey @rdblue, do you have any feedback regarding either of the above approaches? |
@rdsr, sorry for the delay. I was thinking about a solution more like the alternative you proposed, but I was thinking that this would work using just Avro schemas, so no need to convert from Iceberg to Avro. Iceberg already has field IDs, the question is how to match those up with the Avro schema in a data file. We also don't want to change the schema from the file too much because it is required to correctly read the data. So converting to Iceberg, then back to Avro is much more risky than transforming Avro to Avro+ids. I like your idea to have a some mapping callback, similar to |
@rdblue . Thanks for the input. I see your point. Converting from Avro to Iceberg and then to Avro again may end up possibly losing some information. In that case, I think we cannot even use I'll see if I can use my second approach but only using Avro to Avro transformations. |
… in Flink iceberg connector) (apache#40)
Files written by Iceberg writers contain Iceberg field IDs that are used for column projection. Iceberg doesn't currently support tracking data files that were written by other systems and added to Iceberg tables with the API because the field IDs are missing. To support files written by non-Iceberg writers, Iceberg could support a table-level mapping from a source schema to Iceberg IDs.
For example, a table with 2 columns might have an Avro schema mapping like this one, encoded as JSON in table properties:
When reading an Avro file, the read schema would be produced using the file's schema and the field IDs from the mapping. The
names
in each field mapping is a list to handle aliasing.The text was updated successfully, but these errors were encountered: