Add external schema mappings for files written with name-based schemas #40

rdblue · 2018-12-08T00:25:54Z

Files written by Iceberg writers contain Iceberg field IDs that are used for column projection. Iceberg doesn't currently support tracking data files that were written by other systems and added to Iceberg tables with the API because the field IDs are missing. To support files written by non-Iceberg writers, Iceberg could support a table-level mapping from a source schema to Iceberg IDs.

For example, a table with 2 columns might have an Avro schema mapping like this one, encoded as JSON in table properties:

[ {"field-id": 1, "names": ["id"]},
  {"field-id": 2, "names": ["data"]} ]

When reading an Avro file, the read schema would be produced using the file's schema and the field IDs from the mapping. The names in each field mapping is a list to handle aliasing.

The text was updated successfully, but these errors were encountered:

rdblue · 2018-12-08T00:26:35Z

An incomplete implementation is available as a PR in the Netflix repository: Netflix/iceberg#80

rdsr · 2018-12-14T18:46:09Z

@YuvalItzchakov are you still working on this? If not then I can take this up.

rdsr · 2019-02-06T22:17:53Z

@rdblue, I spent some time today looking into it. I have a few questions

Is this feature only meant for Avro datasets/files?
Why do we need to store a table level mapping when the table's [Iceberg] schema may already have field ids mapped to names?

rdblue · 2019-02-06T22:21:01Z

This is for mapping external schemas into Iceberg. Basically adding Iceberg's IDs during conversion. This could be done for any format. There's code for "fallback" in Parquet that does this using the column ordinal, which is how we read old Parquet data.

This particular issue is to add a mapping for Avro, which uses names to resolve columns. This is useful for data coming from an external writer that can write Avro schemas, but doesn't have an Iceberg schema.

rdblue · 2019-02-07T00:12:59Z

Why do we need to store a table level mapping when the table's [Iceberg] schema may already have field ids mapped to names?

Iceberg names are not necessarily the names used by the writer. Avro coming from Kafka, for example, allows renaming a column. You can still read older data because Avro contains an Alias for it. If we only used Iceberg's current name for a column, a change like this would cause Iceberg to start ignoring records.

rdsr · 2019-02-20T07:07:09Z

Thanks @rdblue for that information. Here's a rough spec of what we can do :-

Update Avro.ReadBuilder to take in a Map of field names to fields ids which is passed on to ProjectionDatumReader.
In ProjectionDatumReader if the passed in fileSchema does not have any field ids, we use a modified SchemaToType class to assign ids to fields with the help of the mapping.
We'll need to update SchemaToType to take in a function which can either allocate increasing ids (current default behavior) or assign ids using the map.
Reconvert the returned iceberg schema (SchemaToType returns Type) to Avro using AvroSchemaUtil, which can then be passed to AvroSchemaUtil#buildAvroProjection.

Updating SchemaToType
In order to write a default mapping function to emulate current behavior in SchemaToType. I need to clear up some doubts around how are we currently allocating ids.

In SchemaToType, when called with table schema or record - their top level fields are assigned ids starting from 0, when called with other type -- map, list etc, ids are assigned starting from 1. We can possibly write in a function to emulate the current behavior, or do u see scope here to simplify and only assign ever increasing ids, without caring about whether the top level fields of a schema get ids starting from 0.

For the new function which uses a map to assign ids to fields, I presume for other types - map, list etc we can use non conflicting increasing ids to assign to each field in these container types, as they will not be specified in our mapping.

Possible Alternative
Another possible alternative here could be that we don't modify SchemaToType , but use it nonetheless to generate an Iceberg schema. We define a new method - assignIds, under TypeUtil similar to TypeUtil#assignFreshIds(Schema, TypeUtil.NextID) which can take that [Iceberg] schema and a function [similar to NextID] which can assign ids using the mapping. We can then reconvert the returned iceberg schema to Avro using AvroSchemaUtil, just as in the first approach.

We will basically be overriding all the ids which SchemaToType would have allocated. I think there's scope of reusing com.netflix.iceberg.types.AssignFreshIds and making it more generic to serve two usecases - assigning fresh ids and assigning ids from a map. [For non record container types we'd still have to use non conflicting ids to assign to their fields].

rdsr · 2019-02-25T06:45:11Z

Hey @rdblue, do you have any feedback regarding either of the above approaches?

rdblue · 2019-02-25T23:51:19Z

@rdsr, sorry for the delay.

I was thinking about a solution more like the alternative you proposed, but I was thinking that this would work using just Avro schemas, so no need to convert from Iceberg to Avro.

Iceberg already has field IDs, the question is how to match those up with the Avro schema in a data file. We also don't want to change the schema from the file too much because it is required to correctly read the data. So converting to Iceberg, then back to Avro is much more risky than transforming Avro to Avro+ids.

I like your idea to have a some mapping callback, similar to NextID but that can be implemented by something that contains the mapping. Hopefully, that API could be used for Avro and Parquet ID mapping since we'd like to be able to do the same for parquet-avro files, eventually.

rdsr · 2019-03-04T05:09:24Z

@rdblue . Thanks for the input. I see your point. Converting from Avro to Iceberg and then to Avro again may end up possibly losing some information. In that case, I think we cannot even use SchemaToType as described in the first approach.

I'll see if I can use my second approach but only using Avro to Avro transformations.

… in Flink iceberg connector) (apache#40)

rdblue added the good first issue Good for newcomers label Dec 8, 2018

rdblue mentioned this issue Dec 8, 2018

Add external schema mappings for data without field IDs Netflix/iceberg#71

Closed

rdblue mentioned this issue Dec 8, 2018

Add an API to maintain external schema mappings #41

Closed

rdblue mentioned this issue Jul 31, 2019

Add mapping to Iceberg for external name-based schemas #338

Merged

rdblue closed this as completed in 9f1598e Oct 22, 2019

zzhhhzz added a commit to zzhhhzz/iceberg that referenced this issue Aug 21, 2023

backport apache#5126 to close compressor and avoid memory leak (found…

9931da1

… in Flink iceberg connector) (apache#40)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add external schema mappings for files written with name-based schemas #40

Add external schema mappings for files written with name-based schemas #40

rdblue commented Dec 8, 2018

rdblue commented Dec 8, 2018

rdsr commented Dec 14, 2018 •

edited

rdsr commented Feb 6, 2019

rdblue commented Feb 6, 2019

rdblue commented Feb 7, 2019

rdsr commented Feb 20, 2019 •

edited

rdsr commented Feb 25, 2019

rdblue commented Feb 25, 2019

rdsr commented Mar 4, 2019

Add external schema mappings for files written with name-based schemas #40

Add external schema mappings for files written with name-based schemas #40

Comments

rdblue commented Dec 8, 2018

rdblue commented Dec 8, 2018

rdsr commented Dec 14, 2018 • edited

rdsr commented Feb 6, 2019

rdblue commented Feb 6, 2019

rdblue commented Feb 7, 2019

rdsr commented Feb 20, 2019 • edited

rdsr commented Feb 25, 2019

rdblue commented Feb 25, 2019

rdsr commented Mar 4, 2019

rdsr commented Dec 14, 2018 •

edited

rdsr commented Feb 20, 2019 •

edited