Python: Convert Avro to Iceberg schema #4742

Fokko · 2022-05-10T14:20:08Z

Converts an Avro schema into an Iceberg one.

Looked at https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/avro/avro_to_iceberg.py but I decided to rewrite it to make it more pythonic.

python/src/iceberg/schema.py

python/src/iceberg/utils/schema_conversion.py

Proposal to change the types into dataclasses. This has several improvments: - We can use the dataclasss field(repr=True) to include the fields in the representation, instead of building our own strings - We can assign the types in the post_init when they are dynamic (List, Maps, Structs etc) , or just override them when they are static (Primitives) - We don't have to implement any eq methods because they come for free - The types are frozen, which is kind of nice since we re-use them - The code is much more consise - We can assign the min/max of the int/long/float as Final as of 3.8: https://peps.python.org/pep-0591/ My inspiration was the comment by Kyle: apache#4742 (comment) This would entail implementing eq, but why not use the generated one since we're comparing all the attributes :) Would love to get you input

* Change types into dataclasses Proposal to change the types into dataclasses. This has several improvments: - We can use the dataclasss field(repr=True) to include the fields in the representation, instead of building our own strings - We can assign the types in the post_init when they are dynamic (List, Maps, Structs etc) , or just override them when they are static (Primitives) - We don't have to implement any eq methods because they come for free - The types are frozen, which is kind of nice since we re-use them - The code is much more consise - We can assign the min/max of the int/long/float as Final as of 3.8: https://peps.python.org/pep-0591/ My inspiration was the comment by Kyle: #4742 (comment) This would entail implementing eq, but why not use the generated one since we're comparing all the attributes :) Would love to get you input * Remove explicit repr and eq * Use @cached_property to cache the string Add missing words to spelling * Add additional guard for initializing StructType using kwargs * Replace type with field_type

python/src/iceberg/types.py

Fokko · 2022-05-19T22:36:14Z

@rdblue @kbendick I've updated the PR with the comments addressed 👍🏻

rdblue · 2022-05-19T22:39:38Z

Thanks! I'll take another look.

python/src/iceberg/utils/schema_conversion.py

rdblue · 2022-05-27T14:56:26Z

python/src/iceberg/utils/schema_conversion.py

+        """
+        # In the case of a primitive field
+        if isinstance(field, str):
+            return AvroSchemaConversion.PRIMITIVE_FIELD_TYPE_MAP[field]


I find the logic in these methods hard to follow and I think it's because the cases are not cleanly separated by method. This is a good example. This method is handling an Avro field, but this is checking if the field is not a field and is instead a primitive type. That should never happen in a schema so it raises questions about when this method is called.

This is one reason why we use the visitor pattern elsewhere. Keeping the logic to traverse a schema separate from the logic to actually do something with it is useful, but it also keeps you using a consistent and focused pattern to construct this logic: here's how to convert a record, here's how to convert a field, here's how to convert a map, etc.

Since this isn't handling an Avro Schema class, I wasn't originally going to suggest it, but I think this would be cleaner and easier to review/maintain if it were structured around Avro's schema model:

def _convert_schema(schema: str | Dict[str, Any]): if isinstance(schema, str): return AvroSchemaConversion.PRIMITIVE_FIELD_TYPE_MAP[schema] elif isinstance(schema, dict): type_name = schema["type"] if type_name == "record": return _convert_record(schema) elif type_name == "union": ... elif type_name == "map": ... elif type_name == "array": ... else: logical_type = schema.get("logicalType") if logicalType: ... else: return AvroSchemaConversion.PRIMITIVE_FIELD_TYPE_MAP[type_name] else: raise ValueError(f"Cannot convert invalid schema: {schema}") def _convert_record(schema: Dict[str, Any]): ... # calls _convert_field def _convert_field(field: Dict[str, Any]): ...

Yes, that's much cleaner indeed. The method grew a bit over time, but I really like the suggestion of decoupling the field and type. I've updated the code.

…hema-conversion

rdblue · 2022-05-29T16:23:27Z

python/src/iceberg/schema.py

    @property
-    def columns(self) -> Iterable[NestedField]:
+    def columns(self) -> Tuple[NestedField]:


Should this be Tuple[NestedField, ...] since it isn't a single field in the tuple?

From the typing.Tuple docs:

To specify a variable-length tuple of homogeneous type, use literal ellipsis, e.g. Tuple[int, ...]

Yes, good one. I know that we're going a bit back and forth on this annotation, but I've double-checked using the inspector, and it is a Tuple at runtime 👍🏻

rdblue · 2022-05-29T16:33:11Z

python/src/iceberg/utils/schema_conversion.py

+        Returns:
+            The Iceberg equivalent field
+        """
+        assert "field-id" in field, "Missing field-id in the Avro field, this is required for converting it to an Iceberg schema"


What is thrown when using assert rather than raise ValueError?

Also, we try to be more direct with error messages by starting with what was attempted, then giving context for what went wrong. It's also assumed that if an exception is raised then some requirement wasn't met so we'd typically leave out the "this is required ..." part. How about f"Cannot convert field, missing field-id: {field}"?

Great point. I'm not against assert's personally, but there is a battle in the Python community if you should use them for checks like these. Technically you can ignore them using a flag python -O, and they are meant for developers. I've changed them to a ValueError to be in line with the rest of the exceptions (which I like).

I've updated the message to your suggestion. I think it is good to have the field in there 👍🏻

python/src/iceberg/utils/schema_conversion.py

rdblue · 2022-05-29T16:40:21Z

python/src/iceberg/utils/schema_conversion.py

+        plain_type, element_is_optional = self._resolve_union(avro_type[inner_field_name])
+        if isinstance(plain_type, dict):
+            # We need the element-id downstream
+            plain_type["field-id"] = avro_type[id_field]


I strongly prefer not modifying the incoming data like this is doing. I think avro_type and parts of that type should be handled as though they are immutable.

Also, I don't think that this works. _convert_schema is going to return a type. For primitive types, there is no way to pass in the field-id property, and in all cases the field-id is going to be ignored because Iceberg types don't carry field IDs.

I think the simple solution to both problems is to handle inner types in the map and array methods directly, rather than using a common method.

Hey Ryan, I agree. I don't think we really need the method at all. I've removed it and just resolved the type instead of the field. Still some old code dangling around.

python/src/iceberg/utils/schema_conversion.py

rdblue · 2022-05-29T16:51:08Z

python/src/iceberg/utils/schema_conversion.py

+LOGICAL_FIELD_TYPE_MAPPING: Dict[str, PrimitiveType] = {
+    "date": DateType(),
+    "time-millis": TimeType(),
+    "timestamp-millis": TimestampType(),


Iceberg will only write time-micros and timestamp-micros because at least microsecond precision is required by the SQL spec. The mappings for millis are correct, but we should also be able to handle micros.

I've added the mapping, but I think we need to correct the scale when reading the data. Analog to the Java impl:

iceberg/core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java

Lines 137 to 140 in 90225d6

case "timestamp-millis":

// adjust to microseconds

ValueReader<Long> longs = ValueReaders.longs();

return (ValueReader<Long>) (decoder, ignored) -> longs.read(decoder, null) * 1000L;

Yes, exactly. We'll need to be aware of whether it is in millis or micros when reading the file!

python/src/iceberg/utils/schema_conversion.py

rdblue · 2022-05-29T21:25:53Z

python/src/iceberg/utils/schema_conversion.py

+        Returns:
+        """
+        if record_type["type"] != "record":
+            raise ValueError(f"Expected type, got: {record_type}")


Expected record?

rdblue · 2022-05-29T21:28:07Z

python/src/iceberg/utils/schema_conversion.py

+        elif logical_type == "map":
+            return self._convert_logical_map_type(avro_logical_type)
+        elif (logical_type, physical_type) in LOGICAL_FIELD_TYPE_MAPPING:
+            return LOGICAL_FIELD_TYPE_MAPPING[(logical_type, physical_type)]


I just also realized that this doesn't check for the adjustToUtc flag that we use to distinguish between TimestampType() and TimestamptzType(). We'll have to add that.

rdblue · 2022-05-29T21:28:46Z

python/src/iceberg/utils/schema_conversion.py

+        Returns:
+            A Iceberg DecimalType
+        """
+        return DecimalType(precision=avro_type["precision"], scale=avro_type["scale"])


Since you're validating physical type above, you may want to validate that the physical type here is fixed, long, or int.

rdblue · 2022-05-29T21:29:14Z

python/src/iceberg/utils/schema_conversion.py

+        fields = avro_type["items"]["fields"]
+        if len(fields) != 2:
+            raise ValueError(f'Invalid key-value pair schema: {avro_type["items"]}')
+        key = self._convert_field(list(filter(lambda f: f["name"] == "key", fields))[0])


Looks good. Too bad it is more complicated, but thanks for fixing this.

rdblue · 2022-05-29T21:31:02Z

There are a few minor things to fix, but overall I think this looks good. I'm going to commit it to unblock reader work and we can follow up with minor updates. Thanks, @Fokko!

* Change types into dataclasses Proposal to change the types into dataclasses. This has several improvments: - We can use the dataclasss field(repr=True) to include the fields in the representation, instead of building our own strings - We can assign the types in the post_init when they are dynamic (List, Maps, Structs etc) , or just override them when they are static (Primitives) - We don't have to implement any eq methods because they come for free - The types are frozen, which is kind of nice since we re-use them - The code is much more consise - We can assign the min/max of the int/long/float as Final as of 3.8: https://peps.python.org/pep-0591/ My inspiration was the comment by Kyle: apache/iceberg#4742 (comment) This would entail implementing eq, but why not use the generated one since we're comparing all the attributes :) Would love to get you input * Remove explicit repr and eq * Use @cached_property to cache the string Add missing words to spelling * Add additional guard for initializing StructType using kwargs * Replace type with field_type

github-actions bot added the python label May 10, 2022

Fokko force-pushed the fd-schema-conversion branch from 8c00b5b to 113e2b2 Compare May 10, 2022 14:58

kbendick reviewed May 10, 2022

View reviewed changes

python/src/iceberg/schema.py Outdated Show resolved Hide resolved

kbendick reviewed May 10, 2022

View reviewed changes

python/src/iceberg/schema.py Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 13, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

Fokko mentioned this pull request May 13, 2022

Change types into dataclasses #4767

Merged

Python: Convert Avro to Iceberg schema

56d2882

Fokko force-pushed the fd-schema-conversion branch from 113e2b2 to 9d6fa80 Compare May 19, 2022 14:45

Fix the lists, maps and logical types

0a1b456

Fokko force-pushed the fd-schema-conversion branch from 9d6fa80 to 0a1b456 Compare May 19, 2022 22:06

Fokko commented May 19, 2022

View reviewed changes

python/src/iceberg/types.py Outdated Show resolved Hide resolved

Add support for maps with non-string keys

5f2fae0

rdblue reviewed May 27, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 27, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 27, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 27, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Show resolved Hide resolved

rdblue reviewed May 27, 2022

View reviewed changes

Fokko added 4 commits May 27, 2022 23:03

Restructered the code

eb3af18

Merge branch 'master' of https://github.com/apache/iceberg into fd-sc…

3f74bbc

…hema-conversion

Revert ignoring the doc

59b51a3

Implement the __eq__ method properly

7789bfe

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

rdblue reviewed May 29, 2022

View reviewed changes

python/src/iceberg/utils/schema_conversion.py Show resolved Hide resolved

Address comments

a2fca2a

rdblue reviewed May 29, 2022

View reviewed changes

rdblue approved these changes May 29, 2022

View reviewed changes

rdblue merged commit 1615821 into apache:master May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Convert Avro to Iceberg schema #4742

Python: Convert Avro to Iceberg schema #4742

Fokko commented May 10, 2022

Fokko commented May 19, 2022

rdblue commented May 19, 2022

rdblue May 27, 2022 •

edited

Loading

Fokko May 27, 2022

rdblue May 29, 2022

Fokko May 29, 2022

rdblue May 29, 2022

Fokko May 29, 2022

rdblue May 29, 2022

Fokko May 29, 2022

rdblue May 29, 2022

Fokko May 29, 2022

rdblue May 29, 2022

rdblue May 29, 2022

rdblue May 29, 2022

rdblue May 29, 2022

rdblue May 29, 2022

rdblue commented May 29, 2022

	case "timestamp-millis":
	// adjust to microseconds
	ValueReader<Long> longs = ValueReaders.longs();
	return (ValueReader<Long>) (decoder, ignored) -> longs.read(decoder, null) * 1000L;

Python: Convert Avro to Iceberg schema #4742

Python: Convert Avro to Iceberg schema #4742

Conversation

Fokko commented May 10, 2022

Fokko commented May 19, 2022

rdblue commented May 19, 2022

rdblue May 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented May 29, 2022

rdblue May 27, 2022 •

edited

Loading