Python: Use Pydantic for (de)serialization #5011

Fokko · 2022-06-09T23:39:55Z

https://github.com/samuelcolvin/pydantic provides data validation and settings management using Python type-hints.

Fast and extensible, pydantic plays nicely with your linters/IDE/brain. Define how data should be in pure, canonical Python 3.7+; validate it with pydantic.

I took the basis from @samredai's PR in #3677

It allows us to easily (de)serialize the TableMetadata including all the nested fields (schema, fields, and types).

python/src/iceberg/utils/iceberg_base_model.py

rdblue · 2022-06-12T20:03:54Z

python/src/iceberg/utils/iceberg_base_model.py

+from pydantic import BaseModel
+
+
+class IcebergBaseModel(BaseModel):


Is there a better name for this? How about IcebergObject? I don't think BaseModel is a clear name that we want to carry forward, unless this is a convention that I'm missing. To me, IcebergObject does signal that it is layering in some common functionality.

I'm also okay with IcebergObject, but that might also be confusing since it is much more than the plain Python object. I don't think this is a Python convention, but I do like calling it a model because it connects it to Pydantic. It has much more than an object, for example, it has a schema: https://pydantic-docs.helpmanual.io/usage/schema/

python/src/iceberg/schema.py

python/src/iceberg/types.py

rdblue · 2022-06-12T20:08:50Z

python/src/iceberg/types.py

+        yield cls.validate
+
+    @classmethod
+    def validate(cls, v):


Why is the parsing done in the validate method?

Good one, I've moved this into the Type itself.

Looks like this is still here?

@Fokko, is this supposed to be left in?

I'm not sure why I missed this comment. So, this is supposed to be still in, but I moved the actual parsing (regex) to the type itself. Because the schema is a List[IcebergType] it will try to validate that one first. As an example fixed[22] It will not be passed to the Fixed validate method (because the __root__ is dynamic).

Before I had everything in the validate method, now I call a static method on the class itself:

if v.startswith("fixed"): return FixedType.parse(v)

Okay, I see the calls to parse. That brings us back to my original question, which is why are we doing parsing in a validate method? Shouldn't this validate the input rather than returning the deserialized value?

I would expect this to do something like the schema validator, which would change "fixed[12]" into {"__root__": "fixed", "length": 12} or something.

But then we're still parsing in the validate method, but returning a dict that describes the FixedType instead of an instantiated FixedType. Let me check if there is an alternative way to implement this other than a validator.

Doesn't look like the parse_obj/parse_raw gets invoked. The IcebergType is a bit of a special case, since it cannot be invoked, I think it would be okay to (mis)use the validator there. It gets called because it is part of the schema:

class Schema: fields: Tuple[IcebergType. ...]

When deserializing, it will first visit the IcebergType for validation. We can then handle the special cases for the decimal and fixed.

python/src/iceberg/types.py

python/src/iceberg/table/metadata.py

python/tests/catalog/test_base.py

python/tests/table/test_metadata.py

python/tests/test_types.py

rdblue · 2022-06-12T20:32:33Z

python/tests/test_types.py

+    assert str(simple_map) == "map<string, double>"
+
+
+def test_repr_map(simple_map: MapType):


It would be nice to have a more complex type that mixes maps, lists, and structs within one another, but I think this is good overall.

rdblue

This looks great overall! I really like how quickly this is able to implement the JSON serialization.

My main concern about this approach is the TableMetadata handling. We need to read into a common representation that works for both v1 and v2, and serialize to the requirements of each version. I don't think that's quite working now, but I may just understand.

…dantic-is-lit

python/src/iceberg/schema.py

…dantic-is-lit

python/src/iceberg/table/metadata.py

…dantic-is-lit

python/src/iceberg/table/metadata.py

rdblue · 2022-06-27T21:28:06Z

python/src/iceberg/table/metadata.py

+    to be used for arbitrary metadata. For example, commit.retry.num-retries
+    is used to control the number of commit retries."""
+
+    current_snapshot_id: Optional[int] = Field(alias="current-snapshot-id", default=-1)


Non-blocker: we use -1 in the format, but if I could do it over, I would probably avoid it and use null / None instead. You may want to consider converting in a pre-validation from -1 to clean this up.

I've added a pre-validator. This way we can also keep it cleaner down the line.

rdblue · 2022-06-27T21:31:26Z

python/src/iceberg/table/metadata.py

+    default_sort_order_id = values["default_sort_order_id"]
+
+    # 0 == unsorted
+    if default_sort_order_id != 0:


Nit: use constant DEFAULT_SORT_ORDER_UNSORTED?

Great one, also removed the comment

rdblue · 2022-06-27T21:34:07Z

python/src/iceberg/table/metadata.py

+    based on the spec. Implementations must throw an exception if a table’s
+    version is higher than the supported version."""
+
+    table_uuid: Optional[UUID] = Field(alias="table-uuid")


Since this is defaulted in a pre-validator, I think you could move it to common and remove the Optional.

Done! Keep in mind that we keep it nullable in Java: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L317

Yeah, we kept it nullable in Java because in the process of adding a UUID could cause weird cases where UUIDs were concurrently assigned. That would cause commits to fail because it appears to be a different table. Since UUIDs have been supported for a long time, that's no longer likely so I don't think we need to care about it for the Python library.

rdblue · 2022-06-27T21:35:32Z

python/src/iceberg/table/metadata.py

+        return data
+
+    @root_validator(skip_on_failure=True)
+    def construct_schema(cls, data: Dict[str, Any]) -> Dict[str, Any]:


Nit: Should this be construct_schemas (plural, like construct_partition_specs below)?

rdblue · 2022-06-27T21:36:52Z

python/src/iceberg/table/metadata.py

+        # This is going to be much nicer as soon as partition-spec is also migrated to pydantic
+        if not data.get("partition_specs"):
+            fields = data["partition_spec"]
+            data["partition_specs"] = [{"spec-id": INITIAL_SPEC_ID, "fields": fields}]


Should this also set default_spec_id here rather than in set_v2_compatible_defaults? That would match what construct_schema is doing.

Hmm, this is actually a remained of some refactoring along the way. Since this is not a pre-validator, the field already needs to be set. The default_schema_id has a default. I've also set a default for the partition-spec, so everything is consistent again.

rdblue · 2022-06-27T21:37:22Z

python/src/iceberg/table/metadata.py

+        # Probably we'll just create a UNSORTED_ORDER constant then
+        if not data.get("sort_orders"):
+            data["sort_orders"] = [{"order_id": 0, "fields": []}]
+            data["default_sort_order_id"] = 0


Use the constant instead of 0?

Yes, this is now in line with schemas and partitions-specs

rdblue · 2022-06-27T21:38:58Z

python/src/iceberg/table/metadata.py

+    Implementations must throw an exception if a table’s UUID does not match
+    the expected UUID after refreshing metadata."""
+
+    schema_: Schema = Field(alias="schema")


(Not urgent) One thing we will want to watch out for here is ensuring this is set when current_schema_id changes.

rdblue · 2022-06-27T21:41:21Z

python/src/iceberg/table/metadata.py

+
+def check_partition_specs(values: Dict[str, Any]) -> Dict[str, Any]:
+    """Validator to check if the default-spec-id is present in partition-specs"""
+    default_spec_id = values["default_spec_id"]


(Not urgent) One thing I find slightly hard to follow, especially now that these validation methods are shared, is when the name_with_underscores vs name-with-dashes is used. I know the latter is when running a pre-validator, but it would be nice to standardize on either pre-validation or normal validation so we don't have to think about when to translate field strings by hand.

It's also a bit odd that this uses partition_specs, but spec-id at the inner level... Seems like following up to use pre-validation everywhere would help readability. What do you think?

I agree, it is a bit confusing. I think we should as little as pre-validators as possible, and try to standardize the use of the underscores.

It's also a bit odd that this uses partition_specs, but spec-id at the inner level... Seems like following up to use pre-validation everywhere would help readability. What do you think?

This will be fixed once we start moving the downstream objects to Pydantic as well. spec["spec-id"] will change into spec.spec_id since it is then an object (in the post-validation at least :)

rdblue · 2022-06-27T21:46:07Z

python/src/iceberg/table/refs.py

+
+class SnapshotRefType(str, Enum):
+    branch = "branch"
+    tag = "tag"


Minor: Most of the examples in the stdlib docs use upper case, which would match other constants and Java. Why not BRANCH and TAG?

Good one. I see this being used both ways, since they are static, I prefer uppercase. Updated!

rdblue · 2022-06-27T23:43:00Z

Nice work, @Fokko! Thanks for getting this done!

Python: Use Pydantic for (de)serialization

abe6897

github-actions bot added the python label Jun 9, 2022

Fokko force-pushed the fd-pydantic-is-lit branch from cf2f93d to abe6897 Compare June 10, 2022 21:51