Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Use Pydantic for (de)serialization #5011

Merged
merged 27 commits into from
Jun 27, 2022

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 9, 2022

https://github.com/samuelcolvin/pydantic provides data validation and settings management using Python type-hints.

Fast and extensible, pydantic plays nicely with your linters/IDE/brain. Define how data should be in pure, canonical Python 3.7+; validate it with pydantic.

I took the basis from @samredai's PR in #3677

It allows us to easily (de)serialize the TableMetadata including all the nested fields (schema, fields, and types).

from pydantic import BaseModel


class IcebergBaseModel(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name for this? How about IcebergObject? I don't think BaseModel is a clear name that we want to carry forward, unless this is a convention that I'm missing. To me, IcebergObject does signal that it is layering in some common functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also okay with IcebergObject, but that might also be confusing since it is much more than the plain Python object. I don't think this is a Python convention, but I do like calling it a model because it connects it to Pydantic. It has much more than an object, for example, it has a schema: https://pydantic-docs.helpmanual.io/usage/schema/

python/src/iceberg/types.py Outdated Show resolved Hide resolved
yield cls.validate

@classmethod
def validate(cls, v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the parsing done in the validate method?

Copy link
Contributor Author

@Fokko Fokko Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, I've moved this into the Type itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is still here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, is this supposed to be left in?

Copy link
Contributor Author

@Fokko Fokko Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why I missed this comment. So, this is supposed to be still in, but I moved the actual parsing (regex) to the type itself. Because the schema is a List[IcebergType] it will try to validate that one first. As an example fixed[22] It will not be passed to the Fixed validate method (because the __root__ is dynamic).

Before I had everything in the validate method, now I call a static method on the class itself:

            if v.startswith("fixed"):
                return FixedType.parse(v)

Copy link
Contributor

@rdblue rdblue Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see the calls to parse. That brings us back to my original question, which is why are we doing parsing in a validate method? Shouldn't this validate the input rather than returning the deserialized value?

I would expect this to do something like the schema validator, which would change "fixed[12]" into {"__root__": "fixed", "length": 12} or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we're still parsing in the validate method, but returning a dict that describes the FixedType instead of an instantiated FixedType. Let me check if there is an alternative way to implement this other than a validator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like the parse_obj/parse_raw gets invoked. The IcebergType is a bit of a special case, since it cannot be invoked, I think it would be okay to (mis)use the validator there. It gets called because it is part of the schema:

class Schema:
    fields: Tuple[IcebergType. ...]

When deserializing, it will first visit the IcebergType for validation. We can then handle the special cases for the decimal and fixed.

python/tests/test_types.py Outdated Show resolved Hide resolved
assert str(simple_map) == "map<string, double>"


def test_repr_map(simple_map: MapType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a more complex type that mixes maps, lists, and structs within one another, but I think this is good overall.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall! I really like how quickly this is able to implement the JSON serialization.

My main concern about this approach is the TableMetadata handling. We need to read into a common representation that works for both v1 and v2, and serialize to the requirements of each version. I don't think that's quite working now, but I may just understand.

@github-actions github-actions bot added the docs label Jun 23, 2022
to be used for arbitrary metadata. For example, commit.retry.num-retries
is used to control the number of commit retries."""

current_snapshot_id: Optional[int] = Field(alias="current-snapshot-id", default=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker: we use -1 in the format, but if I could do it over, I would probably avoid it and use null / None instead. You may want to consider converting in a pre-validation from -1 to clean this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a pre-validator. This way we can also keep it cleaner down the line.

default_sort_order_id = values["default_sort_order_id"]

# 0 == unsorted
if default_sort_order_id != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use constant DEFAULT_SORT_ORDER_UNSORTED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great one, also removed the comment

based on the spec. Implementations must throw an exception if a table’s
version is higher than the supported version."""

table_uuid: Optional[UUID] = Field(alias="table-uuid")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is defaulted in a pre-validator, I think you could move it to common and remove the Optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we kept it nullable in Java because in the process of adding a UUID could cause weird cases where UUIDs were concurrently assigned. That would cause commits to fail because it appears to be a different table. Since UUIDs have been supported for a long time, that's no longer likely so I don't think we need to care about it for the Python library.

return data

@root_validator(skip_on_failure=True)
def construct_schema(cls, data: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this be construct_schemas (plural, like construct_partition_specs below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one!

# This is going to be much nicer as soon as partition-spec is also migrated to pydantic
if not data.get("partition_specs"):
fields = data["partition_spec"]
data["partition_specs"] = [{"spec-id": INITIAL_SPEC_ID, "fields": fields}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also set default_spec_id here rather than in set_v2_compatible_defaults? That would match what construct_schema is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is actually a remained of some refactoring along the way. Since this is not a pre-validator, the field already needs to be set. The default_schema_id has a default. I've also set a default for the partition-spec, so everything is consistent again.

# Probably we'll just create a UNSORTED_ORDER constant then
if not data.get("sort_orders"):
data["sort_orders"] = [{"order_id": 0, "fields": []}]
data["default_sort_order_id"] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the constant instead of 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is now in line with schemas and partitions-specs

Implementations must throw an exception if a table’s UUID does not match
the expected UUID after refreshing metadata."""

schema_: Schema = Field(alias="schema")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not urgent) One thing we will want to watch out for here is ensuring this is set when current_schema_id changes.


def check_partition_specs(values: Dict[str, Any]) -> Dict[str, Any]:
"""Validator to check if the default-spec-id is present in partition-specs"""
default_spec_id = values["default_spec_id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not urgent) One thing I find slightly hard to follow, especially now that these validation methods are shared, is when the name_with_underscores vs name-with-dashes is used. I know the latter is when running a pre-validator, but it would be nice to standardize on either pre-validation or normal validation so we don't have to think about when to translate field strings by hand.

It's also a bit odd that this uses partition_specs, but spec-id at the inner level... Seems like following up to use pre-validation everywhere would help readability. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it is a bit confusing. I think we should as little as pre-validators as possible, and try to standardize the use of the underscores.

It's also a bit odd that this uses partition_specs, but spec-id at the inner level... Seems like following up to use pre-validation everywhere would help readability. What do you think?

This will be fixed once we start moving the downstream objects to Pydantic as well. spec["spec-id"] will change into spec.spec_id since it is then an object (in the post-validation at least :)


class SnapshotRefType(str, Enum):
branch = "branch"
tag = "tag"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Most of the examples in the stdlib docs use upper case, which would match other constants and Java. Why not BRANCH and TAG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one. I see this being used both ways, since they are static, I prefer uppercase. Updated!

@rdblue rdblue merged commit 98dc1e6 into apache:master Jun 27, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2022

Nice work, @Fokko! Thanks for getting this done!

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants