-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Use Pydantic for (de)serialization #5011
Conversation
cf2f93d
to
abe6897
Compare
from pydantic import BaseModel | ||
|
||
|
||
class IcebergBaseModel(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name for this? How about IcebergObject
? I don't think BaseModel
is a clear name that we want to carry forward, unless this is a convention that I'm missing. To me, IcebergObject
does signal that it is layering in some common functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also okay with IcebergObject
, but that might also be confusing since it is much more than the plain Python object. I don't think this is a Python convention, but I do like calling it a model because it connects it to Pydantic. It has much more than an object, for example, it has a schema: https://pydantic-docs.helpmanual.io/usage/schema/
yield cls.validate | ||
|
||
@classmethod | ||
def validate(cls, v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the parsing done in the validate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, I've moved this into the Type itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is still here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, is this supposed to be left in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why I missed this comment. So, this is supposed to be still in, but I moved the actual parsing (regex) to the type itself. Because the schema is a List[IcebergType]
it will try to validate that one first. As an example fixed[22]
It will not be passed to the Fixed validate method (because the __root__
is dynamic).
Before I had everything in the validate method, now I call a static method on the class itself:
if v.startswith("fixed"):
return FixedType.parse(v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see the calls to parse
. That brings us back to my original question, which is why are we doing parsing in a validate method? Shouldn't this validate the input rather than returning the deserialized value?
I would expect this to do something like the schema validator, which would change "fixed[12]"
into {"__root__": "fixed", "length": 12}
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we're still parsing in the validate method, but returning a dict
that describes the FixedType
instead of an instantiated FixedType
. Let me check if there is an alternative way to implement this other than a validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like the parse_obj
/parse_raw
gets invoked. The IcebergType
is a bit of a special case, since it cannot be invoked, I think it would be okay to (mis)use the validator there. It gets called because it is part of the schema:
class Schema:
fields: Tuple[IcebergType. ...]
When deserializing, it will first visit the IcebergType for validation. We can then handle the special cases for the decimal and fixed.
assert str(simple_map) == "map<string, double>" | ||
|
||
|
||
def test_repr_map(simple_map: MapType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a more complex type that mixes maps, lists, and structs within one another, but I think this is good overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great overall! I really like how quickly this is able to implement the JSON serialization.
My main concern about this approach is the TableMetadata
handling. We need to read into a common representation that works for both v1 and v2, and serialize to the requirements of each version. I don't think that's quite working now, but I may just understand.
0c6dbff
to
899db10
Compare
f10aba1
to
d70c4a0
Compare
4da35be
to
8b25039
Compare
python/src/iceberg/table/metadata.py
Outdated
to be used for arbitrary metadata. For example, commit.retry.num-retries | ||
is used to control the number of commit retries.""" | ||
|
||
current_snapshot_id: Optional[int] = Field(alias="current-snapshot-id", default=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocker: we use -1 in the format, but if I could do it over, I would probably avoid it and use null
/ None
instead. You may want to consider converting in a pre-validation from -1 to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a pre-validator. This way we can also keep it cleaner down the line.
python/src/iceberg/table/metadata.py
Outdated
default_sort_order_id = values["default_sort_order_id"] | ||
|
||
# 0 == unsorted | ||
if default_sort_order_id != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use constant DEFAULT_SORT_ORDER_UNSORTED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great one, also removed the comment
python/src/iceberg/table/metadata.py
Outdated
based on the spec. Implementations must throw an exception if a table’s | ||
version is higher than the supported version.""" | ||
|
||
table_uuid: Optional[UUID] = Field(alias="table-uuid") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is defaulted in a pre-validator, I think you could move it to common and remove the Optional
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Keep in mind that we keep it nullable in Java: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we kept it nullable in Java because in the process of adding a UUID could cause weird cases where UUIDs were concurrently assigned. That would cause commits to fail because it appears to be a different table. Since UUIDs have been supported for a long time, that's no longer likely so I don't think we need to care about it for the Python library.
python/src/iceberg/table/metadata.py
Outdated
return data | ||
|
||
@root_validator(skip_on_failure=True) | ||
def construct_schema(cls, data: Dict[str, Any]) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should this be construct_schemas
(plural, like construct_partition_specs
below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one!
# This is going to be much nicer as soon as partition-spec is also migrated to pydantic | ||
if not data.get("partition_specs"): | ||
fields = data["partition_spec"] | ||
data["partition_specs"] = [{"spec-id": INITIAL_SPEC_ID, "fields": fields}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also set default_spec_id
here rather than in set_v2_compatible_defaults
? That would match what construct_schema
is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is actually a remained of some refactoring along the way. Since this is not a pre-validator, the field already needs to be set. The default_schema_id
has a default. I've also set a default for the partition-spec, so everything is consistent again.
python/src/iceberg/table/metadata.py
Outdated
# Probably we'll just create a UNSORTED_ORDER constant then | ||
if not data.get("sort_orders"): | ||
data["sort_orders"] = [{"order_id": 0, "fields": []}] | ||
data["default_sort_order_id"] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the constant instead of 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is now in line with schemas and partitions-specs
Implementations must throw an exception if a table’s UUID does not match | ||
the expected UUID after refreshing metadata.""" | ||
|
||
schema_: Schema = Field(alias="schema") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not urgent) One thing we will want to watch out for here is ensuring this is set when current_schema_id
changes.
|
||
def check_partition_specs(values: Dict[str, Any]) -> Dict[str, Any]: | ||
"""Validator to check if the default-spec-id is present in partition-specs""" | ||
default_spec_id = values["default_spec_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not urgent) One thing I find slightly hard to follow, especially now that these validation methods are shared, is when the name_with_underscores
vs name-with-dashes
is used. I know the latter is when running a pre-validator, but it would be nice to standardize on either pre-validation or normal validation so we don't have to think about when to translate field strings by hand.
It's also a bit odd that this uses partition_specs
, but spec-id
at the inner level... Seems like following up to use pre-validation everywhere would help readability. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it is a bit confusing. I think we should as little as pre-validators as possible, and try to standardize the use of the underscores.
It's also a bit odd that this uses partition_specs, but spec-id at the inner level... Seems like following up to use pre-validation everywhere would help readability. What do you think?
This will be fixed once we start moving the downstream objects to Pydantic as well. spec["spec-id"]
will change into spec.spec_id
since it is then an object (in the post-validation at least :)
python/src/iceberg/table/refs.py
Outdated
|
||
class SnapshotRefType(str, Enum): | ||
branch = "branch" | ||
tag = "tag" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Most of the examples in the stdlib docs use upper case, which would match other constants and Java. Why not BRANCH and TAG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one. I see this being used both ways, since they are static, I prefer uppercase. Updated!
Nice work, @Fokko! Thanks for getting this done! |
https://github.com/samuelcolvin/pydantic provides data validation and settings management using Python type-hints.
Fast and extensible, pydantic plays nicely with your linters/IDE/brain. Define how data should be in pure, canonical Python 3.7+; validate it with pydantic.
I took the basis from @samredai's PR in #3677
It allows us to easily (de)serialize the TableMetadata including all the nested fields (schema, fields, and types).