New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation #3677
Conversation
python/src/iceberg/table/metadata.py
Outdated
return cls(metadata=metadata, version=version) | ||
|
||
@classmethod | ||
def from_s3( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this from_s3
class method is here, but this should be abstracted out to a generic file-io method that uses something like a FileIO
abstract base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been refactored to use a generic from_file()
method that's dependent on PR #3691 which adds the FileIO
abstract base class
python/src/iceberg/table/metadata.py
Outdated
"last-sequence-number": {"type": "integer"}, | ||
"last-updated-ms": {"type": "integer"}, | ||
"last-column-id": {"type": "integer"}, | ||
"schemas": {"type": "array", "items": {}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these eventually have type information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just did a simple top-level schema as part of this draft, but the final schema would include much more and be comprehensive of the spec. All types and required/optional flags. We can also add in regex requirements for string fields. The spec doesn't have to be a single huge spec either so we could have a separate jsonschema for PartitionSpec and just reference it in the larger table metadata jsonschema definition.
python/src/iceberg/table/metadata.py
Outdated
"sort-orders": {"type": "array", "items": {}}, | ||
"default-sort-order-id": {"type": "integer"}, | ||
}, | ||
"required": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this listing the fields that are required for writing or reading? For reading, we have to be more relaxed because we need to accept older versions of the spec. For example, schemas
and current-schema-id
didn't used to exist. It used to be a single schema
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that this is for v1. It doesn't quite align with v1, and there are some constructs that are optional. v1 requires schema
, but allows setting schemas
and current-schema-id
. We'd probably need to keep these up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I figured there could be a separate schema definition for v1 and v2 (and future v3). The version argument then essentially detemines which jsonschema is used. Sorry I should have mentioned this in the PR description (or better yet a TODO comment). I still have to complete the schemas to actually match the v1 and v2 specs exactly.
python/src/iceberg/table/metadata.py
Outdated
""" | ||
|
||
def __init__(self, metadata: dict, version: Union[str, int]): | ||
self._version = version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int(version)
to be able to assume that it is an int elsewhere?
python/src/iceberg/table/metadata.py
Outdated
|
||
def _wrap(self, value: Any): | ||
"""A recursive function that drills into iterable values and returns | ||
nested TableMetadata instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be nested table metadata instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class name does make this sound odd but the logic is essentially a DFS through the metadata to assign everything as class attributes. If a value in the json is an object (like properties), it instantiates that as a sort of "partial" TableMetadata
instance and then starts a new traversal through the contents of that object. In other words
table_metadata = TableMetadata(...)
isinstance(table_metadata, TableMetadata) # True
isinstance(table_metadata.properties, TableMetadata) # True
isinstance(table_metadata.snapshot_log, list) # True
isinstance(table_metadata.snapshot_log[0], TableMetadata) # True
If the idea of this partial metadata existing as a TableMetadata
instance doesn't sit well, I could instead have a generic Metadata
or Config
class that actually traverses the json object to create a class with all of the class attributes, and have that as an argument to a TableMetadata
class that handles validation and other table metadata related things.
Something like:
config = Config({"table-uuid": "foo", ...})
table_metadata = TableMetadata(config, version="2")
The _wrap()
method would live in the Config
class and the docstring would read as:
A recursive function that drills into iterable values and returns
nested Config instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I get it.
I like being able to use m.snapshot_log[0]
and similar ways to access the metadata. But, I'm not sure that this is changing enough that it's worth the wrapper approach, instead of just converting to a TableMetadata
class that pulls out and stores self.current_schema_id
(for example). Table metadata shouldn't be that complicated since it's mostly a few lists of objects at the most nested level (like metadata > snapshots > snapshot > properties > key/value).
While this makes it easy to get started, it would be awkward to make updates to the metadata as JSON because you'd need to produce a new JSON tree and then wrap with this class. We may also want classes for things like Snapshot, which can embed some operations like reading metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR to remove the wrapper approach and set the top level as class attributes. I've also added a to_dict()
method that's used by validate()
where the TableMetadata
instance is serialized into a python dictionary and then passed into the jsonschema validate function.
I've also added freezing of instances and this punts the question of how we perform table metadata updates. I can follow-up with a PR that uses the builder pattern where a python dict
is the mutable form. So building from an existing table metadata instance would start with calling to_dict()
.
python/src/iceberg/table/metadata.py
Outdated
"""Fixes attribute names to be python friendly""" | ||
return value.replace("-", "_").replace(".", "_") | ||
|
||
def validate(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this can use JSON schema to validate, but what should we do if validation fails? Are there some things that we can recover from? For example, v1 metadata with schemas
and current-schema-id
but not schema
is technically invalid. But we can still read the table just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd bet that we can get the jsonschema pretty accurate to the point where only bonafide validation failures raise a ValidationError. For your example, we could use some of the conditional schema features, specifically the dependentRequired
which should let us define something that says "schema
is optional when schemas
and current-schema-id
exist"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds reasonable.
python/tests/table/test_metadata.py
Outdated
assert table_metadata.sort_orders == [] | ||
assert table_metadata.default_sort_order_id == 8 | ||
|
||
assert table_metadata.properties._metadata == {"read.split.target-size": 134217728} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks good to me.
python/src/iceberg/table/metadata.py
Outdated
return cls.from_dict(metadata) | ||
|
||
@classmethod | ||
def from_file(cls, path: str, custom_file_io: FileIO = None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api nit (apologies if the style has been discussed elsewhere) but it it might be more user-friendly to be less specific on individual parameters. e.g. have one parameter "file" which can be ["str", "FileIO", ...] and make the decision internally on what to use. Passing the base class as a factory seems a little strange since presumably clients know they have a custom_file_io and combine it with the path outside of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I totally agree. A custom file-io would have a custom InputFile
implementation that would already include the path when used here. If a string uri is provided instead I'll handle that by finding the right "known" implementation.
I'm going to start working on this and getting it out of draft status!
python/src/iceberg/table/metadata.py
Outdated
with custom_file_io(path, **kwargs) as f: | ||
table_metadata = cls.from_byte_stream(byte_stream=f.byte_stream) | ||
else: | ||
with get_file_io(path, custom_file_io=custom_file_io, **kwargs) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems redundant to pass through custom_file_io here? is the first branch really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. The get_file_io
function (not included in this PR) should be responsible for handling when a custom file-io is provided, i.e. custom_file_io != None
. I'm planning to do another pass at this draft once PR #3691 is finalized and I'll make sure to update this. Thanks!
I think this is ready for another look. I've rebased with the commits from the file io PR that was merged. The table metadata class has a Another thing I did was move the If we can settle on the design here, I think a thorough inspection of the V1 and V2 jsonschemas defined here is needed before merging. |
manifest files. (Deprecated: use partition-specs and default-spec-id | ||
instead)""" | ||
|
||
partition_specs: list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention to refine collection types later with actual object types or keep these fairly generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we should definitely refine these later, i.e. list[PartitionSpec] once that class has been added.
python/src/iceberg/table/metadata.py
Outdated
|
||
from iceberg.io.base import InputFile, OutputFile | ||
|
||
TABLE_METADATA_V1_SCHEMA = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to auto-generate this from the OpenAPI spec for the REST catalog? Or maybe to use that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, it should be possible using a RefResolver from jsonschema to pull out the references in the TableMetadata section of the open-api spec. Let me try this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #4545 adds the rest catalog openapi to the python library. Once that's in I can update this PR to validate using that.
python/tests/conftest.py
Outdated
|
||
@pytest.fixture | ||
def LocalFileIOFixture(): | ||
"""""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe remove this one? Looks messy :)
python/src/iceberg/table/metadata.py
Outdated
if self.format_version == 1: | ||
d["schema"] = self.schema | ||
d["partition-spec"] = self.partition_spec | ||
if self.format_version == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.format_version == 2: | |
elif self.format_version == 2: |
python/src/iceberg/table/metadata.py
Outdated
d["partition-spec"] = self.partition_spec | ||
if self.format_version == 2: | ||
d["last-sequence-number"] = self.last_sequence_number | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add here:
else:
logger.warn(f"Unknown format version: {self.format_version}")
python/src/iceberg/table/metadata.py
Outdated
""" | ||
f = output_file.create(overwrite=overwrite) | ||
f.write(json.dumps(self.to_dict()).encode("utf-8")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f.close()
is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a helper Context Manger like output_file_writer()
for this?
class output_file_writer():
def __init__(self, output_file: OutputFile, overwrite: bool = False):
self.__output_file = output_file
self.__overwrite = True if overwrite else False
def __enter__(self):
self.__file = output_file.create(self.__overwrite)
return self
def __exit__(self, type, val, tb):
self.__file.close()
With that helper the above method becomes:
def to_output_file(self, output_file: OutputFile, overwrite: bool = False) -> None:
with output_file_writer(output_file, overwrite) as file:
file.write(json.dumps(self.to_dict()).encode("utf-8"))
You could even default the encoding to UTF-8 inside the context manager, in one single place, if that is our standard encoding format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did this in the initial draft PR but eventually took it out. This is expected to be a pretty stable part of the code base and is not exposed in the user-facing API (so the convenience of using a context manager is immediately lost since users will interact with the higher level FileIO object at most)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a big fan of context managers, but I think they should be integrated into the class itself, instead of using yet another wrapper.
I'm quite confused btw output_file.create(overwrite=overwrite)
should return an OutputStream
from iceberg.io.base
, but that class doesn't seem to exist 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OutputStream is just a protocol, so it's basically documentation that explains OutputFile.create(...) should return an object with a write
, closed
and close
method.
…rialization for table metadata
from iceberg.table.metadata import TableMetadata | ||
|
||
|
||
class FromDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cabhishek I know you're working on serialization/deserialization of Schema
. You should be able to add a schema
method to the classes in this file.
Updated this PR to remove all of the jsonschema validation logic (to be revisited later, hopefully using the OpenAPI spec for the REST catalog), and also replaced attrs with the built-in dataclass decorator. The last thing I've done here is move all of the serialization/deserialization logic (initially this PR had them as methods on the |
UPDATE: This has been updated to remove the wrapper mechanism and instead assigns attributes to the class explicitly. The s3 io has also been abstracted to a generic file io in the
from_file()
method which is dependent on PR #3691.I've also included the popular attrs library to allow for a cleaner class definition and also to freeze
TableMetadata
instances. There's agreement that table metadata instances should be frozen but I anticipate some discussions around how to actually perform table metadata updates so I think it's best to tackle that in follow-up PR's/proposals.This is the table metadata portion of #3227. This isn't complete but wanted it to be visible to get feedback on the direction. The equivalent logic in the legacy implementation is stored in table_metadata.py and table_metadata_parser.py.
The idea here is that, instead of including very explicit parsing and validation logic for table metadata files, we can rely on the standard library in conjunction with jsonschema tooling to accomplish both. The
TABLE_METADATA_V2_SCHEMA
jsonschema definition found in metadata.py is an example of how this can be done (still needs to be tuned to the spec exactly). TheTableMetadata
class itself naively parses a given json object and includes avalidate()
method that validates against the defined jsonschema. (validate()
simply calls thevalidate_v1()
orvalidate_v2()
static method.) Table metadata values can then be retrieved using simple dot notation and can be updated as well.Once tweaked, the jsonschema definition should prove re-usable since jsonschema parsers exist in almost every language. It may also be valuable to include a blessed jsonschema definition in the Iceberg docs.
A proposal for editing table metadata is to use a collection of functions that update value(s), validate that the schema is still valid, and return the updated
TableMetadata
instance, such as: