Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation #3677

samredai · 2021-12-06T18:17:00Z

UPDATE: This has been updated to remove the wrapper mechanism and instead assigns attributes to the class explicitly. The s3 io has also been abstracted to a generic file io in the from_file() method which is dependent on PR #3691.

I've also included the popular attrs library to allow for a cleaner class definition and also to freeze TableMetadata instances. There's agreement that table metadata instances should be frozen but I anticipate some discussions around how to actually perform table metadata updates so I think it's best to tackle that in follow-up PR's/proposals.

This is the table metadata portion of #3227. This isn't complete but wanted it to be visible to get feedback on the direction. The equivalent logic in the legacy implementation is stored in table_metadata.py and table_metadata_parser.py.

The idea here is that, instead of including very explicit parsing and validation logic for table metadata files, we can rely on the standard library in conjunction with jsonschema tooling to accomplish both. The TABLE_METADATA_V2_SCHEMA jsonschema definition found in metadata.py is an example of how this can be done (still needs to be tuned to the spec exactly). The TableMetadata class itself naively parses a given json object and includes a validate() method that validates against the defined jsonschema. (validate() simply calls the validate_v1() or validate_v2() static method.) Table metadata values can then be retrieved using simple dot notation and can be updated as well.

table_metadata = TableMetadata.from_s3("s3://foo/bar/baz.metadata.json", version=2)
print(table_metadata.properties.read_split_target_size) # 134217728
table_metadata.properties.read_split_target_size = 268435456
print(table_metadata.properties.read_split_target_size) # 268435456

Once tweaked, the jsonschema definition should prove re-usable since jsonschema parsers exist in almost every language. It may also be valuable to include a blessed jsonschema definition in the Iceberg docs.

A proposal for editing table metadata is to use a collection of functions that update value(s), validate that the schema is still valid, and return the updated TableMetadata instance, such as:

from copy import deepcopy
import time

def rollback(table_metadata: TableMetadata, snapshot_id: int):
  new_table_metadata = deepcopy(table_metadata)
  now_millis = int(time.time() * 1000)
  new_table_metadata.snapshot_log.append({"timestamp_millis": now_millis, "snapshot_id":  snapshot_id})
  new_table_metadata.current_snapshot_id = snapshot_id
  return new_table_metadata

samredai · 2021-12-06T19:12:27Z

python/src/iceberg/table/metadata.py

+        return cls(metadata=metadata, version=version)
+
+    @classmethod
+    def from_s3(


Currently, this from_s3 class method is here, but this should be abstracted out to a generic file-io method that uses something like a FileIO abstract base class.

This has been refactored to use a generic from_file() method that's dependent on PR #3691 which adds the FileIO abstract base class

samredai · 2021-12-06T19:13:33Z

@CircArgs

rdblue · 2021-12-06T19:24:02Z

python/src/iceberg/table/metadata.py

+        "last-sequence-number": {"type": "integer"},
+        "last-updated-ms": {"type": "integer"},
+        "last-column-id": {"type": "integer"},
+        "schemas": {"type": "array", "items": {}},


Would these eventually have type information?

Yeah I just did a simple top-level schema as part of this draft, but the final schema would include much more and be comprehensive of the spec. All types and required/optional flags. We can also add in regex requirements for string fields. The spec doesn't have to be a single huge spec either so we could have a separate jsonschema for PartitionSpec and just reference it in the larger table metadata jsonschema definition.

rdblue · 2021-12-06T19:25:03Z

python/src/iceberg/table/metadata.py

+        "sort-orders": {"type": "array", "items": {}},
+        "default-sort-order-id": {"type": "integer"},
+    },
+    "required": [


Is this listing the fields that are required for writing or reading? For reading, we have to be more relaxed because we need to accept older versions of the spec. For example, schemas and current-schema-id didn't used to exist. It used to be a single schema field.

Ah, I see that this is for v1. It doesn't quite align with v1, and there are some constructs that are optional. v1 requires schema, but allows setting schemas and current-schema-id. We'd probably need to keep these up to date.

Yeah I figured there could be a separate schema definition for v1 and v2 (and future v3). The version argument then essentially detemines which jsonschema is used. Sorry I should have mentioned this in the PR description (or better yet a TODO comment). I still have to complete the schemas to actually match the v1 and v2 specs exactly.

rdblue · 2021-12-06T19:26:43Z

python/src/iceberg/table/metadata.py

+    """
+
+    def __init__(self, metadata: dict, version: Union[str, int]):
+        self._version = version


int(version) to be able to assume that it is an int elsewhere?

rdblue · 2021-12-06T19:27:44Z

python/src/iceberg/table/metadata.py

+
+    def _wrap(self, value: Any):
+        """A recursive function that drills into iterable values and returns
+        nested TableMetadata instances


Why would there be nested table metadata instances?

The class name does make this sound odd but the logic is essentially a DFS through the metadata to assign everything as class attributes. If a value in the json is an object (like properties), it instantiates that as a sort of "partial" TableMetadata instance and then starts a new traversal through the contents of that object. In other words

table_metadata = TableMetadata(...) isinstance(table_metadata, TableMetadata) # True isinstance(table_metadata.properties, TableMetadata) # True isinstance(table_metadata.snapshot_log, list) # True isinstance(table_metadata.snapshot_log[0], TableMetadata) # True

If the idea of this partial metadata existing as a TableMetadata instance doesn't sit well, I could instead have a generic Metadata or Config class that actually traverses the json object to create a class with all of the class attributes, and have that as an argument to a TableMetadata class that handles validation and other table metadata related things.

Something like:

config = Config({"table-uuid": "foo", ...}) table_metadata = TableMetadata(config, version="2")

The _wrap() method would live in the Config class and the docstring would read as:

A recursive function that drills into iterable values and returns
nested Config instances

Okay, I think I get it.

I like being able to use m.snapshot_log[0] and similar ways to access the metadata. But, I'm not sure that this is changing enough that it's worth the wrapper approach, instead of just converting to a TableMetadata class that pulls out and stores self.current_schema_id (for example). Table metadata shouldn't be that complicated since it's mostly a few lists of objects at the most nested level (like metadata > snapshots > snapshot > properties > key/value).

While this makes it easy to get started, it would be awkward to make updates to the metadata as JSON because you'd need to produce a new JSON tree and then wrap with this class. We may also want classes for things like Snapshot, which can embed some operations like reading metadata.

I've updated the PR to remove the wrapper approach and set the top level as class attributes. I've also added a to_dict() method that's used by validate() where the TableMetadata instance is serialized into a python dictionary and then passed into the jsonschema validate function.

I've also added freezing of instances and this punts the question of how we perform table metadata updates. I can follow-up with a PR that uses the builder pattern where a python dict is the mutable form. So building from an existing table metadata instance would start with calling to_dict().

rdblue · 2021-12-06T19:30:09Z

python/src/iceberg/table/metadata.py

+        """Fixes attribute names to be python friendly"""
+        return value.replace("-", "_").replace(".", "_")
+
+    def validate(self):


I like how this can use JSON schema to validate, but what should we do if validation fails? Are there some things that we can recover from? For example, v1 metadata with schemas and current-schema-id but not schema is technically invalid. But we can still read the table just fine.

I'd bet that we can get the jsonschema pretty accurate to the point where only bonafide validation failures raise a ValidationError. For your example, we could use some of the conditional schema features, specifically the dependentRequired which should let us define something that says "schema is optional when schemas and current-schema-id exist"

Okay, sounds reasonable.

rdblue · 2021-12-06T19:31:01Z

python/tests/table/test_metadata.py

+    assert table_metadata.sort_orders == []
+    assert table_metadata.default_sort_order_id == 8
+
+    assert table_metadata.properties._metadata == {"read.split.target-size": 134217728}


This test looks good to me.

emkornfield · 2022-01-20T21:40:53Z

python/src/iceberg/table/metadata.py

+        return cls.from_dict(metadata)
+
+    @classmethod
+    def from_file(cls, path: str, custom_file_io: FileIO = None, **kwargs):


api nit (apologies if the style has been discussed elsewhere) but it it might be more user-friendly to be less specific on individual parameters. e.g. have one parameter "file" which can be ["str", "FileIO", ...] and make the decision internally on what to use. Passing the base class as a factory seems a little strange since presumably clients know they have a custom_file_io and combine it with the path outside of this function.

Great point, I totally agree. A custom file-io would have a custom InputFile implementation that would already include the path when used here. If a string uri is provided instead I'll handle that by finding the right "known" implementation.

I'm going to start working on this and getting it out of draft status!

emkornfield · 2022-01-20T21:41:24Z

python/src/iceberg/table/metadata.py

+            with custom_file_io(path, **kwargs) as f:
+                table_metadata = cls.from_byte_stream(byte_stream=f.byte_stream)
+        else:
+            with get_file_io(path, custom_file_io=custom_file_io, **kwargs) as f:


it seems redundant to pass through custom_file_io here? is the first branch really necessary?

Great catch. The get_file_io function (not included in this PR) should be responsible for handling when a custom file-io is provided, i.e. custom_file_io != None. I'm planning to do another pass at this draft once PR #3691 is finalized and I'll make sure to update this. Thanks!

samredai · 2022-02-04T16:16:23Z

I think this is ready for another look. I've rebased with the commits from the file io PR that was merged. The table metadata class has a from_input_file and to_output_file method that takes an InputFile or OutputFile instance.

Another thing I did was move the LocalFileIO, LocalInputFile, and LocalOutputFile implementations to conftest.py as a fixture that can be used across all tests (see the docstring at the top of conftest.py for more details).

If we can settle on the design here, I think a thorough inspection of the V1 and V2 jsonschemas defined here is needed before merging.

emkornfield · 2022-02-06T04:37:51Z

python/src/iceberg/table/metadata.py

+    manifest files. (Deprecated: use partition-specs and default-spec-id 
+    instead)"""
+
+    partition_specs: list


Is the intention to refine collection types later with actual object types or keep these fairly generic?

I'd say we should definitely refine these later, i.e. list[PartitionSpec] once that class has been added.

rdblue · 2022-04-10T20:04:38Z

python/src/iceberg/table/metadata.py

+
+from iceberg.io.base import InputFile, OutputFile
+
+TABLE_METADATA_V1_SCHEMA = {


Is it possible to auto-generate this from the OpenAPI spec for the REST catalog? Or maybe to use that directly?

That's a good idea, it should be possible using a RefResolver from jsonschema to pull out the references in the TableMetadata section of the open-api spec. Let me try this out!

PR #4545 adds the rest catalog openapi to the python library. Once that's in I can update this PR to validate using that.

Fokko · 2022-05-03T13:56:44Z

python/tests/conftest.py

+
+@pytest.fixture
+def LocalFileIOFixture():
+    """"""


nit: maybe remove this one? Looks messy :)

Fokko · 2022-05-03T13:58:04Z

python/src/iceberg/table/metadata.py

+        if self.format_version == 1:
+            d["schema"] = self.schema
+            d["partition-spec"] = self.partition_spec
+        if self.format_version == 2:


Suggested change

if self.format_version == 2:

elif self.format_version == 2:

Fokko · 2022-05-03T13:58:48Z

python/src/iceberg/table/metadata.py

+            d["partition-spec"] = self.partition_spec
+        if self.format_version == 2:
+            d["last-sequence-number"] = self.last_sequence_number
+


I would add here:

else: logger.warn(f"Unknown format version: {self.format_version}")

Fokko · 2022-05-03T14:00:40Z

python/src/iceberg/table/metadata.py

+        """
+        f = output_file.create(overwrite=overwrite)
+        f.write(json.dumps(self.to_dict()).encode("utf-8"))
+


f.close() is missing.

Should we create a helper Context Manger like output_file_writer() for this?

class output_file_writer(): def __init__(self, output_file: OutputFile, overwrite: bool = False): self.__output_file = output_file self.__overwrite = True if overwrite else False def __enter__(self): self.__file = output_file.create(self.__overwrite) return self def __exit__(self, type, val, tb): self.__file.close()

With that helper the above method becomes:

def to_output_file(self, output_file: OutputFile, overwrite: bool = False) -> None: with output_file_writer(output_file, overwrite) as file: file.write(json.dumps(self.to_dict()).encode("utf-8"))

You could even default the encoding to UTF-8 inside the context manager, in one single place, if that is our standard encoding format.

We did this in the initial draft PR but eventually took it out. This is expected to be a pretty stable part of the code base and is not exposed in the user-facing API (so the convenience of using a context manager is immediately lost since users will interact with the higher level FileIO object at most)

I'm a big fan of context managers, but I think they should be integrated into the class itself, instead of using yet another wrapper.

I'm quite confused btw output_file.create(overwrite=overwrite) should return an OutputStream from iceberg.io.base, but that class doesn't seem to exist 🤔

OutputStream is just a protocol, so it's basically documentation that explains OutputFile.create(...) should return an object with a write, closed and close method.

…rialization for table metadata

samredai · 2022-05-24T01:15:09Z

python/src/iceberg/serializers.py

+from iceberg.table.metadata import TableMetadata
+
+
+class FromDict:


@cabhishek I know you're working on serialization/deserialization of Schema. You should be able to add a schema method to the classes in this file.

samredai · 2022-05-24T01:18:28Z

Updated this PR to remove all of the jsonschema validation logic (to be revisited later, hopefully using the OpenAPI spec for the REST catalog), and also replaced attrs with the built-in dataclass decorator. The last thing I've done here is move all of the serialization/deserialization logic (initially this PR had them as methods on the TableMetadata class) out into a serialization.py file that includes FromDict, ToDict, FromByteStream, FromInputFile, and ToInputFile classes of staticmethods.

samredai changed the title ~~Adding TableMetadata object from dict, bytestream, and s3 sources~~ Python: Adding TableMetadata object from dict, bytestream, and s3 sources Dec 6, 2021

samredai marked this pull request as draft December 6, 2021 18:18

samredai requested review from jun-he, rdblue and TGooch44 December 6, 2021 18:20

github-actions bot added the python label Dec 6, 2021

samredai commented Dec 6, 2021

View reviewed changes

rdblue reviewed Dec 6, 2021

View reviewed changes

samredai mentioned this pull request Dec 8, 2021

Add FileIO, InputFile, and OutputFile abstract base classes #3691

Merged

emkornfield reviewed Jan 20, 2022

View reviewed changes

samredai marked this pull request as ready for review February 4, 2022 16:07

samredai requested a review from rdblue February 4, 2022 18:17

samredai changed the title ~~Python: Adding TableMetadata object from dict, bytestream, and s3 sources~~ Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation Feb 5, 2022

emkornfield reviewed Feb 6, 2022

View reviewed changes

rdblue added this to In progress in [Priority 1] Python: Pythonic refactor via automation Feb 7, 2022

rdblue moved this from In progress to Review in progress in [Priority 1] Python: Pythonic refactor Feb 7, 2022

rdblue reviewed Apr 10, 2022

View reviewed changes

Fokko reviewed May 3, 2022

View reviewed changes

samredai added 2 commits May 23, 2022 12:54

Add TableMetadata class

b324403

Add serialize.py with dict, bytestream, and input_file/output_file se…

18d09e2

…rialization for table metadata

samredai commented May 24, 2022

View reviewed changes

This was referenced Jun 6, 2022

Python: (de)serialize using Pydantic #4973

Closed

Python: Use Pydantic for (de)serialization #5011

Merged

samredai closed this Jun 29, 2022

[Priority 1] Python: Pythonic refactor automation moved this from Review in progress to Done Jun 29, 2022


		from iceberg.io.base import InputFile, OutputFile

		TABLE_METADATA_V1_SCHEMA = {

		from iceberg.table.metadata import TableMetadata


		class FromDict:

Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation #3677

Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation #3677

Conversation

samredai commented Dec 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai commented Dec 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai Dec 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai commented Feb 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai Apr 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhruv-pratap May 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai commented May 24, 2022

samredai commented Dec 6, 2021 •

edited

samredai Dec 8, 2021 •

edited

samredai commented Feb 4, 2022 •

edited

samredai Apr 12, 2022 •

edited

dhruv-pratap May 17, 2022 •

edited