Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation #3677

Closed
wants to merge 2 commits into from

Conversation

samredai
Copy link
Collaborator

@samredai samredai commented Dec 6, 2021

UPDATE: This has been updated to remove the wrapper mechanism and instead assigns attributes to the class explicitly. The s3 io has also been abstracted to a generic file io in the from_file() method which is dependent on PR #3691.

I've also included the popular attrs library to allow for a cleaner class definition and also to freeze TableMetadata instances. There's agreement that table metadata instances should be frozen but I anticipate some discussions around how to actually perform table metadata updates so I think it's best to tackle that in follow-up PR's/proposals.


This is the table metadata portion of #3227. This isn't complete but wanted it to be visible to get feedback on the direction. The equivalent logic in the legacy implementation is stored in table_metadata.py and table_metadata_parser.py.

The idea here is that, instead of including very explicit parsing and validation logic for table metadata files, we can rely on the standard library in conjunction with jsonschema tooling to accomplish both. The TABLE_METADATA_V2_SCHEMA jsonschema definition found in metadata.py is an example of how this can be done (still needs to be tuned to the spec exactly). The TableMetadata class itself naively parses a given json object and includes a validate() method that validates against the defined jsonschema. (validate() simply calls the validate_v1() or validate_v2() static method.) Table metadata values can then be retrieved using simple dot notation and can be updated as well.

table_metadata = TableMetadata.from_s3("s3://foo/bar/baz.metadata.json", version=2)
print(table_metadata.properties.read_split_target_size) # 134217728
table_metadata.properties.read_split_target_size = 268435456
print(table_metadata.properties.read_split_target_size) # 268435456

Once tweaked, the jsonschema definition should prove re-usable since jsonschema parsers exist in almost every language. It may also be valuable to include a blessed jsonschema definition in the Iceberg docs.

A proposal for editing table metadata is to use a collection of functions that update value(s), validate that the schema is still valid, and return the updated TableMetadata instance, such as:

from copy import deepcopy
import time

def rollback(table_metadata: TableMetadata, snapshot_id: int):
  new_table_metadata = deepcopy(table_metadata)
  now_millis = int(time.time() * 1000)
  new_table_metadata.snapshot_log.append({"timestamp_millis": now_millis, "snapshot_id":  snapshot_id})
  new_table_metadata.current_snapshot_id = snapshot_id
  return new_table_metadata

@samredai samredai changed the title Adding TableMetadata object from dict, bytestream, and s3 sources Python: Adding TableMetadata object from dict, bytestream, and s3 sources Dec 6, 2021
@samredai samredai marked this pull request as draft December 6, 2021 18:18
@github-actions github-actions bot added the python label Dec 6, 2021
return cls(metadata=metadata, version=version)

@classmethod
def from_s3(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this from_s3 class method is here, but this should be abstracted out to a generic file-io method that uses something like a FileIO abstract base class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been refactored to use a generic from_file() method that's dependent on PR #3691 which adds the FileIO abstract base class

@samredai
Copy link
Collaborator Author

samredai commented Dec 6, 2021

@CircArgs

"last-sequence-number": {"type": "integer"},
"last-updated-ms": {"type": "integer"},
"last-column-id": {"type": "integer"},
"schemas": {"type": "array", "items": {}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these eventually have type information?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just did a simple top-level schema as part of this draft, but the final schema would include much more and be comprehensive of the spec. All types and required/optional flags. We can also add in regex requirements for string fields. The spec doesn't have to be a single huge spec either so we could have a separate jsonschema for PartitionSpec and just reference it in the larger table metadata jsonschema definition.

"sort-orders": {"type": "array", "items": {}},
"default-sort-order-id": {"type": "integer"},
},
"required": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this listing the fields that are required for writing or reading? For reading, we have to be more relaxed because we need to accept older versions of the spec. For example, schemas and current-schema-id didn't used to exist. It used to be a single schema field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that this is for v1. It doesn't quite align with v1, and there are some constructs that are optional. v1 requires schema, but allows setting schemas and current-schema-id. We'd probably need to keep these up to date.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figured there could be a separate schema definition for v1 and v2 (and future v3). The version argument then essentially detemines which jsonschema is used. Sorry I should have mentioned this in the PR description (or better yet a TODO comment). I still have to complete the schemas to actually match the v1 and v2 specs exactly.

"""

def __init__(self, metadata: dict, version: Union[str, int]):
self._version = version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(version) to be able to assume that it is an int elsewhere?


def _wrap(self, value: Any):
"""A recursive function that drills into iterable values and returns
nested TableMetadata instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be nested table metadata instances?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name does make this sound odd but the logic is essentially a DFS through the metadata to assign everything as class attributes. If a value in the json is an object (like properties), it instantiates that as a sort of "partial" TableMetadata instance and then starts a new traversal through the contents of that object. In other words

table_metadata = TableMetadata(...)
isinstance(table_metadata, TableMetadata)  # True
isinstance(table_metadata.properties, TableMetadata)  # True

isinstance(table_metadata.snapshot_log, list)  # True
isinstance(table_metadata.snapshot_log[0], TableMetadata)  # True

If the idea of this partial metadata existing as a TableMetadata instance doesn't sit well, I could instead have a generic Metadata or Config class that actually traverses the json object to create a class with all of the class attributes, and have that as an argument to a TableMetadata class that handles validation and other table metadata related things.

Something like:

config = Config({"table-uuid": "foo", ...})
table_metadata = TableMetadata(config, version="2")

The _wrap() method would live in the Config class and the docstring would read as:

A recursive function that drills into iterable values and returns
nested Config instances

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I get it.

I like being able to use m.snapshot_log[0] and similar ways to access the metadata. But, I'm not sure that this is changing enough that it's worth the wrapper approach, instead of just converting to a TableMetadata class that pulls out and stores self.current_schema_id (for example). Table metadata shouldn't be that complicated since it's mostly a few lists of objects at the most nested level (like metadata > snapshots > snapshot > properties > key/value).

While this makes it easy to get started, it would be awkward to make updates to the metadata as JSON because you'd need to produce a new JSON tree and then wrap with this class. We may also want classes for things like Snapshot, which can embed some operations like reading metadata.

Copy link
Collaborator Author

@samredai samredai Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR to remove the wrapper approach and set the top level as class attributes. I've also added a to_dict() method that's used by validate() where the TableMetadata instance is serialized into a python dictionary and then passed into the jsonschema validate function.

I've also added freezing of instances and this punts the question of how we perform table metadata updates. I can follow-up with a PR that uses the builder pattern where a python dict is the mutable form. So building from an existing table metadata instance would start with calling to_dict().

"""Fixes attribute names to be python friendly"""
return value.replace("-", "_").replace(".", "_")

def validate(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this can use JSON schema to validate, but what should we do if validation fails? Are there some things that we can recover from? For example, v1 metadata with schemas and current-schema-id but not schema is technically invalid. But we can still read the table just fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd bet that we can get the jsonschema pretty accurate to the point where only bonafide validation failures raise a ValidationError. For your example, we could use some of the conditional schema features, specifically the dependentRequired which should let us define something that says "schema is optional when schemas and current-schema-id exist"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds reasonable.

assert table_metadata.sort_orders == []
assert table_metadata.default_sort_order_id == 8

assert table_metadata.properties._metadata == {"read.split.target-size": 134217728}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good to me.

return cls.from_dict(metadata)

@classmethod
def from_file(cls, path: str, custom_file_io: FileIO = None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api nit (apologies if the style has been discussed elsewhere) but it it might be more user-friendly to be less specific on individual parameters. e.g. have one parameter "file" which can be ["str", "FileIO", ...] and make the decision internally on what to use. Passing the base class as a factory seems a little strange since presumably clients know they have a custom_file_io and combine it with the path outside of this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I totally agree. A custom file-io would have a custom InputFile implementation that would already include the path when used here. If a string uri is provided instead I'll handle that by finding the right "known" implementation.

I'm going to start working on this and getting it out of draft status!

with custom_file_io(path, **kwargs) as f:
table_metadata = cls.from_byte_stream(byte_stream=f.byte_stream)
else:
with get_file_io(path, custom_file_io=custom_file_io, **kwargs) as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems redundant to pass through custom_file_io here? is the first branch really necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. The get_file_io function (not included in this PR) should be responsible for handling when a custom file-io is provided, i.e. custom_file_io != None. I'm planning to do another pass at this draft once PR #3691 is finalized and I'll make sure to update this. Thanks!

@samredai samredai marked this pull request as ready for review February 4, 2022 16:07
@samredai
Copy link
Collaborator Author

samredai commented Feb 4, 2022

I think this is ready for another look. I've rebased with the commits from the file io PR that was merged. The table metadata class has a from_input_file and to_output_file method that takes an InputFile or OutputFile instance.

Another thing I did was move the LocalFileIO, LocalInputFile, and LocalOutputFile implementations to conftest.py as a fixture that can be used across all tests (see the docstring at the top of conftest.py for more details).

If we can settle on the design here, I think a thorough inspection of the V1 and V2 jsonschemas defined here is needed before merging.

@samredai samredai requested a review from rdblue February 4, 2022 18:17
@samredai samredai changed the title Python: Adding TableMetadata object from dict, bytestream, and s3 sources Python: Adding TableMetadata object from dict, bytestream, and InputFile implementation Feb 5, 2022
manifest files. (Deprecated: use partition-specs and default-spec-id
instead)"""

partition_specs: list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to refine collection types later with actual object types or keep these fairly generic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we should definitely refine these later, i.e. list[PartitionSpec] once that class has been added.

@rdblue rdblue added this to In progress in [Priority 1] Python: Pythonic refactor via automation Feb 7, 2022
@rdblue rdblue moved this from In progress to Review in progress in [Priority 1] Python: Pythonic refactor Feb 7, 2022

from iceberg.io.base import InputFile, OutputFile

TABLE_METADATA_V1_SCHEMA = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to auto-generate this from the OpenAPI spec for the REST catalog? Or maybe to use that directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, it should be possible using a RefResolver from jsonschema to pull out the references in the TableMetadata section of the open-api spec. Let me try this out!

Copy link
Collaborator Author

@samredai samredai Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #4545 adds the rest catalog openapi to the python library. Once that's in I can update this PR to validate using that.


@pytest.fixture
def LocalFileIOFixture():
""""""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe remove this one? Looks messy :)

if self.format_version == 1:
d["schema"] = self.schema
d["partition-spec"] = self.partition_spec
if self.format_version == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.format_version == 2:
elif self.format_version == 2:

d["partition-spec"] = self.partition_spec
if self.format_version == 2:
d["last-sequence-number"] = self.last_sequence_number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here:

else:
    logger.warn(f"Unknown format version: {self.format_version}")

"""
f = output_file.create(overwrite=overwrite)
f.write(json.dumps(self.to_dict()).encode("utf-8"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f.close() is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a helper Context Manger like output_file_writer() for this?

class output_file_writer():
    def __init__(self, output_file: OutputFile, overwrite: bool = False):
        self.__output_file = output_file
        self.__overwrite = True if overwrite else False

    def __enter__(self):
        self.__file = output_file.create(self.__overwrite)
        return self

    def __exit__(self, type, val, tb):
        self.__file.close()

With that helper the above method becomes:

def to_output_file(self, output_file: OutputFile, overwrite: bool = False) -> None:
        with output_file_writer(output_file, overwrite) as file:
            file.write(json.dumps(self.to_dict()).encode("utf-8"))

You could even default the encoding to UTF-8 inside the context manager, in one single place, if that is our standard encoding format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did this in the initial draft PR but eventually took it out. This is expected to be a pretty stable part of the code base and is not exposed in the user-facing API (so the convenience of using a context manager is immediately lost since users will interact with the higher level FileIO object at most)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of context managers, but I think they should be integrated into the class itself, instead of using yet another wrapper.

I'm quite confused btw output_file.create(overwrite=overwrite) should return an OutputStream from iceberg.io.base, but that class doesn't seem to exist 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputStream is just a protocol, so it's basically documentation that explains OutputFile.create(...) should return an object with a write, closed and close method.

from iceberg.table.metadata import TableMetadata


class FromDict:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cabhishek I know you're working on serialization/deserialization of Schema. You should be able to add a schema method to the classes in this file.

@samredai
Copy link
Collaborator Author

Updated this PR to remove all of the jsonschema validation logic (to be revisited later, hopefully using the OpenAPI spec for the REST catalog), and also replaced attrs with the built-in dataclass decorator. The last thing I've done here is move all of the serialization/deserialization logic (initially this PR had them as methods on the TableMetadata class) out into a serialization.py file that includes FromDict, ToDict, FromByteStream, FromInputFile, and ToInputFile classes of staticmethods.

@samredai samredai closed this Jun 29, 2022
[Priority 1] Python: Pythonic refactor automation moved this from Review in progress to Done Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants