Support CreateTableTransaction in Glue and Rest #498

HonahX · 2024-03-05T09:41:35Z

This PR contains implementation for CreateTableTransaction.

On Table Side:

Add StagedTable which represents a uncommitted table, inherits Table and disables table scan.
Add CreateTableTransaction which inherits Transaction, takes in a StagedTable and add initial create table updates to the self._updates
Other helpers to help build metadata from a chain of updates

On Catalog Side:

Add _create_staged_table API which returns a StagedTable
Add create_table_transaction API
Implement the new APIs in GlueCatalog
Implement the new APIs in RestCatalog

The example usage is

with test_catalog.create_table_transaction(identifier, table_schema_nested) as txn:
        with txn.update_schema() as update_schema:
            update_schema.add_column(path="b", field_type=IntegerType())

        txn.set_properties(test_a="test_aa", test_b="test_b", test_c="test_c")

Appreciate any reviews, suggestions and thoughts!

cc @syun64 @Fokko

pyiceberg/table/__init__.py

syun64 · 2024-03-05T17:26:00Z

pyiceberg/catalog/glue.py

@@ -358,6 +365,33 @@ def _get_glue_table(self, database_name: str, table_name: str) -> TableTypeDef:
        except self.glue.exceptions.EntityNotFoundException as e:
            raise NoSuchTableError(f"Table does not exist: {database_name}.{table_name}") from e

+    def _create_staged_table(


I think we might have an opportunity here to move _create_staged_table function into pyiceberg/catalog/init.py and refactor the existing create_table functions on the different catalog implementations to all use _create_staged_table to create an instance of the StagedTable, and then commit that StagedTable to the catalog backend.

I think what sets each catalog's implementation of create_table apart is how it handles the commit against the catalog backened, but they all seem to share the same sequence of operations in how it instantiates its notion of a new table.

What are your thoughts on this idea @HonahX ?

Great suggestion! In the initial implementation I did not pay much attention to the catalog code organization. Let me refactor it.

pyiceberg/catalog/glue.py

pyiceberg/table/metadata.py

Fokko

This is looking great @HonahX left some of my thoughts in the PR.

pyiceberg/table/__init__.py

pyiceberg/table/metadata.py

Fokko · 2024-03-14T11:21:36Z

tests/integration/test_writes.py

+def test_create_table_transaction(session_catalog: Catalog, format_version: int) -> None:
+    if format_version == 1:
+        pytest.skip(
+            "There is a bug in the REST catalog (maybe server side) that prevents create and commit a staged version 1 table"


Interesting, let me track this down.

Thanks! My initial guess is that CatalogHandler use the default values to build the initial empty metadata. But the default format-version is now 2 so it's impossible for the server to build v1 metadata in this case.

I remembered that iceberg-rest used the CatalogHandler to wrap the SqlCatalog. I will double-check that this weekend and do more tests

pyiceberg/catalog/__init__.py

Fokko · 2024-03-14T11:28:35Z

pyiceberg/catalog/glue.py

+            return CommitTableResponse(metadata=updated_metadata, metadata_location=new_metadata_location)
+        except NoSuchTableError:
+            # Create the table
+            updated_metadata = construct_table_metadata(table_request.updates)


I would expect create_table to be called from the _commit from the Transaction.

I was thinking of it, and my rationale for the current implementation centers around ensuring a uniform transaction creation and commit process for both RestCatalogs and other types of catalogs. Specifically, for RestCatalogs, it's required to initiate CreateTableTransaction with _create_table(staged_create=True) and to use _commit_table with both initial and subsequent updates during transaction commitment. On the other hand, alternative catalogs offer more flexibility, allowing for either the use of _commit_table to reconstruct table metadata upon commitment or a modified _create_table API to create table during the transaction commitment.

Considering pyiceberg's alignment with Rest API principles, where _commit_table aggregates metadata updates to construct the revised metadata for table updates within a transaction, it seems prudent to maintain consistency with Rest API practices for table creation within transactions. This approach simplifies the process by relying on _commit_table to generate and commit metadata from scratch, eliminating the need to distinguish between RestCatalogs and other catalog types during transaction commitments.

Additionally, I've noted that the existing create_table and new_table_metadata APIs lack support for initializing metadata with snapshot information. I think that responsibility should belong to AddSnapshotUpdate and update_table_metadata. Thus, I've opted to maintain the current approach of utilizing _commit_table for both functions.

Does this approach sound reasonable to you? Please feel free to correct me if I've misunderstood any aspect of this process. Thanks for your input!

Fokko · 2024-03-14T11:30:35Z

pyiceberg/table/__init__.py

@@ -382,7 +383,54 @@ def commit_transaction(self) -> Table:
            return self._table


+class CreateTableTransaction(Transaction):
+    @staticmethod
+    def create_changes(table_metadata: TableMetadata) -> Tuple[TableUpdate, ...]:


I think we can also accumulate the changes in _updates of the Transaction itself

# Conflicts: # pyiceberg/catalog/__init__.py # pyiceberg/table/__init__.py # tests/catalog/test_glue.py

change to the third approach

# Conflicts: # tests/conftest.py

HonahX · 2024-03-20T08:38:37Z

I updated the implementation to take the third approach:

a third approach that we may add a flag to TableUpdate to mark it as create changes and let _apply_table_update handle these updates specially. We will exclude this flag from serialization so RestCatalog is not affected. If it can work, we do not need this additional class anymore.

This approach let us get rid of the additional metadata class and requires only small changes on the current update_table_metadata mechanism.

I will address other review comments soon

HonahX

Shall we move "append", "overwrite", and "add_files" to Transaction class? This change would enable us to seamlessly chain these operations with other table updates in a single commit. This adjustment could be particularly beneficial in the context of CreateTableTransaction, as it would enable users to not only create a table but also populate it with initial data in one go.

pyiceberg/table/metadata.py

HonahX · 2024-03-25T01:08:03Z

pyiceberg/catalog/glue.py

+            return CommitTableResponse(metadata=updated_metadata, metadata_location=new_metadata_location)
+        except NoSuchTableError:
+            # Create the table
+            updated_metadata = construct_table_metadata(table_request.updates)


I was thinking of it, and my rationale for the current implementation centers around ensuring a uniform transaction creation and commit process for both RestCatalogs and other types of catalogs. Specifically, for RestCatalogs, it's required to initiate CreateTableTransaction with _create_table(staged_create=True) and to use _commit_table with both initial and subsequent updates during transaction commitment. On the other hand, alternative catalogs offer more flexibility, allowing for either the use of _commit_table to reconstruct table metadata upon commitment or a modified _create_table API to create table during the transaction commitment.

Considering pyiceberg's alignment with Rest API principles, where _commit_table aggregates metadata updates to construct the revised metadata for table updates within a transaction, it seems prudent to maintain consistency with Rest API practices for table creation within transactions. This approach simplifies the process by relying on _commit_table to generate and commit metadata from scratch, eliminating the need to distinguish between RestCatalogs and other catalog types during transaction commitments.

Additionally, I've noted that the existing create_table and new_table_metadata APIs lack support for initializing metadata with snapshot information. I think that responsibility should belong to AddSnapshotUpdate and update_table_metadata. Thus, I've opted to maintain the current approach of utilizing _commit_table for both functions.

Does this approach sound reasonable to you? Please feel free to correct me if I've misunderstood any aspect of this process. Thanks for your input!

pyiceberg/catalog/__init__.py

Fokko

Thanks again for working on this @HonahX. I think we're almost there 🚀

pyiceberg/catalog/__init__.py

Fokko · 2024-03-25T05:13:48Z

pyiceberg/catalog/__init__.py

@@ -717,6 +791,10 @@ def _get_updated_props_and_update_summary(

        return properties_update_summary, updated_properties

+    @staticmethod
+    def empty_table_metadata() -> TableMetadata:
+        return TableMetadataV1(location="", last_column_id=-1, schema=Schema())


I recommend creating a V2 table by default. This is also the case for Java. Partition evolution is very awkward for V1 tables (keeping the old partitions as null-transforms).

The default is still V2 table. When creating a createTableTransaction, we first call new_table_metadata which by default gives a V2 metadata. Then, we parse the initial metadata to a sequence of TableUpdates, including an UpgradeFormatVersionUpdate. Finally when we re-build the metadata in _commit_table, the UpgradeFormatVersionUpdate will bump the metadata to the correct version.

If we use V2Metadata here, we won't be able to create any V1 table since we cannot downgrade from V2 to V1. I think this is the same issue in iceberg-rest that prevents us from creating V1 table via create-table-transaction.#498 (comment) I should follow-up that later.

I added some comments to explain the purpose and also make it private.

Thanks for the detailed explanation, that makes sense to me 👍

Fokko · 2024-03-25T09:10:24Z

pyiceberg/catalog/rest.py

+        return TableResponse(**response.json())
+
+    @retry(**_RETRY_ARGS)
+    def _create_staged_table(


The inheritance feels off here. I would just expect to create_table_transaction. The current _create_staged_table on the Catalog is now very specific for non-REST catalogs. Should we introduce another layer in the OOP hierarchy there? Then we can override create_table_transaction there.

Thanks! This is a great suggestion! After going through the current code, I find that not only _create_staged_table but also many other helper functions in Catalog are specific to non-Rest Catalogs. Hence, I add a new class named MetastoreCatalog (appreciate any suggestions on naming) and make all non-Rest Catalogs inherit from it instead.

Since it is a big refactoring, please let me know if you want this to happen in a follow-up PR.

pyiceberg/table/__init__.py

# Conflicts: # tests/table/test_metadata.py

syun64 · 2024-04-03T00:42:31Z

Shall we move "append", "overwrite", and "add_files" to Transaction class? This change would enable us to seamlessly chain these operations with other table updates in a single commit. This adjustment could be particularly beneficial in the context of CreateTableTransaction, as it would enable users to not only create a table but also populate it with initial data in one go.

I think this is a great question.

I think we have two options here:

We move these actions into the Transaction class, and remove them from Table class
We move them into the Transaction class, and also keep an implementation in the Table class

I'm not sure which of the above two are better, but I keep asking myself whether there's a 'good' reason why we have two separate APIs that achieve similar results.

For example, we have update_spec, update_schema that can be created from the Transaction or the Table, and I feel like we might be creating work for ourselves by duplicating the feature in both classes. What if we consolidated all of our actions into the Transaction class, and removed them from the Table class?

I think the upside of that would be that API would convey a very clear message to the developer that a transaction is committed to a table, and that a series of actions can be chained onto the same transaction, as a single commit.

In addition, we can avoid issues like this where we roll out a feature to one API implementation, but not the other.

    with given_table.update_schema() as tx:
        tx.add_column(path="new_column1", field_type=IntegerType())

    with given_table.transaction() as tx:
        with tx.update_schema() as update:
            update.add_column(path="new_column1", field_type=IntegerType())

To me, the bottom pattern feels more explicit than the above option, and I'm curious to hear others' opinions on this topic

Fokko

Left some small comments, but looks good to me. Thanks for adding the MetastoreCatalog layer in there, I think this cleans up the hierarchy quite a bit 👍

mkdocs/docs/api.md

pyiceberg/catalog/__init__.py

Fokko · 2024-04-03T11:55:09Z

pyiceberg/catalog/__init__.py

@@ -717,6 +791,10 @@ def _get_updated_props_and_update_summary(

        return properties_update_summary, updated_properties

+    @staticmethod
+    def empty_table_metadata() -> TableMetadata:
+        return TableMetadataV1(location="", last_column_id=-1, schema=Schema())


Thanks for the detailed explanation, that makes sense to me 👍

pyiceberg/table/__init__.py

HonahX · 2024-04-04T05:01:00Z

Thanks @Fokko @syun64 for detailed reviewing! Since this PR changes/refactors lots of code, I'll merge this first and leave other works (such as supporting createTableTrans in Hive and Sql) to follow-up PRs

HonahX added 3 commits March 5, 2024 00:34

initial example

998c6f1

add integration test

8eace7c

lint

64e6346

HonahX commented Mar 5, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

HonahX added 2 commits March 5, 2024 01:54

fix test

ffb8ff6

Merge branch 'main' into create_table_transaction_experiment

3a579cd

syun64 reviewed Mar 5, 2024

View reviewed changes

pyiceberg/catalog/glue.py Show resolved Hide resolved

HonahX linked an issue Mar 6, 2024 that may be closed by this pull request

Support CreateTableTransaction #483

Closed

HonahX added 4 commits March 6, 2024 23:49

Merge branch 'main' into create_table_transaction_experiment

049e0e2

move create_staged catalog to init

df0c5ed

Merge branch 'main' into create_table_transaction_experiment

755aebf

Add IncompleteMetadata

c98b3b4

HonahX commented Mar 12, 2024

View reviewed changes

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

syun64 mentioned this pull request Mar 12, 2024

Add Data Files from Parquet Files to UnPartitioned Table #506

Merged

Add support for rest

09b60ca

Fokko changed the title ~~[WIP][Discussion]CreateTableTransaction Implementation~~ [WIP][Discussion] CreateTableTransaction Implementation Mar 14, 2024

Fokko reviewed Mar 14, 2024

View reviewed changes

HonahX added 7 commits March 19, 2024 23:40

Merge branch 'main' into create_table_transaction_experiment

04ef8df

# Conflicts: # pyiceberg/catalog/__init__.py # pyiceberg/table/__init__.py # tests/catalog/test_glue.py

Fix merge issue

211de32

get rid of IncompleteTableMetadata

d57ac1c

change to the third approach

simplify code

978a0aa

fix small issue

a413c2e

remove extra line

ad840d5

Merge branch 'main' into create_table_transaction_experiment

47ce986

# Conflicts: # tests/conftest.py

HonahX added 3 commits March 23, 2024 21:05

Merge branch 'main' into create_table_transaction_experiment

1f5cc28

accumulates initial updates in self._updates in transaction directly

9ac2f7f

update comments

d2617fb

update tablemetadata v1 update

44df2d7

HonahX mentioned this pull request Mar 24, 2024

[Bug Fix] Fix TableMetadataV1 Validators #544

Merged

HonahX added 2 commits March 24, 2024 17:04

add doc

1a4d262

fix lint

f7c04cf

HonahX commented Mar 25, 2024

View reviewed changes

HonahX changed the title ~~[WIP][Discussion] CreateTableTransaction Implementation~~ Support CreateTableTransaction in Glue and Rest Mar 25, 2024

HonahX marked this pull request as ready for review March 25, 2024 01:36

Fokko reviewed Mar 25, 2024

View reviewed changes

syun64 added this to the PyIceberg 0.7.0 release milestone Mar 26, 2024

syun64 mentioned this pull request Mar 26, 2024

Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

Open

HonahX added 3 commits March 26, 2024 22:20

Merge branch 'main' into create_table_transaction_experiment

cecf1c0

# Conflicts: # tests/table/test_metadata.py

revert some change

2152542

refactor catalog interface, add MetastoreCatalog

c449fb0

syun64 mentioned this pull request Apr 3, 2024

Move writes to Transaction #571

Merged

Fokko approved these changes Apr 3, 2024

View reviewed changes

HonahX added 2 commits April 3, 2024 21:42

Merge branch 'main' into create_table_transaction_experiment

8fc1562

update api doc

b99c619

HonahX merged commit a892309 into apache:main Apr 4, 2024
7 checks passed

This was referenced Apr 24, 2024

Refactor GlueCatalog's _commit_table #653

Merged

table_exists unit/integration test for NoSuchTableError #678

Merged

Support CreateTableTransaction for HiveCatalog #683

Merged

Support CreateTableTransaction for SqlCatalog #684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CreateTableTransaction in Glue and Rest #498

Support CreateTableTransaction in Glue and Rest #498

HonahX commented Mar 5, 2024 •

edited

Loading

syun64 Mar 5, 2024

HonahX Mar 6, 2024

Fokko left a comment

Fokko Mar 14, 2024

HonahX Mar 15, 2024 •

edited

Loading

Fokko Mar 14, 2024

HonahX Mar 25, 2024

Fokko Mar 14, 2024

HonahX commented Mar 20, 2024 •

edited

Loading

HonahX left a comment

HonahX Mar 25, 2024

Fokko left a comment

Fokko Mar 25, 2024

HonahX Mar 28, 2024 •

edited

Loading

Fokko Apr 3, 2024

Fokko Mar 25, 2024

HonahX Mar 28, 2024 •

edited

Loading

syun64 commented Apr 3, 2024

Fokko left a comment

Fokko Apr 3, 2024

HonahX commented Apr 4, 2024

Support CreateTableTransaction in Glue and Rest #498

Support CreateTableTransaction in Glue and Rest #498

Conversation

HonahX commented Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX commented Mar 20, 2024 • edited Loading

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

syun64 commented Apr 3, 2024

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX commented Apr 4, 2024

HonahX commented Mar 5, 2024 •

edited

Loading

HonahX Mar 15, 2024 •

edited

Loading

HonahX commented Mar 20, 2024 •

edited

Loading

HonahX Mar 28, 2024 •

edited

Loading

HonahX Mar 28, 2024 •

edited

Loading