Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CreateTableTransaction in Glue and Rest #498

Merged
merged 28 commits into from
Apr 4, 2024

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Mar 5, 2024

This PR contains implementation for CreateTableTransaction.

On Table Side:

  • Add StagedTable which represents a uncommitted table, inherits Table and disables table scan.
  • Add CreateTableTransaction which inherits Transaction, takes in a StagedTable and add initial create table updates to the self._updates
  • Other helpers to help build metadata from a chain of updates

On Catalog Side:

  • Add _create_staged_table API which returns a StagedTable
  • Add create_table_transaction API
  • Implement the new APIs in GlueCatalog
  • Implement the new APIs in RestCatalog

The example usage is

with test_catalog.create_table_transaction(identifier, table_schema_nested) as txn:
        with txn.update_schema() as update_schema:
            update_schema.add_column(path="b", field_type=IntegerType())

        txn.set_properties(test_a="test_aa", test_b="test_b", test_c="test_c")

Appreciate any reviews, suggestions and thoughts!

cc @syun64 @Fokko

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@@ -358,6 +365,33 @@ def _get_glue_table(self, database_name: str, table_name: str) -> TableTypeDef:
except self.glue.exceptions.EntityNotFoundException as e:
raise NoSuchTableError(f"Table does not exist: {database_name}.{table_name}") from e

def _create_staged_table(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have an opportunity here to move _create_staged_table function into pyiceberg/catalog/init.py and refactor the existing create_table functions on the different catalog implementations to all use _create_staged_table to create an instance of the StagedTable, and then commit that StagedTable to the catalog backend.

I think what sets each catalog's implementation of create_table apart is how it handles the commit against the catalog backened, but they all seem to share the same sequence of operations in how it instantiates its notion of a new table.

What are your thoughts on this idea @HonahX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! In the initial implementation I did not pay much attention to the catalog code organization. Let me refactor it.

@HonahX HonahX linked an issue Mar 6, 2024 that may be closed by this pull request
pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
@Fokko Fokko changed the title [WIP][Discussion]CreateTableTransaction Implementation [WIP][Discussion] CreateTableTransaction Implementation Mar 14, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @HonahX left some of my thoughts in the PR.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
def test_create_table_transaction(session_catalog: Catalog, format_version: int) -> None:
if format_version == 1:
pytest.skip(
"There is a bug in the REST catalog (maybe server side) that prevents create and commit a staged version 1 table"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, let me track this down.

Copy link
Contributor Author

@HonahX HonahX Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! My initial guess is that CatalogHandler use the default values to build the initial empty metadata. But the default format-version is now 2 so it's impossible for the server to build v1 metadata in this case.

I remembered that iceberg-rest used the CatalogHandler to wrap the SqlCatalog. I will double-check that this weekend and do more tests

pyiceberg/catalog/__init__.py Show resolved Hide resolved
return CommitTableResponse(metadata=updated_metadata, metadata_location=new_metadata_location)
except NoSuchTableError:
# Create the table
updated_metadata = construct_table_metadata(table_request.updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect create_table to be called from the _commit from the Transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of it, and my rationale for the current implementation centers around ensuring a uniform transaction creation and commit process for both RestCatalogs and other types of catalogs. Specifically, for RestCatalogs, it's required to initiate CreateTableTransaction with _create_table(staged_create=True) and to use _commit_table with both initial and subsequent updates during transaction commitment. On the other hand, alternative catalogs offer more flexibility, allowing for either the use of _commit_table to reconstruct table metadata upon commitment or a modified _create_table API to create table during the transaction commitment.

Considering pyiceberg's alignment with Rest API principles, where _commit_table aggregates metadata updates to construct the revised metadata for table updates within a transaction, it seems prudent to maintain consistency with Rest API practices for table creation within transactions. This approach simplifies the process by relying on _commit_table to generate and commit metadata from scratch, eliminating the need to distinguish between RestCatalogs and other catalog types during transaction commitments.

Additionally, I've noted that the existing create_table and new_table_metadata APIs lack support for initializing metadata with snapshot information. I think that responsibility should belong to AddSnapshotUpdate and update_table_metadata. Thus, I've opted to maintain the current approach of utilizing _commit_table for both functions.

Does this approach sound reasonable to you? Please feel free to correct me if I've misunderstood any aspect of this process. Thanks for your input!

@@ -382,7 +383,54 @@ def commit_transaction(self) -> Table:
return self._table


class CreateTableTransaction(Transaction):
@staticmethod
def create_changes(table_metadata: TableMetadata) -> Tuple[TableUpdate, ...]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also accumulate the changes in _updates of the Transaction itself

# Conflicts:
#	pyiceberg/catalog/__init__.py
#	pyiceberg/table/__init__.py
#	tests/catalog/test_glue.py
change to the third approach
@HonahX
Copy link
Contributor Author

HonahX commented Mar 20, 2024

I updated the implementation to take the third approach:

a third approach that we may add a flag to TableUpdate to mark it as create changes and let _apply_table_update handle these updates specially. We will exclude this flag from serialization so RestCatalog is not affected. If it can work, we do not need this additional class anymore.

This approach let us get rid of the additional metadata class and requires only small changes on the current update_table_metadata mechanism.

I will address other review comments soon

Copy link
Contributor Author

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move "append", "overwrite", and "add_files" to Transaction class? This change would enable us to seamlessly chain these operations with other table updates in a single commit. This adjustment could be particularly beneficial in the context of CreateTableTransaction, as it would enable users to not only create a table but also populate it with initial data in one go.

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
return CommitTableResponse(metadata=updated_metadata, metadata_location=new_metadata_location)
except NoSuchTableError:
# Create the table
updated_metadata = construct_table_metadata(table_request.updates)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of it, and my rationale for the current implementation centers around ensuring a uniform transaction creation and commit process for both RestCatalogs and other types of catalogs. Specifically, for RestCatalogs, it's required to initiate CreateTableTransaction with _create_table(staged_create=True) and to use _commit_table with both initial and subsequent updates during transaction commitment. On the other hand, alternative catalogs offer more flexibility, allowing for either the use of _commit_table to reconstruct table metadata upon commitment or a modified _create_table API to create table during the transaction commitment.

Considering pyiceberg's alignment with Rest API principles, where _commit_table aggregates metadata updates to construct the revised metadata for table updates within a transaction, it seems prudent to maintain consistency with Rest API practices for table creation within transactions. This approach simplifies the process by relying on _commit_table to generate and commit metadata from scratch, eliminating the need to distinguish between RestCatalogs and other catalog types during transaction commitments.

Additionally, I've noted that the existing create_table and new_table_metadata APIs lack support for initializing metadata with snapshot information. I think that responsibility should belong to AddSnapshotUpdate and update_table_metadata. Thus, I've opted to maintain the current approach of utilizing _commit_table for both functions.

Does this approach sound reasonable to you? Please feel free to correct me if I've misunderstood any aspect of this process. Thanks for your input!

pyiceberg/catalog/__init__.py Show resolved Hide resolved
@HonahX HonahX changed the title [WIP][Discussion] CreateTableTransaction Implementation Support CreateTableTransaction in Glue and Rest Mar 25, 2024
@HonahX HonahX marked this pull request as ready for review March 25, 2024 01:36
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for working on this @HonahX. I think we're almost there 🚀

pyiceberg/catalog/__init__.py Show resolved Hide resolved
@@ -717,6 +791,10 @@ def _get_updated_props_and_update_summary(

return properties_update_summary, updated_properties

@staticmethod
def empty_table_metadata() -> TableMetadata:
return TableMetadataV1(location="", last_column_id=-1, schema=Schema())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend creating a V2 table by default. This is also the case for Java. Partition evolution is very awkward for V1 tables (keeping the old partitions as null-transforms).

Copy link
Contributor Author

@HonahX HonahX Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is still V2 table. When creating a createTableTransaction, we first call new_table_metadata which by default gives a V2 metadata. Then, we parse the initial metadata to a sequence of TableUpdates, including an UpgradeFormatVersionUpdate. Finally when we re-build the metadata in _commit_table, the UpgradeFormatVersionUpdate will bump the metadata to the correct version.

If we use V2Metadata here, we won't be able to create any V1 table since we cannot downgrade from V2 to V1. I think this is the same issue in iceberg-rest that prevents us from creating V1 table via create-table-transaction.#498 (comment) I should follow-up that later.

I added some comments to explain the purpose and also make it private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, that makes sense to me 👍

return TableResponse(**response.json())

@retry(**_RETRY_ARGS)
def _create_staged_table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inheritance feels off here. I would just expect to create_table_transaction. The current _create_staged_table on the Catalog is now very specific for non-REST catalogs. Should we introduce another layer in the OOP hierarchy there? Then we can override create_table_transaction there.

Copy link
Contributor Author

@HonahX HonahX Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a great suggestion! After going through the current code, I find that not only _create_staged_table but also many other helper functions in Catalog are specific to non-Rest Catalogs. Hence, I add a new class named MetastoreCatalog (appreciate any suggestions on naming) and make all non-Rest Catalogs inherit from it instead.

Since it is a big refactoring, please let me know if you want this to happen in a follow-up PR.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@syun64
Copy link
Collaborator

syun64 commented Apr 3, 2024

Shall we move "append", "overwrite", and "add_files" to Transaction class? This change would enable us to seamlessly chain these operations with other table updates in a single commit. This adjustment could be particularly beneficial in the context of CreateTableTransaction, as it would enable users to not only create a table but also populate it with initial data in one go.

I think this is a great question.

I think we have two options here:

  1. We move these actions into the Transaction class, and remove them from Table class
  2. We move them into the Transaction class, and also keep an implementation in the Table class

I'm not sure which of the above two are better, but I keep asking myself whether there's a 'good' reason why we have two separate APIs that achieve similar results.

For example, we have update_spec, update_schema that can be created from the Transaction or the Table, and I feel like we might be creating work for ourselves by duplicating the feature in both classes. What if we consolidated all of our actions into the Transaction class, and removed them from the Table class?

I think the upside of that would be that API would convey a very clear message to the developer that a transaction is committed to a table, and that a series of actions can be chained onto the same transaction, as a single commit.

In addition, we can avoid issues like this where we roll out a feature to one API implementation, but not the other.

    with given_table.update_schema() as tx:
        tx.add_column(path="new_column1", field_type=IntegerType())
    with given_table.transaction() as tx:
        with tx.update_schema() as update:
            update.add_column(path="new_column1", field_type=IntegerType())

To me, the bottom pattern feels more explicit than the above option, and I'm curious to hear others' opinions on this topic

@syun64 syun64 mentioned this pull request Apr 3, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments, but looks good to me. Thanks for adding the MetastoreCatalog layer in there, I think this cleans up the hierarchy quite a bit 👍

mkdocs/docs/api.md Outdated Show resolved Hide resolved
pyiceberg/catalog/__init__.py Show resolved Hide resolved
@@ -717,6 +791,10 @@ def _get_updated_props_and_update_summary(

return properties_update_summary, updated_properties

@staticmethod
def empty_table_metadata() -> TableMetadata:
return TableMetadataV1(location="", last_column_id=-1, schema=Schema())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, that makes sense to me 👍

pyiceberg/table/__init__.py Show resolved Hide resolved
@HonahX
Copy link
Contributor Author

HonahX commented Apr 4, 2024

Thanks @Fokko @syun64 for detailed reviewing! Since this PR changes/refactors lots of code, I'll merge this first and leave other works (such as supporting createTableTrans in Hive and Sql) to follow-up PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support CreateTableTransaction
3 participants