-
Notifications
You must be signed in to change notification settings - Fork 722
Feature: Write to Glue Governed Tables #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…aws-data-wrangler into feature/lf-transactions
…aws-data-wrangler into feature/lf-transactions
jaidisido
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarifications
| if (catalog_table_input is None) and (table_type == "GOVERNED"): | ||
| catalog._create_parquet_table( # pylint: disable=protected-access | ||
| database=database, | ||
| table=table, | ||
| path=path, # type: ignore | ||
| columns_types=columns_types, | ||
| table_type=table_type, | ||
| partitions_types=partitions_types, | ||
| bucketing_info=bucketing_info, | ||
| compression=compression, | ||
| description=description, | ||
| parameters=parameters, | ||
| columns_comments=columns_comments, | ||
| boto3_session=session, | ||
| mode=mode, | ||
| catalog_versioning=catalog_versioning, | ||
| projection_enabled=projection_enabled, | ||
| projection_types=projection_types, | ||
| projection_ranges=projection_ranges, | ||
| projection_values=projection_values, | ||
| projection_intervals=projection_intervals, | ||
| projection_digits=projection_digits, | ||
| catalog_id=catalog_id, | ||
| catalog_table_input=catalog_table_input, | ||
| ) | ||
| catalog_table_input = catalog._get_table_input( # pylint: disable=protected-access | ||
| database=database, table=table, boto3_session=session, catalog_id=catalog_id | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is admittedly ugly (we are making the create_parquet_table and get_table_input calls twice). But is required to handle cases where a "GOVERNED" table does not exist and must be created before running the _to_dataset call. Open to suggestions on how to avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't think in any better alternative now. I will try to think more about it later. 🤔
| if (catalog_table_input is None) and (table_type == "GOVERNED"): | ||
| catalog._create_parquet_table( # pylint: disable=protected-access | ||
| database=database, | ||
| table=table, | ||
| path=path, # type: ignore | ||
| columns_types=columns_types, | ||
| table_type=table_type, | ||
| partitions_types=partitions_types, | ||
| bucketing_info=bucketing_info, | ||
| compression=compression, | ||
| description=description, | ||
| parameters=parameters, | ||
| columns_comments=columns_comments, | ||
| boto3_session=session, | ||
| mode=mode, | ||
| catalog_versioning=catalog_versioning, | ||
| projection_enabled=projection_enabled, | ||
| projection_types=projection_types, | ||
| projection_ranges=projection_ranges, | ||
| projection_values=projection_values, | ||
| projection_intervals=projection_intervals, | ||
| projection_digits=projection_digits, | ||
| catalog_id=catalog_id, | ||
| catalog_table_input=catalog_table_input, | ||
| ) | ||
| catalog_table_input = catalog._get_table_input( # pylint: disable=protected-access | ||
| database=database, table=table, boto3_session=session, catalog_id=catalog_id | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't think in any better alternative now. I will try to think more about it later. 🤔
| if catalog_table_input: | ||
| table_type = catalog_table_input["TableType"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was pointed to me that there is no need to pass table_type="GOVERNED" if the table already exists and should instead be retrieved from the catalog so I have fixed it here
* Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Minor - Removing unnecessary ensure_session * Minor - Adding changes from comments and review * Minor - Adding Abort, Begin, Commit and Extend transactions * Minor - Adding missing functions * Minor - Adding missing @Property * Minor - Disable too many public methods * Minor - Checkpoint * Major - Governed tables write operations tested * Minor - Adding validate flow on branches * Minor - reducing static checks * Minor - Adding to_csv code * Minor - Disabling too-many-branches * Major - Ready for release * Minor - Proofreading * Minor - Removing needless use_threads argument * Minor - Removing the need to specify table_type when table is already created * Minor - Fixing _catalog_id call * Minor - Clarifying SQL filter operation * Minor - Removing type ignore
* Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Minor - Removing unnecessary ensure_session * Minor - Adding changes from comments and review * Minor - Adding Abort, Begin, Commit and Extend transactions * Minor - Adding missing functions * Minor - Adding missing @Property * Minor - Disable too many public methods * Minor - Checkpoint * Major - Governed tables write operations tested * Minor - Adding validate flow on branches * Minor - reducing static checks * Minor - Adding to_csv code * Minor - Disabling too-many-branches * Major - Ready for release * Minor - Proofreading * Minor - Removing needless use_threads argument * Minor - Removing the need to specify table_type when table is already created * Minor - Fixing _catalog_id call * Minor - Clarifying SQL filter operation * Minor - Removing type ignore
* Feature: Write to Glue Governed Tables (#560) * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Minor - Removing unnecessary ensure_session * Minor - Adding changes from comments and review * Minor - Adding Abort, Begin, Commit and Extend transactions * Minor - Adding missing functions * Minor - Adding missing @Property * Minor - Disable too many public methods * Minor - Checkpoint * Major - Governed tables write operations tested * Minor - Adding validate flow on branches * Minor - reducing static checks * Minor - Adding to_csv code * Minor - Disabling too-many-branches * Major - Ready for release * Minor - Proofreading * Minor - Removing needless use_threads argument * Minor - Removing the need to specify table_type when table is already created * Minor - Fixing _catalog_id call * Minor - Clarifying SQL filter operation * Minor - Removing type ignore * Minor - Reducing scope gitworkflow * Minor - Fixing _sanitize_name * Minor - Adding map_types flag * Minor - Aligning optional path argument with main branch * Minor tests adjustments * Minor - Removing Chunked parameter * Fixing issues from diverged branch * Major - M1 Launch (Stable) * Mmproving read by concatenating zero copies of arrow ttables * [skip ci] - Minor - Killing thread * [skip ci] - Minor - Passing client instead of session * Major - Adding Metadata Transaction API changes * Minor - Adding query as of time to test * [skip ci] - syncing with main branch * Merge main and adapt tests to API changes from Erie team * Merge main * Merging main - move to poetry * Merge main and resolve conflicts * Minor - Sync with main * Merging with main * Lint * Fixing get_table_obj retries * Green tests * Minor - Fixing automated merge * LakeFormation test infra * Commit protocol change - Erie * [skip ci] - Minor - Fixing catalog unit test * [skip ci] - Minor - Adding transaction_id to does_table_exist * Minor - Missing projection_storage_location_template * Upgrading botocore * xfail moto * Adding s3fs to tox * LF concurrent modification exception * catalog.py test * lint
Description of changes:
New feature implementing the ability to write to an AWS Glue Governed table with Lake Formation. This is an extension to the existing
wr.s3.to_parquetmethod.Example:
User must pass
table_type="GOVERNED"to enable the feature. They can specify atransaction_idor no argument (i.e.transaction_idautomatically generated). The command returns the S3 paths of the created parquet objects and any partition values. All of "append", "overwrite" and "overwrite_partitions" modes are supported. As part of the change, it's not necessary to specify thepathargument if a Glue table already exists (i.e. path is obtained from the Glue table metadata).Test:
pytest -n 8 tests/test__routines.pypytest -n 8 tests/test_lakeformation.pyUser must have permissions to create a Governed Table in Lake Formation.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.