-
Notifications
You must be signed in to change notification settings - Fork 722
Feature: Glue Governed Tables (#560) #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (catalog_table_input is None) and (table_type == "GOVERNED"): | ||
| catalog._create_parquet_table( # pylint: disable=protected-access | ||
| database=database, | ||
| table=table, | ||
| path=path, # type: ignore | ||
| columns_types=columns_types, | ||
| table_type=table_type, | ||
| partitions_types=partitions_types, | ||
| bucketing_info=bucketing_info, | ||
| compression=compression, | ||
| description=description, | ||
| parameters=parameters, | ||
| columns_comments=columns_comments, | ||
| boto3_session=session, | ||
| mode=mode, | ||
| catalog_versioning=catalog_versioning, | ||
| projection_enabled=projection_enabled, | ||
| projection_types=projection_types, | ||
| projection_ranges=projection_ranges, | ||
| projection_values=projection_values, | ||
| projection_intervals=projection_intervals, | ||
| projection_digits=projection_digits, | ||
| catalog_id=catalog_id, | ||
| catalog_table_input=catalog_table_input, | ||
| ) | ||
| catalog_table_input = catalog._get_table_input( # pylint: disable=protected-access | ||
| database=database, table=table, boto3_session=session, catalog_id=catalog_id | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is admittedly ugly (we are making the create_parquet_table and get_table_input calls twice). But is required to handle cases where a "GOVERNED" table does not exist and must be created before running the _to_dataset call. Open to suggestions on how to avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have _create_parquet_table return catalog_table_input to save on the last call? Not 100% sure on that but it seems to work based on similar structure anyway.
59dab54 to
d16a42c
Compare
* Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Minor - Removing unnecessary ensure_session * Minor - Adding changes from comments and review * Minor - Adding Abort, Begin, Commit and Extend transactions * Minor - Adding missing functions * Minor - Adding missing @Property * Minor - Disable too many public methods * Minor - Checkpoint * Major - Governed tables write operations tested * Minor - Adding validate flow on branches * Minor - reducing static checks * Minor - Adding to_csv code * Minor - Disabling too-many-branches * Major - Ready for release * Minor - Proofreading * Minor - Removing needless use_threads argument * Minor - Removing the need to specify table_type when table is already created * Minor - Fixing _catalog_id call * Minor - Clarifying SQL filter operation * Minor - Removing type ignore
d16a42c to
6034f76
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
👏 👏 👏 |
Description of changes:
Implements read and write operations into AWS Glue Governed tables
Ability to query an AWS Glue Governed table with Lake Formation. Example:
User can specify a
transaction_id,query_as_of_timeor no argument (i.e.transaction_idautomatically generated). The command returns a single or an Iterator of Pandas DataFrames.use_threadsis set to True by default to distribute the query executionAbility to write to an AWS Glue Governed table with Lake Formation. This is an extension to the existing
wr.s3.to_parquetmethod. Example:User must pass
table_type="GOVERNED"if creating a new table. They can specify atransaction_idor no argument (i.e.transaction_idautomatically generated). The command returns the S3 paths of the created parquet objects and any partition values. All of "append", "overwrite" and "overwrite_partitions" modes are supported. As part of the change, it's not necessary to specify thepathargument if a Glue table already exists (i.e. path is obtained from the Glue table metadata).Test:
pytest -n 8 tests/test__routines.pypytest -n 8 tests/test_lakeformation.pyUser must have permissions to create a Governed Table in Lake Formation.
Known Issues:
ORDER BYSQL command is not honoured (despite iterating over workers in order):sql="SELECT * FROM gov_people ORDER BY name;"However, the same command in the Athena Console returns results in order
In some cases, SQL filtering returns empty streams:
a.
sql="SELECT * FROM gov_people;" # Returns 19 streams with datab.
sql="SELECT * FROM gov_people WHERE gender='female';" # Returns 19 streams with datac.
sql="SELECT * FROM gov_people WHERE gender='female' and family_name='Sanchez';" # Returns 1 stream with data (one row) and 18 empty streamsIt does not seem possible to influence the size nor the number of Arrow streams returned by the engine. Thus, an arbitrary number of "chunked" DataFrames are created and then concatenated
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.