Feature: Glue Governed Tables (#560) #571

jaidisido · 2021-02-25T15:43:07Z

Description of changes:
Implements read and write operations into AWS Glue Governed tables

Ability to query an AWS Glue Governed table with Lake Formation. Example:

import awswrangler as wr

df = wr.lakeformation.read_sql_query(
   sql="SELECT * FROM my_table;",
   database="my_db",
   catalog_id="111111111111"
 )

User can specify a transaction_id, query_as_of_time or no argument (i.e. transaction_id automatically generated). The command returns a single or an Iterator of Pandas DataFrames. use_threads is set to True by default to distribute the query execution

Ability to write to an AWS Glue Governed table with Lake Formation. This is an extension to the existing wr.s3.to_parquet method. Example:

import awswrangler as wr
import pandas as pd
wr.s3.to_parquet(
    ...     df=pd.DataFrame({
    ...         'col': [1, 2, 3],
    ...         'col2': ['A', 'A', 'B'],
    ...         'col3': [None, None, None]
    ...     }),
    ...     dataset=True,
    ...     mode='overwrite',
    ...     database='default',  # Athena/Glue database
    ...     table='my_table',  # Athena/Glue table
    ...     table_type='GOVERNED',
    ...     transaction_id="xxx",
)

User must pass table_type="GOVERNED" if creating a new table. They can specify a transaction_id or no argument (i.e. transaction_id automatically generated). The command returns the S3 paths of the created parquet objects and any partition values. All of "append", "overwrite" and "overwrite_partitions" modes are supported. As part of the change, it's not necessary to specify the path argument if a Glue table already exists (i.e. path is obtained from the Glue table metadata).

Test:
pytest -n 8 tests/test__routines.py
pytest -n 8 tests/test_lakeformation.py
User must have permissions to create a Governed Table in Lake Formation.

Known Issues:

ORDER BY SQL command is not honoured (despite iterating over workers in order):
sql="SELECT * FROM gov_people ORDER BY name;"
However, the same command in the Athena Console returns results in order
In some cases, SQL filtering returns empty streams:
a. sql="SELECT * FROM gov_people;" # Returns 19 streams with data
b. sql="SELECT * FROM gov_people WHERE gender='female';" # Returns 19 streams with data
c. sql="SELECT * FROM gov_people WHERE gender='female' and family_name='Sanchez';" # Returns 1 stream with data (one row) and 18 empty streams
It does not seem possible to influence the size nor the number of Arrow streams returned by the engine. Thus, an arbitrary number of "chunked" DataFrames are created and then concatenated

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jaidisido · 2021-02-25T15:49:03Z

awswrangler/s3/_write_parquet.py

+            if (catalog_table_input is None) and (table_type == "GOVERNED"):
+                catalog._create_parquet_table(  # pylint: disable=protected-access
+                    database=database,
+                    table=table,
+                    path=path,  # type: ignore
+                    columns_types=columns_types,
+                    table_type=table_type,
+                    partitions_types=partitions_types,
+                    bucketing_info=bucketing_info,
+                    compression=compression,
+                    description=description,
+                    parameters=parameters,
+                    columns_comments=columns_comments,
+                    boto3_session=session,
+                    mode=mode,
+                    catalog_versioning=catalog_versioning,
+                    projection_enabled=projection_enabled,
+                    projection_types=projection_types,
+                    projection_ranges=projection_ranges,
+                    projection_values=projection_values,
+                    projection_intervals=projection_intervals,
+                    projection_digits=projection_digits,
+                    catalog_id=catalog_id,
+                    catalog_table_input=catalog_table_input,
+                )
+                catalog_table_input = catalog._get_table_input(  # pylint: disable=protected-access
+                    database=database, table=table, boto3_session=session, catalog_id=catalog_id
+                )


This is admittedly ugly (we are making the create_parquet_table and get_table_input calls twice). But is required to handle cases where a "GOVERNED" table does not exist and must be created before running the _to_dataset call. Open to suggestions on how to avoid it

Have _create_parquet_table return catalog_table_input to save on the last call? Not 100% sure on that but it seems to work based on similar structure anyway.

* Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Minor - Removing unnecessary ensure_session * Minor - Adding changes from comments and review * Minor - Adding Abort, Begin, Commit and Extend transactions * Minor - Adding missing functions * Minor - Adding missing @Property * Minor - Disable too many public methods * Minor - Checkpoint * Major - Governed tables write operations tested * Minor - Adding validate flow on branches * Minor - reducing static checks * Minor - Adding to_csv code * Minor - Disabling too-many-branches * Major - Ready for release * Minor - Proofreading * Minor - Removing needless use_threads argument * Minor - Removing the need to specify table_type when table is already created * Minor - Fixing _catalog_id call * Minor - Clarifying SQL filter operation * Minor - Removing type ignore

jaidisido · 2021-03-26T20:53:36Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-vPH6gKH92ax6
Commit ID: acb9d98
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-01T15:47:04Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: 786e205
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-01T19:24:06Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: 9284d7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-02T02:31:45Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: 38fb6a8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-02T03:13:03Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: 2f71850
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-02T08:03:33Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: afb209f
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-02T08:41:41Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: afb209f
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-02T21:24:03Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: c54ba97
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-12-02T21:48:05Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
Commit ID: 04d44ab
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

igorborgest · 2021-12-02T23:46:35Z

👏 👏 👏

jaidisido added WIP Work in progress feature labels Feb 25, 2021

jaidisido requested a review from igorborgest February 25, 2021 15:43

jaidisido self-assigned this Feb 25, 2021

jaidisido linked an issue Feb 25, 2021 that may be closed by this pull request

Query Governed Tables (Lake Formation) - Chunked approach. #570

Closed

jaidisido commented Feb 25, 2021

View reviewed changes

jaidisido mentioned this pull request Mar 5, 2021

Module for upserting data into an existing table #566

Closed

igorborgest force-pushed the main-governed-tables branch from 59dab54 to d16a42c Compare March 9, 2021 12:30

jaidisido added 3 commits March 9, 2021 14:50

Minor - Reducing scope gitworkflow

1a0c551

Minor - Fixing _sanitize_name

6034f76

igorborgest force-pushed the main-governed-tables branch from d16a42c to 6034f76 Compare March 9, 2021 17:51

jaidisido added 6 commits March 9, 2021 18:22

Merge branch 'main' into main-governed-tables

796751c

Minor - Adding map_types flag

2660de2

Merge branch 'main' into main-governed-tables

4a4cef8

Merge branch 'main' into main-governed-tables

cb10ed2

Minor - Aligning optional path argument with main branch

02445e5

Merge branch 'main' into main-governed-tables

09d7f7e

aws deleted a comment from mmpoulos Mar 16, 2021

jaidisido added 4 commits March 16, 2021 20:34

Merge branch 'main' into main-governed-tables

8ff3a1e

Merge branch 'main' into main-governed-tables

e03a23f

Merge branch 'main' into main-governed-tables

c1fb2b1

Merge branch 'main' into main-governed-tables

acb9d98

jaidisido added 6 commits April 6, 2021 15:42

Merge branch 'main' into main-governed-tables

ca21ff2

Minor tests adjustments

0918da0

Merge branch 'main' into main-governed-tables

152e570

Minor - Removing Chunked parameter

febcb5c

Merge branch 'main' into main-governed-tables

0afc79c

Merge branch 'main' into main-governed-tables

c3e90c7

jaidisido added 7 commits November 5, 2021 18:05

LakeFormation test infra

245d365

Commit protocol change - Erie

a091986

Merge branch 'main' into main-governed-tables

a3e78c6

[skip ci] - Minor - Fixing catalog unit test

14c03b7

[skip ci] - Minor - Adding transaction_id to does_table_exist

7fadf86

Merge branch 'main' into main-governed-tables

0a8cccb

Minor - Missing projection_storage_location_template

1bb91a7

jaidisido removed the WIP Work in progress label Nov 23, 2021

jaidisido added this to the 2.13.0 milestone Nov 23, 2021

jaidisido marked this pull request as ready for review November 23, 2021 15:36

jaidisido mentioned this pull request Nov 29, 2021

Atomic transactions using lakeformation #1037

Closed

Merge branch 'main' into main-governed-tables

786e205

Upgrading botocore

9284d7c

xfail moto

38fb6a8

Adding s3fs to tox

2f71850

LF concurrent modification exception

afb209f

catalog.py test

c54ba97

lint

04d44ab

jaidisido merged commit d0cbd9f into main Dec 2, 2021

jaidisido deleted the main-governed-tables branch December 2, 2021 23:41

jaidisido linked an issue Dec 3, 2021 that may be closed by this pull request

Lake Formation Table Creation misnames table causing EntityNotFoundException during wr.s3.to_parquet() #579

Closed

Feature: Glue Governed Tables (#560) #571

Feature: Glue Governed Tables (#560) #571

Uh oh!

Conversation

jaidisido commented Feb 25, 2021

Uh oh!

jaidisido Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

kukushking May 7, 2021

Choose a reason for hiding this comment

Uh oh!

jaidisido commented Mar 26, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 1, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 1, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 2, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 2, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 2, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 2, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 2, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Dec 2, 2021

AWS CodeBuild CI Report

Uh oh!

igorborgest commented Dec 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants