Skip to content

Conversation

@jaidisido
Copy link
Contributor

@jaidisido jaidisido commented Feb 15, 2021

Description of changes:
New feature implementing the ability to write to an AWS Glue Governed table with Lake Formation. This is an extension to the existing wr.s3.to_parquet method.
Example:

import awswrangler as wr
import pandas as pd
wr.s3.to_parquet(
    ...     df=pd.DataFrame({
    ...         'col': [1, 2, 3],
    ...         'col2': ['A', 'A', 'B'],
    ...         'col3': [None, None, None]
    ...     }),
    ...     dataset=True,
    ...     mode='overwrite',
    ...     database='default',  # Athena/Glue database
    ...     table='my_table',  # Athena/Glue table
    ...     table_type='GOVERNED',
    ...     transaction_id="xxx",
)

User must pass table_type="GOVERNED" to enable the feature. They can specify a transaction_id or no argument (i.e. transaction_id automatically generated). The command returns the S3 paths of the created parquet objects and any partition values. All of "append", "overwrite" and "overwrite_partitions" modes are supported. As part of the change, it's not necessary to specify the path argument if a Glue table already exists (i.e. path is obtained from the Glue table metadata).

Test:
pytest -n 8 tests/test__routines.py
pytest -n 8 tests/test_lakeformation.py
User must have permissions to create a Governed Table in Lake Formation.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido jaidisido added WIP Work in progress feature labels Feb 15, 2021
@jaidisido jaidisido self-assigned this Feb 15, 2021
@jaidisido jaidisido changed the title Feature/lf write table Feature: Write to Glue Governed Tables Feb 15, 2021
Copy link
Contributor Author

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarifications

Comment on lines +580 to +607
if (catalog_table_input is None) and (table_type == "GOVERNED"):
catalog._create_parquet_table( # pylint: disable=protected-access
database=database,
table=table,
path=path, # type: ignore
columns_types=columns_types,
table_type=table_type,
partitions_types=partitions_types,
bucketing_info=bucketing_info,
compression=compression,
description=description,
parameters=parameters,
columns_comments=columns_comments,
boto3_session=session,
mode=mode,
catalog_versioning=catalog_versioning,
projection_enabled=projection_enabled,
projection_types=projection_types,
projection_ranges=projection_ranges,
projection_values=projection_values,
projection_intervals=projection_intervals,
projection_digits=projection_digits,
catalog_id=catalog_id,
catalog_table_input=catalog_table_input,
)
catalog_table_input = catalog._get_table_input( # pylint: disable=protected-access
database=database, table=table, boto3_session=session, catalog_id=catalog_id
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is admittedly ugly (we are making the create_parquet_table and get_table_input calls twice). But is required to handle cases where a "GOVERNED" table does not exist and must be created before running the _to_dataset call. Open to suggestions on how to avoid it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think in any better alternative now. I will try to think more about it later. 🤔

Comment on lines +580 to +607
if (catalog_table_input is None) and (table_type == "GOVERNED"):
catalog._create_parquet_table( # pylint: disable=protected-access
database=database,
table=table,
path=path, # type: ignore
columns_types=columns_types,
table_type=table_type,
partitions_types=partitions_types,
bucketing_info=bucketing_info,
compression=compression,
description=description,
parameters=parameters,
columns_comments=columns_comments,
boto3_session=session,
mode=mode,
catalog_versioning=catalog_versioning,
projection_enabled=projection_enabled,
projection_types=projection_types,
projection_ranges=projection_ranges,
projection_values=projection_values,
projection_intervals=projection_intervals,
projection_digits=projection_digits,
catalog_id=catalog_id,
catalog_table_input=catalog_table_input,
)
catalog_table_input = catalog._get_table_input( # pylint: disable=protected-access
database=database, table=table, boto3_session=session, catalog_id=catalog_id
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think in any better alternative now. I will try to think more about it later. 🤔

Comment on lines +543 to +544
if catalog_table_input:
table_type = catalog_table_input["TableType"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was pointed to me that there is no need to pass table_type="GOVERNED" if the table already exists and should instead be retrieved from the catalog so I have fixed it here

@jaidisido jaidisido changed the base branch from feature/lf-transactions to main-governed-tables February 25, 2021 11:57
@jaidisido jaidisido marked this pull request as ready for review February 25, 2021 11:57
@jaidisido jaidisido merged commit 1c49594 into main-governed-tables Feb 25, 2021
@jaidisido jaidisido deleted the feature/lf-write-table branch February 25, 2021 15:02
igorborgest pushed a commit that referenced this pull request Mar 9, 2021
* Initial Commit

* Minor - Refactoring Work Units Logic

* Major - Checkpoint w/ functional read code/example

* Initial Commit

* Minor - Refactoring Work Units Logic

* Major - Checkpoint w/ functional read code/example

* Minor - Removing unnecessary ensure_session

* Minor - Adding changes from comments and review

* Minor - Adding Abort, Begin, Commit and Extend transactions

* Minor - Adding missing functions

* Minor - Adding missing @Property

* Minor - Disable too many public methods

* Minor - Checkpoint

* Major - Governed tables write operations tested

* Minor - Adding validate flow on branches

* Minor - reducing static checks

* Minor - Adding to_csv code

* Minor - Disabling too-many-branches

* Major - Ready for release

* Minor - Proofreading

* Minor - Removing needless use_threads argument

* Minor - Removing the need to specify table_type when table is already created

* Minor - Fixing _catalog_id call

* Minor - Clarifying SQL filter operation

* Minor - Removing type ignore
igorborgest pushed a commit that referenced this pull request Mar 9, 2021
* Initial Commit

* Minor - Refactoring Work Units Logic

* Major - Checkpoint w/ functional read code/example

* Initial Commit

* Minor - Refactoring Work Units Logic

* Major - Checkpoint w/ functional read code/example

* Minor - Removing unnecessary ensure_session

* Minor - Adding changes from comments and review

* Minor - Adding Abort, Begin, Commit and Extend transactions

* Minor - Adding missing functions

* Minor - Adding missing @Property

* Minor - Disable too many public methods

* Minor - Checkpoint

* Major - Governed tables write operations tested

* Minor - Adding validate flow on branches

* Minor - reducing static checks

* Minor - Adding to_csv code

* Minor - Disabling too-many-branches

* Major - Ready for release

* Minor - Proofreading

* Minor - Removing needless use_threads argument

* Minor - Removing the need to specify table_type when table is already created

* Minor - Fixing _catalog_id call

* Minor - Clarifying SQL filter operation

* Minor - Removing type ignore
jaidisido added a commit that referenced this pull request Dec 2, 2021
* Feature: Write to Glue Governed Tables (#560)

* Initial Commit

* Minor - Refactoring Work Units Logic

* Major - Checkpoint w/ functional read code/example

* Initial Commit

* Minor - Refactoring Work Units Logic

* Major - Checkpoint w/ functional read code/example

* Minor - Removing unnecessary ensure_session

* Minor - Adding changes from comments and review

* Minor - Adding Abort, Begin, Commit and Extend transactions

* Minor - Adding missing functions

* Minor - Adding missing @Property

* Minor - Disable too many public methods

* Minor - Checkpoint

* Major - Governed tables write operations tested

* Minor - Adding validate flow on branches

* Minor - reducing static checks

* Minor - Adding to_csv code

* Minor - Disabling too-many-branches

* Major - Ready for release

* Minor - Proofreading

* Minor - Removing needless use_threads argument

* Minor - Removing the need to specify table_type when table is already created

* Minor - Fixing _catalog_id call

* Minor - Clarifying SQL filter operation

* Minor - Removing type ignore

* Minor - Reducing scope gitworkflow

* Minor - Fixing _sanitize_name

* Minor - Adding map_types flag

* Minor - Aligning optional path argument with main branch

* Minor tests adjustments

* Minor - Removing Chunked parameter

* Fixing issues from diverged branch

* Major - M1 Launch (Stable)

* Mmproving read by concatenating zero copies of arrow ttables

* [skip ci] - Minor - Killing thread

* [skip ci] - Minor - Passing client instead of session

* Major - Adding Metadata Transaction API changes

* Minor - Adding query as of time to test

* [skip ci] - syncing with main branch

* Merge main and adapt tests to API changes from Erie team

* Merge main

* Merging main - move to poetry

* Merge main and resolve conflicts

* Minor - Sync with main

* Merging with main

* Lint

* Fixing get_table_obj retries

* Green tests

* Minor -  Fixing automated merge

* LakeFormation test infra

* Commit protocol change - Erie

* [skip ci] - Minor - Fixing catalog unit test

* [skip ci] - Minor - Adding transaction_id to does_table_exist

* Minor - Missing projection_storage_location_template

* Upgrading botocore

* xfail moto

* Adding s3fs to tox

* LF concurrent modification exception

* catalog.py test

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature WIP Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants