Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Destination AWS Datalake: fix KeyError issue and add parquet support #17193

Conversation

henriblancke
Copy link
Contributor

@henriblancke henriblancke commented Sep 27, 2022

What

Describe what the change is solving

  • Resolve destination-aws-datalake bug and add more functionality for destination
    • Resolve error when a stream schema does not have properties defined for "type":"object", resulting in a KeyError('properties') and breaking the sync
    • Add support for parquet format, file compression and partitioning

How

Describe the solution

This change uses awswrangler to get a lot of additional functionality for this destination out of the box, such as:

  • Support different output formats (json, parquet)
  • Support output file compression
    • Support gzip, snappy and zstd compression

Additional features include:

  • Support partitioning based on cursor fields
    • Partition on date, year, month, day or year/month/day
  • Improve pyarrow/athena type casting:
    • Use the source json-schema to infer pyarrow and athena types
    • Support complex data types such as structs and arrays and apply sane defaults when they can't be inferred.
  • Allow using namespaces so a single destination can create multiple databases (useful for example, when you want multiple sources to use their own database).
  • Configure the use of governed vs external tables
  • Configure how floats should be handled in athena (double vs decimal).
  • Allow assigning default LF-tags to databases created by this connection (helps with permission management)
  • Delete table in athena when a connection is reset.
  • Remove the test database after validating connection and permissions in check.

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

There are some spec and configuration changes

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

Tests

Unit Screen Shot 2022-09-26 at 8 19 58 PM
Integration Screen Shot 2022-09-26 at 8 01 06 PM
Acceptance

Put your acceptance tests output here.

Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
@CLAassistant
Copy link

CLAassistant commented Sep 27, 2022

CLA assistant check
All committers have signed the CLA.

Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Sep 27, 2022
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
@marcosmarxm
Copy link
Member

Sorry the delay in review this amazing contribution @henriblancke !!! Hope to get a review until wednesday.

Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
@marcosmarxm
Copy link
Member

@henriblancke sorry the delay here. I didn't have time this week and I'm OOO Today. I added this to my priority to-do list on Monday.

@henriblancke
Copy link
Contributor Author

@marcosmarxm thanks for the update and thanks for adding this to your priority list for Monday 😄! Enjoy your day off today!

Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henriblancke tests are failing to me, looks I can't find the database for integration tests. I need some test to check what is causing it.

@marcosmarxm marcosmarxm self-assigned this Oct 11, 2022
@henriblancke
Copy link
Contributor Author

@marcosmarxm sorry to hear that! What test is failing for you? Do you have secrets/config.json set up with a real AWS account that has permissions to lake formation, s3 and athena?

All integration tests seem to pass for me locally. I may be able to help troubleshoot, let me know what test(s) are failing for you. Thanks!

@grishick
Copy link
Contributor

grishick commented Mar 8, 2023

draft PR to run CI: #23855

Copy link
Contributor

@grishick grishick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration tests are not passing. Here is my latest run https://github.com/airbytehq/airbyte/actions/runs/4359907106/jobs/7622249697

@henriblancke
Copy link
Contributor Author

@grishick it looks like format is not defined in your config, should be flagged as required in spec.json now

@henriblancke
Copy link
Contributor Author

@grishick for the connector icon check, there was no icon defined in the previous version. Should I go ahead and define one?

@grishick
Copy link
Contributor

grishick commented Mar 8, 2023

@grishick for the connector icon check, there was no icon defined in the previous version. Should I go ahead and define one?

yes please

Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
@henriblancke
Copy link
Contributor Author

@grishick let me know what the next steps here are. Thanks again for all the help!

)
if table is None:
message = f"Could not create a table in database {connector_config.lakeformation_database_name}"
tbl = "airbyte_test"
Copy link
Contributor

@grishick grishick Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this table name should be randomized and check should also attempt to delete this table before exiting. Right now, subsequent test runs fail with this error (https://github.com/airbytehq/airbyte/actions/runs/4483018644):

	 ERROR    airbyte:destination.py:147 Could not create table airbyte_*** in database airbyte-integration: AlreadyExistsException('An error occurred (AlreadyExistsException) when calling the CreateTable operation: Table already exists.')
	 =========================== short *** summary info ============================
	 FAILED integration_***s/integration_***.py::***_check_valid_config - Asser...
	 ========================= 1 failed, 3 passed in 48.70s =========================

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm weird, aws_handler.reset_table(db, tbl) a couple of lines below this one should delete the table if it exists and all it's underlying data in s3 before it tries to create it again. Does your configured iam role or user have permissions to access/delete the table? The integration test logs show the following sequence of events:

  1. A call made to GetTable returning "Table not found" (user may not have permissions to access it)
  2. Calls made to s3, ListObjects, PutObject, etc (writing data to s3)
  3. A call made to CreateTable returning "Table already exists" (error could be caused by table existing but user not being able to access it)
  4. A call made to DeleteTable returning "Table not found"

Makes me think that for some reason the role or user does not have permissions to access the existing table? But I may be overlooking something? I'll make sure to randomize the table name in my next commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scenario that the integration test should account for is that a previous run did not properly cleanup (it could be cancelled, interrupted, failed due to a hosting problem, CI problem or anything else).

Another problem is that CHECK should not assume that airbyte_test table does not exist in the destination, because a user could be using multiple connections to write to the same destination. It is safer to add a random suffix/preffix to the tables created by CHECK method to avoid a name collision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!

@grishick
Copy link
Contributor

@grishick let me know what the next steps here are. Thanks again for all the help!

@henriblancke I re-ran integration tests and added several more comments based on the failures. I modified the config that we have saved in GSM in order to get passed failures caused by missing config values, and added one more comment about a failure in check method.

Signed-off-by: Henri Blancke <blanckehenri@gmail.com>
@henriblancke
Copy link
Contributor Author

@grishick let me know what the next steps here are. Thanks again for all the help!

@henriblancke I re-ran integration tests and added several more comments based on the failures. I modified the config that we have saved in GSM in order to get passed failures caused by missing config values, and added one more comment about a failure in check method.

@grishick thanks again for the review 🚀, I really appreciate all the help here! I responded to your check comment, added a random element to the table name and addressed the other comments.

@grishick
Copy link
Contributor

grishick commented Mar 24, 2023

Rerunning the integration tests and CI checks: #23855

@grishick
Copy link
Contributor

grishick commented Mar 24, 2023

/test connector=connectors/destination-aws-datalake

🕑 connectors/destination-aws-datalake https://github.com/airbytehq/airbyte/actions/runs/4514395508
✅ connectors/destination-aws-datalake https://github.com/airbytehq/airbyte/actions/runs/4514395508
Python tests coverage:

Name                                        Stmts   Miss  Cover
---------------------------------------------------------------
destination_aws_datalake/__init__.py            2      0   100%
destination_aws_datalake/config_reader.py      81     15    81%
destination_aws_datalake/destination.py        87     28    68%
destination_aws_datalake/aws.py               118     39    67%
destination_aws_datalake/stream_writer.py     233    110    53%
---------------------------------------------------------------
TOTAL                                         521    192    63%
Name                                        Stmts   Miss  Cover
---------------------------------------------------------------
destination_aws_datalake/__init__.py            2      0   100%
destination_aws_datalake/config_reader.py      81     13    84%
destination_aws_datalake/stream_writer.py     233     65    72%
destination_aws_datalake/aws.py               118     63    47%
destination_aws_datalake/destination.py        87     67    23%
---------------------------------------------------------------
TOTAL                                         521    208    60%

Build Passed

Test summary info:

All Passed

@octavia-squidington-iii octavia-squidington-iii removed area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Mar 27, 2023
@grishick
Copy link
Contributor

Published and merged here: #23855

@grishick grishick closed this Mar 27, 2023
@henriblancke
Copy link
Contributor Author

Published and merged here: #23855

Thanks again for all the help @grishick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants