Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native support for data transfers optimisation #481

Closed
10 tasks
sunank200 opened this issue Jun 22, 2022 · 0 comments · Fixed by #544
Closed
10 tasks

Add native support for data transfers optimisation #481

sunank200 opened this issue Jun 22, 2022 · 0 comments · Fixed by #544
Assignees
Labels
feature New feature or request

Comments

@sunank200
Copy link
Contributor

sunank200 commented Jun 22, 2022

Please describe the feature you'd like to see
Currently, we do not support the native transfers between different file locations and databases in LoadFile operator even though the external provider library supports the optimised approach to transfer the data. For example, to transfer between S3 to Snowflake we could use COPY INTO

command to load data from a Cloud Storage bucket into tables that would make the transfer way faster.

Describe the solution you'd like
Me and @utkarsharma2 were discussing various approaches. One of the most optimised approaches looks like using existing airflow hooks to achieve that. For example, S3ToSnowflakeOperator has the whole logic already implemented. I propose using existing transfer operators from airflow providers where ever possible. If that provider doesn't exist we could implement that logic on our end on the respective location classes in astro-sdk-python.

For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible. In the LoadFile execute() method we can do the following:

  • Check if the Source file storage(input_file) to Destination database(output_table) optimised transfer exists or not. (This mapping can be maintained in each database and location classes). For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible.

  • If optimised transfer exists, fetch the relevant hooks from airflow providers and do the transfer. Another way would be to use the Transfer Operators implemented on airflow directly if it exists. ( Example - S3ToSnowflakeOperator)

Are there any alternatives to this feature?
Is there another way we could solve this problem or enable this use-case?
Next step would be to use chunk or split the file and do parallel async uploads. But using optimised transfers using airflow hooks would improve the performance drastically.

Acceptance Criteria

  • All checks and tests in the CI should pass
  • Unit tests (90% code coverage or more, once available)
  • Integration tests (if the feature relates to a new database or external service)
  • Example DAG
  • Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • Exception handling in case of errors
  • Logging (are we exposing useful information to the user? e.g. source and destination)
  • Improve the documentation (README, Sphinx, and any other relevant)
  • How to use Guide for the feature (example)
  • Run the benchmark script and evaluate the improvement
@sunank200 sunank200 added the feature New feature or request label Jun 22, 2022
@sunank200 sunank200 self-assigned this Jun 22, 2022
@tatiana tatiana self-assigned this Jul 14, 2022
tatiana added a commit that referenced this issue Jul 14, 2022
Refactor how tables are created in BaseDatabase.load_file_to_table

We should prioritise creating the table using the `table.columns` if they are specified by the user and have the dataframe autodetection as a fallback.

Most of the complexity of #487 was the creation of tables, and this step aims to simplify the Snowflake `load_file` optimization.

Relates to: #430, #481, #493, #494
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
2 participants