Add the interface for universal transfer operator #1267

sunank200 · 2022-11-18T06:02:30Z

Description

What is the current behavior?

Add interface for universal transfer operator with working example DAG.

closes: #1139

What is the new behavior?

Define an interface for the universal transfer operator
Implement a POC for Fivetran integration
Implement GCS to S3 transfer

Does this introduce a breaking change?

No

Checklist

Created tests which fail without the change (if possible)
Extended the README / documentation, if necessary

codecov · 2022-11-18T06:20:20Z

Codecov Report

Base: 97.36% // Head: 97.36% // No change to project coverage 👍

Coverage data is based on head (56763c8) compared to base (0925b32).
Patch has no changes to coverable lines.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1267   +/-   ##
=======================================
  Coverage   97.36%   97.36%           
=======================================
  Files          19       19           
  Lines         682      682           
=======================================
  Hits          664      664           
  Misses         18       18

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

python-sdk/dev/Dockerfile

pankajastro

class FileLocationand IngestorSupported should we keep them in some constant.py or use the existing one?

create_dataprovider function should be a part of dataprovider module so what about keeping it in init.py of dataprovider

create_transfer_integration function should be a part of the integrations module so what about keeping it in init.py of integrations

get_dataset_connection_type should be part of the base dataset class

I feel util.py should not require

UTO will only be available if native API is available from source to destination?

Do we need some load testing?

I feel there is a lot of "Universal" prefixes here which is a bit confusing.

python-sdk/example_dags/example_universal_transfer_operator.py

python-sdk/src/uto/data_providers/aws/s3.py

python-sdk/src/uto/universal_transfer_operator.py

python-sdk/src/uto/utils.py

python-sdk/src/uto/datasets/base.py

python-sdk/src/uto/universal_transfer_operator.py

python-sdk/src/uto/data_providers/aws/s3.py

sunank200 · 2022-11-29T21:13:45Z

class FileLocationand IngestorSupported should we keep them in some constant.py or use the existing one?

create_dataprovider function should be a part of dataprovider module so what about keeping it in init.py of dataprovider

create_transfer_integration function should be a part of the integrations module so what about keeping it in init.py of integrations

get_dataset_connection_type should be part of the base dataset class

I feel util.py should not require

UTO will only be available if native API is available from source to destination?

Do we need some load testing?

I feel there is a lot of "Universal" prefixes here which is a bit confusing.

I have done the necessary changes or commented accordingly.

tatiana · 2022-11-29T22:21:53Z

python-sdk/src/transfers/constants.py

@@ -0,0 +1,27 @@
+from enum import Enum


@sunank200 will the UniversalTransferOperator be shipped as an independent package or as part of the python-sdk?
If it is part of the same package, it may be confusing to have a separate constants file - and a different FileLocation definition than the rest.

That was exactly the question I had here

python-sdk/src/transfers/universal_transfer_operator.py

python-sdk/src/transfers/integrations/base.py

feluelle · 2022-11-30T08:54:39Z

python-sdk/src/transfers/integrations/__init__.py

+    module = importlib.import_module(module_path)
+    class_name = get_class_name(module_ref=module, suffix="Integration")
+    transfer_integrations: TransferIntegrations = getattr(module, class_name)(transfer_params)


Do we really have to dynamically import? I think for simplicity, we should map type to class directly. E.g.

from transfers.integrations.fivetran import FivetranIntegration CUSTOM_INGESTION_TYPE_TO_MODULE_PATH = {"Fivetran": FivetranIntegration}

Because if you really want to do all of this dynamically you need to take care of more things. For example, suffix="Integration" is very hidden whereas the definition of the class(name) is in one place i.e. class FivetranIntegration. One could easily forget that the class has to have an integration suffix.

Note also that your exceptions on the module path only fail on run-time - not during parsing.

I think as long as a dynamic import is not necessary, I would go with static imports instead.

python-sdk/src/transfers/integrations/__init__.py

python-sdk/src/transfers/integrations/fivetran.py

python-sdk/src/transfers/utils.py

dimberman

@sunank200 is there a way to break this up into smaller PRs? This is a lot to bite off at once.

dimberman · 2022-12-01T17:49:14Z

python-sdk/src/transfers/constants.py

@@ -0,0 +1,27 @@
+from enum import Enum


That was exactly the question I had here

dimberman · 2022-12-01T17:49:50Z

python-sdk/src/transfers/constants.py

+class FileLocation(Enum):
+    # [START filelocation]
+    LOCAL = "local"
+    HTTP = "http"
+    HTTPS = "https"
+    GS = "google_cloud_platform"  # Google Cloud Storage
+    google_cloud_platform = "google_cloud_platform"  # Google Cloud Storage
+    S3 = "s3"  # Amazon S3
+    AWS = "aws"


Is the plan to keep this static to what we "officially" support? Will there be a way for users to load custom locations/ingestors on top of this library?

It would be nice if users could easily create custom ingestors and treat them as plugins

dimberman · 2022-12-01T17:51:30Z

python-sdk/src/transfers/data_providers/__init__.py

+def create_dataprovider(
+    dataset: Dataset,
+    transfer_params: dict = {},
+    transfer_mode: TransferMode = TransferMode.NONNATIVE,
+    if_exists: LoadExistStrategy = "replace",
+) -> DataProviders:
+    from airflow.hooks.base import BaseHook
+
+    conn_type = BaseHook.get_connection(dataset.conn_id).conn_type
+    module_path = DATASET_CONN_ID_TO_DATAPROVIDER_MAPPING[conn_type]
+    module = importlib.import_module(module_path)
+    class_name = get_class_name(module_ref=module, suffix="DataProvider")
+    data_provider: DataProviders = getattr(module, class_name)(
+        conn_id=dataset.conn_id,
+        extra=dataset.extra,
+        transfer_params=transfer_params,
+        transfer_mode=transfer_mode,
+        if_exists=if_exists,
+    )
+    return data_provider


See I think we could easily have something that checks for plugins in this function. This might be future-work just wanted to bring it up

dimberman · 2022-12-01T17:52:26Z

python-sdk/src/transfers/data_providers/aws/s3.py

+from transfers.data_providers.base import DataProviders
+from transfers.datasets.base import UniversalDataset as Dataset
+
+from astro.constants import LoadExistStrategy


IF the plan is to separate this into its own package it will be worth seeing how much/how little we actually import from other astro SDK modules.

(if this is separated I imagine astro SDK would import this rather than the other way around)

dimberman · 2022-12-01T17:55:20Z

python-sdk/src/transfers/data_providers/aws/s3.py

+        """Return true if the dataset exists"""
+        raise NotImplementedError
+
+    def load_data_from_gcs(self, source_dataset: Dataset, destination_dataset: Dataset) -> None:


Wait are we going to have a function for each ingestor? This seems like it could blow up the classes. Unless this is more meant for "special cases"?

dimberman · 2022-12-01T17:56:07Z

python-sdk/src/transfers/data_providers/aws/s3.py

+            source_dataprovider.get_bucket_name(source_dataset),  # type: ignore
+            source_dataset.extra.get("delimiter", None),
+            source_dataset.extra.get("prefix", None),
+        )


This function is WAY too long. Please break it up into sub-functions.

dimberman · 2022-12-01T17:56:31Z

python-sdk/src/transfers/data_providers/aws/s3.py

+        if files:
+            for file in files:
+                with source_hook.provide_file(
+                    object_name=file, bucket_name=source_dataprovider.get_bucket_name(source_dataset)  # type: ignore
+                ) as local_tmp_file:
+                    dest_key = os.path.join(dest_s3_key, file)
+                    logging.info("Saving file to %s", dest_key)
+                    self.hook.load_file(
+                        filename=local_tmp_file.name,
+                        key=dest_key,
+                        replace="replace",
+                        acl_policy=destination_dataset.extra.get("s3_acl_policy", None),


This level of nesting is problematic. This should be a subfunction.

python-sdk/src/transfers/datasets/base.py

python-sdk/example_dags/example_universal_transfer_operator.py

Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>

for more information, see https://pre-commit.ci

…gration based on factory

sunank200 · 2022-12-23T15:56:46Z

Closing this as PR as continuing the work on other PR: #1492

# Description ## What is the current behavior?  Closing the PR #1267 and continuing the change in this fresh PR.  closes: #1139 closes: #1544 closes: #1545 closes: #1546 closes: #1551 ## What is the new behavior?  - Define interfaces - Use Airflow 2.4 Dataset concept to build more types of Datasets: - Table - File - Dataframe - API - Define an interface for the universal transfer operator - Add the `TransferParameters` class to pass transfer configurations. - Use context manager from DataProvider for clean up. - Introduce three transfer modes - `native`, `non-native` and `third-party`. - `DataProviders` - Add interface for `DataProvider`. - Add interface for `BaseFilesystemProviders`. - Add `read` and `write` methods in `DataProviders` with the context manager. - `TransferIntegrations` and third-party transfers - Add interface for `TransferIntegrations` and introduce the third-party transfer approach - Non-native transfers - Add `Dataprovider` for S3 and GCS. - Add a transfer workflow for S3 to GCS using a non-native approach. - Add a transfer workflow for GCS to S3 using a non-native approach. - Add example DAG for S3 to GCS implementation. - Add example DAG for GCS to S3 implementation. - Third-party transfers - Add `FivetranTransferIntegration` class for all transfers using Fivetran. - Implement `FivetranOptions` which inherits from `TransferParameters` class to pass transfer configurations. - Implement a POC for Fivetran integration - Add example DAG for Fivetran implementation - Fivetran POC with working DAG for transfer example (S3 to Snowflake) when `connector_id` is passed. - Document the APIs for Fivetran transfers on the notion here: https://www.notion.so/astronomerio/Fivetran-3bd9ecfbdcae411faa49cb38595a4571 - MakeFile and Dockerfile along with docker-compose.yaml to build it locally and on the container ## Does this introduce a breaking change? No ### Checklist - [x] Example DAG - [x] Created tests which fail without the change (if possible) - [x] Extended the README/documentation, if necessary Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com> Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>

# Description ## What is the current behavior?  Closing the PR astronomer/astro-sdk#1267 and continuing the change in this fresh PR.  closes: #1139 closes: #1544 closes: #1545 closes: #1546 closes: #1551 ## What is the new behavior?  - Define interfaces - Use Airflow 2.4 Dataset concept to build more types of Datasets: - Table - File - Dataframe - API - Define an interface for the universal transfer operator - Add the `TransferParameters` class to pass transfer configurations. - Use context manager from DataProvider for clean up. - Introduce three transfer modes - `native`, `non-native` and `third-party`. - `DataProviders` - Add interface for `DataProvider`. - Add interface for `BaseFilesystemProviders`. - Add `read` and `write` methods in `DataProviders` with the context manager. - `TransferIntegrations` and third-party transfers - Add interface for `TransferIntegrations` and introduce the third-party transfer approach - Non-native transfers - Add `Dataprovider` for S3 and GCS. - Add a transfer workflow for S3 to GCS using a non-native approach. - Add a transfer workflow for GCS to S3 using a non-native approach. - Add example DAG for S3 to GCS implementation. - Add example DAG for GCS to S3 implementation. - Third-party transfers - Add `FivetranTransferIntegration` class for all transfers using Fivetran. - Implement `FivetranOptions` which inherits from `TransferParameters` class to pass transfer configurations. - Implement a POC for Fivetran integration - Add example DAG for Fivetran implementation - Fivetran POC with working DAG for transfer example (S3 to Snowflake) when `connector_id` is passed. - Document the APIs for Fivetran transfers on the notion here: https://www.notion.so/astronomerio/Fivetran-3bd9ecfbdcae411faa49cb38595a4571 - MakeFile and Dockerfile along with docker-compose.yaml to build it locally and on the container ## Does this introduce a breaking change? No ### Checklist - [x] Example DAG - [x] Created tests which fail without the change (if possible) - [x] Extended the README/documentation, if necessary Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com> Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>

Add the interface for universal transfer operator

144a5f6

sunank200 added 5 commits November 22, 2022 20:27

Add GCS to S3 transfer implementation

eb02459

Merge branch 'main' into uto

b073811

Remove cyclic dependency for dataprovider

0dfc6fb

Merge branch 'main' into uto

8fa2b73

Fix the GCS to S3 transfer changes

d6f2597

tatiana reviewed Nov 25, 2022

View reviewed changes

python-sdk/dev/Dockerfile Outdated Show resolved Hide resolved

pankajastro reviewed Nov 26, 2022

View reviewed changes

sunank200 added 6 commits November 29, 2022 14:44

Remove params and rename uto to transfers

e98b417

Add the interface for universal transfer operator

72a9cb3

Add GCS to S3 transfer implementation

324bc6d

Remove cyclic dependency for dataprovider

d81ab78

Fix the GCS to S3 transfer changes

cd21922

Remove params and rename uto to transfers

059df36

sunank200 force-pushed the uto branch from e98b417 to 059df36 Compare November 29, 2022 20:21

sunank200 added 2 commits November 30, 2022 02:08

Merge branch 'uto' of https://github.com/astronomer/astro-sdk into uto

b01738f

Dynamically load file from source to sync

fa8bf60

tatiana reviewed Nov 29, 2022

View reviewed changes

sunank200 marked this pull request as ready for review November 30, 2022 07:30

sunank200 requested review from dimberman, utkarsharma2, pankajkoti and feluelle as code owners November 30, 2022 07:30

feluelle requested changes Nov 30, 2022

View reviewed changes

feluelle reviewed Nov 30, 2022

View reviewed changes

python-sdk/src/transfers/integrations/__init__.py Outdated Show resolved Hide resolved

feluelle reviewed Nov 30, 2022

View reviewed changes

python-sdk/src/transfers/integrations/fivetran.py Outdated Show resolved Hide resolved

feluelle reviewed Nov 30, 2022

View reviewed changes

python-sdk/src/transfers/utils.py Show resolved Hide resolved

Merge branch 'main' into uto

b9c9b98

dimberman requested changes Dec 1, 2022

View reviewed changes

kaxil reviewed Dec 12, 2022

View reviewed changes

python-sdk/src/transfers/datasets/base.py Show resolved Hide resolved

kaxil reviewed Dec 12, 2022

View reviewed changes

python-sdk/example_dags/example_universal_transfer_operator.py Outdated Show resolved Hide resolved

sunank200 and others added 11 commits December 14, 2022 16:57

Update python-sdk/src/transfers/universal_transfer_operator.py

a91841b

Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>

Update python-sdk/src/transfers/integrations/fivetran.py

87a0eae

Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d2c41ff

for more information, see https://pre-commit.ci

Refactor rename TransferIntegrations to TransferIntegration

4aae20a

Add Fivetran support for uto

0363ed7

Merge branch 'main' into uto

c57d8f9

Merge branch 'uto' of https://github.com/astronomer/astro-sdk into uto

5e4364e

Add working example for uto with fivetran connector_id

09c4205

Add the different datasets supported

84b0ac3

Update new dataset based on File, Table, Dataframe and API dataset

f36c537

refactor rename from create_transfer_integration to get_transfer_inte…

56763c8

…gration based on factory

sunank200 mentioned this pull request Dec 23, 2022

Add universal transfer operator interfaces with working example #1492

Merged

3 tasks

sunank200 closed this Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the interface for universal transfer operator #1267

Add the interface for universal transfer operator #1267

sunank200 commented Nov 18, 2022 •

edited

codecov bot commented Nov 18, 2022 •

edited

pankajastro left a comment

sunank200 commented Nov 29, 2022

tatiana Nov 29, 2022

dimberman Dec 1, 2022

feluelle Nov 30, 2022

dimberman left a comment

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

dimberman Dec 1, 2022

sunank200 commented Dec 23, 2022

Add the interface for universal transfer operator #1267

Add the interface for universal transfer operator #1267

Conversation

sunank200 commented Nov 18, 2022 • edited

Description

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Checklist

codecov bot commented Nov 18, 2022 • edited

Codecov Report

pankajastro left a comment

Choose a reason for hiding this comment

sunank200 commented Nov 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimberman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunank200 commented Dec 23, 2022

sunank200 commented Nov 18, 2022 •

edited

codecov bot commented Nov 18, 2022 •

edited