Add GCSToTrinoOperator #21704

rsg17 · 2022-02-21T06:09:11Z

Follow-up PR as discussed in #21084

Logic followed is similar to the above PR.
Loads a csv file from Google Cloud Storage into a Trino table.
Assumptions:

First row of the csv contains headers
Trino table with requisite columns is already created

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

rsg17 · 2022-02-21T06:10:11Z

r? @eladkal
Here is the PR for GCSToTrinoOperator that we discussed on the GCSToPrestoOperator PR.

uranusjr · 2022-02-21T06:17:36Z

airflow/providers/trino/transfers/gcs_to_trino.py

+            data = list(csv.reader(temp_file))
+            fields = tuple(data[0])
+            rows = []
+            for row in data[1:]:
+                rows.append(tuple(row))


Suggested change

data = list(csv.reader(temp_file))

fields = tuple(data[0])

rows = []

for row in data[1:]:

rows.append(tuple(row))

data = csv.reader(temp_file)

fields = tuple(next(data))

rows = (tuple(row) for row in data)

This saves significant memory usage when the downloaded file is large.

@uranusjr
fields = tuple(next(data)) fails with TypeError: 'tuple' object is not an iterator when I run the unit-test.
The line flagged from the test file is op.execute(None)

I am not really sure why this fails. The error indicates I am passing in a tuple in place of a csv-reader. But, given the line from the test file at which the failure occurs; I am not sure how I can pass in a csv-reader.

TypeError: 'tuple' object is not an iterator

This doesn’t sound right. The next call is on a csv.reader(), which is not a tuple. Did you mistype this as next(tuple(data))?

No - I did not mistype that. I was surprised it failed too!

I have pushed the change (I expect CI would fail).

eladkal · 2022-02-21T13:34:21Z

Tests are failing:


  self = <Task(GCSToTrinoOperator): test_gcs_to_trino>, context = None
  
      def execute(self, context: 'Context') -> None:
          gcs_hook = GCSHook(
              gcp_conn_id=self.gcp_conn_id,
              delegate_to=self.delegate_to,
              impersonation_chain=self.impersonation_chain,
          )
      
          trino_hook = TrinoHook(trino_conn_id=self.trino_conn_id)
      
          with NamedTemporaryFile("w+") as temp_file:
              self.log.info("Downloading data from %s", self.source_object)
              gcs_hook.download(
                  bucket_name=self.source_bucket,
                  object_name=self.source_object,
                  filename=temp_file.name,
              )
      
              data = csv.reader(temp_file)
  >           fields = tuple(next(data))
  E           TypeError: 'tuple' object is not an iterator
  
  airflow/providers/trino/transfers/gcs_to_trino.py:97: TypeError

rsg17 · 2022-02-21T16:22:36Z

Tests are failing:


  self = <Task(GCSToTrinoOperator): test_gcs_to_trino>, context = None
  
      def execute(self, context: 'Context') -> None:
          gcs_hook = GCSHook(
              gcp_conn_id=self.gcp_conn_id,
              delegate_to=self.delegate_to,
              impersonation_chain=self.impersonation_chain,
          )
      
          trino_hook = TrinoHook(trino_conn_id=self.trino_conn_id)
      
          with NamedTemporaryFile("w+") as temp_file:
              self.log.info("Downloading data from %s", self.source_object)
              gcs_hook.download(
                  bucket_name=self.source_bucket,
                  object_name=self.source_object,
                  filename=temp_file.name,
              )
      
              data = csv.reader(temp_file)
  >           fields = tuple(next(data))
  E           TypeError: 'tuple' object is not an iterator
  
  airflow/providers/trino/transfers/gcs_to_trino.py:97: TypeError

Yes. I made the change based on @uranusjr suggestion. This logic is more efficient than reading in the whole csv at once; but it causes the unit tests to fail.

I pushed the change upstream so that I can get suggestions on why this happens.

rsg17 · 2022-02-22T05:53:22Z

Here is an updated version without the target_fields in the first row of the csv. I think all tests would pass for this.

I am still not sure why fields = tuple(next(data)) failed with TypeError: 'tuple' object is not an iterator. I finally ended excluding it from the code.

cc - @eladkal, @uranusjr

uranusjr · 2022-02-22T06:25:51Z

But fields is gone entirely in your new commit. Is it intended?

rsg17 · 2022-02-22T15:58:47Z

But fields is gone entirely in your new commit. Is it intended?

Yes. It is intended because it does not work with the logic for fields.

I can add a different way to specify schema (maybe a separate json file); but not fields

josh-fell · 2022-02-23T16:05:43Z

airflow/providers/trino/transfers/gcs_to_trino.py

+    from airflow.utils.context import Context
+
+
+class GCSToTrinoOperator(BaseOperator):


Are there are args for this operator that could dynamically generated and should be template_fields? I'm thinking source_bucket, source_object, and trino_table(?) might be good candidates here.

And, on a related note, probably worth also adding the same template_fields to GCSToPrestoOperator in a separate PR too.

Can you give an example of template fields / dynamically generated fields?

I can make the change here and for the presto operator after that

Within operators and sensors you can specify template_fields which allow the values for those args to be Jinja templated and unlock some functionality with TaskFlow API. The Concepts docs have some information on Jinja templating in operators.

So you'd want to think about how users might interact with this operator. Could they potentially use any of the built-in Jinja templates as part of the arg for certain parameters or even want to use an output from an upstream task as the value?

Storage buckets/containers and object names are pretty classic examples where Jinja templating is used frequently. Especially if you think about date-partitioned paths. This is why I was suggesting source_bucket and source_object as template_fields. But adding trino_table as an arg that can be templated can't hurt anything though.

Added template_fields. Thank you!!

rsg17 · 2022-02-25T05:36:40Z

@eladkal , @uranusjr : I have added alternate ways to provide fields. User can either provide them as a list or point to a json file on GCS that has the fields. I referred to how this was done for gcs_to_bigquery

Do you think these will be enough?

github-actions · 2022-02-25T06:01:22Z

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

rsg17 requested a review from mik-laj as a code owner February 21, 2022 06:09

boring-cyborg bot added area:dev-tools area:providers kind:documentation labels Feb 21, 2022

uranusjr reviewed Feb 21, 2022

View reviewed changes

rsg17 force-pushed the gcs_trino branch from f6ca2b5 to e7d0d22 Compare February 21, 2022 07:57

rsg17 force-pushed the gcs_trino branch from e7d0d22 to ee06387 Compare February 22, 2022 05:49

rsg17 force-pushed the gcs_trino branch from ee06387 to 13cda83 Compare February 23, 2022 07:06

josh-fell reviewed Feb 23, 2022

View reviewed changes

gcs_trino

2773aec

rsg17 force-pushed the gcs_trino branch from 13cda83 to 2773aec Compare February 25, 2022 05:33

uranusjr approved these changes Feb 25, 2022

View reviewed changes

github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Feb 25, 2022

josh-fell approved these changes Feb 25, 2022

View reviewed changes

eladkal merged commit 942f8fd into apache:main Feb 27, 2022

eladkal mentioned this pull request Feb 27, 2022

Add GCSToPrestoOperator #21084

Merged

rsg17 mentioned this pull request Feb 27, 2022

Quick Update GCS Presto #21855

Merged

potiuk mentioned this pull request Mar 7, 2022

Status of testing Providers that were prepared on March 07, 2022 #22063

Closed

84 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GCSToTrinoOperator #21704

Add GCSToTrinoOperator #21704

rsg17 commented Feb 21, 2022

rsg17 commented Feb 21, 2022

uranusjr Feb 21, 2022

rsg17 Feb 21, 2022

uranusjr Feb 21, 2022

rsg17 Feb 21, 2022

eladkal commented Feb 21, 2022

rsg17 commented Feb 21, 2022

rsg17 commented Feb 22, 2022

uranusjr commented Feb 22, 2022

rsg17 commented Feb 22, 2022

josh-fell Feb 23, 2022

rsg17 Feb 24, 2022

josh-fell Feb 24, 2022

rsg17 Feb 25, 2022 •

edited

rsg17 commented Feb 25, 2022 •

edited

github-actions bot commented Feb 25, 2022

		from airflow.utils.context import Context


		class GCSToTrinoOperator(BaseOperator):

Add GCSToTrinoOperator #21704

Add GCSToTrinoOperator #21704

Conversation

rsg17 commented Feb 21, 2022

rsg17 commented Feb 21, 2022

uranusjr Feb 21, 2022

Choose a reason for hiding this comment

rsg17 Feb 21, 2022

Choose a reason for hiding this comment

uranusjr Feb 21, 2022

Choose a reason for hiding this comment

rsg17 Feb 21, 2022

Choose a reason for hiding this comment

eladkal commented Feb 21, 2022

rsg17 commented Feb 21, 2022

rsg17 commented Feb 22, 2022

uranusjr commented Feb 22, 2022

rsg17 commented Feb 22, 2022

josh-fell Feb 23, 2022

Choose a reason for hiding this comment

rsg17 Feb 24, 2022

Choose a reason for hiding this comment

josh-fell Feb 24, 2022

Choose a reason for hiding this comment

rsg17 Feb 25, 2022 • edited

Choose a reason for hiding this comment

rsg17 commented Feb 25, 2022 • edited

github-actions bot commented Feb 25, 2022

rsg17 Feb 25, 2022 •

edited

rsg17 commented Feb 25, 2022 •

edited