-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GCSToTrinoOperator #21704
Add GCSToTrinoOperator #21704
Conversation
r? @eladkal |
data = list(csv.reader(temp_file)) | ||
fields = tuple(data[0]) | ||
rows = [] | ||
for row in data[1:]: | ||
rows.append(tuple(row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data = list(csv.reader(temp_file)) | |
fields = tuple(data[0]) | |
rows = [] | |
for row in data[1:]: | |
rows.append(tuple(row)) | |
data = csv.reader(temp_file) | |
fields = tuple(next(data)) | |
rows = (tuple(row) for row in data) |
This saves significant memory usage when the downloaded file is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uranusjr
fields = tuple(next(data))
fails with TypeError: 'tuple' object is not an iterator
when I run the unit-test.
The line flagged from the test file is op.execute(None)
I am not really sure why this fails. The error indicates I am passing in a tuple in place of a csv-reader. But, given the line from the test file at which the failure occurs; I am not sure how I can pass in a csv-reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeError: 'tuple' object is not an iterator
This doesn’t sound right. The next
call is on a csv.reader()
, which is not a tuple. Did you mistype this as next(tuple(data))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - I did not mistype that. I was surprised it failed too!
I have pushed the change (I expect CI would fail).
Tests are failing:
|
Yes. I made the change based on @uranusjr suggestion. This logic is more efficient than reading in the whole csv at once; but it causes the unit tests to fail. I pushed the change upstream so that I can get suggestions on why this happens. |
But |
Yes. It is intended because it does not work with the logic for I can add a different way to specify schema (maybe a separate json file); but not |
from airflow.utils.context import Context | ||
|
||
|
||
class GCSToTrinoOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there are args for this operator that could dynamically generated and should be template_fields
? I'm thinking source_bucket
, source_object
, and trino_table
(?) might be good candidates here.
And, on a related note, probably worth also adding the same template_fields
to GCSToPrestoOperator
in a separate PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of template fields / dynamically generated fields?
I can make the change here and for the presto operator after that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within operators and sensors you can specify template_fields
which allow the values for those args to be Jinja templated and unlock some functionality with TaskFlow API. The Concepts docs have some information on Jinja templating in operators.
So you'd want to think about how users might interact with this operator. Could they potentially use any of the built-in Jinja templates as part of the arg for certain parameters or even want to use an output from an upstream task as the value?
Storage buckets/containers and object names are pretty classic examples where Jinja templating is used frequently. Especially if you think about date-partitioned paths. This is why I was suggesting source_bucket
and source_object
as template_fields
. But adding trino_table
as an arg that can be templated can't hurt anything though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added template_fields
. Thank you!!
@eladkal , @uranusjr : I have added alternate ways to provide Do you think these will be enough? |
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
Follow-up PR as discussed in #21084
Logic followed is similar to the above PR.
Loads a csv file from Google Cloud Storage into a Trino table.
Assumptions:
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.