Skip to content

Conversation

@JavierLopezT
Copy link
Contributor

@JavierLopezT JavierLopezT commented Jun 12, 2020

This PR adds the truncate argument (bool), which makes a truncate right before the copy statement, within the same transaction. Also, the color has been changed to a more vivid one.


Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Unit tests coverage for changes (not needed for documentation changes)
  • Target Github ISSUE in description if exists
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

@boring-cyborg boring-cyborg bot added the provider:amazon AWS/Amazon - related issues label Jun 12, 2020
@feluelle feluelle self-requested a review June 14, 2020 19:03
Copy link
Member

@feluelle feluelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure all those functionality should be in a transfer operator. It gets really complex. Where do we limit it?

@feluelle
Copy link
Member

feluelle commented Jun 24, 2020

I think the functionality you added is useful, but this can also be done in separate tasks which then makes it also easier to monitor i.e. every step you can check in the UI.

@JavierLopezT
Copy link
Contributor Author

Got it. So I will remove from the PR the functionalities of deleting s3 file, duplicates checking and executing previous queries, and leave the other things. Is that okay?

@feluelle
Copy link
Member

Got it. So I will remove from the PR the functionalities of deleting s3 file, duplicates checking and executing previous queries, and leave the other things. Is that okay?

By different tasks I mean different Airflow tasks - not different PR's.

@feluelle
Copy link
Member

I think also the "truncate" functionality can be done by a different PostgresOperator task. WDYT?

@JavierLopezT
Copy link
Contributor Author

While I agree that the rest of functionalities can be done in separates operators (for instance, you may want to delete several s3 files at the end of your dag, or you don't want to check duplicates, or you may execute previous queries in a single operator at the beginning of the dag), I think the "truncate" is more "fittable" to be in this operator, because it is way simpler than the others (just one line) and it is just kind of preparation for the copy, not an extra step. Also, I believe the log/errors that can be thrown because of the truncate are simple enough to be monitored in the same task of the COPY. Anyway, if you think I should also remove that from the PR, I'll do it. Thanks!

@feluelle
Copy link
Member

Good explanation. I can see the "truncate" functionality stay in there. 👍

@JeffryMAC
Copy link

I think also the "truncate" functionality can be done by a different PostgresOperator task. WDYT?

Other than what @JavierLopezT mentioned if you separated Truncate & Copy you might end up with empty table.
You want the Truncate & Copy be executed one after another.

While this doesn't guaranty atomecy it's the closest thing you can have.

@feluelle
Copy link
Member

Okay, but then we should at least use SQL transactions.

@feluelle
Copy link
Member

@JavierLopezT WDYT? Can you use transactions when doing truncate + copy? :)

@JavierLopezT
Copy link
Contributor Author

@feluelle Sure, I'll come back to this PR as soon as I can

@JavierLopezT
Copy link
Contributor Author

Hello @feluelle. Do you think I can take the changes proposed in #7688 and put it in there or should I open a new MR?

@feluelle
Copy link
Member

feluelle commented Aug 6, 2020

I would add it to a new PR.

Note that the PR you linked is proposing a change to RedshiftToS3 not S3ToRedshift like you do here.

@JavierLopezT
Copy link
Contributor Author

Hello @feluelle, I have done the changes, and now comes the drama: TESTING

I think I have to add a test to check that the truncate works.
Is it enough to add a check that tests if the truncate_statement returns the correct string? Do I have to modify the already existing test to include the transaction part?

Thank you very much

Copy link
Member

@feluelle feluelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JavierLopezT can we rename transaction variable to someting more appropriate like sql or so?

@feluelle
Copy link
Member

Javier I also removed the following from the description:

  • Modified a part in the copy statement, that will makes prior workflows incompatible. Not sure if I have to propose a discussion about this point somewhere else (If so, I will delete this part from this PR) or just create a note in UPDATING.md. The prior code was:
COPY {self.schema}.{self.table}
            FROM 's3://{self.s3_bucket}/{self.s3_key}/{self.table}'
            with credentials
            'aws_access_key_id={credentials.access_key};aws_secret_access_key={credentials.secret_key}'
            {copy_options};

And the new is:

COPY {self.schema}.{self.table}
            FROM 's3://{self.s3_bucket}/{self.s3_key}'
            with credentials
            'aws_access_key_id={credentials.access_key};aws_secret_access_key={credentials.secret_key}'
            {copy_options};

So we don't have to have the table as the file name. I have read here and somewhere else complains about the existence of the table at the end of the from.

..as this isn't in your changes / PR ?!

@feluelle feluelle changed the title Few new functionalities and changes in S3ToRedshiftOperator Add truncate table (before copy) option to S3ToRedshiftOperator Sep 23, 2020
@JavierLopezT
Copy link
Contributor Author

Javier I also removed the following from the description:

  • Modified a part in the copy statement, that will makes prior workflows incompatible. Not sure if I have to propose a discussion about this point somewhere else (If so, I will delete this part from this PR) or just create a note in UPDATING.md. The prior code was:
COPY {self.schema}.{self.table}
            FROM 's3://{self.s3_bucket}/{self.s3_key}/{self.table}'
            with credentials
            'aws_access_key_id={credentials.access_key};aws_secret_access_key={credentials.secret_key}'
            {copy_options};

And the new is:

COPY {self.schema}.{self.table}
            FROM 's3://{self.s3_bucket}/{self.s3_key}'
            with credentials
            'aws_access_key_id={credentials.access_key};aws_secret_access_key={credentials.secret_key}'
            {copy_options};

So we don't have to have the table as the file name. I have read here and somewhere else complains about the existence of the table at the end of the from.

..as this isn't in your changes / PR ?!

Oh yes, I have just seen that this was already solved in August in c635804#diff-85c770282ec360127508910595a47fd1
Thanks!

@JavierLopezT
Copy link
Contributor Author

The template_field is also added in #10890. I'll edit my description again

@turbaszek turbaszek requested a review from feluelle September 27, 2020 17:08
@JavierLopezT
Copy link
Contributor Author

@JavierLopezT can we rename transaction variable to someting more appropriate like sql or so?

Done

@feluelle
Copy link
Member

feluelle commented Oct 8, 2020

LGTM @JavierLopezT. Let's wait until the issue on master is fixed and then I am merging this one.

@feluelle
Copy link
Member

feluelle commented Oct 9, 2020

@JavierLopezT There are so many failing checks, can you rebase once more and if this is fine I am merging it. :)

@JavierLopezT
Copy link
Contributor Author

@JavierLopezT There are so many failing checks, can you rebase once more and if this is fine I am merging it. :)

Done. Not sure if I have done it correctly though. (I followed https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#id9)

@feluelle
Copy link
Member

@JavierLopezT test seems to fail. Can you look at it, please? :)

Copy link
Member

@turbaszek turbaszek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👌 @feluelle ?

Copy link
Member

@feluelle feluelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, merging it.

@feluelle feluelle merged commit db121f7 into apache:master Oct 28, 2020
michalmisiewicz pushed a commit to michalmisiewicz/airflow that referenced this pull request Oct 30, 2020
…he#9246)

- add table arg to jinja template fields
- change ui_color

Co-authored-by: javier.lopez <javier.lopez@promocionesfarma.com>
szn pushed a commit to szn/airflow that referenced this pull request Nov 1, 2020
…he#9246)

- add table arg to jinja template fields
- change ui_color

Co-authored-by: javier.lopez <javier.lopez@promocionesfarma.com>
@JavierLopezT JavierLopezT deleted the s3toredshift branch January 20, 2021 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants