Skip to content

[AIRFLOW-8664] Postgres to GCS operator - use named cursor#11793

Merged
mik-laj merged 1 commit intoapache:masterfrom
maroshmka:hmka/feat-server-side-cursor-pg
Nov 4, 2020
Merged

[AIRFLOW-8664] Postgres to GCS operator - use named cursor#11793
mik-laj merged 1 commit intoapache:masterfrom
maroshmka:hmka/feat-server-side-cursor-pg

Conversation

@maroshmka
Copy link
Contributor

Hello!

Implemented feature for postgres_to_gcs operator that allows to use server side cursor. More in the issue #8664

I was thinking about multiple designs to generalize it a bit more, but in the end went with straight forward solution on doing it only for postgres db, while keeping it consistent with presto_to_gcs operator. If there would be more demand for server-side cursors implemented on other dbs as well (mssql, mysql), generalization can be done then.

btw, had great experience with Breeze, absolutely love it! I was sceptic and was counting on "this will help you develop, but it will yield 100 errors anyway" experience, but it was really a breeze - install, read docs, run. nice!

Thanks!

@boring-cyborg boring-cyborg bot added the provider:google Google (including GCP) related issues label Oct 23, 2020
@maroshmka maroshmka force-pushed the hmka/feat-server-side-cursor-pg branch from 70fb9b3 to 7df41f4 Compare October 23, 2020 20:54
@mik-laj
Copy link
Member

mik-laj commented Oct 24, 2020

I like this change very much, but I would like to test it before merging. Can you add system tests to make this easier? https://github.com/apache/airflow/blob/7df41f4f14a870ac3f62da5b639768b72ba9f583/tests/providers/google/cloud/transfers/test_postgres_to_gcs_system.py
For that, you need to update the example DAG.
https://github.com/apache/airflow/tree/7df41f4f14a870ac3f62da5b639768b72ba9f583/airflow/providers/google/cloud/example_dags

@potiuk
Copy link
Member

potiuk commented Oct 25, 2020

As @mik-laj mentioned - adding system tests/update the howtos would be really great.

And I agree that generalizations might be added later. We even have a discussion about it and design doc about it : Discussion here: https://lists.apache.org/thread.html/rc888a329f1c49622c0123c2ddbcfcc107eead020b774f8a8fab6d7f1%40%3Cdev.airflow.apache.org%3E and the design doc : https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit so you might simply join the discussion and maybe participate later in developing it. Since we are busy with 2.0 release now, this feature ("generic transfer operators') does not have the highest priority, but sounds like a solid candidate for 2.1 some time early next year.

And thanks for the kind words about Breeze :)

@maroshmka
Copy link
Contributor Author

sure, will add the system tests. Would try to do it in first half of the week.

ad generic transfers - @potiuk great to hear that something like that it's happening. thank you for the info, i'll look into it :)

@maroshmka maroshmka force-pushed the hmka/feat-server-side-cursor-pg branch from 7df41f4 to bcc8664 Compare October 26, 2020 22:35
@maroshmka
Copy link
Contributor Author

maroshmka commented Oct 26, 2020

hello, I amended the system tests and docs. Few notes:

1. It took quite a lot of time making it work locally.

  • I needed to export GCP_PROJECT_ID, which wasn't written in the doc
  • the GCS_BUCKET variable was used on 2 places, so I changed it to be unique, but only on 1 place, but didn't know whats the problem, so I tried to debug elsewhere, but didn't noticed this :/ (also no env set yet, so I was getting 403 instead of 409)

2. I'm not really sure what are these tests for.
I thought there will be some assert against the real GCS bucket. But as I saw (and found in doc), there is no logic in tests and all the other tests I checked contained just bunch of tasks e.g. copying a table to bq and thats it. So, did I miss something? Or are these tests just for ensuring that it can be run in actual dag ? and if yes then ok, no matter the logic (which is tested separately). These resemble e2e tests to me, but without validation.

Not sure if 1. should be solved as I'm not sure about 2. 😄 But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.

@mik-laj
Copy link
Member

mik-laj commented Oct 27, 2020

I needed to export GCP_PROJECT_ID, which wasn't written in the doc

We are working to make all environmental variables public, but that will take some time. Sorry for the inconvenience.

the GCS_BUCKET variable was used on 2 places, so I changed it to be unique, but only on 1 place, but didn't know whats the problem, so I tried to debug elsewhere, but didn't noticed this :/ (also no env set yet, so I was getting 403 instead of 409)

It looks like you don't have access to the right buckets. Each DAG example should allow bucket to be changed using an environment variable. In this particular case, it did not happen, and I think it is worth adding this possibility. This will allow you to use this easy to change buckets in case of a conflicting name.

  1. I'm not really sure what are these tests for.

In most cases, we have a more complicated DAG example and the validation is embedded in the DAG logic, e.g. first task creates a resource, and the next task uses this resource. If the first task didn't create it, the next task wouldn't use it. But even without complicated DAGs, such simple system tests without validation can detect problems with third libraries, because when a method with wrong parameters is called, a runtime exception will be thrown. This won't get detected in unit testing as we use unitest.mock very intensively.

Not sure if 1. should be solved as I'm not sure about 2. 😄 But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.

We don't want random names generated every time we run the test as this will make debugging bugs difficult. The bucket name should only be generated once on a given computer to make behavior more predictable.

If you want, you can create files/airflow-breeze-config/init.sh with some extra environment initialization code. In this directory we store the script that configures the environment variables and provides the service account keys. For now, we have 26 service account key and 128 env variables.

This quarter, we've already started working on making these system tests run regularly on the CI. Once we're done, everyone will have an easy way to create all the required environment keys and variables. Running system tests in an OSS project is quite complicated from a variety of financial, security and technical factors.

Hope my explanations were helpful for you. Let me know if you want to make any more changes. If not, I will run the system tests (manually) and check if it works properly, then merge the change.

@maroshmka maroshmka force-pushed the hmka/feat-server-side-cursor-pg branch from bcc8664 to 9262b3f Compare October 27, 2020 08:50
@maroshmka
Copy link
Contributor Author

thank you very much for such an thorough explanation, it clears a lot stuff in my head!

Each DAG example should allow bucket to be changed using an environment variable. In this particular case, it did not happen, and I think it is worth adding this possibility.

amended with possibility to take it from env vars, this was missing and thats why I got stuck to most :/

We don't want random names generated every time we run the test as this will make debugging bugs difficult. The bucket name should only be generated once on a given computer to make behavior more predictable.

agree. my idea was to append some random suffix when starting breeze and set the env vars, e.g. bucket_test_{str(uuid.uuid4())[:6]}. as you are already working on it, its cool I guess :)

that'd be all from my side, so you can test it :)

@kaxil
Copy link
Member

kaxil commented Nov 4, 2020

Can you please rebased your PR on latest Master since we have applied Black and PyUpgrade on Master.

It will help if your squash your commits into single commit first so that there are less conflicts.

@maroshmka maroshmka force-pushed the hmka/feat-server-side-cursor-pg branch from 9262b3f to 680d7a4 Compare November 4, 2020 08:39
@maroshmka
Copy link
Contributor Author

sure, done.

@mik-laj
Copy link
Member

mik-laj commented Nov 4, 2020


tests/providers/google/cloud/transfers/test_postgres_to_gcs_system.py::PostgresToGCSSystemTest::test_run_example_dag PASSED                                                                          [100%]

Thanks!

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Nov 4, 2020
@github-actions
Copy link

github-actions bot commented Nov 4, 2020

The PR should be OK to be merged with just subset of tests as it does not modify Core of Airflow. The committers might merge it or can add a label 'full tests needed' and re-run it to run all tests if they see it is needed!

@mik-laj mik-laj merged commit fd3db77 into apache:master Nov 4, 2020
@maroshmka maroshmka deleted the hmka/feat-server-side-cursor-pg branch November 4, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

okay to merge It's ok to merge this PR as it does not require more tests provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants