Support fnmatch pattern in SFTP Hook and Sensor #17436

saveriogzz · 2021-08-05T08:38:31Z

Effort to give users the possibility to just operates on specific files using the SFTP provider. Please refer to the old PR below for details.

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

boring-cyborg · 2021-08-05T08:38:35Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

uranusjr · 2021-08-05T09:34:04Z

I think my #15409 (comment) still applies. fnmatch is indeed a better solution, but we need to design the functions better. Maybe leave retrieve_file alone and implement a new retrieve_files instead?

…ng fnmatch

We are using ghcr.io as image cache for our CI builds and Breeze and it seems ghcr.io is being "rebuilt" while running. We had been using "airflow-.." image convention before, bacause multiple nesting levels of images were not supported, however we experienced errors recently with pushing 2.1 images (https://issues.apache.org/jira/browse/INFRA-22124) and during investigation it turned out, that it is possible now to use "/" in the name of the image, and while it still does not introduce multiple nesting levels and folder structure, the UI of GitHub treats it like that and if you have image which starts wiht "airflow/", the airflow prefix is stripped out and you can also have even more "/" in then name to introduce further hierarchy. Since we have to change image naming convention due to (still unresolved) bug with no permission to push the v2-1-test image we've decided to change naming convention for all our cache images to follow this - now available - "/" connvention to make it better structured and easier to manage/understand. Some more optimisations are implemented - Python, prod-build and ci-manifest images are only pushed when "latest" image is prepared. They are not needed for the COMMIT builds because we only need final images for those builds. This simplified the code quite a bit. The push of cache image in CI is done in one job for both CI and PROD images and the image is rebuilt again with latest constraints, to account for the latest constraints but to make sure that UPGRADE_TO_NEWER_DEPENDENCIES is not set during the build (which invalidates the cache for next non-upgrade builds) Backwards-compatibility was implemented to allow PRs that have not been upgraded to continue building after this one is merged, also a workaround has been implemented to make this change to work even if it is not merged yet to main. This "legacy" mode will be removed in ~week when everybody rebase on top of main. Documentation is updated reflecting those changes.

Add missing exception handling in success/retry/failure callbacks

…17406)

…orceHook (#17399) Adding other auth type inputs to SalesforceHook

Adding myself to INTHEWILD

CeleryKubernetesExecutor lets us use both celery and kubernetes executors. KEDA lets us scale down to zero when there are no celery tasks running. If we have no celery tasks running, and we run a k8s task, then KEDA will launch a worker even though there are still no celery tasks. We can prevent this from happening by ignoring the kubernetes queue in the KEDA query.

We are preparing Helm chart from main branch only and we never run it from airflow version branches (similarly as providers) This change disables Helm Chart tests in case default branch is different than main.

This test caused missing failure that manifested in 2.1.1 in the #16810

There was a change in #16521 that introduced schema field in DBApiHook, but unfortunately using it in provider Hooks deriving from DBApiHook is backwards incompatible for Airflow 2.1 and below. This caused Postgres 2.1.0 release backwards incompatibility and failures for Airflow 2.1.0. Since the change is small and most of DBApi-derived hooks already set the schema field on their own, the best approach is to make the schema field private for the DBApiHook and make a change in Postgres Hook to store the schema in the same way as all other operators. Fixes: #17422

…#17486) We have a global limit (60 seconds) for individual test execution, however 'test_airflow_context' of the Python Virtualenv test might take longer in case they are run in parallel - because they are using dill serialization including a lot of serializable data from the context of the task. We give the test 120 seconds to complete now.

When we build production image during test in main, we install providers from current sources, to make sure that all the tests including Helm Chart/Kubernetes tests are using latest sources for providers. However, when we build the prod image during v* branches, we want to build the production image using latest released providers instead, because this will be the way it will be built shortly when we release it. We do not run providers test not helm chart tests in this branch so it is more important to build the image in the way it will be built for releases - we run verification then and install dependencies in the very same way it will be done during release.

We get a lot of bug reports for providers, and it's rare that the versions being used are in the initial report. Let's ask for them.

Update in the wild with our organization.

The change applies dag_maker fixture in test_taskinstance.py

The resource check in breeze was slow (3 docker commands instead of one) and it used an extra image which needed to be downloaded. The new check uses already available airflow CI image and it performs all check in one docker command - thus is a lot faster and it also checks the image at the same time.

This PR applies dag_maker to tests in test_core.py module

There are several ways people might get the quick-start docker-compose running messed up (especially on linux): 1) they do not run initialization steps and run docker-compose-up 2) they do not run docker-compose-init first Also on MacOS/Windows default memory/disk settings are not enough to run Airflow via docker-compose and people are reporting "Airflow not working" where they simply do not allocate enough resources. Finally the docker compose does not support all versions of airflow and various problems might occur when you use this docker compose with old version of airflow. This change adds the following improvements: * automated check of minimum version of airflow supported * mkdir -p in the directories creation in instructions * automated checking if AIRFLOW_UID has been set (and printing error and instruction link in case it is not) * prints warning about too-low memory, cpu, disk allocation and instruction link where to read about it * automated fixing of ownership of the directories created in case they were not created initially and ended up owned by root user

…ng fnmatch

…into AIRFLOW-15332

This reverts commit 763860c.

uranusjr · 2021-08-09T10:01:50Z

It seems like your rebase went wrong.

saveriogzz

It seems like your rebase went wrong.

very wrong.... Sorry for this!
Should I close this PR and create a cleaner one?

saveriogzz · 2021-08-05T08:44:58Z

airflow/providers/sftp/hooks/sftp.py

+    def retrieve_file(
+        self, remote_full_path: str, local_full_path: str, fnmatch_pattern: Optional[str] = None
+    ) -> None:


@uranusjr, in #15409, already mentioned that this implementation could cause headaches to users inasmuch "it subtly changes semantics depending on the arguments". We can start from here and think about a better solution.

saveriogzz · 2021-08-05T11:31:02Z

airflow/providers/sftp/hooks/sftp.py

+        if fnmatch_pattern:
+            for file in conn.listdir(path):
+                if fnmatch(file, fnmatch_pattern):
+                    path = file
+                    break
+


@uranusjr , how about this method called by the SFTPSensor?
Should we create a different method to be able to catch all the files matching the pattern or we just wait for the the first of the list as it is implemented now?

saveriogzz · 2021-08-05T12:41:43Z

airflow/providers/sftp/sensors/sftp.py

+        if not self.fnmatch_pattern:
+            self.log.info('Poking for %s', self.path)
+            try:
+                mod_time = self.hook.get_mod_time(self.path)
+                self.log.info('Found File %s last modified: %s', str(self.path), str(mod_time))
+            except OSError as e:
+                if e.errno != SFTP_NO_SUCH_FILE:
+                    raise e
+                return False
+        else:
+            self.log.info('Poking for files matching pattern %s into %s', self.fnmatch_pattern, self.path)
+            try:
+                mod_time = self.hook.get_mod_time_pattern(self.path, self.fnmatch_pattern)
+                self.log.info('Found Files last modified %s', str(mod_time))
+            except OSError as e:
+                if e.errno != SFTP_NO_SUCH_FILE:
+                    raise e
+                return False


here we have two options:

not having specified fnmatch_pattern

having specified it

In the first case, the parameter path is as usual the full path to the file.
In the second case, the path will be the path to the directory whilst fnmatch_pattern will be useful to catch all the files we are interested in.
In this second case, the new method get_mod_time_pattern, which returns a dictionaries with full paths to files as keys and their modification time as values.

saveriogzz · 2021-08-05T14:13:02Z

airflow/providers/sftp/hooks/sftp.py

+        :type fnmatch_pattern: str
+        """
+        conn = self.get_conn()
+        for file in conn.listdir(dir_path):


this is wrong because it should raise errors in case no matches are found, at the moment it's not

saveriogzz · 2021-08-09T10:30:28Z

I'm going ahead closing this because I realized that the contribution is far from being ready to merge.

boring-cyborg bot added the area:providers label Aug 5, 2021

Saverio Guzzo and others added 27 commits August 5, 2021 17:07

support fnmatch pattern in SFTPHook and SFTPSensor

f120a9c

added tests for sftp hook and sensor

167f418

edited PEP8 docstring for sftp sensor

6798e6b

adjusted tests for new feature: support fnmatch in SFTP (AIRFLOW-15332)

1bcb38c

left retrieve_file in sftp hook as original and added method supporti…

06899e2

…ng fnmatch

fixed indentation error

147c576

fixed return value type of get_mod_time_pattern function

53a916a

fixed get_mod_time_pattern method

5cb7b14

Handle and log exceptions raised during task callback (#17347)

faf9f73

Add missing exception handling in success/retry/failure callbacks

Fix type annotations in OracleOperator, JdbcOperator, SqliteOperator (#…

bb52098

…17406)

Adding JWT, IP filtering, and direct session login support for Salesf…

5c0e98c

…orceHook (#17399) Adding other auth type inputs to SalesforceHook

Add missing permissions to varimport (#17468)

eb6af07

Add ephraimbuddy to INTHEWILD (#17472)

82b876b

Adding myself to INTHEWILD

Disable Helm tests when branch is not main (#17457)

ed1eac1

We are preparing Helm chart from main branch only and we never run it from airflow version branches (similarly as providers) This change disables Helm Chart tests in case default branch is different than main.

Add Datatonic to INTHEWILD (#17483)

cb0b895

Remove trigger-dag-run with configuration from quarantine (#16818)

d884a3d

This test caused missing failure that manifested in 2.1.1 in the #16810

[AIRFLOW-17200] Add Alibaba Cloud OSS support (#17201)

75ca654

Ask for provider versions in bug reports (#17480)

eec14b9

We get a lot of bug reports for providers, and it's rare that the versions being used are in the initial report. Let's ask for them.

Update INTHEWILD.md (#17491)

f5596d2

Update in the wild with our organization.

Add Microsoft PSRP provider (#17361)

69d2ed6

add deprecation notice for SubDagOperator (#17488)

b311bc0

Use dag_maker fixture in models/test_taskinstance.py (#17425)

ef1b124

The change applies dag_maker fixture in test_taskinstance.py

potiuk and others added 13 commits August 8, 2021 23:11

Use dag_maker in tests/core/test_core.py (#17462)

daece96

This PR applies dag_maker to tests in test_core.py module

New generic tableau operator: TableauOperator (#16915)

5df99d6

support fnmatch pattern in SFTPHook and SFTPSensor

73feaff

added tests for sftp hook and sensor

999095a

edited PEP8 docstring for sftp sensor

6ae04c8

adjusted tests for new feature: support fnmatch in SFTP (AIRFLOW-15332)

fba652b

left retrieve_file in sftp hook as original and added method supporti…

865ef49

…ng fnmatch

fixed indentation error

3b76f2f

fixed return value type of get_mod_time_pattern function

7b92c3a

fixed get_mod_time_pattern method

d3b3366

Merge branch 'AIRFLOW-15332' of https://github.com/saveriogzz/airflow …

fad14b6

…into AIRFLOW-15332

saveriogzz requested review from ashb, dimberman, jedcunningham, kaxil, mik-laj, potiuk, ryanahamilton, vikramkoka and XD-DENG as code owners August 9, 2021 09:53

Revert "Better diagnostics and self-healing of docker-compose (#17484)"

1647899

This reverts commit 763860c.

saveriogzz commented Aug 9, 2021

View reviewed changes

saveriogzz marked this pull request as draft August 9, 2021 10:12

saveriogzz closed this Aug 9, 2021

saveriogzz deleted the AIRFLOW-15332 branch August 9, 2021 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fnmatch pattern in SFTP Hook and Sensor #17436

Support fnmatch pattern in SFTP Hook and Sensor #17436

saveriogzz commented Aug 5, 2021

boring-cyborg bot commented Aug 5, 2021

uranusjr commented Aug 5, 2021 •

edited

Loading

uranusjr commented Aug 9, 2021

saveriogzz left a comment

saveriogzz Aug 5, 2021

saveriogzz Aug 5, 2021

saveriogzz Aug 5, 2021

saveriogzz Aug 5, 2021

saveriogzz commented Aug 9, 2021

Support fnmatch pattern in SFTP Hook and Sensor #17436

Support fnmatch pattern in SFTP Hook and Sensor #17436

Conversation

saveriogzz commented Aug 5, 2021

boring-cyborg bot commented Aug 5, 2021

uranusjr commented Aug 5, 2021 • edited Loading

uranusjr commented Aug 9, 2021

saveriogzz left a comment

Choose a reason for hiding this comment

saveriogzz Aug 5, 2021

Choose a reason for hiding this comment

saveriogzz Aug 5, 2021

Choose a reason for hiding this comment

saveriogzz Aug 5, 2021

Choose a reason for hiding this comment

saveriogzz Aug 5, 2021

Choose a reason for hiding this comment

saveriogzz commented Aug 9, 2021

uranusjr commented Aug 5, 2021 •

edited

Loading