-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Airflow file sensor by prefix for azure data lake storage #22315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
|
Can you please add unit tests for it ? (see other unit tests for azure as an example). Also an example dag and ideally howto guide would be great. |
josh-fell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 For adding unit tests, example DAG, and how-to guide would be great. All don't necessarily need to be part of this PR.
The airflow.contrib path is deprecated so we shouldn't add new modules there. This module should be part of airflow.providers.microsoft.azure.sensors.
I would recommend running static code checks if you can as well. This will make the CI run smoother when approved and decrease any review back-and-forth, fixes, etc.
| from airflow.contrib.hooks.azure_data_lake_hook import AzureDataLakeHook | ||
| from airflow.sensors.base_sensor_operator import BaseSensorOperator | ||
| from airflow.utils.decorators import apply_defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from airflow.contrib.hooks.azure_data_lake_hook import AzureDataLakeHook | |
| from airflow.sensors.base_sensor_operator import BaseSensorOperator | |
| from airflow.utils.decorators import apply_defaults | |
| from airflow.providers.microsoft.azure.hooks.data_lake import AzureDataLakeHook | |
| from airflow.sensors.base import BaseSensorOperator |
The import paths for AzureDataLakeHook and BaseSensorOperator are deprecated. Also using apply_defaults is not necessary as of Airflow 2.
|
|
||
| """ | ||
|
|
||
| @apply_defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @apply_defaults |
No longer needed as of Airflow 2.
| @apply_defaults | ||
| def __init__( | ||
| self, | ||
| DataLake_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use snake case for variables, args, etc.
| DataLake_path, | ||
| prefix, | ||
| azure_data_lake_conn_id="azure_data_lake_default", | ||
| check_options=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look used in the sensor. Should it be removed?
| DataLake_path : directory of the file | ||
|
|
||
| prefix : file name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the docstring for the parameters to match the Sphinx style used. The azure_data_lake_conn_id above is a good example or other examples in operators and sensors.
| Interacts with Azure Data Lake: | ||
|
|
||
| Client ID and client secret should be in user and password parameters. | ||
| Tenant and account name should be extra field as | ||
| {"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description doesn't quite describe the purpose of the sensor. Would you mind updating and elaborating here?
| return len(hook.list(self.DataLake_path + "/" + self.prefix)) > 0 | ||
|
|
||
|
|
||
| # return TRUE => 1 ou more file detected | ||
| # return FALSE => No file detected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return len(hook.list(self.DataLake_path + "/" + self.prefix)) > 0 | |
| # return TRUE => 1 ou more file detected | |
| # return FALSE => No file detected | |
| num_files = len(hook.list(self.DataLake_path + "/" + self.prefix)) | |
| return num_files > 0 |
Small refactor for readability. The comments are better suited for the sensor's description in the docstring since this is the fundamental operation of the sensor.
|
I am going to release a new version of providers in ~ week so if you want to get it included @Ash-ZAMAN - good time to address the comments. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Hi,
I noticed there was no airflow file sensor suitable for Azure Data Lake Storage. Hence, I created one. That could be realy useful for people who store data in Data Lake.
My sensor is adaptable with the prefix of the file's name and any directory path
hope this contribution will help someone.
Regards, Ashraf ZAMAN
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.