-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change dataset URI validation to raise warning instead of error in Airflow 2.9 #39670
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 tasks
tatiana
changed the title
Change dataset validation to raise warning
Change dataset URI validation to raise warning instead of error in Airflow 2.9
May 16, 2024
pankajkoti
approved these changes
May 16, 2024
pankajkoti
reviewed
May 16, 2024
RNHTTR
approved these changes
May 16, 2024
amoghrajesh
approved these changes
May 17, 2024
Since we have multiple approvals on this, I'm merging this. However, please drop a comment here in case someone has comments and we can address those in a subsequent PR. |
utkarsharma2
added
type:improvement
Changelog: Improvements
type:bug-fix
Changelog: Bug Fixes
and removed
type:improvement
Changelog: Improvements
labels
Jun 3, 2024
ephraimbuddy
pushed a commit
that referenced
this pull request
Jun 4, 2024
…rflow 2.9 (#39670) Closes: #39486 Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default. (cherry picked from commit a07d799)
ephraimbuddy
pushed a commit
that referenced
this pull request
Jun 5, 2024
…rflow 2.9 (#39670) Closes: #39486 Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default. (cherry picked from commit a07d799)
41 tasks
2 tasks
tatiana
pushed a commit
to astronomer/astronomer-cosmos
that referenced
this pull request
Jul 17, 2024
I bumped the astronomer image to 11.6 since it uses the newest airflow, incorporating [tatiana's change in airflow version](apache/airflow#39670) for datasets and I am installing requirements.txt by default in the image, since if the user wants to add some custom package, he needs to change not only the requirements.txt file but also the Dockerfile. With this only adding dependencies to requirements.txt will already install them in the image for dev development.
romsharon98
pushed a commit
to romsharon98/airflow
that referenced
this pull request
Jul 26, 2024
…rflow 2.9 (apache#39670) Closes: apache#39486 # Context Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). # How to reproduce By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` # About the changes introduced This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes: #39486
Context
Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as
Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")
, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.This was a breaking change in an Airflow minor version. We should avoid this.
Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.
The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).
How to reproduce
By running the following DAG with
apache-airflow==2.9.1
andapache-airflow-providers-postgres==5.11.0
, as an example:Causes to the exception:
About the changes introduced
This PR introduces the following:
A boolean configuration within
[core],
namedstrict_dataset_uri_validation,
which should beFalse
by default.When this configuration is
False,
Airflow should raise a warning saying:If this configuration is
True,
Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1From Airflow 3.0, we change this configuration to be
True
by default.