Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement AIP-60 Dataset URI formats #37005

Merged
merged 18 commits into from
Feb 26, 2024
Merged

Conversation

uranusjr
Copy link
Member

I’m struggling with documentation, and have yet to implement actual sanitisation logic. And it seems like a lot of the URIs (especially file-like ones) don’t actually need any validation at all? But this implements a basic structure.

@uranusjr
Copy link
Member Author

uranusjr commented Feb 2, 2024

Finally got this to a point I’m somewhat satisfied with. The logic should be mostly there. Not exactly sure about docs yet, please provide some ideas on where to add things.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was reading two times bare code - and I like it!

I am not an expert in structural changes, as provider interface is extended I assume this is not breaking provider releases as long as the core does not deliver the extension in Airflow 2.9.0?

Kindly ask to add a newsfragment entry for the release notes for the new feature that URIs are now validated, might hit a lot of users when upgrading and exsting datasets render invalid in upgrade. Users need to fix DAGs prior upgrade!

@uranusjr
Copy link
Member Author

The providers will be fine even if we release this with 2.9—since URIs without a registered scheme is silently treated as plain strings, they will work like they did in 2.8 and prior, aside from some new additional validations (e.g. no auth in the URI—but you shouldn’t do that in the first place anyway).

Good point about the fragment, I’ll add one.

@uranusjr
Copy link
Member Author

I expanded the documentation a bit, and added a news fragment as suggested above.

@mobuchowski mobuchowski self-requested a review February 23, 2024 14:42
@uranusjr uranusjr merged commit b52b227 into apache:main Feb 26, 2024
59 checks passed
@uranusjr uranusjr deleted the uri-validators branch February 26, 2024 09:28
alejorodriguez96 pushed a commit to alejorodriguez96/airflow that referenced this pull request Feb 26, 2024
alejorodriguez96 pushed a commit to alejorodriguez96/airflow that referenced this pull request Feb 26, 2024
alejorodriguez96 pushed a commit to alejorodriguez96/airflow that referenced this pull request Feb 26, 2024
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.0 milestone Mar 6, 2024
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Mar 6, 2024
utkarsharma2 pushed a commit to astronomer/airflow that referenced this pull request Apr 22, 2024
pankajkoti pushed a commit that referenced this pull request May 17, 2024
…rflow 2.9 (#39670)


Closes: #39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
RNHTTR pushed a commit to RNHTTR/airflow that referenced this pull request Jun 1, 2024
…rflow 2.9 (apache#39670)


Closes: apache#39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
ephraimbuddy pushed a commit that referenced this pull request Jun 4, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
ephraimbuddy pushed a commit that referenced this pull request Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
utkarsharma2 pushed a commit that referenced this pull request Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
ephraimbuddy pushed a commit that referenced this pull request Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
…rflow 2.9 (apache#39670)


Closes: apache#39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants