Separate "public" Dataset class from SQLA model #25727

ashb · 2022-08-15T18:04:25Z

Importing all of SQLA can be very heavy weight (slowing import times),
and it can also make it harder if libraries want to build on top of the
Dataset concept.

Not to mention that the interal-API/db-isolation AIP should mean we
don't want to import any SQLA models directly in user code.

Importing all of SQLA can be very heavy weight (slowing import times), and it can also make it harder if libraries want to build on top of the Dataset concept. Not to mention that the interal-API/db-isolation AIP should mean we don't want to import any SQLA models directly in user code.

kaxil

Some tests are failing:

==== API postgres: 3 failures ====

tests/api_connexion/endpoints/test_dag_run_endpoint.py::test__get_upstream_dataset_events_no_prior: sqlalchemy.orm.exc.UnmappedInstanceError: Class 'airflow.datasets.Dataset' is not mapped
tests/api_connexion/endpoints/test_dag_run_endpoint.py::test__get_upstream_dataset_events_with_prior: sqlalchemy.orm.exc.UnmappedInstanceError: Class 'airflow.datasets.Dataset' is not mapped
tests/api_connexion/schemas/test_dataset_schema.py::TestDatasetSchema::test_serialize: ValueError: Use of List object with `schedule` param is only supported for List[Dataset].

==== Core postgres: 3 failures ====

tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'
tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets_failed: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'
tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets_skipped: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'

==== Other postgres: 3 failures ====

tests/lineage/test_lineage.py::TestLineage::test_lineage: AttributeError: 'dict' object has no attribute '_context'
tests/lineage/test_lineage.py::TestLineage::test_lineage_render: AttributeError: 'dict' object has no attribute '_context'
tests/lineage/test_lineage.py::TestLineage::test_lineage_is_sent_to_backend: AttributeError: 'dict' object has no attribute '_context'

==== Providers postgres: 1 failure ====

tests/providers/papermill/operators/test_papermill.py::TestPapermillOperator::test_execute: AttributeError: 'dict' object has no attribute '_context'

==== WWW postgres: 1 failure ====

tests/www/views/test_views_grid.py::test_next_run_datasets: TypeError: __init__() got an unexpected keyword argument 'id'

ashb · 2022-08-15T20:04:16Z

Overzealous refactoring ;) will fix those in the morning

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.

ashb requested review from ryanahamilton, bbovenzi, mik-laj, ephraimbuddy, kaxil and XD-DENG as code owners August 15, 2022 18:04

boring-cyborg bot added area:API Airflow's REST/HTTP API area:lineage area:serialization area:webserver Webserver related Issues labels Aug 15, 2022

kaxil approved these changes Aug 15, 2022

View reviewed changes

kaxil reviewed Aug 15, 2022

View reviewed changes

ashb added 3 commits August 16, 2022 11:10

fixup! Separate "public" Dataset class from SQLA model

0f50e75

fixup! Separate "public" Dataset class from SQLA model

fae18be

fixup! Separate "public" Dataset class from SQLA model

5d1ec76

ashb merged commit b90707e into apache:main Aug 16, 2022

ashb deleted the dataset-model-not-user-facing branch August 16, 2022 12:59

jedcunningham added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) AIP-48 Data-aware Scheduling labels Sep 12, 2022

ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022

blag mentioned this pull request Nov 17, 2022

Switch (back) to late imports #27730

Merged

mpeteuil mentioned this pull request Feb 16, 2024

Make Datasets hashable #37465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate "public" Dataset class from SQLA model #25727

Separate "public" Dataset class from SQLA model #25727

ashb commented Aug 15, 2022

kaxil left a comment

ashb commented Aug 15, 2022

Separate "public" Dataset class from SQLA model #25727

Separate "public" Dataset class from SQLA model #25727

Conversation

ashb commented Aug 15, 2022

kaxil left a comment

Choose a reason for hiding this comment

ashb commented Aug 15, 2022