Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate "public" Dataset class from SQLA model #25727

Merged
merged 4 commits into from
Aug 16, 2022

Conversation

ashb
Copy link
Member

@ashb ashb commented Aug 15, 2022

Importing all of SQLA can be very heavy weight (slowing import times),
and it can also make it harder if libraries want to build on top of the
Dataset concept.

Not to mention that the interal-API/db-isolation AIP should mean we
don't want to import any SQLA models directly in user code.

Importing all of SQLA can be very heavy weight (slowing import times),
and it can also make it harder if libraries want to build on top of the
Dataset concept.

Not to mention that the interal-API/db-isolation AIP should mean we
don't want to import any SQLA models directly in user code.
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests are failing:

==== API postgres: 3 failures ====

tests/api_connexion/endpoints/test_dag_run_endpoint.py::test__get_upstream_dataset_events_no_prior: sqlalchemy.orm.exc.UnmappedInstanceError: Class 'airflow.datasets.Dataset' is not mapped
tests/api_connexion/endpoints/test_dag_run_endpoint.py::test__get_upstream_dataset_events_with_prior: sqlalchemy.orm.exc.UnmappedInstanceError: Class 'airflow.datasets.Dataset' is not mapped
tests/api_connexion/schemas/test_dataset_schema.py::TestDatasetSchema::test_serialize: ValueError: Use of List object with `schedule` param is only supported for List[Dataset].

==== Core postgres: 3 failures ====

tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'
tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets_failed: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'
tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets_skipped: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'

==== Other postgres: 3 failures ====

tests/lineage/test_lineage.py::TestLineage::test_lineage: AttributeError: 'dict' object has no attribute '_context'
tests/lineage/test_lineage.py::TestLineage::test_lineage_render: AttributeError: 'dict' object has no attribute '_context'
tests/lineage/test_lineage.py::TestLineage::test_lineage_is_sent_to_backend: AttributeError: 'dict' object has no attribute '_context'

==== Providers postgres: 1 failure ====

tests/providers/papermill/operators/test_papermill.py::TestPapermillOperator::test_execute: AttributeError: 'dict' object has no attribute '_context'

==== WWW postgres: 1 failure ====

tests/www/views/test_views_grid.py::test_next_run_datasets: TypeError: __init__() got an unexpected keyword argument 'id'

@ashb
Copy link
Member Author

ashb commented Aug 15, 2022

Overzealous refactoring ;) will fix those in the morning

@ashb ashb merged commit b90707e into apache:main Aug 16, 2022
@ashb ashb deleted the dataset-model-not-user-facing branch August 16, 2022 12:59
@jedcunningham jedcunningham added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) AIP-48 Data-aware Scheduling labels Sep 12, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022
mpeteuil added a commit to mpeteuil/airflow that referenced this pull request Feb 21, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
potiuk pushed a commit that referenced this pull request Feb 21, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-48 Data-aware Scheduling area:API Airflow's REST/HTTP API area:lineage area:serialization area:webserver Webserver related Issues changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants