Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Datasets hashable #37465

Merged
merged 1 commit into from Feb 21, 2024
Merged

Make Datasets hashable #37465

merged 1 commit into from Feb 21, 2024

Conversation

mpeteuil
Copy link
Contributor

@mpeteuil mpeteuil commented Feb 15, 2024

Currently DAGs accept a Collection["Dataset"] as an option for the schedule, but that collection cannot be a set because Datasets are not a hashable type. The interesting thing is that the DatasetModel is actually already hashable, so this introduces a bit of duplication since it's mostly the same implementation. However, Airflow users are primarily interfacing with Dataset, not DatasetModel so I think it makes sense for Dataset to be hashable. I'm not sure how to square the duplication or what __eq__ and __hash__ provide for DatasetModel though.

There was discussion on the original PR that created the Dataset (#24613) about whether to create two classes or one. In that discussion @kaxil mentioned:

I would slightly favour a separate DatasetModel and Dataset so Dataset becomes an extensible class, and DatasetModel just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it.

That first PR created the Dataset model as both SQLAlchemy and user space class though. It wasn't until later on (#25727) that the DatasetModel got broken out from Dataset and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Feb 16, 2024

LGTM. But I would prefer Others to also take a look - maybe there is a good reason for not having it hashable :). And thanks for the thorough research on the past !

BTW. Static checks need fixing on that one (I heartily recommend installing pre-commit and largely stop caring about those failures).

Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
@potiuk potiuk merged commit d5d6f4b into apache:main Feb 21, 2024
59 checks passed
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.0 milestone Mar 6, 2024
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants