-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Dataset URIs #305
Comments
Yes - the dataset names should represent database that In the future, OL will provide dataset classes that will ensure that dataset name is constructed up to spec. |
As of 1.0, Cosmos implements datasets only from an outlet perspective (it doesn't implement inlets). Since the output of seeds, snapshots, and models are "tables," this is the only type of dataset we need to worry about. In other words, we don't need to worry about creating dataset URIs for files in the meantime. As of 1.6, dbt allows tables to be materialized in four different ways:
This means Cosmos should not output datasets for resources that generate CTE (materialized=ephemeral). We also should not render these nodes as Airflow tasks. A second part of this ticket is the Dataset URI definition itself. For most of the currently supported profiles, we'll need the following information to create a Dataset URI: We already have (iv) as part of the Regarding the schema and database, they can come from three different sources:
To retrieve the schema (iii) or database (ii), we can do the following: Finally, when working on this ticket, we also need to take into account the fact that different databases may result in different URIs (in BQ, schema is called dataset; Sqlite does not have schema; and so on) |
Yesterday, I talked to @wslulciuc, and we discussed the possibility of Cosmos using the We also discussed AIP-53. However, the fact that Cosmos does not use Airflow hooks to run database transformations also means it does not benefit from this. We agreed I'll be looking into how
|
@tatiana I think This is also the reason for which I'm wary of using current Cosmos dataset naming - as |
It totally makes sense, @mobuchowski ! I'll look into these once I'm back to this issue and I'll raise any questions/issues here, thank you very much! |
@tatiana also, one thing it's worth looking at is adapter support.
Would be worth to make sure to not have a mismatch here. |
So far, we have yet to identify a flawless strategy to implement this ticket. I would love to hear others' thoughts! These are some of the approaches we have for fixing Airflow Dataset URIs in Cosmos:
So far, it feels the ideal solution would be (3). I'll give it a try! |
Use `openlineage-dbt` to create outlets Dataset URIs from within Cosmos. Closes: #305 Closes: #497 Closes: #433 (only emits outlet events from the model - the same behaviour as openlineage-dbt) Validation This change was tested by running Marquez locally and triggering the dag `basic_cosmos_dag` using Airflow 2.6.1 and Python 3.10.10. The output generated by this version of Cosmos can be seen in the following screenshots: <img width="1624" alt="Screenshot 2023-09-04 at 22 08 32" src="https://github.com/astronomer/astronomer-cosmos/assets/272048/bf8ac0fe-a4de-42a1-aac3-4e111876b615"> <img width="1624" alt="Screenshot 2023-09-04 at 22 09 01" src="https://github.com/astronomer/astronomer-cosmos/assets/272048/3d41821b-2cfb-414d-9951-e0b062ba0a1a"> <img width="1624" alt="Screenshot 2023-09-04 at 22 09 31" src="https://github.com/astronomer/astronomer-cosmos/assets/272048/1ad93331-70b1-428d-b638-e73d26492c96"> Tasks - [x] Fix pre-commit checks - [x] Add test - [x] Inlets support - [x] ~Emit open lineage events~ - [x] ~Support Docker/K8s~ (deverred to issue: #496) - [x] ~Create a PR on openlineage-dbt to remove the dbt dependency~ Not needed since the code depends on `openlineage-integration-common` and not `openlineage-dbt` ([more info](https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/common/setup.py#L15)) - [x] Understand which dataset is being emitted from test tasks (only inlets, no outlets)
[The documentation](https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html) was outdated. The method `get_dbt_dataset` no longer exists. It used to exist in older versions of Cosmos (before 1.1) when the URIs respected the format: `Dataset(f"DBT://{connection_id.upper()}/{project_name.upper()}/{model_name.upper()}")` More information on why we changed this: #305 Closes: #1032
[The documentation](https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html) was outdated. The method `get_dbt_dataset` no longer exists. It used to exist in older versions of Cosmos (before 1.1) when the URIs respected the format: `Dataset(f"DBT://{connection_id.upper()}/{project_name.upper()}/{model_name.upper()}")` More information on why we changed this: #305 Closes: #1032 (cherry picked from commit c47e104)
Cosmos, at the moment, has a naive approach to creating Dataset URIs:
astronomer-cosmos/cosmos/providers/dbt/core/utils/data_aware_scheduling.py
Lines 4 to 5 in 2aacff5
We should revisit this and adopt the patterns described at:
https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
The text was updated successfully, but these errors were encountered: