Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Dataset URIs #305

Closed
tatiana opened this issue Jun 1, 2023 · 8 comments · Fixed by #485
Closed

Improve Dataset URIs #305

tatiana opened this issue Jun 1, 2023 · 8 comments · Fixed by #485
Assignees
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Jun 1, 2023

Cosmos, at the moment, has a naive approach to creating Dataset URIs:

def get_dbt_dataset(connection_id: str, project_name: str, model_name: str):
return Dataset(f"DBT://{connection_id.upper()}/{project_name.upper()}/{model_name.upper()}")

We should revisit this and adopt the patterns described at:
https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

@tatiana tatiana changed the title Review how we create Dataset URIs Improve Dataset URIs Jun 1, 2023
@mobuchowski
Copy link

Yes - the dataset names should represent database that dbt is connecting to.

In the future, OL will provide dataset classes that will ensure that dataset name is constructed up to spec.

@tatiana tatiana added this to the 1.0.0 milestone Jun 1, 2023
@tatiana tatiana self-assigned this Jul 24, 2023
@tatiana tatiana modified the milestones: 1.0.0, 1.1.0 Jul 24, 2023
@tatiana
Copy link
Collaborator Author

tatiana commented Jul 24, 2023

As of 1.0, Cosmos implements datasets only from an outlet perspective (it doesn't implement inlets). Since the output of seeds, snapshots, and models are "tables," this is the only type of dataset we need to worry about. In other words, we don't need to worry about creating dataset URIs for files in the meantime.

As of 1.6, dbt allows tables to be materialized in four different ways:

  • table
  • view
  • CTE
  • increment

This means Cosmos should not output datasets for resources that generate CTE (materialized=ephemeral). We also should not render these nodes as Airflow tasks.

A second part of this ticket is the Dataset URI definition itself. For most of the currently supported profiles, we'll need the following information to create a Dataset URI:
i. database type (e.g. Postgres, BigQuery, Redshift, MSSQL)
ii. database (e.g. "public")
iii. schema (e.g. "myschema")
iv. table name (dbt resource name)

We already have (iv) as part of the DbtNode definition.

Regarding the schema and database, they can come from three different sources:

  1. In the SQL/Python resource file itself, such as:
{{ config(schema='marketing') }}

select ...
  1. As part of the profile file, for example:
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: postgres
      password: postgres
      port: 5432
      dbname: public
      schema: dbt_example
      threads: 4
  1. Within the dbt_project.yml file, such as:
models:
  my_project:
    marketing:
      +schema: marketing

To retrieve the schema (iii) or database (ii), we can do the following:
a) Enrich the DbtNode definition to contain the schema and database. If the project is loaded with dbt ls or manifest, and if the user sets this information in the dbt_project.yml or in the model itself, we solve the problem.
b) If the user did not set custom schema/database per model neither in the dbt_project.yml nor in the model itself, we'll have to retrieve the "global" definition from the DBT profile / Airflow connection.

Finally, when working on this ticket, we also need to take into account the fact that different databases may result in different URIs (in BQ, schema is called dataset; Sqlite does not have schema; and so on)

@tatiana
Copy link
Collaborator Author

tatiana commented Aug 10, 2023

Yesterday, I talked to @wslulciuc, and we discussed the possibility of Cosmos using the openlineage-dbt package. The drawback with this approach is that Airflow would not be aware of the Datasets, so Dataset scheduling would not work, and we would not be benefiting from any dataset-aware feature available in Airflow.

We also discussed AIP-53. However, the fact that Cosmos does not use Airflow hooks to run database transformations also means it does not benefit from this.

We agreed I'll be looking into how openlineage-dbt works and we'll try to have a consistent experience:

@mobuchowski
Copy link

mobuchowski commented Aug 10, 2023

@tatiana I think openlineage-dbt - even if not directly - would be useful. You've provided the link to DbtArtifactProcessor, which has been modified to work with both dbt core and cloud - and this improvement would prove helpful here too I believe, as code could be reused to fit into Cosmos. For example, metadata parsing could happen in get_openlineage_facets_on_complete hook inside DbtBaseOperator. Additionally, reusing that code means that the dataset naming will be consistent between cosmos and raw dbt projects using openlineage-dbt.

This is also the reason for which I'm wary of using current Cosmos dataset naming - as dbt://... scheme tells nothing about physical location of a dataset - it's just a proxy that's useful when scheduling. But also potentially less useful if, for example, non-cosmos task wants to have a dependency on a Snowflake table, that could be modified in other way than dbt. So any other approach would still have to implement some canonical naming I believe.

@tatiana
Copy link
Collaborator Author

tatiana commented Aug 10, 2023

It totally makes sense, @mobuchowski ! I'll look into these once I'm back to this issue and I'll raise any questions/issues here, thank you very much!

@mobuchowski
Copy link

@tatiana also, one thing it's worth looking at is adapter support.
Currently openlineage-dbt supports those ones:

bigquery
snowflake
redshift
spark
postgres
databricks

Would be worth to make sure to not have a mismatch here.

@tatiana
Copy link
Collaborator Author

tatiana commented Aug 16, 2023

So far, we have yet to identify a flawless strategy to implement this ticket. I would love to hear others' thoughts!

These are some of the approaches we have for fixing Airflow Dataset URIs in Cosmos:

  1. Implement a solution using the new Airflow Openlineage Provider. Drawbacks:

    • would require some mapping between profiles/connections, depending on the user's profile mapping strategy
    • would only work with Airflow 2.7
  2. Create a dependency between Cosmos and the Openlineage DBT package, always wrapping dbt calls with it. Drawbacks:

    • May not work if the users are defining the dbt executable (either using a custom binary in local mode, virtualenv, docker, or Kubernetes execution modes)
    • Airflow would not know about Datasets - the communication would be entirely between dbt and airflow (so we would not be leveraging dataset scheduling and dataset visualization in Airflow), or
    • We could change openlineage-dbt to have a dry-run to generate open lineage metadata/URIs and consume that from Cosmos, using the information to set task outlets. However, this would add more complexity to the DAG generation, which is already very complicated and expensive, leading to many user issues.
  3. Use openlineage-dbt as a library. Drawbacks:

  • Currently, it depends on dbt as a library. We'll need to remove this dependency (apparently, it is simple, just a leftover)
  • We may need to change the interface
  • Confirm we can set Airflow inlets and outlets as part of execute
  • Confirm this strategy would work with Kubernetes and Docker execution modes
  1. Reuse as much as possible of the openlineage-dbt source code within Cosmos (copying the relevant parts). Drawbacks:
  • Both code bases will diverge. We won't benefit from updates to openlineage-dbt, and the same is also valid
  1. Have a stand-alone implementation of Dataset URIs within Cosmos. Drawbacks:
  • Code duplication, possibly inconsistencies with openlineage-dbt

So far, it feels the ideal solution would be (3). I'll give it a try!

tatiana added a commit that referenced this issue Sep 5, 2023
Use `openlineage-dbt` to create outlets Dataset URIs from within Cosmos.

Closes: #305
Closes: #497
Closes: #433 (only emits outlet events from the model - the same
behaviour as openlineage-dbt)

Validation
This change was tested by running Marquez locally and triggering the dag
`basic_cosmos_dag` using Airflow 2.6.1 and Python 3.10.10.

The output generated by this version of Cosmos can be seen in the
following screenshots:

<img width="1624" alt="Screenshot 2023-09-04 at 22 08 32"
src="https://github.com/astronomer/astronomer-cosmos/assets/272048/bf8ac0fe-a4de-42a1-aac3-4e111876b615">

<img width="1624" alt="Screenshot 2023-09-04 at 22 09 01"
src="https://github.com/astronomer/astronomer-cosmos/assets/272048/3d41821b-2cfb-414d-9951-e0b062ba0a1a">

<img width="1624" alt="Screenshot 2023-09-04 at 22 09 31"
src="https://github.com/astronomer/astronomer-cosmos/assets/272048/1ad93331-70b1-428d-b638-e73d26492c96">


Tasks
- [x] Fix pre-commit checks
- [x] Add test
- [x]  Inlets support
- [x]  ~Emit open lineage events~
- [x]  ~Support Docker/K8s~ (deverred to issue: #496)
- [x] ~Create a PR on openlineage-dbt to remove the dbt dependency~ Not
needed since the code depends on `openlineage-integration-common` and
not `openlineage-dbt` ([more
info](https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/common/setup.py#L15))
- [x] Understand which dataset is being emitted from test tasks (only
inlets, no outlets)
@tatiana
Copy link
Collaborator Author

tatiana commented Sep 8, 2023

#522

tatiana added a commit that referenced this issue Jun 7, 2024
[The
documentation](https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html)
was outdated.

The method `get_dbt_dataset` no longer exists. It used to exist in older
versions of Cosmos (before 1.1) when the URIs respected the format:

`Dataset(f"DBT://{connection_id.upper()}/{project_name.upper()}/{model_name.upper()}")`

More information on why we changed this:
#305

Closes: #1032
pankajkoti pushed a commit that referenced this issue Jun 7, 2024
[The
documentation](https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html)
was outdated.

The method `get_dbt_dataset` no longer exists. It used to exist in older
versions of Cosmos (before 1.1) when the URIs respected the format:

`Dataset(f"DBT://{connection_id.upper()}/{project_name.upper()}/{model_name.upper()}")`

More information on why we changed this:
#305

Closes: #1032
(cherry picked from commit c47e104)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants