Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docs so it does not reference non-existing get_dbt_dataset #1034

Merged
merged 3 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/01-bug.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: Bug Report
description: File a bug report.
title: "[Bug]: "
title: "[Bug] "
labels: ["bug", "triage-needed"]
body:
- type: markdown
Expand Down
3 changes: 2 additions & 1 deletion .github/ISSUE_TEMPLATE/02-feature.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: Feature request
description: Suggest an idea for this project
labels: ["enhancement", "needs-triage"]
title: "[Feature] "
labels: ["enhancement", "triage-needed"]
body:
- type: markdown
attributes:
Expand Down
19 changes: 12 additions & 7 deletions docs/configuration/scheduling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,23 +24,31 @@ To schedule a dbt project on a time-based schedule, you can use Airflow's schedu
Data-Aware Scheduling
---------------------

By default, Cosmos emits `Airflow Datasets <https://airflow.apache.org/docs/apache-airflow/stable/concepts/datasets.html>`_ when running dbt projects. This allows you to use Airflow's data-aware scheduling capabilities to schedule your dbt projects. Cosmos emits datasets in the following format:
Apache Airflow 2.4 introduced the concept of `scheduling based on Datasets <https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html>`_.

By default, if Airflow 2.4 or higher is used, Cosmos emits `Airflow Datasets <https://airflow.apache.org/docs/apache-airflow/stable/concepts/datasets.html>`_ when running dbt projects. This allows you to use Airflow's data-aware scheduling capabilities to schedule your dbt projects. Cosmos emits datasets using the OpenLineage URI format, as detailed in the `OpenLineage Naming Convention <https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md>`_.

Cosmos calculates these URIs during the task execution, by using the library `OpenLineage Integration Common <https://pypi.org/project/openlineage-integration-common/>`_.

This block illustrates a Cosmos-generated dataset for Postgres:

.. code-block:: python

Dataset("DBT://{connection_id}/{project_name}/{model_name}")
Dataset("postgres://host:5432/database.schema.table")


For example, let's say you have:

- A dbt project (``project_one``) with a model called ``my_model`` that runs daily
- A second dbt project (``project_two``) with a model called ``my_other_model`` that you want to run immediately after ``my_model``

We are assuming that the Database used is Postgres, the host is ``host``, the database is ``database`` and the schema is ``schema``.

Then, you can use Airflow's data-aware scheduling capabilities to schedule ``my_other_model`` to run after ``my_model``. For example, you can use the following DAGs:

.. code-block:: python

from cosmos import DbtDag, get_dbt_dataset
from cosmos import DbtDag

project_one = DbtDag(
# ...
Expand All @@ -49,10 +57,7 @@ Then, you can use Airflow's data-aware scheduling capabilities to schedule ``my_
)

project_two = DbtDag(
# for airflow <=2.3
# schedule=[get_dbt_dataset("my_conn", "project_one", "my_model")],
# for airflow > 2.3
schedule=[get_dbt_dataset("my_conn", "project_one", "my_model")],
schedule=[Dataset("postgres://host:5432/database.schema.my_model")],
dbt_project_name="project_two",
)

Expand Down
Loading