Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more template fields to DbtBaseOperator #786

Merged
merged 22 commits into from
Feb 29, 2024

Conversation

dwreeves
Copy link
Collaborator

@dwreeves dwreeves commented Jan 7, 2024

This partially addresses #754 via allowing for built-in templating support for the DbtBaseOperator.

I also noticed --full-refresh was not documented so I added that in.

Still missing

Manual run pattern is not documented; the fact that these fields are templated is not documented. I don't really know where in the docs to put this. The docs are very API-focused more than narrative-based or suggestive, and Cosmos's maintainers prefer this style of documentation so it's hard to find a spot for this. It's possible that that's fine and we just keep this as a feature for more advanced users who dig into source code to discover for themselves? 🤷

@dwreeves dwreeves requested review from a team as code owners January 7, 2024 01:24
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jan 7, 2024
Copy link

netlify bot commented Jan 7, 2024

Deploy Preview for sunny-pastelito-5ecb04 ready!

Name Link
🔨 Latest commit d44424b
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/65e10b172f877300087b29de
😎 Deploy Preview https://deploy-preview-786--sunny-pastelito-5ecb04.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@dosubot dosubot bot added area:docs Relating to documentation, changes, fixes, improvement dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 7, 2024
Copy link

codecov bot commented Jan 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.39%. Comparing base (1bf78b4) to head (d44424b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #786      +/-   ##
==========================================
+ Coverage   95.35%   95.39%   +0.03%     
==========================================
  Files          57       57              
  Lines        2737     2759      +22     
==========================================
+ Hits         2610     2632      +22     
  Misses        127      127              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dwreeves, thanks a lot for the contribution!

Great catch on the missing docstring and docs.

The docs are more API-focused than narrative-based or suggestive, and Cosmos's maintainers prefer this documentation style, so it's hard to find a spot for this.

We can and should certainly improve Cosmos docs. I agree they are more API-focused, but it isn't a maintainers' preference. We're open to having other styles of docs, and I'd be super happy if we could follow https://diataxis.fr/. It has not been a priority so far - but we'd welcome help.

Regarding template fields, given the existing docs, I think we could either:

(1) Create a new page inside docs/configuration with a list of existing template fields

(2) Highlight the Cosmos arguments that are template fields where we expose them. Eg. In the case of select, exclude and selection, this could be in:
https://astronomer.github.io/astronomer-cosmos/configuration/render-config.html

In the case of env and vars, we could probably set it in dbt_vars and env_vars in:
https://astronomer.github.io/astronomer-cosmos/configuration/project-config.html

We should probably add full_refresh together with models:
https://astronomer.github.io/astronomer-cosmos/configuration/operator-args.html

As of compiled SQL, it seems we have a dedicated section for it:
https://astronomer.github.io/astronomer-cosmos/configuration/compiled-sql.html
Which could probably be merged into operator_args.

I'm happy for us to merge this PR once we have some docs related to the template fields.

@tatiana tatiana added the status:awaiting-author Issue/PR is under discussion and waiting for author's input label Jan 9, 2024
@tatiana tatiana added this to the 1.4.0 milestone Jan 9, 2024
@joppevos
Copy link
Contributor

joppevos commented Jan 10, 2024

@dwreeves oops I did not document it when adding full-refresh. Thanks for taking care of it 🙇

@dwreeves
Copy link
Collaborator Author

I think the thing that would make the most sense for now is to just add a section under /operator-args.html that is just about the templated variables. With the exception of --full-refresh, all of these variables are either (1) potentially used for rendering if DBT_LS is true (vars and env), or (2) used for model selection (select, selector, models, exclude) and are thus typically out of scope for operator_args inside of the dbt->Airflow converter context. A note should be included that clarifies this, although it's a little confusing because --full-refresh is essentially the only variable where it would make sense to do something like: operator_args={"full_refresh": "{{ params.full_refresh }}"}.

(2) Highlight the Cosmos arguments that are template fields where we expose them. Eg. In the case of select, exclude and selection, this could be in:
https://astronomer.github.io/astronomer-cosmos/configuration/render-config.html

So this is a bit weird because templating does not work when rendering a DAG, only when executing the task. I will keep in mind that a note should be made about this.


This also does beg a question though about the recent treatment of the dbt_vars and dbt_env. Right now in validate_initial_user_config(), an error is raised when both operator_args={"vars": ...} and project_config.dbt_vars are set. Since both rendering and execution use --vars and environment variables, this means that rendering and execution are strongly coupled and it's not possible to have these differ.

This was something I was thinking about irrespective of this particular PR. I think it's possible that it should be possible to set both operator_args={"vars": ...} and project_config.dbt_vars, and that the behavior would be that project_config.dbt_vars is used for rendering and operator_args={"vars": ...} could either merge or overwrite. Alternatively, render config and execution config could each have their own notion of dbt vars that can get set to the project configs' vars if either is not defined at the execution or rendering level.

In general the render and execution configs share a lot of configuration in the case that DBT_LS needs to be run, and maybe a separate discussion but I think there should be some sort of way to make the configurations more decoupled at the user level, but more DRY in their under-the-hood implementations. If that makes sense.


The reason why this discussion is potentially relevant to this PR is because of my note above about how the only operator args you can seemingly reliably utilize templating for at Cosmos's "highest level" API (i.e. DbtDag + DbtTaskGroup) is full_refresh.

I mean, you can access vars and env technically, but these will either:

  • if set in operator_args, they will raise deprecation warnings as of 1.3
  • if set in project_config and LoadMode.DBT_LS is used, then it will not properly render at all during LoadMode.DBT_LS rendering, i.e. the string literal will be used for either the dbt var or env var, since it never gets passed through Jinja and the scheduler wouldn't have potentially relevant context variables anyway.

This all feels both somewhat suboptimal, and also extremely confusing, to users. It's strange that ProjectConfig(dbt_vars={'logical_date': '{{ context["logical_date"] }}') is valid code, and works perfectly fine and as-intended if the user is using LoadMode.DBT_MANIFEST, but does not behave so well when using LoadMode.DBT_LS. It's tricky to convey this in documentation as a general statement. "Templating does not work with ProjectConfig(dbt_vars=...)" is not a true statement, but it is not a false statement either. The truth is templating works only in execution, and may or may not fail in rendering. Here are 2 example cases when it fails in rendering:

  • Since executing dbt ls does involve dbt parsing all the Jinja, and if your SQL query has something like {{ modules.datetime.datetime.fromisoformat(var("logical_date")) }} then the code will fail when rendering even though it would succeed when executing. In this case I think the user would want to either not have logical_date set at all (and rely on a default instead), or alternatively they should be allowed to set logical_date differently in rendering and execution.

  • Env vars can be a little more realistic a problem if, say, an env var is being used as part of a connection or to enable or disable models. For example, imagine the user sets a dbt profile with something like role: {{ env_var("DBT_ROLE") }}, and in their DAG they want to set DAG(..., params={"dbt_role": Param(...)}) and inside a DbtTaskGroup they want to set DbtTaskGroup(..., project_config=ProjectConfig(..., dbt_vars={"dbt_role": "{{ params.dbt_role }}"})). Here, again, it works fine in execution, but it does not work in the case where the load method is LoadMode.DBT_LS specifically.

That said, I am aware these are potentially extreme edge cases-- for example in most cases passing the string literal {{ logical_date }} into the dbt rendering will have zero impact on parsing with dbt ls and everything will render fine. So this adds to the confusion! The fully qualified correct statement is: Templating works in execution but does not work in rendering when the load method is LoadMode.DBT_LS; it may or may not matter depending on your dbt project. And that's just way more confusing than saying it doesn't work with dbt ls fully unqualified, and instead pushing users toward a separate layer of abstraction in the event that they want access to templating.


OK, this is all very confusing and in the weeds so I hope what I am saying makes sense. TLDR is:

  • Probably just stick this in operator-args.html as a separate section.
  • I would not mention this at all in ProjectConfig; and if we do, we may want to revisit some of the 1.3 changes regarding deprecation of vars= and env= in the operator_args?
  • The use of templating is actually really complicated and isn't fully addressed by the abstractions. It may be best to revisit this.
  • For now, without any further code changes, the only real thing that can be templated via DbtDag or DbtTaskGroup is full_refresh. (Maybe also warn users that render_template_as_native_obj=True should be true for that to work, since right now it cannot handle strings; alternatively make it so full_refresh can handle strings??).

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 25, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 28, 2024
@dwreeves
Copy link
Collaborator Author

Cov requirements are still failing. There is no universe where I get this done before the weekend, just as an FYI.

@tatiana
Copy link
Collaborator

tatiana commented Feb 29, 2024

Thanks, @dwreeves , I just managed to get some time to add the missing coverage bit - it will be released as part of the alpha today!

@dwreeves
Copy link
Collaborator Author

Yes! Thank you! Can't wait for this 😁

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you very much, @dwreeves both for the contribution and persistence to get this into the finishing line!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 29, 2024
@tatiana tatiana merged commit 95ad1cb into astronomer:main Feb 29, 2024
60 checks passed
tatiana added a commit that referenced this pull request Mar 1, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in #737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in #850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in #800
* Add Azure Container Instance as Execution Mode by @danielvdende in
#771
* Add dbt build operators by @dylanharper-qz in #795
* Add dbt profile config variables to mapped profile by @ykuc in #794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in #683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in #856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in #859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in #835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in #849

Docs

* Fix docs homepage link by @jlaneve in #860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847
* Fix typo in MWAA getting started guide by @jlaneve in #846

Others

* Add performance integration tests by @jlaneve in #827
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in #826
* Add import sorting (isort) to Cosmos by @jbandoro in #866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824
and #825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in #854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in #858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in #845
* Pre-commit hook updates in #834, #843 and #852
@tatiana tatiana mentioned this pull request May 2, 2024
tatiana added a commit that referenced this pull request May 13, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in #737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in #850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in #800
* Improve performance by 22-35% or more by caching partial parse
artefact by @tatiana in #904
* Add Azure Container Instance as Execution Mode by @danielvdende in
#771
* Add dbt build operators by @dylanharper-qz in #795
* Add dbt profile config variables to mapped profile by @ykuc in #794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786
* Add ``pip_install_options`` argument to operators by @octiva in #808

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in #683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in #856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in #859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in #835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in #849
* Fix ``TrinoBaseProfileMapping`` required parameter for non method
authentication by @AlexandrKhabarov in #921
* Fix global flags for lists by @ms32035 in #863
* Fix ``GoogleCloudServiceAccountDictProfileMapping`` when getting
values from the Airflow connection ``extra__`` keys by @glebkrapivin in
#923
* Fix using the dag as a keyword argument as ``specific_args_keys`` in
DbtTaskGroup by @tboutaour in #916
* Fix ACI integration (``DbtAzureContainerInstanceBaseOperator``) by
@danielvdende in #872
* Fix setting dbt project dir to the tmp dir by @dwreeves in #873
* Fix dbt docs operator to not use ``graph.gpickle`` file when
``--no-write-json`` is passed by @dwreeves in #883
* Make Pydantic a required dependency by @pankajkoti in #939
* Gracefully error if users try to ``emit_datasets`` with ``Airflow
2.9.0`` or ``2.9.1`` by @tatiana in #948
* Fix parsing tests that have no parents in #933 by @jlaneve
* Correct ``root_path`` in partial parse cache by @pankajkoti in #950

Docs

* Fix docs homepage link by @jlaneve in #860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847
* Fix typo in MWAA getting started guide by @jlaneve in #846
* Fix typo related to exporting docs to GCS by @tboutaour in #922
* Improve partial parsing docs by @tatiana in #898
* Improve docs for datasets for airflow >= 2.4 by @SiddiqueAhmad in #879
* Improve test behaviour docs to highlight ``warning`` feature in the
``virtualenv`` mode by @mc51 in #910
* Fix docs typo by @SiddiqueAhmad in #917
* Improve Astro docs by @RNHTTR in #951

Others

* Add performance integration tests by @jlaneve in #827
* Enable ``append_env`` in ``operator_args`` by default by @tatiana in
#899
* Change default ``append_env`` behaviour depending on Cosmos
``ExecutionMode`` by @pankajkoti and @pankajastro in #954
* Expose the ``dbt`` graph in the ``DbtToAirflowConverter`` class by
@tommyjxl in #886
* Improve dbt docs plugin rendering padding by @dwreeves in #876
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in #826
* Add import sorting (isort) to Cosmos by @jbandoro in #866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824
and #825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in #854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in #858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in #845
* Add Apache Airflow 2.9 to the test matrix by @tatiana in #940
* Replace deprecated ``DummyOperator`` by ``EmptyOperator`` if Airflow
>=2.4.0 by @tatiana in #900
* Improve logs to troubleshoot issue in 1.4.0a2 with astro-cli by
@tatiana in #947
* Fix issue when publishing a new release to PyPI by @tatiana in #946
* Pre-commit hook updates in #820, #834, #843 and #852, #890, #896,
#901, #905, #908, #919, #931, #941
tatiana added a commit that referenced this pull request May 15, 2024
[Daniel Reeves](https://www.linkedin.com/in/daniel-reeves-27700545/)
(@dwreeves ) is an experienced Open-Source Developer currently working
as a Data Architect at Battery Ventures. He has significant experience
with Apache Airflow, SQL, and Python and has contributed to many [OSS
projects](https://github.com/dwreeve).

Not only has he been using Cosmos since its early stages, but since
January 2023, he has actively contributed to the project:
![Screenshot 2024-05-14 at 10 47
30](https://github.com/astronomer/astronomer-cosmos/assets/272048/57829cb6-7eee-4b02-998b-46cc7746f15a)

He has been a critical driver for the Cosmos 1.4 release, and some of
his contributions include new features, bug fixes, and documentation
improvements, including:
* Creation of an Airflow plugin to render dbt docs:
#737
* Support using dbt partial parsing file:
#800
* Add more template fields to `DbtBaseOperator`:
#786
* Add cancel on kill functionality:
#101
* Make region optional in Snowflake profile mapping:
#100
* Fix the dbt docs operator to not look for `graph.pickle`:
#883

He thinks about the project long-term and proposes thorough solutions to
problems faced by the community, as can be seen in Github tickets:
* Introducing composability in the middle layer of Cosmos's API:
#895
* Establish a general pattern for uploading artifacts to storage:
#894
* Support `operator_arguments` injection at a node level:
#881

One of Daniel's notable traits is his collaborative and supportive
approach. He has actively engaged with users in the #airflow-dbt Slack
channel, demonstrating his commitment to fostering a supportive
community.

We want to promote him as a Cosmos committer and maintainer for all
these, recognising his constant efforts and achievements towards our
community. Thank you very much, @dwreeves !
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:docs Relating to documentation, changes, fixes, improvement dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment lgtm This PR has been approved by a maintainer parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc size:L This PR changes 100-499 lines, ignoring generated files. status:awaiting-author Issue/PR is under discussion and waiting for author's input
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants