Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dbt docs natively in Airflow via plugin #737

Merged
merged 33 commits into from
Feb 20, 2024

Conversation

dwreeves
Copy link
Collaborator

@dwreeves dwreeves commented Dec 2, 2023

Description

This PR adds a plugin (via the Airflow plugins entrypoint) that adds a menu item inside of Browse that renders the dbt docs:

image

And this is what it looks like. (This example is inside the dev docker compose):

image

The docs are rendered via an iframe with some additional hacks to make the page render in a user friendly way. I chose an iframe over vendoring the index.html in the templates for a few reasons, but mostly to support custom {% block __overview__ %} text. However, extracting the text from index.html and rendering it in a custom page is certainly an option too.

The dbt docs are specified in the Airflow config with the following parameters:

[cosmos]
dbt_docs_dir = path/to/docs/here
dbt_docs_conn_id = my_conn_id

Note that the path can be a link to any of the following:

  • S3
  • Azure Blob Storage
  • Google Cloud Storage
  • HTTP/HTTPS
  • Local storage

This is designed to work with the operators that dump the dbt docs, and the documentation changes I added make that clear.

Lastly, if docs are not hooked up, a message comes up telling the user that they should set their dbt docs up:

image

Current limitations

  • Most importantly, I need help testing the S3 / Azure / GCS integrations. I think I got them right but I'll need someone to actually try them.
  • I also wouldn't mind some help testing the UI on more browsers. I've tested both Firefox and Chrome.
  • The iframe hack is less than ideal; I would preferably want the dbt docs to have a fixed height. So instead of using the scroll bar of the Airflow UI, use the scroll bar of the dbt docs UI. The issue is basically that I am not an HTML/CSS/JavaScript person. I don't think there is any reason this shouldn't be possible, so I can continue to look into this as the PR is reviewed, or someone else can just do it for me.
  • I cannot run tests locally (lots of issues, mostly the databricks DAG in dev/dags/ fails locally), so I actually have no idea whether the test suite works. I was just planning on letting Github Actions take a stab at it.

API Decisions

The core maintainers of the repo should provide some feedback on a few high level API decisions:

  • Config variable names: Let me know if dbt_docs_dir and dbt_docs_conn_id are appropriate names. Other names could be like, dbt_docs_path, or dbt_docs_dir_conn_id, or dbt_docs_path_conn_id, etc.
  • Location in UI: I entertained two ideas: (a) Adding a menu button called Cosmos with dbt docs underneath. (b) Adding it under browse. Ultimately I decided on option 2.

Related Issue(s)

Closes #571.

Breaking Change?

This PR should not cause any breaking changes.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@dwreeves dwreeves requested review from a team as code owners December 2, 2023 18:11
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 2, 2023
Copy link

netlify bot commented Dec 2, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit ab42fdd
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/658f676a53dec50008f71d31

@dosubot dosubot bot added area:docs Relating to documentation, changes, fixes, improvement dbt:docs Primarily related to dbt docs command or functionality execution:docker Related to Docker execution environment labels Dec 2, 2023
@jlaneve
Copy link
Collaborator

jlaneve commented Dec 2, 2023

I've been excited about this one! I haven't looked at this super in-depth yet, but how does the local filesystem option work? Could I add running dbt docs to my build process, bake it into my Airflow image, and read from there?

@dwreeves
Copy link
Collaborator Author

dwreeves commented Dec 2, 2023

I've been excited about this one! I haven't looked at this super in-depth yet, but how does the local filesystem option work? Could I add running dbt docs to my build process, bake it into my Airflow image, and read from there?

Yes, exactly. For my own professional usage this is the option I am doing.

As part of my deployment process, I dbt compile, the manifest.json gets spit out into the dags directory, and in the code I specify the manifest_path in the ProjectConfig.

I like this deployment approach because it relieves some compute pressure on the scheduler since it doesn't need to dbt ls from a clean slate on every heartbeat. (Actually, this deployment approach isn't documented; could be something I add to the docs...)

So when this feature is in, I would dbt docs generate as part of the deployment process as well. And then I would just use the docs there.

This approach does have a small downside as far as dbt docs are concerned (and now that I mention it I should note it in the documentation), which is that the count(*) doesn't get updated in the docs regularly; it's just using whatever the count was at the last deploy. Also, if a model is incremental but hasn't run before, it will render the compiled version as the non-incremental code, which becomes stale after the first run. So basically, values can become stale. But for relieving pressure on the scheduler, and now in the case of docs for reducing cloud infrastructure requirements, I do like it.


Many users will be fine with these limitations of stale caching. If anyone wants low infra but up-to-date docs, you can dbt docs generate on load of dbt_docs_index.html, but:

  • This would be slower. You'd want to add a loading GIF to the iframe to give responsive feedback to users.
  • You'd also need to be super careful about the artifacts in this case too, as just dumping artifacts into a tmp dir is leaky.

Overall no cloud infra + no pre-compile is an option, but it would require additional hacks and probably isn't something I'd encourage.

@dwreeves
Copy link
Collaborator Author

dwreeves commented Dec 2, 2023

Please bear with me as I just throw some commits at Github Actions in a desperate attempt to get the tests I added working. I cannot for the life of me get the tests working on my local machine. 😓

Also, I added the aforementioned caveats regarding local storage to the documentation.

@dwreeves
Copy link
Collaborator Author

What's the status of this? 🤔 The issue we left off at was that @tatiana was having some issues with the HTTP method for retrieving files, but I was unable to replicate it or figure out what the problem was (it worked on my end...). We both were able to confirm that the local method worked, and the GCS/S3/Azure methods have not been fully integration tested.

Copy link
Collaborator

@jbandoro jbandoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the status of this? 🤔 The issue we left off at was that @tatiana was having some issues with the HTTP method for retrieving files, but I was unable to replicate it or figure out what the problem was (it worked on my end...). We both were able to confirm that the local method worked, and the GCS/S3/Azure methods have not been fully integration tested.

Thanks for another great contribution @dwreeves, this is awesome!

I confirmed that the HTTP, local file, and GCS paths work with docker-compose. @tatiana mentioned to me that the HTTP path wasn't working when running airflow standalone, and I had an issue where the webserver process kept exiting with signal 11 when trying to view the dbt docs when I was using the example docs dir you put in the docker-compose here, but it did work with a local file path in airlfow standalone mode.

Airflow recommends against using airflow standalone in production, and issues could be related to local system, and since this worked well in my testing with docker-compose, happy to approve with the minor comment below fixed before merging.

docs/configuration/hosting-docs.rst Outdated Show resolved Hide resolved
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 20, 2024
Co-authored-by: Justin Bandoro <79104794+jbandoro@users.noreply.github.com>
@dwreeves
Copy link
Collaborator Author

Thanks for the approval. Just updated.

@jbandoro jbandoro merged commit 11ff713 into astronomer:main Feb 20, 2024
60 checks passed
@ms32035
Copy link
Contributor

ms32035 commented Feb 26, 2024

@dwreeves how does this work when there are multiple dbt projects/doc dirs?

@dwreeves
Copy link
Collaborator Author

@ms32035 I was a little worried someone would ask. It is not supported directly. It could be in theory, although I cannot think of an API+UI design for multiple dbt projects' docs that wouldn't complicate things a lot for users with just one project. I'm all ears for what user-facing interface you'd think is appropriate.

Another solution, which I find reasonably elegant, is to import projects into a "docs project" so to speak. So you create a dbt project that has a packages.yml that looks like this:

packages:
- local: ../project1
- local: ../project2

It's possible that this pattern could/should be documented. Or we just support multiple projects' docs directly, although again, I don't know how to avoid the complications for the API.

@dwreeves
Copy link
Collaborator Author

I actually do think at the very least this pattern should be documented. I'll open a PR in the next couple days.

I'm aware due to very custom setups like passing vars into the dbt compile command, that "just create a docs project" could be a nonstarter for many people. At that point though, I'd rather have a variable that turns off the blueprint for the plugin, and let users do their own thing by subclassing the plugin, and they can create their own dropdown with multiple blueprints. Most of the novel javascript and iframe machinery is there, after all.

@ms32035
Copy link
Contributor

ms32035 commented Feb 26, 2024

@dwreeves the only design idea I have is to have a list page as an entry point, where you'd have to pre-configure the folders, - a comma separated string in airflow conf. I did something similar here https://github.com/ms32035/airflow-multirepo-deploy/blob/f08d8a5863b3311bf4210d3dc2c835370dd0296d/multirepo-deploy-plugin/multirepo_deploy_plugin.py#L110

@dwreeves
Copy link
Collaborator Author

I would only support that solution if, when there is only one project, the dbt docs load normally without a list. So the list is just there if you have multiple projects.

Making each attribute into a comma separated list is that gets zip()'d is something I briefly considered. It does help make things simple for single project users, although it feels weird and slightly inexplicit as a data model. That's the trade-off, basically.

@ms32035
Copy link
Contributor

ms32035 commented Feb 26, 2024

If list length = 1 then redirect to the first element instead of rendering the table :D

@tatiana tatiana mentioned this pull request Feb 27, 2024
tatiana added a commit that referenced this pull request Mar 1, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in #737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in #850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in #800
* Add Azure Container Instance as Execution Mode by @danielvdende in
#771
* Add dbt build operators by @dylanharper-qz in #795
* Add dbt profile config variables to mapped profile by @ykuc in #794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in #683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in #856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in #859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in #835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in #849

Docs

* Fix docs homepage link by @jlaneve in #860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847
* Fix typo in MWAA getting started guide by @jlaneve in #846

Others

* Add performance integration tests by @jlaneve in #827
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in #826
* Add import sorting (isort) to Cosmos by @jbandoro in #866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824
and #825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in #854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in #858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in #845
* Pre-commit hook updates in #834, #843 and #852
@tatiana tatiana mentioned this pull request May 2, 2024
tatiana added a commit that referenced this pull request May 13, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in #737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in #850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in #800
* Improve performance by 22-35% or more by caching partial parse
artefact by @tatiana in #904
* Add Azure Container Instance as Execution Mode by @danielvdende in
#771
* Add dbt build operators by @dylanharper-qz in #795
* Add dbt profile config variables to mapped profile by @ykuc in #794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786
* Add ``pip_install_options`` argument to operators by @octiva in #808

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in #683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in #856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in #859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in #835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in #849
* Fix ``TrinoBaseProfileMapping`` required parameter for non method
authentication by @AlexandrKhabarov in #921
* Fix global flags for lists by @ms32035 in #863
* Fix ``GoogleCloudServiceAccountDictProfileMapping`` when getting
values from the Airflow connection ``extra__`` keys by @glebkrapivin in
#923
* Fix using the dag as a keyword argument as ``specific_args_keys`` in
DbtTaskGroup by @tboutaour in #916
* Fix ACI integration (``DbtAzureContainerInstanceBaseOperator``) by
@danielvdende in #872
* Fix setting dbt project dir to the tmp dir by @dwreeves in #873
* Fix dbt docs operator to not use ``graph.gpickle`` file when
``--no-write-json`` is passed by @dwreeves in #883
* Make Pydantic a required dependency by @pankajkoti in #939
* Gracefully error if users try to ``emit_datasets`` with ``Airflow
2.9.0`` or ``2.9.1`` by @tatiana in #948
* Fix parsing tests that have no parents in #933 by @jlaneve
* Correct ``root_path`` in partial parse cache by @pankajkoti in #950

Docs

* Fix docs homepage link by @jlaneve in #860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847
* Fix typo in MWAA getting started guide by @jlaneve in #846
* Fix typo related to exporting docs to GCS by @tboutaour in #922
* Improve partial parsing docs by @tatiana in #898
* Improve docs for datasets for airflow >= 2.4 by @SiddiqueAhmad in #879
* Improve test behaviour docs to highlight ``warning`` feature in the
``virtualenv`` mode by @mc51 in #910
* Fix docs typo by @SiddiqueAhmad in #917
* Improve Astro docs by @RNHTTR in #951

Others

* Add performance integration tests by @jlaneve in #827
* Enable ``append_env`` in ``operator_args`` by default by @tatiana in
#899
* Change default ``append_env`` behaviour depending on Cosmos
``ExecutionMode`` by @pankajkoti and @pankajastro in #954
* Expose the ``dbt`` graph in the ``DbtToAirflowConverter`` class by
@tommyjxl in #886
* Improve dbt docs plugin rendering padding by @dwreeves in #876
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in #826
* Add import sorting (isort) to Cosmos by @jbandoro in #866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824
and #825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in #854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in #858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in #845
* Add Apache Airflow 2.9 to the test matrix by @tatiana in #940
* Replace deprecated ``DummyOperator`` by ``EmptyOperator`` if Airflow
>=2.4.0 by @tatiana in #900
* Improve logs to troubleshoot issue in 1.4.0a2 with astro-cli by
@tatiana in #947
* Fix issue when publishing a new release to PyPI by @tatiana in #946
* Pre-commit hook updates in #820, #834, #843 and #852, #890, #896,
#901, #905, #908, #919, #931, #941
tatiana added a commit that referenced this pull request May 15, 2024
[Daniel Reeves](https://www.linkedin.com/in/daniel-reeves-27700545/)
(@dwreeves ) is an experienced Open-Source Developer currently working
as a Data Architect at Battery Ventures. He has significant experience
with Apache Airflow, SQL, and Python and has contributed to many [OSS
projects](https://github.com/dwreeve).

Not only has he been using Cosmos since its early stages, but since
January 2023, he has actively contributed to the project:
![Screenshot 2024-05-14 at 10 47
30](https://github.com/astronomer/astronomer-cosmos/assets/272048/57829cb6-7eee-4b02-998b-46cc7746f15a)

He has been a critical driver for the Cosmos 1.4 release, and some of
his contributions include new features, bug fixes, and documentation
improvements, including:
* Creation of an Airflow plugin to render dbt docs:
#737
* Support using dbt partial parsing file:
#800
* Add more template fields to `DbtBaseOperator`:
#786
* Add cancel on kill functionality:
#101
* Make region optional in Snowflake profile mapping:
#100
* Fix the dbt docs operator to not look for `graph.pickle`:
#883

He thinks about the project long-term and proposes thorough solutions to
problems faced by the community, as can be seen in Github tickets:
* Introducing composability in the middle layer of Cosmos's API:
#895
* Establish a general pattern for uploading artifacts to storage:
#894
* Support `operator_arguments` injection at a node level:
#881

One of Daniel's notable traits is his collaborative and supportive
approach. He has actively engaged with users in the #airflow-dbt Slack
channel, demonstrating his commitment to fostering a supportive
community.

We want to promote him as a Cosmos committer and maintainer for all
these, recognising his constant efforts and achievements towards our
community. Thank you very much, @dwreeves !
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:docs Relating to documentation, changes, fixes, improvement dbt:docs Primarily related to dbt docs command or functionality execution:docker Related to Docker execution environment lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files. status:awaiting-reviewer The issue/PR is awaiting for a reviewer input
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Render DBT Docs
6 participants