New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Allow "persist_docs" to be run independently #4226
Comments
I can see a possible situation where you run Is decoupling these two components safe? What might the behaviour be in that situation? |
Indeed, good point but in that case it should just throw an error or it should just go through for the fields that are in sync with the DB. And actually this is actually the behavior when you do a "dbt run" in case your model.yml file is not in sync with the schema of your model. |
@Charles1104 Thanks for opening! I hazard agreeing with @balmasi, for related reasons: I don't feel great about officially decoupling running a model and persisting its descriptions as database comments. Nor do I feel great about tightly coupling docs persistence (mutative That said, I do see one way to proceed with this, by emulating the work of the built-in The code below is a sample, and very much "use at your own risk": so long as you know what you're doing, this might save your developers some time. Plus, it's fun to see what you can do with dbt, when so much of its built-in functionality is written in macros. -- macros/persist_docs_op.sql
{% macro persist_docs_op(model_name, relation = true, columns = true) %}
{% if execute %}
{% set model_node = (graph.nodes.values() | selectattr('name', 'equalto', model_name) | list)[0] %}
{% set relation = adapter.get_relation(
database=model_node.database, schema=model_node.schema, identifier=model_node.alias
) %}
{{ log("Altering relation comment for " + model_name, info = true) }}
{{ run_query(alter_relation_comment(relation, model_node.description)) }}
{{ log("Altering column comments for " + model_name, info = true) }}
{{ run_query(alter_column_comment(relation, model_node.columns)) }}
{% endif %}
{% endmacro %} Running:
Checking into
|
HI @jtcohen6, Thank you for the answer and explanation. Some additional context on why we would like to do that. In production, we run dbt via Airflow and we are exploding the manifest.json into granular stages as per this astronomer article. Then in each Airlfow stage, we decided to not use a Bash operator and run "dbt run" but instead to use a BigQuery operator and point the queries to the compiled queries (compiled via "dbt compile" in our first Airflow stage) for efficiency reason. But this has the implication that it is not persisting the docs as it is just running the query. I like the solution you suggest and we will use it to persist our docs in production. Thank you for sharing the code snippet. Best regards, Charles |
Would like to breath some life into this. I don't really understand why does dbt print actual statements for comments to a log:
But it does not add them to either
|
@george-zubrienko Unfortunately, this is a limitation of dbt's current materialization model. Rather than including all SQL that dbt executed as part of materializing a model, I've wanted to change that for a long time, so that materializations include all relevant statements in |
I see. This is definitely needed, because for backends like spark it is quite crucial, for performance reasons, to be able to run SQL statements outside dbt - but them being generated by dbt. It is not about dbt performance itself, but rather the nature and volume of data the goes through spark environments. For example, we have a solid workload management framework sitting on top of a swarm of spark clusters managed by k8s. I don't really need an adapter for that - just the SQL that can be executed right away. W/o this feature, our DBT use currently is limited to databricks adhoc development and table provisioning. Parsing manifest/dbt log is fine as workaround for now, only issue here being that relying on file parsing tends to generate extra work hours on minor releases :) However for this to work, DBT needs to decouple itself from the target database and rely on YML schemas instead. Which would be nice, because runtime type-checking on insert is the biggest bad thing about SQL. |
I also think an operation to only change descriptions would be useful. My models process quite a lot of data when they run, so I would like to be able to update the descriptions independently. I managed to achieve this (with BigQuery) with a slightly modified macro inspired by @jtcohen6 : {% macro persist_docs_op(model_name, relation = true, columns = true) %}
{% if execute %}
{% set model_node = (graph.nodes.values() | selectattr('name', 'equalto', model_name) | list)[0] %}
{% set relation = adapter.get_relation(
database=model_node.database, schema=model_node.schema, identifier=model_node.alias
) %}
{% if 'description' in model_node %}
{{ log("Altering table description for " + model_name, info = true) }}
{{ adapter.update_table_description(model_node['database'], model_node['schema'], model_node['alias'], model_node['description']) }}
{% endif %}
{{ log("Altering column comments for " + model_name, info = true) }}
{{ alter_column_comment(relation, model_node.columns) }}
{% endif %}
{% endmacro %} |
Is there an existing feature request for this?
Describe the Feature
Currently the "persist_docs" configuration is only applied with a "dbt run" command. With this feature, goal is to allow documentation to be persisted to the database without having to run your models. There are time when you just want to update some descriptions without having to run your models.
How this should work ?
A new flag to the "dbt docs" command. For instance "--persist-docs". The following command would therefore just persist the docs for the selected models:
dbt docs --persist-docs --select [MODELS]
Describe alternatives you've considered
No possible alternatives at the moment as per the documentation of the "persist_docs".
Who will this benefit?
This will benefit data analyst / analytics engineers that regularly updates the documentation in the yml files and that want this documentation to be persisted to their analytical database without having to re-run their models.
Are you interested in contributing this feature?
Will need some ramp-up on the codebase but I would be thrilled to contribute.
Anything else?
No response
The text was updated successfully, but these errors were encountered: