[Feature] Allow "persist_docs" to be run independently #4226

Charles1104 · 2021-11-08T07:23:17Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

Currently the "persist_docs" configuration is only applied with a "dbt run" command. With this feature, goal is to allow documentation to be persisted to the database without having to run your models. There are time when you just want to update some descriptions without having to run your models.

How this should work ?

A new flag to the "dbt docs" command. For instance "--persist-docs". The following command would therefore just persist the docs for the selected models:

dbt docs --persist-docs --select [MODELS]

Describe alternatives you've considered

No possible alternatives at the moment as per the documentation of the "persist_docs".

Who will this benefit?

This will benefit data analyst / analytics engineers that regularly updates the documentation in the yml files and that want this documentation to be persisted to their analytical database without having to re-run their models.

Are you interested in contributing this feature?

Will need some ramp-up on the codebase but I would be thrilled to contribute.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

balmasi · 2021-11-08T18:46:12Z

I can see a possible situation where you run dbt docs --persist-docs --select [MODELS] but your models have not been re-run so the codebase is out of sync with the DB.

Is decoupling these two components safe?

What might the behaviour be in that situation?

Charles1104 · 2021-11-08T19:12:26Z

Indeed, good point but in that case it should just throw an error or it should just go through for the fields that are in sync with the DB.

And actually this is actually the behavior when you do a "dbt run" in case your model.yml file is not in sync with the schema of your model.

jtcohen6 · 2021-11-17T10:35:58Z

@Charles1104 Thanks for opening!

I hazard agreeing with @balmasi, for related reasons: I don't feel great about officially decoupling running a model and persisting its descriptions as database comments. Nor do I feel great about tightly coupling docs persistence (mutative alter/comment statements) with docs generate (metadata queries, no mutation, introspection only) instead—though the former certainly benefits the latter, these are different tasks, with different goals and footprints.

That said, I do see one way to proceed with this, by emulating the work of the built-in persist_docs macro in a custom run-operation. This requires tapping into the graph context variable to pick out the model you want. Unfortunately, this won't work with actual node selection syntax, but that doesn't feel as important for the case where you're iterating on a single model's descriptions.

The code below is a sample, and very much "use at your own risk": so long as you know what you're doing, this might save your developers some time. Plus, it's fun to see what you can do with dbt, when so much of its built-in functionality is written in macros.

-- macros/persist_docs_op.sql
{% macro persist_docs_op(model_name, relation = true, columns = true) %}

  {% if execute %}
    {% set model_node = (graph.nodes.values() | selectattr('name', 'equalto', model_name) | list)[0] %}
    {% set relation = adapter.get_relation(
        database=model_node.database, schema=model_node.schema, identifier=model_node.alias
    ) %}
  
    {{ log("Altering relation comment for " + model_name, info = true) }}
    {{ run_query(alter_relation_comment(relation, model_node.description)) }}
    {{ log("Altering column comments for " + model_name, info = true) }}
    {{ run_query(alter_column_comment(relation, model_node.columns)) }}
  
  {% endif %}

{% endmacro %}

Running:

$ dbt run-operation persist_docs_op --args 'model_name: model_a'
Running with dbt=0.21.1-rc1
Altering relation comment for model_a
Altering column comments for model_a

Checking into logs/dbt.log to confirm it ran:

2021-11-17 10:31:47.375852 (MainThread): Altering relation comment for model_a
2021-11-17 10:31:47.383144 (MainThread): Using postgres connection "macro_persist_docs_op".
2021-11-17 10:31:47.383318 (MainThread): On macro_persist_docs_op: /* {"app": "dbt", "dbt_version": "0.21.1rc1", "profile_name": "garage-postgres", "target_name": "dev", "connection_name": "macro_persist_docs_op"} */

    
  
  comment on view "jerco"."dbt_jcohen"."model_a" is $dbt_comment_literal_block$not cool model$dbt_comment_literal_block$;

  
2021-11-17 10:31:47.384160 (MainThread): SQL status: COMMENT in 0.00 seconds
2021-11-17 10:31:47.385003 (MainThread): Altering column comments for model_a
2021-11-17 10:31:47.389274 (MainThread): Using postgres connection "macro_persist_docs_op".
2021-11-17 10:31:47.389406 (MainThread): On macro_persist_docs_op: /* {"app": "dbt", "dbt_version": "0.21.1rc1", "profile_name": "garage-postgres", "target_name": "dev", "connection_name": "macro_persist_docs_op"} */

    
  
    
    
    comment on column "jerco"."dbt_jcohen"."model_a".id is $dbt_comment_literal_block$not cool column$dbt_comment_literal_block$;
  

  
2021-11-17 10:31:47.389865 (MainThread): SQL status: COMMENT in 0.00 seconds
2021-11-17 10:31:47.390647 (MainThread): On macro_persist_docs_op: ROLLBACK
2021-11-17 10:31:47.391006 (MainThread): On macro_persist_docs_op: Close

Charles1104 · 2021-11-17T16:52:13Z

HI @jtcohen6,

Thank you for the answer and explanation.

Some additional context on why we would like to do that. In production, we run dbt via Airflow and we are exploding the manifest.json into granular stages as per this astronomer article. Then in each Airlfow stage, we decided to not use a Bash operator and run "dbt run" but instead to use a BigQuery operator and point the queries to the compiled queries (compiled via "dbt compile" in our first Airflow stage) for efficiency reason. But this has the implication that it is not persisting the docs as it is just running the query.

I like the solution you suggest and we will use it to persist our docs in production. Thank you for sharing the code snippet.

Best regards,

Charles

george-zubrienko · 2022-03-03T16:01:16Z

Would like to breath some life into this. I don't really understand why does dbt print actual statements for comments to a log:

        alter table auto_replenishment_george_1.assortment change column 
            sku_number
            comment 'Unique identifier for a network route\'s sku';

But it does not add them to either target/compiled or target/run output for the selected model. If it did just that, would be no issue at all?
Disclaimer: everything is doable with shell ofc.

cat logs/dbt.log | grep -A4 alter > docs.sql

jtcohen6 · 2022-03-03T16:11:46Z

@george-zubrienko Unfortunately, this is a limitation of dbt's current materialization model. Rather than including all SQL that dbt executed as part of materializing a model, target/run only includes the "main" statement that dbt runs within the materialization. This limitation, along with the subtle distinction between "compiled" and "run" SQL, is something we've discussed before in other contexts (#3222, #4227).

I've wanted to change that for a long time, so that materializations include all relevant statements in target/run. This also feels like an important precondition for being able to "dry run" dbt models (#4456).

george-zubrienko · 2022-03-03T16:27:32Z

@george-zubrienko Unfortunately, this is a limitation of dbt's current materialization model. Rather than including all SQL that dbt executed as part of materializing a model, target/run only includes the "main" statement that dbt runs within the materialization. This limitation, along with the subtle distinction between "compiled" and "run" SQL, is something we've discussed before in other contexts (#3222, #4227).

I've wanted to change that for a long time, so that materializations include all relevant statements in target/run. This also feels like an important precondition for being able to "dry run" dbt models (#4456).

I see. This is definitely needed, because for backends like spark it is quite crucial, for performance reasons, to be able to run SQL statements outside dbt - but them being generated by dbt. It is not about dbt performance itself, but rather the nature and volume of data the goes through spark environments. For example, we have a solid workload management framework sitting on top of a swarm of spark clusters managed by k8s. I don't really need an adapter for that - just the SQL that can be executed right away. W/o this feature, our DBT use currently is limited to databricks adhoc development and table provisioning.

Parsing manifest/dbt log is fine as workaround for now, only issue here being that relying on file parsing tends to generate extra work hours on minor releases :)

However for this to work, DBT needs to decouple itself from the target database and rely on YML schemas instead. Which would be nice, because runtime type-checking on insert is the biggest bad thing about SQL.

halvorlu · 2022-10-25T10:46:28Z

I also think an operation to only change descriptions would be useful. My models process quite a lot of data when they run, so I would like to be able to update the descriptions independently. I managed to achieve this (with BigQuery) with a slightly modified macro inspired by @jtcohen6 :

{% macro persist_docs_op(model_name, relation = true, columns = true) %}

  {% if execute %}
    {% set model_node = (graph.nodes.values() | selectattr('name', 'equalto', model_name) | list)[0] %}
    {% set relation = adapter.get_relation(
                                  database=model_node.database, schema=model_node.schema, identifier=model_node.alias
                          ) %}

    {% if 'description' in model_node %}
        {{ log("Altering table description for " + model_name, info = true) }}
        {{ adapter.update_table_description(model_node['database'], model_node['schema'], model_node['alias'], model_node['description']) }}
    {% endif %}
    {{ log("Altering column comments for " + model_name, info = true) }}
    {{ alter_column_comment(relation, model_node.columns) }}

  {% endif %}

{% endmacro %}

Charles1104 added enhancement New feature or request triage labels Nov 8, 2021

Charles1104 changed the title ~~[Feature] <title>~~ [Feature] Allow "persist_docs" to be run independently Nov 8, 2021

jtcohen6 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Nov 17, 2021

jtcohen6 closed this as completed Nov 17, 2021

jtcohen6 mentioned this issue Oct 30, 2022

[CT-1331] [Feature] Add metadata to a model without running the model dbt-labs/dbt-bigquery#348

Closed

3 tasks

jtcohen6 mentioned this issue Nov 14, 2022

[CT-1500] [Feature] Append mode for persist_docs #6244

Closed

3 tasks

jtcohen6 mentioned this issue Dec 8, 2022

[CT-1633] [Feature] Persist docs for sources dbt-labs/dbt-bigquery#416

Closed

3 tasks

morsapaes mentioned this issue Jan 19, 2023

dbt-materialize: support --persist-docs config MaterializeInc/materialize#17216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Allow "persist_docs" to be run independently #4226

[Feature] Allow "persist_docs" to be run independently #4226

Charles1104 commented Nov 8, 2021

balmasi commented Nov 8, 2021 •

edited

Charles1104 commented Nov 8, 2021

jtcohen6 commented Nov 17, 2021

Charles1104 commented Nov 17, 2021

george-zubrienko commented Mar 3, 2022 •

edited

jtcohen6 commented Mar 3, 2022

george-zubrienko commented Mar 3, 2022

halvorlu commented Oct 25, 2022 •

edited

[Feature] Allow "persist_docs" to be run independently #4226

[Feature] Allow "persist_docs" to be run independently #4226

Comments

Charles1104 commented Nov 8, 2021

Is there an existing feature request for this?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

balmasi commented Nov 8, 2021 • edited

Charles1104 commented Nov 8, 2021

jtcohen6 commented Nov 17, 2021

Charles1104 commented Nov 17, 2021

george-zubrienko commented Mar 3, 2022 • edited

jtcohen6 commented Mar 3, 2022

george-zubrienko commented Mar 3, 2022

halvorlu commented Oct 25, 2022 • edited

balmasi commented Nov 8, 2021 •

edited

george-zubrienko commented Mar 3, 2022 •

edited

halvorlu commented Oct 25, 2022 •

edited