Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow "persist_docs" to be run independently #4226

Closed
1 task done
Charles1104 opened this issue Nov 8, 2021 · 8 comments
Closed
1 task done

[Feature] Allow "persist_docs" to be run independently #4226

Charles1104 opened this issue Nov 8, 2021 · 8 comments
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core

Comments

@Charles1104
Copy link

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Currently the "persist_docs" configuration is only applied with a "dbt run" command. With this feature, goal is to allow documentation to be persisted to the database without having to run your models. There are time when you just want to update some descriptions without having to run your models.

How this should work ?

A new flag to the "dbt docs" command. For instance "--persist-docs". The following command would therefore just persist the docs for the selected models:

dbt docs --persist-docs --select [MODELS]

Describe alternatives you've considered

No possible alternatives at the moment as per the documentation of the "persist_docs".

Who will this benefit?

This will benefit data analyst / analytics engineers that regularly updates the documentation in the yml files and that want this documentation to be persisted to their analytical database without having to re-run their models.

Are you interested in contributing this feature?

Will need some ramp-up on the codebase but I would be thrilled to contribute.

Anything else?

No response

@Charles1104 Charles1104 added enhancement New feature or request triage labels Nov 8, 2021
@Charles1104 Charles1104 changed the title [Feature] <title> [Feature] Allow "persist_docs" to be run independently Nov 8, 2021
@balmasi
Copy link

balmasi commented Nov 8, 2021

I can see a possible situation where you run dbt docs --persist-docs --select [MODELS] but your models have not been re-run so the codebase is out of sync with the DB.

Is decoupling these two components safe?

What might the behaviour be in that situation?

@Charles1104
Copy link
Author

Indeed, good point but in that case it should just throw an error or it should just go through for the fields that are in sync with the DB.

And actually this is actually the behavior when you do a "dbt run" in case your model.yml file is not in sync with the schema of your model.

@jtcohen6
Copy link
Contributor

@Charles1104 Thanks for opening!

I hazard agreeing with @balmasi, for related reasons: I don't feel great about officially decoupling running a model and persisting its descriptions as database comments. Nor do I feel great about tightly coupling docs persistence (mutative alter/comment statements) with docs generate (metadata queries, no mutation, introspection only) instead—though the former certainly benefits the latter, these are different tasks, with different goals and footprints.

That said, I do see one way to proceed with this, by emulating the work of the built-in persist_docs macro in a custom run-operation. This requires tapping into the graph context variable to pick out the model you want. Unfortunately, this won't work with actual node selection syntax, but that doesn't feel as important for the case where you're iterating on a single model's descriptions.

The code below is a sample, and very much "use at your own risk": so long as you know what you're doing, this might save your developers some time. Plus, it's fun to see what you can do with dbt, when so much of its built-in functionality is written in macros.

-- macros/persist_docs_op.sql
{% macro persist_docs_op(model_name, relation = true, columns = true) %}

  {% if execute %}
    {% set model_node = (graph.nodes.values() | selectattr('name', 'equalto', model_name) | list)[0] %}
    {% set relation = adapter.get_relation(
        database=model_node.database, schema=model_node.schema, identifier=model_node.alias
    ) %}
  
    {{ log("Altering relation comment for " + model_name, info = true) }}
    {{ run_query(alter_relation_comment(relation, model_node.description)) }}
    {{ log("Altering column comments for " + model_name, info = true) }}
    {{ run_query(alter_column_comment(relation, model_node.columns)) }}
  
  {% endif %}

{% endmacro %}

Running:

$ dbt run-operation persist_docs_op --args 'model_name: model_a'
Running with dbt=0.21.1-rc1
Altering relation comment for model_a
Altering column comments for model_a

Checking into logs/dbt.log to confirm it ran:

2021-11-17 10:31:47.375852 (MainThread): Altering relation comment for model_a
2021-11-17 10:31:47.383144 (MainThread): Using postgres connection "macro_persist_docs_op".
2021-11-17 10:31:47.383318 (MainThread): On macro_persist_docs_op: /* {"app": "dbt", "dbt_version": "0.21.1rc1", "profile_name": "garage-postgres", "target_name": "dev", "connection_name": "macro_persist_docs_op"} */

    
  
  comment on view "jerco"."dbt_jcohen"."model_a" is $dbt_comment_literal_block$not cool model$dbt_comment_literal_block$;

  
2021-11-17 10:31:47.384160 (MainThread): SQL status: COMMENT in 0.00 seconds
2021-11-17 10:31:47.385003 (MainThread): Altering column comments for model_a
2021-11-17 10:31:47.389274 (MainThread): Using postgres connection "macro_persist_docs_op".
2021-11-17 10:31:47.389406 (MainThread): On macro_persist_docs_op: /* {"app": "dbt", "dbt_version": "0.21.1rc1", "profile_name": "garage-postgres", "target_name": "dev", "connection_name": "macro_persist_docs_op"} */

    
  
    
    
    comment on column "jerco"."dbt_jcohen"."model_a".id is $dbt_comment_literal_block$not cool column$dbt_comment_literal_block$;
  

  
2021-11-17 10:31:47.389865 (MainThread): SQL status: COMMENT in 0.00 seconds
2021-11-17 10:31:47.390647 (MainThread): On macro_persist_docs_op: ROLLBACK
2021-11-17 10:31:47.391006 (MainThread): On macro_persist_docs_op: Close

@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Nov 17, 2021
@Charles1104
Copy link
Author

HI @jtcohen6,

Thank you for the answer and explanation.

Some additional context on why we would like to do that. In production, we run dbt via Airflow and we are exploding the manifest.json into granular stages as per this astronomer article. Then in each Airlfow stage, we decided to not use a Bash operator and run "dbt run" but instead to use a BigQuery operator and point the queries to the compiled queries (compiled via "dbt compile" in our first Airflow stage) for efficiency reason. But this has the implication that it is not persisting the docs as it is just running the query.

I like the solution you suggest and we will use it to persist our docs in production. Thank you for sharing the code snippet.

Best regards,

Charles

@george-zubrienko
Copy link

george-zubrienko commented Mar 3, 2022

Would like to breath some life into this. I don't really understand why does dbt print actual statements for comments to a log:

        alter table auto_replenishment_george_1.assortment change column 
            sku_number
            comment 'Unique identifier for a network route\'s sku';

But it does not add them to either target/compiled or target/run output for the selected model. If it did just that, would be no issue at all?
Disclaimer: everything is doable with shell ofc.

cat logs/dbt.log | grep -A4 alter > docs.sql

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 3, 2022

@george-zubrienko Unfortunately, this is a limitation of dbt's current materialization model. Rather than including all SQL that dbt executed as part of materializing a model, target/run only includes the "main" statement that dbt runs within the materialization. This limitation, along with the subtle distinction between "compiled" and "run" SQL, is something we've discussed before in other contexts (#3222, #4227).

I've wanted to change that for a long time, so that materializations include all relevant statements in target/run. This also feels like an important precondition for being able to "dry run" dbt models (#4456).

@george-zubrienko
Copy link

@george-zubrienko Unfortunately, this is a limitation of dbt's current materialization model. Rather than including all SQL that dbt executed as part of materializing a model, target/run only includes the "main" statement that dbt runs within the materialization. This limitation, along with the subtle distinction between "compiled" and "run" SQL, is something we've discussed before in other contexts (#3222, #4227).

I've wanted to change that for a long time, so that materializations include all relevant statements in target/run. This also feels like an important precondition for being able to "dry run" dbt models (#4456).

I see. This is definitely needed, because for backends like spark it is quite crucial, for performance reasons, to be able to run SQL statements outside dbt - but them being generated by dbt. It is not about dbt performance itself, but rather the nature and volume of data the goes through spark environments. For example, we have a solid workload management framework sitting on top of a swarm of spark clusters managed by k8s. I don't really need an adapter for that - just the SQL that can be executed right away. W/o this feature, our DBT use currently is limited to databricks adhoc development and table provisioning.

Parsing manifest/dbt log is fine as workaround for now, only issue here being that relying on file parsing tends to generate extra work hours on minor releases :)

However for this to work, DBT needs to decouple itself from the target database and rely on YML schemas instead. Which would be nice, because runtime type-checking on insert is the biggest bad thing about SQL.

@halvorlu
Copy link

halvorlu commented Oct 25, 2022

I also think an operation to only change descriptions would be useful. My models process quite a lot of data when they run, so I would like to be able to update the descriptions independently. I managed to achieve this (with BigQuery) with a slightly modified macro inspired by @jtcohen6 :

{% macro persist_docs_op(model_name, relation = true, columns = true) %}

  {% if execute %}
    {% set model_node = (graph.nodes.values() | selectattr('name', 'equalto', model_name) | list)[0] %}
    {% set relation = adapter.get_relation(
                                  database=model_node.database, schema=model_node.schema, identifier=model_node.alias
                          ) %}

    {% if 'description' in model_node %}
        {{ log("Altering table description for " + model_name, info = true) }}
        {{ adapter.update_table_description(model_node['database'], model_node['schema'], model_node['alias'], model_node['description']) }}
    {% endif %}
    {{ log("Altering column comments for " + model_name, info = true) }}
    {{ alter_column_comment(relation, model_node.columns) }}

  {% endif %}

{% endmacro %}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

5 participants