Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_sql_header and is_incremental #3264

Closed
1 of 5 tasks
yuzeh opened this issue Apr 15, 2021 · 4 comments
Closed
1 of 5 tasks

set_sql_header and is_incremental #3264

yuzeh opened this issue Apr 15, 2021 · 4 comments
Labels
bug Something isn't working stale Issues that have gone stale

Comments

@yuzeh
Copy link

yuzeh commented Apr 15, 2021

Describe the bug

set_sql_header doesn't play nicely with the is_incremental macro. In particular, is_incremental seems to always evaluate to false during dbt run.

Steps To Reproduce

Minimal reproducible dbt model:

-- models/test__is_incremental__inside__set_sql_header.sql
{{
    config(
        materialized='incremental',
        unique_key="id",
        incremental_strategy="merge",
    )
}}

{% call set_sql_header(config) %}

{% if is_incremental() %}
    THIS CODEPATH SHOULD CRASH
{% else %}
    -- this is a comment
{% endif %}

{% endcall %}

select 1 as id

Expected behavior

  • Exits cleanly when dbt run the first time or with --full-refresh... ✔️
  • Exits with error when dbt run subsequent times (because of the non-SQL code)... ❌

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.19.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - postgres: 0.18.1
  - redshift: 0.18.1
  - bigquery: 0.18.1
  - snowflake: 0.18.1

The operating system you're using: Ubuntu (on WSL2)

The output of python --version: Python 3.8.5

Additional context

This is somewhat related to #2793. I had some reasonable workarounds for the other issues I ran into while putting arbitrary BigQuery SQL into set_sql_header (e.g. hardcoding table names rather than using ref), but I'm not sure how to work around is_incremental not working the way I expect.

Additionally, the output of the compile step (e.g. target/compiled/project/models/test__is_incremental__inside__set_sql_header.sql) does not include the contents of set_sql_header... is this expected?

@yuzeh yuzeh added bug Something isn't working triage labels Apr 15, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 17, 2021

Thanks for opening this issue @yuzeh! I believe this is the same underlying issue as the one you found previously (#2793). I've got a firmer grasp on what's going on, and so I come with a few potential ideas. None is simple, but I do believe we should do something.

Basically: Because sql_header is a model config—whether set via {{ config(sql_header = "...") }} or {% call set_sql_header(config) %}—it is rendered at parse time, the rendered value is stored on the model object, and it is not re-rendered at execution time.

This is also the same reason why the SQL header is not included in the node's compiled_sql (or target/compiled/...): It's actually a different config, a different property of the model. It should appear in target/run/..., but that works imperfectly for any model with multiple "main" queries, i.e. incremental models.

It's clear that users want to leverage SQL headers for a mission beyond its original vocation. We initially envisioned sql_header serving teensy bits of glue code here or there—creating a UDF, altering the session timezone, or similar—rather than performing a key functional role for databases that require in-query operations, like BigQuery scripting.

With that in mind, here are the options as I see them:

  1. Stop treating SQL headers as configs, and start treating them more like model SQL. Today, materializations have access to a model's Jinja-rendered SQL via the {{ sql }} variable, and they can rest easy knowing that the rendering took place at execution time. Instead of pulling sql_header from the config object, the SQL header should be a similar (though separate!) thing: another execution-time-rendered value, {{ sql_header }}, which can be pushed by the materialization into create_table_as/create_view_as/etc macros. The downside of this approach? It would remove the ability to define the sql_header in dbt_project.yml for many models at once, since it would no longer be a config, but a bespoke property of the model.

  2. Allow SQL headers to be rendered twice: Once at parse time (to store on the node object + capture any dependencies), and once at execution time. This is something hooks can do today, though the mechanism is pretty wonky (e.g.Post hooks that call macros get parsed with execute = False #2370). Imagine the following model:

{{ config(
    materialized = 'incremental',
    post_hook = [
        "select 1 as fun -- '" ~ ref('model_a') ~ is_incremental() ~"'",
        "select 1 as fun -- {{ ref('model_a') }} {{ is_incremental() }}",
        ]
) }}

-- {{ is_incremental() }}

select * from {{ ref('model_a') }} 

The first hook is fully rendered at parse time, so it has the parse-time values for each. The second hook contains nested curlies, and because hooks are special, at execution time they are re-rendered—i.e., that nested Jinja is rendered for the first time. The results:

select 1 as fun -- '"jerco"."my_target_schema"."model_b"FalseFalse'
select 1 as fun -- "jerco"."my_target_schema"."model_a" True True

The claim here would be, SQL headers should be able to do the same nested-Jinja thing, giving the user the (confusing yet useful) ability to save some rendering for later.

  1. If we're going to make set_sql_header work like a hook, why not just make it a hook? That is, we could add a within_query: true | false property to model hooks so that, instead of running them before or after the main query in the materialization, the hook's rendered SQL would be dropped right into the main query. For backwards compatibility, dbt would still accept sql_header / set_sql_header, and immediately change it into a pre-hook with within_query: true.

What do you think? Which of the capabilities above would you find most compelling?

@jtcohen6 jtcohen6 removed the triage label Apr 17, 2021
@yuzeh
Copy link
Author

yuzeh commented Apr 17, 2021

Hey @jtcohen6 thank you for the very detailed response!

As for how to proceed; I'm not entirely sure as I don't quite have a good sense of the other dbt features (for example, I make no use of hooks in my project).

A couple of thoughts from an end-user's perspective:

  • I primarily use sql_header for two things: (a) defining temporary UDFs and (b) running BigQuery scripts, saving the result into a table named {{ this.name }}__tmp, and then running select * from {{ this.name }}__tmp as the actual model SQL.
  • I don't set sql_header in dbt_project.yml, though I can imagine this being useful as a feature to other folks.
  • While (3) makes sense mechanically, it doesn't seem obvious to me that a new dbt user would think of using a "hook" to inject SQL into the main query.

All that said, (2) seems the most straightforward solution to my current issue.


If you could -- what's the fastest way to work around my existing issue? I've been racking my head on this for a couple of hours at this point, and I see two solutions currently:

  • hack the incremental materialization locally to somehow "eval" the sql_header, with the idea of reverting this hack when this issue is properly resolved and my local dbt is updated. (I don't know enough jinja to determine whether this is a viable route).
  • Give up on trying to make these BigQuery Scripts run in the main query. This has the side effect of making dbt docs impossible to run in my project, but such is life.
  • [UPDATED] a third one that I just came up with involves somehow writing my model SQL in such a way that specific {% call statement %} directives don't execute if I'm simply doing dbt docs. Is there a way to get this done?

@jtcohen6
Copy link
Contributor

I primarily use sql_header for two things: (a) defining temporary UDFs and (b) running BigQuery scripts, saving the result into a table named {{ this.name }}__tmp, and then running select * from {{ this.name }}__tmp as the actual model SQL.

This is helpful to know @yuzeh!

I'd say that sql_header today is well positioned to help with (a), but it just isn't there for (b). Trying to do more than a simple declare + set feels like overloading the sql_header. Honestly, I think another answer here could be custom materializations—the BQ incremental materialization makes use of scripting for some strategies, and that feels like the proper treatment.


For the a way to hack around this limitation in the meantime: Mike Ascah in dbt Slack came up with a not-pretty (but honestly pretty reasonable) answer that approximates the functionality in (1).

In your model SQL, add an easily identifiable delimiter (such as "__BQ_SCRIPT__") to split up your SQL header above, model SQL below. Then, you can sql.split("__BQ_SCRIPT__") and achieve something like:

{{ my_split_sql[0] }};

create table ... as (
    {{ my_split_sql[1] }}
);

That would involve a simple reimplementation or adaptation of the BQ incremental materialization, or the macros it calls (such as bigquery__create_table_as). In any case, I wouldn't recommend that for the long term; but I think it's always better to employ a self-evident hack than a subtle one.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants