Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document addition of incremental_predicates #2636

Closed
1 task done
jtcohen6 opened this issue Jan 2, 2023 · 1 comment · Fixed by #2681
Closed
1 task done

Document addition of incremental_predicates #2636

jtcohen6 opened this issue Jan 2, 2023 · 1 comment · Fixed by #2681
Assignees
Labels
content Improvements or additions to content dbt-core v1.4 Docs impact for the v1.4 release (Jan 2023) improvement Use this when an area of the docs needs improvement as it's currently unclear

Comments

@jtcohen6
Copy link
Collaborator

jtcohen6 commented Jan 2, 2023

Contributions

  • I have read the contribution docs, and understand what's expected of me.

Link to the page on docs.getdbt.com requiring updates

https://docs.getdbt.com/docs/build/incremental-models

This might want to go under the "About incremental_strategy section," which is really acting as our catch-all for "advanced optimizations."

What part(s) of the page would you like to see updated?

Support for a new (optional) config: incremental_predicates. This config accepts any valid SQL expression. (Note: dbt will not do any validation in advance, so it's up to the user to ensure that their SQL syntax is valid.)

This is an advanced use of incremental models, where data volume is large enough to justify additional investments in performance. It's a real case of "trust the user," where some of the abstraction that dbt offers over vendor-specific merge statements is stripped away.

For instance, this is a pattern we might expect to see on Snowflake:

models:
  - name: my_incremental_model
    config:
      materialized: incremental
      unique_key: id
      cluster_by: ['session_start']  # this will affect how the data is stored on disk, and indexed to limit scans
      incremental_strategy: merge
      # limit scan of the existing table to the last 7 days of data
      incremental_predicates: "DBT_INTERNAL_DEST.session_start > datediff(day, -7, current_date)"

This will template a merge statement like:

merge into <existing_table> DBT_INTERNAL_DEST
    from <temp_table_with_new_records> DBT_INTERNAL_SOURCE
    on
        -- unique key
        DBT_INTERNAL_DEST.id = DBT_INTERNAL_SOURCE.id
        and
        -- custom predicate: limits data scan in the "old" data / existing table
        DBT_INTERNAL_DEST.session_start > datediff(day, -7, current_date)
    when matched then update ...
    when not matched then insert ...

The user will still want to limit data scan of upstream tables, like always, within the body of their incremental model SQL, to limit the amount of "new" data being processed / transformed:

with large_source_table as (

    select * from {{ ref('large_source_table') }}
    {% if is_incremental() %}
        where session_start > dateadd(day, -3, current_date)
    {% endif %}

),

...

Notes:

  • Users do need to explicitly alias any columns with either DBT_INTERNAL_DEST ("old" data) or DBT_INTERNAL_SOURCE ("new" data)
  • There's a decent amount of conceptual overlap with the insert_overwrite incremental strategy. This whole section of the docs probably needs some more love, &/or to link to a revised version of https://discourse.getdbt.com/t/bigquery-dbt-incremental-changes/982 (which is now nearly 3 years old).

Additional information

Relevant merged PRs:

For much much more context on the performance benefits of "bounding" the scan, e.g. to take advantage of a cluster key on Snowflake:

@jtcohen6 jtcohen6 added content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear idea Proposes an idea for new content dbt-core v1.4 Docs impact for the v1.4 release (Jan 2023) labels Jan 2, 2023
@matthewshaver matthewshaver linked a pull request Jan 9, 2023 that will close this issue
7 tasks
@moseleyi
Copy link

moseleyi commented Jun 6, 2023

@jtcohen6 can we apply Jinja conditionals in this config entry? I am trying to change the predicate depending on the target:

incremental_predicates = ["DBT_INTERNAL_DEST.timestamp > CURRENT_DATE - INTERVAL '{{ 7  IF target.name == 'production' else 2}}' DAYS'"]

Is this possible?

@runleonarun runleonarun removed the idea Proposes an idea for new content label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content dbt-core v1.4 Docs impact for the v1.4 release (Jan 2023) improvement Use this when an area of the docs needs improvement as it's currently unclear
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants