Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-646] [Proposal] Streamline Incremental Strategies #5245

Closed
11 tasks
Tracked by #9290
nathaniel-may opened this issue May 13, 2022 · 16 comments · Fixed by #5359
Closed
11 tasks
Tracked by #9290

[CT-646] [Proposal] Streamline Incremental Strategies #5245

nathaniel-may opened this issue May 13, 2022 · 16 comments · Fixed by #5359
Assignees
Labels
enhancement New feature or request incremental Incremental modeling with dbt
Milestone

Comments

@nathaniel-may
Copy link
Contributor

nathaniel-may commented May 13, 2022

When defining an incremental model, there are several incremental strategies that could be used to execute the incremental model, (namely: "append," "delete+insert," "merge," and "insert_overwrite.") but not all of them are supported by every warehouse. Working with these incremental strategies as an adapter maintainer and as an advanced user is difficult. This ticket aims to improve the experience for both of these groups of people.

Functional Requirements

  1. Adapter maintainers need to be able to specify which incremental strategies are supported by the warehouse and be able to supply the correct sql statement to execute if the default is not sufficient.
  2. Advanced users that wish to specify a custom incremental strategy must be able to do so.

Behavior Today

  1. To specify which incremental strategies are supported by the warehouse, the adapter maintainer must write a macro that returns a list of strings that name the supported strategies, and include the jinja macro adapter__get_merge_sql that takes a fixed set of parameters and returns the materialization sql. This macro is often copy-pasted from core, and is error prone when editing.
  2. Advanced dbt users who wish to specify a custom incremental strategy must override the same boiler plate jinja macro by copy pasting it into their dbt project.
  3. When an unsupported incremental strategy is specified, the user is notified via an exception at during the first dbt run.

Desired Behavior

  1. To specify which incremental strategies are supported by the warehouse, the adapter maintainer must write a macro that returns a list of strings that name the supported strategies. If the warehouse only supports a subset of the default incremental strategies and the default implementation in the global project produces sql that works on the target warehouse, no additional work needs to be done. If any of the default behavior needs to be overridden, it can be done by writing a macro that conforms to the naming convention "get_incremental_NAME_sql" where "NAME" is the string used to configure the incremental strategy (e.g. "get incremental_append_sql"). If the warehouse has an additional incremental strategy, defining a new macro using the same "get_incremental_NAME_sql" convention will make it available to users of the adapter.
  2. Advanced dbt users who wish to specify a custom incremental strategy will only need to create a macro that conforms to the naming convention "get_incremental_NAME_sql" that produces the correct sql for the target warehouse.
  3. When an unsupported incremental strategy is designated, raise an error to users before runtime.
  4. Existing projects will experience no breaking changes when they upgrade to the version of dbt that includes this new behavior.

Implementation

  • For backwards compatibility, create new macros in core/dbt/include/global_project/macros/materializations/models/incremental that follow the naming convention for the existing incremental strategies and pass through to the existing current macros. The new macros will take a dictionary, and pass the key-value pairs explicitly to the wrapped macro call to return the sql string that executes the materialization. Concretely:
    • create a new macro get_insert_into_sql() -> str that returns "insert into"
    • create a new macro get_incremental_append_sql(dict) that wraps get_insert_into_sql
    • create a new macro get_incremental_delete_insert_sql(dict) that wraps get_delete_insert_merge_sql
    • create a new macro get_incremental_merge_sql(dict) that wraps get_merge_sql
    • create a new macro get_incremental_insert_overwrite_sql(dict) that wraps get_insert_overwrite_sql
    • create a new macro get_incremental_default_sql(dict) that wraps get_incremental_append_sql. This is where adapters can override the default incremental strategy when append isn't appropriate.
    • Note: this dictionary will always pass the same key-value pairs from core. The dictionary value allows core to add new key-value pairs in the future without breaking existing adapters. It is very possible that a custom incremental strategy may require different parameters than the ones core provides. These can be added by users via context-global i.e. config.get('...').
  • Add incremental_strategy to NodeConfig with a default value of "default".
  • Create a new Python method on core/dbt/adapters/sql/impl.py::AdapterConfig called validate_incremental_strategy that takes no parameters and returns a list of strings. Adapter maintainers are expected to implement this so that the returned strings name the supported incremental strategies. This could have been a macro, but it's much easier for core to call this as a python function to validate incremental strategies in projects. (TODO what if an adapter supports a different set of incremental strategies for sql and python models? where would they indicate that?)
  • Write a new Python function get_incremental_strategy_macro that takes in a string and returns the macro from the context. (e.g. "insert_overwrite" -> callable get_incremental_insert_overwrite_sql). This function will call validate_incremental_strategy and throw an exception if the requested strategy is not returned from validate_incremental_strategy or if it does not exist in the context. This function will need to be called in the materialization macro for models
  • create a "functional" test for each incremental strategy in tests/adapter including tests for several configs for each strategy. This will allow each adapter to import the tests it needs as well as make their own as needed. (Thanks for pointing this requirement out, @dataders)

Docs
This will require a significant change to the documentation on incremental materializations.

@nathaniel-may nathaniel-may added the enhancement New feature or request label May 13, 2022
@github-actions github-actions bot changed the title [Draft] Streamline Incremental Strategies [CT-646] [Draft] Streamline Incremental Strategies May 13, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 16, 2022

Update: everything in this comment has been resolved, and incorporated into the main issue

@nathaniel-may This is really clearly stated, and I think you captured the many nuances well!

TODO name the macro

We were discussing validate_incremental_strategy, and that this may want to be an adapter (Python) method rather than a macro. It would work similar to "snapshot strategy dispatch", but we agreed the Jinja here is pretty jank.

TODO should this actually be a compile check?

I think this may get tricky with partial parsing: We'd need to recalculate any time a new incremental_strategy is defined, and any time a macro named get_incremental_X_sql is created/deleted. While it's definitely preferable to raise this error before the model's actual runtime, it would be worse to cause a buggy experience.

It's also possible that users completely override the incremental materialization—a thing we want them to need to do less of, but still—in which case, the validate_incremental_strategy macro/method may not be used at all.

TODO CONTINUE THIS LIST

Other things I recall from our conversation:

  • The get_incremental_X_sql macros should all have the same macro (function) signature, but they may need different sorts of arguments to be passed in. Rather than developing macro signatures of 10+ arguments, and requiring all macros to evolve their signature when any one of them does, we might use a single dictionary argument instead. The materialization would pass this dictionary into the macro, and the macro would return a string (the templated SQL to be run).
  • If a user defines their own custom strategy, that requires their own custom config, they can just pull that custom config into get_incremental_X_sql via context-global config.get('...'). That's not a good habit for us, as maintainers, but it's a valuable escape hatch for users, and avoids the need to copy-paste-edit the whole incremental materialization in their projects.
  • There are several other complexities with the incremental materialization (full refresh, on_schema_change, whether to create a temp table or not), and this change won't solve for all of them. It should enable us to delete a bunch of boilerplate code, though, and clear the way for examining those other discrepancies that exist today.

@jtcohen6 jtcohen6 added the incremental Incremental modeling with dbt label May 16, 2022
@dataders
Copy link
Contributor

one quick thought as a requirement. will come back to this later

  • there should be a "functional" test for at least each incremental strategy in core/tests (arguably each possible config of each strategy), from which each adapter can import the tests it needs (as well as make their own as needed.

@jtcohen6
Copy link
Contributor

TODO what is the default implementation?

The lowest-level default implementation should be append (as defined in dbt-core). That should also be the default when no unique_key is specified. For backwards compatibility with current behavior, each adapter should be able to specify its default strategy when a unique_key is supplied, namely delete+insert on Postgres/Redshift and merge on Snowflake/BigQuery/Databricks.

@nathaniel-may
Copy link
Contributor Author

top-level ticket description reflects the previous conversation.

@nathaniel-may nathaniel-may changed the title [CT-646] [Draft] Streamline Incremental Strategies [CT-646] [Proposal] Streamline Incremental Strategies Jun 1, 2022
@gshank
Copy link
Contributor

gshank commented Jun 2, 2022

A couple of nits: I think the name "get_incremental_delete+insert_sql(dict)" would need to use an underscore instead of a plus. I believe that the adapter tests should go in the adapter zone (tests/adapter) and not in core/tests. I think that's what was intended from the wording.

@gshank
Copy link
Contributor

gshank commented Jun 2, 2022

I'm not finding get_insert_into_sql or get_incremental_append_sql.

@nathaniel-may
Copy link
Contributor Author

  • Yeah the + isn't going to work. I'll change that to an _.

  • Yeah adapter zone sounds like the right place for these tests I'll change that as well.

  • get_incremental_append_sql is one of the macros that needs to be created. It just gets wrapped by another new "default" function in core.

  • get_insert_into_sql def isn't right though. I bet it was supposed to be get_merge_sql where predicates=None from merge.sql. Does that sound right to you?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 3, 2022

get_insert_into_sql is the cleanest way to represent true "append" behavior, when a unique_key isn't specified. This is an attempt to clean up a few different things that exist today:

  • get_insert_into_sql, defined in dbt-spark only. This is explicit "append" behavior on Spark today.
  • get_delete_insert_merge_sql when no unique_key is specified. This is implicit "append" behavior on Postgres + Redshift + Snowflake (!!) today
  • get_merge_sql, when no unique_key (or other custom predicates) are specified, as it ends up using ON FALSE, a.k.a. constant false predicate. Turns out, that achieves the same result (= insert only, no update/delete). This is implicit "append" behavior on BigQuery + Databricks (Delta) today.

Why (!!) for Snowflake? Even though Snowflake uses the merge strategy by default, its MERGE statement does NOT support a constant false predicate, so we've hard-coded insert into here!

Since all of these databases support good-old-fashioned insert into, I'd like to make that the explicit behavior of all of them when the append strategy is specified, i.e. the standard behavior of get_incremental_append_sql. Doesn't mean that append should be the default, and we definitely shouldn't break the current behavior (implicit append) when folks specify delete+insert or merge without a unique_key specified, but it's the direction we should move in.

Happy to spend more time talking about this, since it's pretty tricky stuff!

@nathaniel-may
Copy link
Contributor Author

updated description to add the new macro get_insert_into_sql.

@gshank
Copy link
Contributor

gshank commented Jun 8, 2022

Which incremental strategies are we supporting in postgres/redshift? If we're implementing this in dbt-core, does that mean that we support them all in postgres?

Currently:
Snowflake: merge, delete+insert
BigQuery: merge, insert_overwrite
Spark: append, insert_overwrite, merge
Redshift: append
Postgres: append

I'm unclear on what the point is of returning a list of the supported strategies, when an additional strategy can be implemented simply by creating a macro with the right name. The description above states behavior that would throw an exception if there's a 'get_incremental_silly_sql' and 'silly' is not returned by 'validate_incremental_strategy'. And since that's described as a python function, it's not like people could replace it when they add a new strategy.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 8, 2022

@gshank Fair question. We want to do two things:

  • Allow each adapter maintainer to declare which strategies are supported out-of-the-box for that adapter, by returning the list of strings reflecting those strategy names
  • Establish a common pattern that makes it easy for adapter maintainers / end users to register a new strategy (e.g. silly), and a macro to use for that strategy (get_incremental_silly_sql)

Postgres/Redshift are only able to support append + delete+insert today — but we should still aim to define the reusable logic for the other strategies in dbt-core's default global_project, to the extent possible.

Today:

  • Snowflake: append (implicit), merge (default), delete+insert
  • BigQuery: append (implicit), merge (default), insert_overwrite
  • Spark: append (explicit, default), insert_overwrite, merge
  • Redshift: append (implicit), delete+insert (default)
  • Postgres: append (implicit), delete+insert (default)

@gshank
Copy link
Contributor

gshank commented Jun 8, 2022

Regarding this statement: "When an unsupported incremental strategy is designated, raise an error to users before runtime." if an end user has added a new incremental strategy, how would it pass the error checking? What would "register a new strategy" look like?

Regarding this statement: "If any of the default behavior needs to be overridden, it can be done by writing a macro that conforms to the naming convention "get_incremental_NAME_sql" where "NAME" is the string used to configure the incremental strategy (e.g. "get incremental_append_sql")." I'm wondering how this interacts with adapter.dispatch. There are three macros that do adapter.dispatch now: get_merge_sql, get_delete_insert_merge_sql, get_insert_overwrite_merge_sql. The only non-default implementation of these macros is spark__get_merge_sql, but I presume that other external adapters might have implemented non-default versions. So we're leaving these adapter.dispatch calls in for compatibility, but the new method of overriding is to actually re-implement the get_incremental_NAME_sql and rely on macro project precedence to resolve to the right one (via the get_incremental_strategy_macro method)?

Are we going to be rewriting our in-house adapters to use the new arrangements?

@gshank
Copy link
Contributor

gshank commented Jun 8, 2022

The special "materialization" macros are mostly copied from the base incremental materialization. They don't inherit. We do need to support the existing ones. What are you thinking we will do to reduce copy-and-pasting? Split up the base materialization into sub-macros?

@gshank
Copy link
Contributor

gshank commented Jun 8, 2022

In order to create tests in the adapter zone for all of the incremental strategies, we will either need to be able to run the tests in Postgres or mark them to skip.

@gshank
Copy link
Contributor

gshank commented Jun 9, 2022

@jtcohen6 What is the purpose of the "predicates" parameter on the insert_overwrite_merge_sql? I can't see that it's ever set anywhere. So it can't be passed in to the macro.

Also the 'include_sql_header', which is only set in one place for bigquery and is set in a keyword parameter, which is not something we can preserve. What was the purpose of that?

@jtcohen6
Copy link
Contributor

@gshank Great questions!

"When an unsupported incremental strategy is designated, raise an error to users before runtime." if an end user has added a new incremental strategy, how would it pass the error checking? What would "register a new strategy" look like?

I'm not strongly opinionated about the exact implementation details here.

It could look like us raising a helpful error if the user has defined a model with incremental_strategy: silly, and there is no get_incremental_silly_sql macro defined. If the macro does exist, it's assumed to be supported.

We need to reconcile that with the need for each adapter plugin to be able to declare which incremental strategies it does and doesn't support out-of-the-box. That could be an argument in favor of making that a macro (user-space code), rather than an adapter method (Python). Then, end users who add a new silly strategy could override the macro and returns a new set of strings:

{% macro validate_incremental_strategies() %}
  {{ return('append', 'delete+insert', 'silly') }}
{% endmacro %}

So we're leaving these adapter.dispatch calls in for compatibility, but the new method of overriding is to actually re-implement the get_incremental_NAME_sql and rely on macro project precedence to resolve to the right one (via the get_incremental_strategy_macro method)?

I think each of these get_incremental_NAME_sql macros should be dispatched, and have a default__ version defined in dbt-core. That allows adapters to implement just a macro named adapter__get_incremental_NAME_sql, in cases where their SQL differs. E.g. spark__get_incremental_merge_sql.

Are we going to be rewriting our in-house adapters to use the new arrangements?

Yes! The existing adapter code should all still run successfully regardless of the changes we make in dbt-core — this will be a good sanity check of backwards compatibility. But the goal here is to delete / consolidate lots of copy-pasted code, so we can only actually capture that value with follow-up refactoring work in the adapters.

The special "materialization" macros are mostly copied from the base incremental materialization. They don't inherit. We do need to support the existing ones. What are you thinking we will do to reduce copy-and-pasting? Split up the base materialization into sub-macros?

They're 90% copy-pasted code. Materializations end up being quite redundant (a) across adapters, and (b) even between materialization types. We should always be striving to identify where they frequently differ, and move those pieces of logic into separate standalone macros. Then, each adapter can reimplement just the piece that actually needs to be different.

In order to create tests in the adapter zone for all of the incremental strategies, we will either need to be able to run the tests in Postgres or mark them to skip.

I think marking them to skip is the right approach. We could also just define the Base test classes without inheriting them into a Test-named class.

What is the purpose of the "predicates" parameter on the insert_overwrite_merge_sql? I can't see that it's ever set anywhere. So it can't be passed in to the macro.

There's some context for this, and an initial implementation, in #4546. That gives the big idea pretty well. (Even more context in #3293, and recently in dbt-labs/dbt-spark#369 as well.)

The big idea: By default, incremental models find rows to update if they match the same unique_id value as rows in the new data. That unique_id match is the only "predicate" we really support today. There are legitimate cases (usually for performance reasons) where users want the ability to configure additional custom "predicates" — usually to filter scans of the existing (very large) table, e.g. only look to update rows from the past 7 days.

Practically, this looks like:

  • We support predicates as a config that's pulled in by the incremental materialization, and passed into all of these get_incremental_X_macros (as one attribute of the dict argument)
  • We template predicates (if defined) into the SQL/DML returned by each get_incremental_X_macro

mirnawong1 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Jan 8, 2024
[Preview](https://docs-getdbt-com-git-dbeatty-custom-incremental-d92d96-dbt-labs.vercel.app/docs/build/incremental-models#custom-strategies)

## What are you changing in this pull request and why?

This addresses the "**For end users**" portion of
#1761.

The feature request in dbt-labs/dbt-core#5245
describes the value proposition as well as the previous and new
behavior:

#### Functional Requirement
- Advanced users that wish to specify a custom incremental strategy must
be able to do so.

#### Previous behavior
- Advanced dbt users who wished to specify a custom incremental strategy
must override the same boilerplate Jinja macro by copy pasting it into
their dbt project.

#### New behavior
- Advanced dbt users who wish to specify a custom incremental strategy
will only need to create a macro that conforms to the naming convention
`get_incremental_NAME_sql` that produces the correct SQL for the target
warehouse.

## Also

To address the questions raised in
dbt-labs/dbt-core#8769, we also want to
document how to utilize custom incremental macros that come from a
package.

For example, to use the `merge_null_safe` custom incremental strategy
from the `example` package, first [install the
package](/build/packages#how-do-i-add-a-package-to-my-project), then add
this macro to your project:

```sql
{% macro get_incremental_merge_null_safe_sql(arg_dict) %}
    {% do return(example.get_incremental_merge_null_safe_sql(arg_dict)) %}
{% endmacro %}
```

## 🎩 

<img width="503" alt="image"
src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/51c3266e-e3fb-49bd-9428-7c43920a5412">

## Checklist
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
- [x] For [docs
versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning),
review how to [version a whole
page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
and [version a block of
content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental Incremental modeling with dbt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants