Initial refactoring of incremental materialization #5359

gshank · 2022-06-10T14:55:55Z

resolves #5245

Description

This is not ready for final review, but I'm putting it out here for discussion and commenting.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

github-actions · 2022-06-10T14:56:16Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

gshank · 2022-06-10T14:57:56Z

I haven't yet incorporated the most recent feedback, so we don't have adapter.dispatch and default macros for everything yet.

ChenyuLInx

This change looks good to me. Love the fact that we are breaking things down into smaller parts for folks to implement and adding more guardrails.
With this kind of change, do we think we can actually try to have smaller tests focusing on certain macros vs having one functional test testing the whole function?

jtcohen6

Great work @gshank! Thanks for throwing up the draft PR, to push forward the conversation. I'm excited by the progress here. I left a bunch of comments, some of which are just "oh I get this now," others being open questions that we should look to answer before merging.

core/dbt/include/global_project/macros/materializations/models/incremental/strategies.sql

core/dbt/adapters/base/impl.py

core/dbt/include/global_project/macros/materializations/models/incremental/strategies.sql

tests/functional/artifacts/expected_manifest.py

core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql

gshank · 2022-07-13T20:08:57Z

@jtcohen6 There's a difference in the way the config.get('...', default='...') works when the config key exists (built-in or "extra") and if it doesn't. The materialization code is doing config.get('incremental_strategy', default='merge') but it only sets incremental_strategy to 'merge' if the config key (and the extra key) doesn't exist. So we're not actually getting the default set from the macro config.get calls because we now have an 'incremental_strategy' config attribute.

I almost want to check for a None value to set a default, but that's a pretty big change, so I'm guessing that's not a good idea. We could change the lines to {%- set strategy = config.get("incremental_strategy") or "merge" %}

In the new code in 'get_incremental_strategy_macro' I set the strategy to 'default' if it's none, but if we're only doing the minimal changes to the adapters (not including using that bit) we'll have to do something else.

jtcohen6 · 2022-07-14T11:35:30Z

@gshank I see what you mean.

I'd be in favor of making a few small changes in each adapter to get this PR merge-able, if it enables us to have a better implementation here, and avoid contorting ourselves into a harder-to-reason-about logical flow.

Following the design / conversation in #5245, I think all that will require us to do, in each adapter, is define a macro default__get_incremental_default_sql, returning the macro which should (for that adapter) be used as the default strategy. So in effect:

{# This part goes in dbt-core #}

{% macro get_default_incremental_sql(arg_dict) %}
  {{ return(adapter.dispatch('get_default_incremental_sql', 'dbt'))(arg_dict) }}
{% endmacro %}

{#
  -- The first 'default' here means for the default adapter,
  -- the second 'default' means for the default incremental strategy (if not otherwise specified)
#}
{% macro default__get_default_incremental_sql(arg_dict) %}
  {{ return(get_append_incremental_sql(arg_dict)) }}
{% endmacro %}

{# This macro goes in dbt-postgres #}

{% macro postgres__get_default_incremental_sql(arg_dict) %}
  {{ return(get_delete_insert_incremental_sql(arg_dict)) }}
{% endmacro %}

{# This macro goes in dbt-snowflake #}

{% macro snowflake__get_default_incremental_sql(arg_dict) %}
  {{ return(get_merge_incremental_sql(arg_dict)) }}
{% endmacro %}

{# This macro goes in dbt-bigquery #}

{% macro bigquery__get_default_incremental_sql(arg_dict) %}
  {{ return(get_merge_incremental_sql(arg_dict)) }}
{% endmacro %}

gshank · 2022-07-19T04:02:43Z

I'm wondering if putting the args in the strategy_arg_dict actually serves any real purpose? It's not clear to me what it gives us that simply passing in the args doesn't. If users have a custom materialization, unless they copy-and-paste the whole incremental materialization, I assume they will be pulling the in any new args from config in their custom code, right?

It's still not clear to me why we want the sql_header in bigquery and only in bigquery. As far as I can tell, none of the other adapters include the sql_header, since "include_sql_header" is set to false everywhere except that one place in bigquery.

I think that in order to use the default__get_incremental_default_sql in the previous comment, we'd have to pretty much implement all the changes in this pull request.

Redshift works fine with no changes on top of dbt-core. Snowflake works with just the tweak to the config.get of the incremental_strategy. Spark works with the tweak to the config.get call. BigQuery works, except that for testing purposes I removed the sql_header thing and everything else works fine except for the test that explicitly checks for the sql header.

jtcohen6 · 2022-07-19T10:31:16Z

I'm wondering if putting the args in the strategy_arg_dict actually serves any real purpose? It's not clear to me what it gives us that simply passing in the args doesn't.

Our thinking was that each of these strategy macros takes a slightly different set of arguments today, and will likely continue to take different arguments in the future. We want a single place in the incremental materialization where we call the strategy macro, passing all possible arguments in.

If users have a custom materialization, unless they copy-and-paste the whole incremental materialization, I assume they will be pulling the in any new args from config in their custom code, right?

Yes, this sounds right to me.

It's still not clear to me why we want the sql_header in bigquery and only in bigquery. As far as I can tell, none of the other adapters include the sql_header, since "include_sql_header" is set to false everywhere except that one place in bigquery.

Every other adapter (and sometimes BigQuery) includes sql_header in the create table as statement. It's only in BigQuery that we need it in the merge statement, and only in the case of insert_overwrite with "static" partitions. I have an idea of how we might be able to remove this—creating views to be dropped later, rather than interpolating the model's {{ sql }} itself directly into the merge—though it would require some futzing with the code, and I'd need a bit of time to test my hypothesis.

I think that in order to use the default__get_incremental_default_sql in the previous comment, we'd have to pretty much implement all the changes in this pull request.

Is that mainly because of the arg_dict being passed in? We'd need to unpack the arguments to match the ones currently being expected by each adapter's version of get_merge_sql/get_insert_delete_merge_sql/get_insert_overwrite_merge_sql?

gshank · 2022-07-19T13:30:31Z

For BigQuery, maybe we could set an extra config? It would be nice to not have to pass around an extra parameter just for that one case.

As far as using default__get_incremental_default_sql, the code just isn't set up such that we can just insert that one bit. I look at putting it in and it just doesn't fit. We'd need to pull in almost all of the new macros. I can try doing that... What the timeframe here?

jtcohen6

Feels like we're very close! One blocking/functional comment, and a few nit picks around the edges. In all, though, I'm excited to get this foundation laid, so we can reap the fruits of it in adapter repos. Thanks as well for pushing that work a bit further in dbt-snowflake, as demonstration.

Looking forward to talking through it all in a few minutes

jtcohen6 · 2022-07-20T13:31:54Z

plugins/postgres/dbt/include/postgres/macros/materializations/incremental_strategies.sql

+{% macro postgres__get_incremental_default_sql(arg_dict) %}
+
+  {% if arg_dict["unique_key"] %}
+    {% do return(get_incremental_delete_insert_sql(arg_dict)) %}
+  {% else %}
+    {% do return(get_incremental_append_sql(arg_dict)) %}
+  {% endif %}
+
+{% endmacro %}


Love to see this!!

jtcohen6 · 2022-07-20T13:33:48Z

core/dbt/include/global_project/macros/materializations/models/incremental/strategies.sql

+{% macro default__get_incremental_default_sql(arg_dict) %}
+
+  {% do return(get_incremental_append_sql(arg_dict)) %}
+
+{% endmacro %}


This represents an implicit breaking change for maintainers of existing adapter plugins who use the default incremental materialization:

Previously, the (implicit) default was delete+insert

Now, the explicit default is append

I think it's a good change! It's just one we'll want to document very clearly

(cc @dataders)

Opened a docs issue: dbt-labs/docs.getdbt.com#1761

The changes in incremental.sql made me wonder the same thing about default behavior.

core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql

core/dbt/adapters/base/impl.py

jtcohen6

Nice work resolving that dbt-snowflake failure! And thanks for addressing most of my outstanding comments on the PR.

I see there's one remaining failure in dbt-bigquery tests, I think related to the matter of include_sql_header. Pending that resolution, this is good to go from a functional perspective / my point of view. Another engineer should still take a look for code review.

core/dbt/adapters/base/impl.py

jtcohen6 · 2022-07-21T09:45:47Z

core/dbt/include/global_project/macros/materializations/models/incremental/strategies.sql

+{% macro default__get_incremental_default_sql(arg_dict) %}
+
+  {% do return(get_incremental_append_sql(arg_dict)) %}
+
+{% endmacro %}


Opened a docs issue: dbt-labs/docs.getdbt.com#1761

gshank · 2022-07-21T17:39:40Z

I didn't find any good way to get include_sql_header without a param, so a put the param back. You can't set config values at runtime...

ChenyuLInx

LGTM! There's no impact for python model's incremental change as we just reuse the same logic here

jtcohen6

Nice work wrapping this up!

) ### Description Avoids using default as a temporary fix for dbt-labs/dbt-core#5359. This is a temporary fix and dbt-labs/dbt-spark#394 should be ported later.

### Description Applies "Initial refactoring of incremental materialization" (dbt-labs/dbt-core#5359). Now it uses `adapter.get_incremental_strategy_macro` instead of dbt-spark's `dbt_spark_get_incremental_sql` macro to dispatch the incremental strategy macro. The overwritten `dbt_spark_get_incremental_sql` macro will not work anymore. Co-authored-by: allisonwang-db <allison.wang@databricks.com>

dataders · 2022-10-10T16:44:51Z

core/dbt/contracts/graph/model_config.py

@@ -433,6 +433,7 @@ class NodeConfig(NodeAndTestConfig):
    # Note: if any new fields are added with MergeBehavior, also update the
    # 'mergebehavior' dictionary
    materialized: str = "view"
+    incremental_strategy: Optional[str] = None


@gshank @nathaniel-may does this now mean that the value of incremental_strategy is now in the manifest.json when it wasn't before? or does this add it to the python context such that it is accessible as an attribute of model as in model.incremental_strategy?

It means that it should be in the manifest.json. It's accessible like other config keys, but the behavior is a bit different for builtin attributes than for adhoc attributes, in that setting defaults doesn't work the same way.

gshank requested a review from a team June 10, 2022 14:55

gshank requested review from a team as code owners June 10, 2022 14:55

gshank requested review from ChenyuLInx, VersusFacit and nathaniel-may June 10, 2022 14:55

gshank marked this pull request as draft June 10, 2022 14:56

cla-bot bot added the cla:yes label Jun 10, 2022

ChenyuLInx reviewed Jun 10, 2022

View reviewed changes

jtcohen6 reviewed Jun 13, 2022

View reviewed changes

gshank added 7 commits June 27, 2022 14:17

Initial refactoring of incremental materialization

0aa213f

Changie

0dc34fd

Add adapter.dispatch calls to new macros

3b96f9e

update default to check unique_id and use delete_insert strategy

98f7b08

Add code to check valid incremental strategies

6ac14be

Create postgres default strategy macro

e4796a9

Merge branch 'main' into ct-646-incremental_refactor

afd347d

gshank force-pushed the ct-646-incremental_refactor branch from 5a8ad92 to afd347d Compare July 13, 2022 13:40

Set 'default' for incremental_strategy

ae691c0

gshank added 2 commits July 19, 2022 12:25

Uncomment sql_header and use config to get include_sql_header

ee002a9

remove stray method

e9e811e

jtcohen6 requested changes Jul 20, 2022

View reviewed changes

gshank added 3 commits July 20, 2022 17:25

Merge branch 'main' into ct-646-incremental_refactor

283c384

Pass model context to get_incremental_strategy_macro method

c571130

Various tweaks and comments

ea7038a

jtcohen6 mentioned this pull request Jul 21, 2022

New components for incremental strategies dbt-labs/docs.getdbt.com#1761

Open

1 task

jtcohen6 reviewed Jul 21, 2022

View reviewed changes

gshank added 2 commits July 21, 2022 09:17

Change default valid_incremental_strategies in base and postgres

69d0e8f

Put back include_sql_header param

1bea247

gshank marked this pull request as ready for review July 21, 2022 17:41

ChenyuLInx approved these changes Jul 21, 2022

View reviewed changes

jtcohen6 approved these changes Jul 21, 2022

View reviewed changes

gshank merged commit 2548ba9 into main Jul 21, 2022

gshank deleted the ct-646-incremental_refactor branch July 21, 2022 18:11

ueshin added a commit to ueshin/dbt-databricks that referenced this pull request Jul 22, 2022

Avoid using default for a temporary fix of dbt-labs/dbt-core#5359.

216379d

ueshin mentioned this pull request Jul 22, 2022

Avoid using default as a temporary fix for dbt-labs/dbt-core#5359. databricks/dbt-databricks#136

Merged

ueshin mentioned this pull request Aug 3, 2022

Apply "Initial refactoring of incremental materialization" databricks/dbt-databricks#148

Merged

agoblet pushed a commit to BigDataRepublic/dbt-core that referenced this pull request Sep 16, 2022

Refactoring of incremental materialization (dbt-labs#5359)

6fa2d65

dataders reviewed Oct 10, 2022

View reviewed changes

jlarue26 mentioned this pull request Oct 24, 2022

dbt-core 1.3 upgrade: Incremental mats: more standard and more error-proof dremio/dbt-dremio#44

Closed

jtcohen6 mentioned this pull request Nov 1, 2022

[CT-1457] Move dbt-postgres into its own repository #6189

Closed

dbeatty10 mentioned this pull request Jan 26, 2023

[CT-1880] [Feature] Enable "merge" incremental strategy for postgres 15 #6696

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial refactoring of incremental materialization #5359

Initial refactoring of incremental materialization #5359

gshank commented Jun 10, 2022

github-actions bot commented Jun 10, 2022

gshank commented Jun 10, 2022

ChenyuLInx left a comment

jtcohen6 left a comment

gshank commented Jul 13, 2022

jtcohen6 commented Jul 14, 2022 •

edited

Loading

gshank commented Jul 19, 2022

jtcohen6 commented Jul 19, 2022

gshank commented Jul 19, 2022

jtcohen6 left a comment

jtcohen6 Jul 20, 2022

jtcohen6 Jul 20, 2022

jtcohen6 Jul 21, 2022

ChenyuLInx Jul 21, 2022

jtcohen6 left a comment

jtcohen6 Jul 21, 2022

gshank commented Jul 21, 2022

ChenyuLInx left a comment

jtcohen6 left a comment

dataders Oct 10, 2022

gshank Oct 10, 2022

Initial refactoring of incremental materialization #5359

Initial refactoring of incremental materialization #5359

Conversation

gshank commented Jun 10, 2022

Description

Checklist

github-actions bot commented Jun 10, 2022

gshank commented Jun 10, 2022

ChenyuLInx left a comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

gshank commented Jul 13, 2022

jtcohen6 commented Jul 14, 2022 • edited Loading

gshank commented Jul 19, 2022

jtcohen6 commented Jul 19, 2022

gshank commented Jul 19, 2022

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Jul 20, 2022

Choose a reason for hiding this comment

jtcohen6 Jul 20, 2022

Choose a reason for hiding this comment

jtcohen6 Jul 21, 2022

Choose a reason for hiding this comment

ChenyuLInx Jul 21, 2022

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Jul 21, 2022

Choose a reason for hiding this comment

gshank commented Jul 21, 2022

ChenyuLInx left a comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

dataders Oct 10, 2022

Choose a reason for hiding this comment

gshank Oct 10, 2022

Choose a reason for hiding this comment

jtcohen6 commented Jul 14, 2022 •

edited

Loading