Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial refactoring of incremental materialization #5359

Merged
merged 15 commits into from
Jul 21, 2022

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Jun 10, 2022

resolves #5245

Description

This is not ready for final review, but I'm putting it out here for discussion and commenting.

Checklist

@gshank gshank requested a review from a team June 10, 2022 14:55
@gshank gshank requested review from a team as code owners June 10, 2022 14:55
@gshank gshank marked this pull request as draft June 10, 2022 14:56
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@cla-bot cla-bot bot added the cla:yes label Jun 10, 2022
@gshank
Copy link
Contributor Author

gshank commented Jun 10, 2022

I haven't yet incorporated the most recent feedback, so we don't have adapter.dispatch and default macros for everything yet.

Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. Love the fact that we are breaking things down into smaller parts for folks to implement and adding more guardrails.
With this kind of change, do we think we can actually try to have smaller tests focusing on certain macros vs having one functional test testing the whole function?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @gshank! Thanks for throwing up the draft PR, to push forward the conversation. I'm excited by the progress here. I left a bunch of comments, some of which are just "oh I get this now," others being open questions that we should look to answer before merging.

@gshank gshank force-pushed the ct-646-incremental_refactor branch from 5a8ad92 to afd347d Compare July 13, 2022 13:40
@gshank
Copy link
Contributor Author

gshank commented Jul 13, 2022

@jtcohen6 There's a difference in the way the config.get('...', default='...') works when the config key exists (built-in or "extra") and if it doesn't. The materialization code is doing config.get('incremental_strategy', default='merge') but it only sets incremental_strategy to 'merge' if the config key (and the extra key) doesn't exist. So we're not actually getting the default set from the macro config.get calls because we now have an 'incremental_strategy' config attribute.

I almost want to check for a None value to set a default, but that's a pretty big change, so I'm guessing that's not a good idea. We could change the lines to {%- set strategy = config.get("incremental_strategy") or "merge" %}

In the new code in 'get_incremental_strategy_macro' I set the strategy to 'default' if it's none, but if we're only doing the minimal changes to the adapters (not including using that bit) we'll have to do something else.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 14, 2022

@gshank I see what you mean.

I'd be in favor of making a few small changes in each adapter to get this PR merge-able, if it enables us to have a better implementation here, and avoid contorting ourselves into a harder-to-reason-about logical flow.

Following the design / conversation in #5245, I think all that will require us to do, in each adapter, is define a macro default__get_incremental_default_sql, returning the macro which should (for that adapter) be used as the default strategy. So in effect:

{# This part goes in dbt-core #}

{% macro get_default_incremental_sql(arg_dict) %}
  {{ return(adapter.dispatch('get_default_incremental_sql', 'dbt'))(arg_dict) }}
{% endmacro %}

{#
  -- The first 'default' here means for the default adapter,
  -- the second 'default' means for the default incremental strategy (if not otherwise specified)
#}
{% macro default__get_default_incremental_sql(arg_dict) %}
  {{ return(get_append_incremental_sql(arg_dict)) }}
{% endmacro %}

{# This macro goes in dbt-postgres #}

{% macro postgres__get_default_incremental_sql(arg_dict) %}
  {{ return(get_delete_insert_incremental_sql(arg_dict)) }}
{% endmacro %}

{# This macro goes in dbt-snowflake #}

{% macro snowflake__get_default_incremental_sql(arg_dict) %}
  {{ return(get_merge_incremental_sql(arg_dict)) }}
{% endmacro %}

{# This macro goes in dbt-bigquery #}

{% macro bigquery__get_default_incremental_sql(arg_dict) %}
  {{ return(get_merge_incremental_sql(arg_dict)) }}
{% endmacro %}

@gshank
Copy link
Contributor Author

gshank commented Jul 19, 2022

I'm wondering if putting the args in the strategy_arg_dict actually serves any real purpose? It's not clear to me what it gives us that simply passing in the args doesn't. If users have a custom materialization, unless they copy-and-paste the whole incremental materialization, I assume they will be pulling the in any new args from config in their custom code, right?

It's still not clear to me why we want the sql_header in bigquery and only in bigquery. As far as I can tell, none of the other adapters include the sql_header, since "include_sql_header" is set to false everywhere except that one place in bigquery.

I think that in order to use the default__get_incremental_default_sql in the previous comment, we'd have to pretty much implement all the changes in this pull request.

Redshift works fine with no changes on top of dbt-core. Snowflake works with just the tweak to the config.get of the incremental_strategy. Spark works with the tweak to the config.get call. BigQuery works, except that for testing purposes I removed the sql_header thing and everything else works fine except for the test that explicitly checks for the sql header.

@jtcohen6
Copy link
Contributor

I'm wondering if putting the args in the strategy_arg_dict actually serves any real purpose? It's not clear to me what it gives us that simply passing in the args doesn't.

Our thinking was that each of these strategy macros takes a slightly different set of arguments today, and will likely continue to take different arguments in the future. We want a single place in the incremental materialization where we call the strategy macro, passing all possible arguments in.

If users have a custom materialization, unless they copy-and-paste the whole incremental materialization, I assume they will be pulling the in any new args from config in their custom code, right?

Yes, this sounds right to me.

It's still not clear to me why we want the sql_header in bigquery and only in bigquery. As far as I can tell, none of the other adapters include the sql_header, since "include_sql_header" is set to false everywhere except that one place in bigquery.

Every other adapter (and sometimes BigQuery) includes sql_header in the create table as statement. It's only in BigQuery that we need it in the merge statement, and only in the case of insert_overwrite with "static" partitions. I have an idea of how we might be able to remove this—creating views to be dropped later, rather than interpolating the model's {{ sql }} itself directly into the merge—though it would require some futzing with the code, and I'd need a bit of time to test my hypothesis.

I think that in order to use the default__get_incremental_default_sql in the previous comment, we'd have to pretty much implement all the changes in this pull request.

Is that mainly because of the arg_dict being passed in? We'd need to unpack the arguments to match the ones currently being expected by each adapter's version of get_merge_sql/get_insert_delete_merge_sql/get_insert_overwrite_merge_sql?

@gshank
Copy link
Contributor Author

gshank commented Jul 19, 2022

For BigQuery, maybe we could set an extra config? It would be nice to not have to pass around an extra parameter just for that one case.

As far as using default__get_incremental_default_sql, the code just isn't set up such that we can just insert that one bit. I look at putting it in and it just doesn't fit. We'd need to pull in almost all of the new macros. I can try doing that... What the timeframe here?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we're very close! One blocking/functional comment, and a few nit picks around the edges. In all, though, I'm excited to get this foundation laid, so we can reap the fruits of it in adapter repos. Thanks as well for pushing that work a bit further in dbt-snowflake, as demonstration.

Looking forward to talking through it all in a few minutes

Comment on lines +1 to +9
{% macro postgres__get_incremental_default_sql(arg_dict) %}

{% if arg_dict["unique_key"] %}
{% do return(get_incremental_delete_insert_sql(arg_dict)) %}
{% else %}
{% do return(get_incremental_append_sql(arg_dict)) %}
{% endif %}

{% endmacro %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see this!!

Comment on lines +62 to +66
{% macro default__get_incremental_default_sql(arg_dict) %}

{% do return(get_incremental_append_sql(arg_dict)) %}

{% endmacro %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This represents an implicit breaking change for maintainers of existing adapter plugins who use the default incremental materialization:

  • Previously, the (implicit) default was delete+insert
  • Now, the explicit default is append

I think it's a good change! It's just one we'll want to document very clearly

(cc @dataders)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a docs issue: dbt-labs/docs.getdbt.com#1761

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in incremental.sql made me wonder the same thing about default behavior.

core/dbt/adapters/base/impl.py Show resolved Hide resolved
core/dbt/adapters/base/impl.py Outdated Show resolved Hide resolved
core/dbt/adapters/base/impl.py Outdated Show resolved Hide resolved
core/dbt/adapters/base/impl.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work resolving that dbt-snowflake failure! And thanks for addressing most of my outstanding comments on the PR.

I see there's one remaining failure in dbt-bigquery tests, I think related to the matter of include_sql_header. Pending that resolution, this is good to go from a functional perspective / my point of view. Another engineer should still take a look for code review.

core/dbt/adapters/base/impl.py Outdated Show resolved Hide resolved
Comment on lines +62 to +66
{% macro default__get_incremental_default_sql(arg_dict) %}

{% do return(get_incremental_append_sql(arg_dict)) %}

{% endmacro %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a docs issue: dbt-labs/docs.getdbt.com#1761

@gshank
Copy link
Contributor Author

gshank commented Jul 21, 2022

I didn't find any good way to get include_sql_header without a param, so a put the param back. You can't set config values at runtime...

@gshank gshank marked this pull request as ready for review July 21, 2022 17:41
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There's no impact for python model's incremental change as we just reuse the same logic here

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work wrapping this up!

@gshank gshank merged commit 2548ba9 into main Jul 21, 2022
@gshank gshank deleted the ct-646-incremental_refactor branch July 21, 2022 18:11
ueshin added a commit to ueshin/dbt-databricks that referenced this pull request Jul 22, 2022
ueshin added a commit to databricks/dbt-databricks that referenced this pull request Jul 22, 2022
)

### Description

Avoids using default as a temporary fix for dbt-labs/dbt-core#5359.

This is a temporary fix and dbt-labs/dbt-spark#394 should be ported later.
ueshin added a commit to databricks/dbt-databricks that referenced this pull request Aug 25, 2022
### Description

Applies "Initial refactoring of incremental materialization" (dbt-labs/dbt-core#5359).

Now it uses `adapter.get_incremental_strategy_macro` instead of dbt-spark's `dbt_spark_get_incremental_sql` macro to dispatch the incremental strategy macro. The overwritten `dbt_spark_get_incremental_sql` macro will not work anymore.

Co-authored-by: allisonwang-db <allison.wang@databricks.com>
agoblet pushed a commit to BigDataRepublic/dbt-core that referenced this pull request Sep 16, 2022
@@ -433,6 +433,7 @@ class NodeConfig(NodeAndTestConfig):
# Note: if any new fields are added with MergeBehavior, also update the
# 'mergebehavior' dictionary
materialized: str = "view"
incremental_strategy: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank @nathaniel-may does this now mean that the value of incremental_strategy is now in the manifest.json when it wasn't before? or does this add it to the python context such that it is accessible as an attribute of model as in model.incremental_strategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that it should be in the manifest.json. It's accessible like other config keys, but the behavior is a bit different for builtin attributes than for adhoc attributes, in that setting defaults doesn't work the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-646] [Proposal] Streamline Incremental Strategies
4 participants