[CT-96] Allow unique_key for incremental materializations to take a list #2479

azhard · 2020-05-20T19:41:54Z

Describe the feature

Right now when creating a model in dbt with materialization set to incremental you can pass in a single column to unique key that will act as the key for merging.

In an ideal world you would be able to pass in multiple columns as there are many cases where a table has more than one column that defines it's primary key.

The simplest solution would be to change to unique_key to take in a list (in addition to a string for backwards compatibility) and create the predicates for the merge based on the list vs. just the single column.

This might not be ideal as the param name unique_key implies a single key. Alternatives would be adding a new optional parameter unique_key_list or unique_keys that always take a list and eventually deprecate the unique_key parameter.

Describe alternatives you've considered

Not necessarily other alternatives but another thing to consider is the use of unique_key throughout the dbt project. It would stand to reason that whatever change is made here would apply to all other usages of unique_key. This can be done in one large roll-out or in stages such as with merges first, then upserts, then snapshots, etc.

Additional context

This feature should work across all databases.

Who will this benefit?

Hopefully most dbt users. Currently the only workaround for this is using dbt_utils.surrogate_key which a) doesn't work for BigQuery and b) should ideally be an out of the box dbt feature.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-05-20T19:51:19Z

Thanks for the solid proposal, @azhard. I've now heard this from several members of the community. Though it's what we recommend, I understand that creating a single hashed surrogate key for each model is against some folks' warehousing paradigms. Currently, that requires overriding a complex merge macro, or the entire incremental materialization.

a) doesn't work for BigQuery

Could you clarify what you mean here? Is the BQ implementation of dbt_utils.surrogate_key broken?

azhard · 2020-05-20T19:58:15Z

Hey @jtcohen6 the surrogate_key function is broken for this particular case but not sure if the fault lies on the dbt side or the dbt_utils side.

Essentially in merge.sql L25 dbt does this to generate the match predicate
DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}

While this normally would work, the unique_key is substituted with to_hex(md5(cast(concat(coalesce(cast(id as ... which means the generated SQL becomes DBT_INTERNAL_SOURCE.to_hext(md5...) which is incorrect as the DBT_INTERNAL_SOURCE. should be preceding id.

I've taken a look at implementing this change to use unique_key as a list in the merge.sql file and it doesn't seem super involved. I'd be happy to put up a PR if there is consensus on the best long-term approach.

azhard · 2020-05-26T20:19:08Z

@jtcohen6 any idea on the surrogate_key issue? I'd be happy to submit a PR to try and fix it, just not sure which side it counts as being broken on

jtcohen6 · 2020-05-27T15:07:18Z

I see what you mean. When I think of creating a surrogate key for an incremental model, I'm thinking of creating a column within that model, to be stored in the resulting table and passed as the unique_key for subsequent incremental runs:

{{ config(
    materialized = 'incremental',
    unique_key = 'unique_id'
) }}

select
    date_day,
    user_id,
    {{ dbt_utils.surrogate_key('date_day', 'user_id') }} as unique_id

...

You're right that, as a result of the way that merge macros are implemented on BigQuery, you cannot create the surrogate key directly within the config like so:

{{ config(
    materialized = 'incremental',
    unique_key = dbt_utils.surrogate_key('date_day', 'user_id')
) }}

I've now heard this change requested from several folks now, including (if I recall correctly) some Snowflake users who have found that merging on cluster keys improves performance somewhat. So I'm not opposed to passing an array of column names. I'm worried that unique_keys is ambiguous; following the lead of the dbt-utils test, I'm thinking along the lines of unique_combination_of_columns.

@drewbanin What do you think? Is that too much config-arg creep?

drewbanin · 2020-05-27T16:58:11Z

@jtcohen6 I'm into the idea of supporting an array of values for unique_key. I do think our best practices should state that a unique_key should be a single field name, but i think it should be relatively easy to build the correct merge statement from a list of fields.

Either way though, I do think we should mandate that unique_key args are simple field names, and we should try to support expressions in the unique_key arg.

So, this would not be supported:

{{ config(
    materialized = 'incremental',
    unique_key = dbt_utils.surrogate_key('date_day', 'user_id')
) }}

but this could be:

{{ config(
    materialized = 'incremental',
    unique_key = ['date_day', 'user_id']
) }}

I would like to think a little harder about how this would work for databases that do not support merge statements. On those databases, we run a query like:

    delete
    from {{ target_relation }}
    where ({{ unique_key }}) in (
        select ({{ unique_key }})
        from {{ tmp_relation }}
    );

It's not clear to me yet how we should build this delete statement with multiple keys, or if we should just not support multiple keys on pg/redshift/etc

jtcohen6 · 2020-05-27T19:14:14Z

To the last point, Postgres, Redshift, and Snowflake all support using syntax with delete statements, so we could write:

    delete
    from {{ target_relation }}
    using {{ tmp_relation }}
    where
    {% for column_name in unique_combination_of_columns %}
        {{ target_relation }}.{{ column_name }} = {{ tmp_relation }}.{{ column_name }}
        {{- "and" if not loop.last }}
    {% endfor %}

azhard · 2020-05-27T20:08:03Z

Any advice on how I can use multiple key columns locally with BigQuery right now?

jtcohen6 · 2020-05-27T20:29:16Z

If need be, you could override the get_merge_sql macro, specifically to change these lines:
https://github.com/fishtown-analytics/dbt/blob/c3c99f317e044a333d20766cc59c523769a4204f/core/dbt/include/global_project/macros/materializations/common/merge.sql#L23-L28
to

    {% if unique_key %}
        {% set unique_key_list = unique_key.split(",") %}
        {% for key in unique_key_list %}
            {% set this_key_match %}
                DBT_INTERNAL_SOURCE.{{ key }} = DBT_INTERNAL_DEST.{{ key }}
            {% endset %}
            {% do predicates.append(this_key_match) %}
        {% endfor %}
    {% else %}

and then pass a comma-separated list to unique_key in your model:

{{ config(
    materialized = 'incremental',
    unique_key = 'date_day, user_id'
) }}

That's obviously not the eventual implementation we're going for, but I think that would work for your specific use case.

azhard · 2020-05-28T15:33:11Z

Hey @jtcohen6 that works perfectly, thanks for the help!

JCZuurmond · 2021-06-12T09:29:24Z

Seeing the discussion and the proposed solution, could we rename this issue to "Allow unique_key to take a list"? To make it more generic.

And I am ok with using unique_key for both a str and List[str] even though unique_keys is more accurate for the latter. This happens in pandas sometimes too, e.g. drop and pivot have a columns parameter that can be both str and List[str]. In my opinion this is better than adding a new parameter.

triedandtested-dev · 2021-09-26T08:55:50Z

I've had a read through both of the above threads and I'm happy to give this a go.

In summary:

unique_key should support unique_key='column' or unique_key=['column1', 'column2'].
As part of this work we should split out ModelConfig specific attributes from NodeConfig.
unique_key should be validated in python.

Considerations:

I would like to introduce a jinja test for lists if everyone is in agreement.

So we can do things like
{% if unique_key is list %}
rather than
{% if unique_key is sequence and unique_key is not mapping and unique_key is not string %}

This will probably benefit a number of macros with list or string behaviour.

jtcohen6 · 2021-09-26T10:07:25Z

@triedandtested-dev Thanks for your interest! I'd welcome a PR for this :)

In terms of your three points:

Yes, let's have the same config unique_key accept either a string column name, or a list of string column names
While the split-apart of ModelConfig and NodeConfig is highly desirable, we can leave that out of scope for this effort. That can come instead in work for Reorganize nodes, node.config in manifest #3557.
The python validation will be a nice side-effect of defining unique_key: Optional[Union[str, List[str]]] within the NodeConfig dataclass

The is list Jinja test is a neat idea! Even if that's not strictly necessary here, I'm hardly opposed.

triedandtested-dev · 2021-09-26T12:35:26Z

Happy to keep the ModelConfig out of scope for this issue.

I'll get cracking on a PR for this!

The jinja "is list" addition should help with readability going forward but completely agree its not technically necessary.

JCZuurmond · 2021-10-21T14:52:35Z

@triedandtested-dev : How is it going with the PR?

ismailsimsek · 2022-01-07T17:19:51Z

+1 running into his issue

* Add unique_key to NodeConfig `unique_key` can be a string or a list. * merge.sql update to work with unique_key as list extend the functionality to support both single and multiple keys Signed-off-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev> * Updated test to include unique_key Signed-off-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev> * updated tests Signed-off-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev> * Fix unit and integration tests * Update Changelog for 2479/4618 Co-authored-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev>

DanDaDataEngineer · 2022-02-28T10:32:58Z

Not sure if out of scope, but the change isn't propagated to the rest of dbt.
For example the Macro snapshot_staging_table still refers to a single Unique Key value rather than multiple.
Is this the intended change?

Thanks,
Dan

jtcohen6 · 2022-03-02T18:31:45Z

@DanDaDataEngineer The scope of this change was for models using the incremental materialization only. You're welcome to open a separate issue for supporting multiple unique keys in snapshots. For my part, I'd just say that the presence of a reliable unique identifier is even more important in snapshots, since the stakes are data loss, or duplication requiring manual intervention.

azhard added enhancement New feature or request triage labels May 20, 2020

jtcohen6 removed the triage label May 20, 2020

jtcohen6 mentioned this issue Jun 7, 2021

Composite key as unique_key - list of columns #3431

Closed

jtcohen6 added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Sep 26, 2021

JCZuurmond mentioned this issue Oct 22, 2021

Support composition key (multiple columns) when merging increment dbt-labs/dbt-spark#238

Closed

jtcohen6 added the incremental Incremental modeling with dbt label Oct 28, 2021

triedandtested-dev mentioned this issue Oct 28, 2021

Allow unique_key to take a list #4159

Closed

4 tasks

gshank mentioned this issue Jan 24, 2022

[#2479] Allow unique_id to take a list #4618

Merged

4 tasks

gshank added the jira label Jan 26, 2022

github-actions bot changed the title ~~Allow unique_key for incremental materializations to take a list~~ [CT-96] Allow unique_key for incremental materializations to take a list Jan 26, 2022

gshank self-assigned this Jan 26, 2022

leahwicz added this to the v1.1.0 milestone Jan 27, 2022

gshank closed this as completed in #4618 Feb 3, 2022

gshank mentioned this issue Feb 11, 2022

Repeated run with an incremental model with unique_key = "id" fails #4716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-96] Allow unique_key for incremental materializations to take a list #2479

[CT-96] Allow unique_key for incremental materializations to take a list #2479

azhard commented May 20, 2020

jtcohen6 commented May 20, 2020

azhard commented May 20, 2020 •

edited

azhard commented May 26, 2020

jtcohen6 commented May 27, 2020 •

edited

drewbanin commented May 27, 2020

jtcohen6 commented May 27, 2020 •

edited

azhard commented May 27, 2020

jtcohen6 commented May 27, 2020

azhard commented May 28, 2020

JCZuurmond commented Jun 12, 2021

triedandtested-dev commented Sep 26, 2021

jtcohen6 commented Sep 26, 2021

triedandtested-dev commented Sep 26, 2021

JCZuurmond commented Oct 21, 2021

ismailsimsek commented Jan 7, 2022

DanDaDataEngineer commented Feb 28, 2022

jtcohen6 commented Mar 2, 2022

[CT-96] Allow unique_key for incremental materializations to take a list #2479

[CT-96] Allow unique_key for incremental materializations to take a list #2479

Comments

azhard commented May 20, 2020

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

jtcohen6 commented May 20, 2020

azhard commented May 20, 2020 • edited

azhard commented May 26, 2020

jtcohen6 commented May 27, 2020 • edited

drewbanin commented May 27, 2020

jtcohen6 commented May 27, 2020 • edited

azhard commented May 27, 2020

jtcohen6 commented May 27, 2020

azhard commented May 28, 2020

JCZuurmond commented Jun 12, 2021

triedandtested-dev commented Sep 26, 2021

jtcohen6 commented Sep 26, 2021

triedandtested-dev commented Sep 26, 2021

JCZuurmond commented Oct 21, 2021

ismailsimsek commented Jan 7, 2022

DanDaDataEngineer commented Feb 28, 2022

jtcohen6 commented Mar 2, 2022

azhard commented May 20, 2020 •

edited

jtcohen6 commented May 27, 2020 •

edited

jtcohen6 commented May 27, 2020 •

edited