Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2479] Allow unique_id to take a list #4618

Merged
merged 6 commits into from
Feb 3, 2022
Merged

[#2479] Allow unique_id to take a list #4618

merged 6 commits into from
Feb 3, 2022

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Jan 24, 2022

resolves #2479

Description

This derives from pull request #4159.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change

@cla-bot cla-bot bot added the cla:yes label Jan 24, 2022
@gshank gshank mentioned this pull request Jan 24, 2022
4 tasks
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works with both Snowflake + BigQuery when I run locally. As a follow-up to merging this PR, we should look to add a simple integration tests to both of those plugins:

{{ config(
  materialized = 'incremental',
  unique_key = ['id1', 'id2']
)}}

select 1 as id1, 2 as id2

I think first step in getting the tests to pass should be rebasing against / merging in main. When I run locally, I see: Running with dbt=0.21.0-rc1.

DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
{% endset %}
{% do predicates.append(unique_key_match) %}
{% if unique_key is sequence and unique_key is not mapping and unique_key is not string %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update to default__get_merge_sql will work for Snowflake + BigQuery out of the box. For other databases:

  • Postgres + Redshift don't support merge, they use get_delete_insert_merge_sql instead. We'd need to refactor the delete statement to support multiple unique keys, which is tricky enough that I think we can open a new issue for it. No need to block this PR in the meantime.
  • Spark reimplements this as spark__get_merge_sql, so we'll need to make a similar change over there if we want to support the same functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Spark change here, @gshank maybe you can pair with @McKnight-42 or @VersusFacit on making that change.

@McKnight-42 @VersusFacit - this behavior difference is something good to document in the specific adapter pages if it isn't there yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll also want integration tests for Snowflake + BigQuery, to ensure this change actually achieves the desired result

triedandtested-dev and others added 6 commits January 27, 2022 13:05
`unique_key` can be a string or a list.
extend the functionality to support both single and multiple keys

Signed-off-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev>
Signed-off-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev>
Signed-off-by: triedandtested-dev (Bryan Dunkley) <bryan@triedandtested.dev>
@gshank
Copy link
Contributor Author

gshank commented Jan 27, 2022

I'm not quite sure what the flow should be here. Should this PR be merged first and then the separate adapter changes be addressed?

@leahwicz
Copy link
Contributor

@gshank I'm not opposed to doing those changes in a separate PR as long as this change doesn't break anything w/o those other changes. If we do a separate PR, can you please open a tracking issue so we don't lose track of doing the work?

@gshank
Copy link
Contributor Author

gshank commented Jan 28, 2022

@leah I think they would have to be separate PRs, because they're different repositories, right? You can't put changes from multiple repos into the same PR. I don't know if this change would break anything in the other repos, and I'm not sure how to test them against a dbt-core branch.

@leahwicz
Copy link
Contributor

they're different repositories, right?

@gshank that is correct but I just didn't know if we needed to coordinate with the other repos. I don't want to merge this change if it causes the other repos to start failing tests and instead coordinate to merge all the PR's the same day if that makes sense.

@jtcohen6
Copy link
Contributor

I don't want to merge this change if it causes the other repos to start failing tests and instead coordinate to merge all the PR's the same day if that makes sense.

This is a really good callout in general. In this particular case, I don't believe that this change will (or should) cause failing tests in those other repos. (If it does, that failing test will be meaningful, and indicate we've probably done something wrong here!)

It is incumbent on us to follow up with:

  • PRs to dbt-snowflake + dbt-bigquery to add an integration test for the functionality added in this PR — since this is a case where the "default" merge logic, defined in the dbt-core global project, is not actually used on Postgres
  • a PR to dbt-spark that mirrors this functionality, and adds an integration test for it

Copy link
Contributor

@McKnight-42 McKnight-42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting some context, and reading up LGTM

@@ -2,6 +2,7 @@

### Features
- New Dockerfile to support specific db adapters and platforms. See docker/README.md for details ([#4495](https://github.com/dbt-labs/dbt-core/issues/4495), [#4487](https://github.com/dbt-labs/dbt-core/pull/4487))
- Allow unique_key to take a list ([#2479](https://github.com/dbt-labs/dbt-core/issues/2479), [#4618](https://github.com/dbt-labs/dbt-core/pull/4618))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to add the contributor from the original PR to the changelog for v1.1:

- [@triedandtested-dev](https://github.com/triedandtested-dev) ([#4618](https://github.com/dbt-labs/dbt-core/pull/4618))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I noticed that this list option is missing from the docs here:
https://docs.getdbt.com/reference/resource-configs/unique_key#use-a-combination-of-two-columns-as-a-unique-key

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I noticed that this list option is missing from the docs here:
https://docs.getdbt.com/reference/resource-configs/unique_key#use-a-combination-of-two-columns-as-a-unique-key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call @cdabel !

Just opened an issue for that here: dbt-labs/docs.getdbt.com#4642

@HeddeCrisp
Copy link

HeddeCrisp commented Mar 16, 2023

I encounter an error when running an incremental bigquery model with multiple unique keys. The error I am getting is:

10:24:38  Database Error in model sample_model (models/ODL_derived/sample_model.sql)
10:24:38    Syntax error: Unexpected "[" at [546:33]. If this is a table identifier, escape the name with `, e.g. `table.name` rather than [table.name].
10:24:38    compiled Code at target/run/sample_dwh/models/ODL_derived/sample_model.sql

This is caused by the compiled code:

[...] ) as DBT_INTERNAL_SOURCE
        on DBT_INTERNAL_SOURCE.['col1', 'col2', 'col3'] = DBT_INTERNAL_DEST.['col1', 'col2', 'col3']

Config used is:

config(
    , name="sample_model"
    , materialized = 'incremental'
    , unique_key=['col1', 'col2', 'col3']
)

Am I doing something wrong or is this dbt code that fails with bigquery with tese settings?

@VersusFacit
Copy link
Contributor

Heya @HeddeCrisp ! It looks like you have the right syntax. Your error is familiar and strikes me as what used to happen before we merged this.

I did a little incremental model test with the following config:

{{
    config(
        materialized='incremental',
	unique_key=['id', 'first_name', 'last_name']
    )
}}

And my dbt version is:

Core:
  - installed: 1.4.5
  - latest:    1.4.5 - Up to date!

Plugins:
  - bigquery: 1.4.3 - Up to date!

So this to me implies there's not an error on our end with this feature but rather you might be using a dbt bigquery version that didn't have this feature active yet. What does dbt --version yield on your system?

@nguyenann13
Copy link

nguyenann13 commented Jun 8, 2023

Receiving an error when I pass a list
Parsing Error
at path ['unique_key']: ['usr_key', 'usr_mbr_key', 'prgm_key', 'prmsn_key', 'resp_key', 'mbr_cmpy_key', 'usr_prgm_prmsn_key'] is not valid under any of the given schemas

{% snapshot snap_mkp_usr_prgm_prmsn %}

    {{
        config(
          target_schema='edl_workdb',
          strategy='timestamp',
          unique_key=['usr_key', 'usr_mbr_key', 'prgm_key', 'prmsn_key', 'resp_key', 'mbr_cmpy_key', 'usr_prgm_prmsn_key'],
          updated_at='upd_ts',
          invalidate_hard_deletes=True,
        )
    }}

    -- Filter out value analysis app as it's sunset and has duplicates
    select * from {{ source('hzn_mkp', 'mkp_usr_prmsn_app_t') }} where app_key <> 5

{% endsnapshot %}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-96] Allow unique_key for incremental materializations to take a list
10 participants