Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-498] [Bug] updated_at ignored for check_cols snapshots #5076

Closed
1 task done
dbeatty10 opened this issue Apr 15, 2022 · 0 comments · Fixed by #5077
Closed
1 task done

[CT-498] [Bug] updated_at ignored for check_cols snapshots #5076

dbeatty10 opened this issue Apr 15, 2022 · 0 comments · Fixed by #5077
Labels
bug Something isn't working regression snapshots Issues related to dbt's snapshot functionality

Comments

@dbeatty10
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Definitions

  • point-in-time snapshot - a stand-alone copy of a database table as it was at a particular moment
  • dbt snapshot - a process that can merge a data source into a type 2 SCD table

Problem overview

Some users have source tables that represent point-in-time snapshots. e.g., a table named users_20190701 might contain a read-only copy of the users table made at midnight on July 1st, 2019. The time of the copy was made is not recorded as an attribute within the table -- the only metadata representing the time of the copy is within the table name itself.

⚠️ If any of those users are using dbt snapshots to build a type 2 SCD they could be subject to a breaking change by #4513.

Setup

See "Steps to Reproduce" below

Output

Suppose consecutive dbt snapshots are taken for those 3 dates at the following timestamps:

  • 2022-04-15 12:39:05.656
  • 2022-04-15 12:39:20.298
  • 2022-04-15 12:39:37.009

After #4513

ID COUNTER TIMESTAMP_COL DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
a 10 2020-01-01 00:00:00.000 9a616725e3cd40065fcc9f0ba08ddeed 2022-04-15 12:39:05.656 2022-04-15 12:39:05.656 2022-04-15 12:39:20.298
b 20 2020-01-01 00:00:00.000 ae9a80e646d82a83e5585056b026942e 2022-04-15 12:39:05.656 2022-04-15 12:39:05.656 2022-04-15 12:39:37.009
a 30 2020-01-02 00:00:00.000 b481297925092458d5f121c067c6eb84 2022-04-15 12:39:20.298 2022-04-15 12:39:20.298
c 40 2020-01-02 00:00:00.000 2797d4f42a6a39d4d658c986844ad1a0 2022-04-15 12:39:20.298 2022-04-15 12:39:20.298

Expected Behavior

Before #4513

ID COUNTER TIMESTAMP_COL DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
a 10 2020-01-01 00:00:00.000 dd497b1ca8b29e7418cc4fa6c085fa1a 2019-07-01 00:00:00.000 2019-07-01 00:00:00.000 2019-07-02 00:00:00.000
b 20 2020-01-01 00:00:00.000 730a2204c478dca213c517f9c9f5366a 2019-07-01 00:00:00.000 2019-07-01 00:00:00.000 2022-04-15 12:39:37.009
a 30 2020-01-02 00:00:00.000 6fe5b3670032e43be3066a38fc8b9012 2019-07-02 00:00:00.000 2019-07-02 00:00:00.000
c 40 2020-01-02 00:00:00.000 af264ae0ed7ec7cdc5f5b7c63ca48e56 2019-07-02 00:00:00.000 2019-07-02 00:00:00.000

The output of the following columns differs before and after:

  • DBT_SCD_ID
  • DBT_UPDATED_AT
  • DBT_VALID_FROM
  • DBT_VALID_TO

(I know, timestamps in tables aren't the easiest to read 😅 )

Steps To Reproduce

  1. In the environment below
  2. With the config below
  3. Run the commands below
  4. See the current vs. expected behavior above

Example

This is a simple example showing changes to a table over 3 days, including insertion and deletions.

Configuring the table copies as a source

An example source configuration in dbt is:

version: 2

sources:
  - name: users
    database: raw
    schema: users
    tables:
      - name: user_{{ var("iso8601") }}

Configuring a dbt snapshot

As of this writing, specifying the updated_at parameter isn't documented here or here for the check_cols strategy. However, it can be configured like this (whether intended or not):

{% snapshot users_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='dbt_dbeatty_snapshots',
      unique_key='id',
      strategy='check',
      check_cols='all',
      updated_at="try_to_timestamp_ntz('" ~ var("iso8601") ~  "', 'YYYYMMDD')",
      invalidate_hard_deletes=True,
    )
}}

select * from {{ source("users", "user_" ~ var("iso8601")) }}

{% endsnapshot %}

Taking dbt snapshots

Processing successive days:

dbt snapshot --vars '{"iso8601": "20190701"}'
dbt snapshot --vars '{"iso8601": "20190702"}'
dbt snapshot --vars '{"iso8601": "20190703"}'

Relevant log output

No response

Environment

- OS: MacOS 12.1 using Rosetta
- Python: 3.9.12
- dbt: dbt-core 1.0.4 vs. main
- dbt-snowflake: 1.0.0

What database are you using dbt with?

snowflake

Additional Context

No response

@dbeatty10 dbeatty10 added bug Something isn't working enhancement New feature or request triage labels Apr 15, 2022
@github-actions github-actions bot changed the title [Bug] updated_at ignored for check_cols snapshots [CT-498] [Bug] updated_at ignored for check_cols snapshots Apr 15, 2022
@jtcohen6 jtcohen6 added regression snapshots Issues related to dbt's snapshot functionality Team:Execution and removed enhancement New feature or request labels Apr 15, 2022
@stu-k stu-k removed the triage label Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants