dbt Snapshots not working correctly when switching from "check" to "timestamp" strategy #2350

jrandrews · 2020-04-22T00:39:10Z

Describe the bug

We have found cases where, using the dbt snapshot command, snapshots are not updating appropriately even though new source data rows have arrived which should cause snapshots to update.

Steps To Reproduce

Create a snapshot of a source table where the source table does not have a "last changed" timestamp, so it is necessary to use the "check" strategy to check for changes in a specific set of columns using a shared common PK across the rows to compare.
After ingesting some amount of snapshot data, then add a "last changed" timestamp from the source into a new column that was not previously available. In our case, this was because the external data loaders that were loading our raw source data into BigQuery were being run by another team who did not understand what our preferred data scenario was. Once we discussed with them, they were eventually able to identify a "last changed" timestamp column and ingest it, but only after we had been running snapshots for a few weeks with the "check" strategy.
Notice that the addition of the new "last changed" timestamp column to the snapshot now occurs, as expected with dbt snapshots when new columns are added to the source. Also note that for all previously existing data in the snapshot, this column is backfilled with NULL values, which is also according to expected behavior.
Ingest/add a new row with a change from the source which already has a corresponding pre-existing row with the same PK in the snapshot from before the change from "check" to timestamp strategies. Note that the "last changed" timestamp column in the snapshot for the relevant entity is NULL because it was previously populated by dbt snapshot using the "check" strategy and this column was backfilled as NULL.
Re-run dbt snapshot. Notice that no updates happen in the snapshot and the existing, old row(s) are not end-dated even though there is a "newer" row in the source which should be inserted into the snapshot and the existing row with no end-date should be end-dated.

Expected behavior

I would expect, regardless of whether switching back and forth between "check" and "timestamp" strategies for snapshots, that new data coming in from the source would appropriately and correctly be inserted into the snapshot, and existing rows would be end-dated appropriately.

There may also be some larger issues here around temporality as described in this dbt discourse article and how dbt snapshots uses and tracks timestamps even in the dbt_* columns based on what snapshot strategy is being used: https://discourse.getdbt.com/t/dbt-bitemporality-and-snapshots/1067 (although I am absolutely not asking for full bi-temporal support here.)

Screenshots and log output

See screenshot attached of a) our snapshot table and b) the incoming row which should be inserted in the snapshot but isn't. Employee name blurred out for obvious reasons.

System information

Which database are you using dbt with?
BigQuery

The output of dbt --version:

installed version: 0.16.1
   latest version: 0.16.1

The operating system you're using:
Mac OS X Mojave 10.14.6
Same issue happens also when running from dbt cloud

The output of python --version:
Python 2.7.16

Additional context

This is a pretty major issue for us, we're losing existing history because of this and looking at also having to throw away a bunch of previously gathered history from when we were using the "check" strategy.

The text was updated successfully, but these errors were encountered:

jrandrews · 2020-04-22T00:40:42Z

We think this may be a factor of how dbt_scd_id is calculated, it looks like it's using the Metadatalastupdatetime column in the snapshot, and perhaps there are problems when that column is NULL?

Here is the SQL:

/* {"app": "dbt", "dbt_version": "0.16.1", "profile_name": "user", "target_name": "default", "node_id": "snapshot.op_analytics.qb_employee_snapshot"} */
        
  create or replace table `qb-sync-sheet-time`.`snapshot`.`qb_employee_snapshot__dbt_tmp`
  
  
  OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
    )
  as (
    with snapshot_query as (
        
    
    
    select * from `qb-sync-sheet-time`.`qb`.`qb_employee`
    
    ),
    snapshotted_data as (
        select *,
            id as dbt_unique_key
        from `qb-sync-sheet-time`.`snapshot`.`qb_employee_snapshot`
    ),
    source_data as (
        select *,
            to_hex(md5(concat(coalesce(cast(id as string), ''), '|',coalesce(cast(MetaDataLastUpdatedTime as string), '')))) as dbt_scd_id,
            id as dbt_unique_key,
            MetaDataLastUpdatedTime as dbt_updated_at,
            MetaDataLastUpdatedTime as dbt_valid_from,
            nullif(MetaDataLastUpdatedTime, MetaDataLastUpdatedTime) as dbt_valid_to
        from snapshot_query
    ),
    insertions as (
        select
            'insert' as dbt_change_type,
            source_data.*
        from source_data
        left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_unique_key is null
           or (
                snapshotted_data.dbt_unique_key is not null
            and snapshotted_data.dbt_valid_to is null
            and (
                (snapshotted_data.MetaDataLastUpdatedTime < source_data.MetaDataLastUpdatedTime)
            )
        )
    )
    select * from insertions
  );

drewbanin · 2020-04-23T17:26:44Z

Hey @jrandrews - thanks for writing this issue up! It's very complete and very well written!

I think the issue here is indeed the null value for MetaDataLastUpdatedTime in this predicate:

where snapshotted_data.dbt_unique_key is null
or (
  snapshotted_data.dbt_unique_key is not null
  and snapshotted_data.dbt_valid_to is null
  and (
    -- This line is probably the problem
    (snapshotted_data.MetaDataLastUpdatedTime < source_data.MetaDataLastUpdatedTime)
  )
)

You're correct in saying that we'd expect the historical MetaDataLastUpdatedTime in the Snapshot table to be null when the column is added to the snapshot table. The problem then is that we should be using the dbt_updated_at timestamp (which should not be null) instead of the MetaDataLastUpdatedTime column to perform this time comparison.

There is always so much to consider when changing the snapshot code, but I suspect we can simply change this equality from:

    (snapshotted_data.MetaDataLastUpdatedTime < source_data.MetaDataLastUpdatedTime)

to

    (snapshotted_data.dbt_updated_at < source_data.MetaDataLastUpdatedTime)

We can guarantee that dbt_updated_at should never be null, and:

For strategy='timestamp': dbt_updated_at = dbt_valid_from = {{ config.updated_at }}
For strategy='check': dbt_updated_at = dbt_valid_from = current_timestamp()

So, I don't believe this change would manifest as a functional change for any snapshot users out there, and it should fix the issue you're seeing.

Let me know if you buy all of that, or if you can think of anything I might be missing here.

jrandrews · 2020-04-24T05:29:25Z

@drewbanin thanks for the review and thoughts on this. Your approach sounds reasonable. Honestly I haven't read the details of the snapshot code thoroughly and I'll rely on your good judgement as it seems like it's pretty involved code.

I don't this fix would do anything to repair the existing data problem we have, right? We'll need to clean that up ourselves if so.

drewbanin · 2020-04-24T13:06:18Z

@jrandrews unfortunately no, you'd need to manually repair the snapshot table yourself

jrandrews · 2020-04-25T02:37:08Z

I think we've managed to fix it by doing this to all of the tables in question for now (although of course it will reoccur if we ever switch another table from check to snapshot in the future so the code fix will definitely be helpful):

UPDATE snapshot.my_snapshot SET MetaDataLastUpdatedTime = dbt_updated_at WHERE MetaDataLastUpdatedTime IS NULL

drewbanin · 2020-05-04T14:39:14Z

closed by #2391

jrandrews added bug Something isn't working triage labels Apr 22, 2020

drewbanin added snapshots Issues related to dbt's snapshot functionality and removed triage labels Apr 23, 2020

drewbanin added a commit that referenced this issue May 2, 2020

(#2350) Fix for changing snapshot strategy types

2cd6a2d

drewbanin mentioned this issue May 2, 2020

Fix for changing snapshot strategy types #2391

Merged

4 tasks

drewbanin added this to the Octavius Catto milestone May 2, 2020

drewbanin closed this as completed May 4, 2020

jtcohen6 mentioned this issue Dec 4, 2020

Snapshot Strategy invalidate_hard_deletes, column dbt_updated_at does not update after first execution #2935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt Snapshots not working correctly when switching from "check" to "timestamp" strategy #2350

dbt Snapshots not working correctly when switching from "check" to "timestamp" strategy #2350

jrandrews commented Apr 22, 2020

jrandrews commented Apr 22, 2020 •

edited by drewbanin

Loading

drewbanin commented Apr 23, 2020

jrandrews commented Apr 24, 2020

drewbanin commented Apr 24, 2020

jrandrews commented Apr 25, 2020

drewbanin commented May 4, 2020

dbt Snapshots not working correctly when switching from "check" to "timestamp" strategy #2350

dbt Snapshots not working correctly when switching from "check" to "timestamp" strategy #2350

Comments

jrandrews commented Apr 22, 2020

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

jrandrews commented Apr 22, 2020 • edited by drewbanin Loading

drewbanin commented Apr 23, 2020

jrandrews commented Apr 24, 2020

drewbanin commented Apr 24, 2020

jrandrews commented Apr 25, 2020

drewbanin commented May 4, 2020

jrandrews commented Apr 22, 2020 •

edited by drewbanin

Loading