Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt Snapshots not working correctly when switching from "check" to "timestamp" strategy #2350

Closed
jrandrews opened this issue Apr 22, 2020 · 6 comments
Labels
bug Something isn't working snapshots Issues related to dbt's snapshot functionality

Comments

@jrandrews
Copy link

Describe the bug

We have found cases where, using the dbt snapshot command, snapshots are not updating appropriately even though new source data rows have arrived which should cause snapshots to update.

Steps To Reproduce

  1. Create a snapshot of a source table where the source table does not have a "last changed" timestamp, so it is necessary to use the "check" strategy to check for changes in a specific set of columns using a shared common PK across the rows to compare.
  2. After ingesting some amount of snapshot data, then add a "last changed" timestamp from the source into a new column that was not previously available. In our case, this was because the external data loaders that were loading our raw source data into BigQuery were being run by another team who did not understand what our preferred data scenario was. Once we discussed with them, they were eventually able to identify a "last changed" timestamp column and ingest it, but only after we had been running snapshots for a few weeks with the "check" strategy.
  3. Notice that the addition of the new "last changed" timestamp column to the snapshot now occurs, as expected with dbt snapshots when new columns are added to the source. Also note that for all previously existing data in the snapshot, this column is backfilled with NULL values, which is also according to expected behavior.
  4. Ingest/add a new row with a change from the source which already has a corresponding pre-existing row with the same PK in the snapshot from before the change from "check" to timestamp strategies. Note that the "last changed" timestamp column in the snapshot for the relevant entity is NULL because it was previously populated by dbt snapshot using the "check" strategy and this column was backfilled as NULL.
  5. Re-run dbt snapshot. Notice that no updates happen in the snapshot and the existing, old row(s) are not end-dated even though there is a "newer" row in the source which should be inserted into the snapshot and the existing row with no end-date should be end-dated.

Expected behavior

I would expect, regardless of whether switching back and forth between "check" and "timestamp" strategies for snapshots, that new data coming in from the source would appropriately and correctly be inserted into the snapshot, and existing rows would be end-dated appropriately.

There may also be some larger issues here around temporality as described in this dbt discourse article and how dbt snapshots uses and tracks timestamps even in the dbt_* columns based on what snapshot strategy is being used: https://discourse.getdbt.com/t/dbt-bitemporality-and-snapshots/1067 (although I am absolutely not asking for full bi-temporal support here.)

Screenshots and log output

See screenshot attached of a) our snapshot table and b) the incoming row which should be inserted in the snapshot but isn't. Employee name blurred out for obvious reasons.

existing_snapshot
raw_source_table

System information

Which database are you using dbt with?
BigQuery

The output of dbt --version:

installed version: 0.16.1
   latest version: 0.16.1

The operating system you're using:
Mac OS X Mojave 10.14.6
Same issue happens also when running from dbt cloud

The output of python --version:
Python 2.7.16

Additional context

This is a pretty major issue for us, we're losing existing history because of this and looking at also having to throw away a bunch of previously gathered history from when we were using the "check" strategy.

@jrandrews jrandrews added bug Something isn't working triage labels Apr 22, 2020
@jrandrews
Copy link
Author

jrandrews commented Apr 22, 2020

We think this may be a factor of how dbt_scd_id is calculated, it looks like it's using the Metadatalastupdatetime column in the snapshot, and perhaps there are problems when that column is NULL?

Here is the SQL:

/* {"app": "dbt", "dbt_version": "0.16.1", "profile_name": "user", "target_name": "default", "node_id": "snapshot.op_analytics.qb_employee_snapshot"} */
        
  create or replace table `qb-sync-sheet-time`.`snapshot`.`qb_employee_snapshot__dbt_tmp`
  
  
  OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
    )
  as (
    with snapshot_query as (
        
    
    
    select * from `qb-sync-sheet-time`.`qb`.`qb_employee`
    
    ),
    snapshotted_data as (
        select *,
            id as dbt_unique_key
        from `qb-sync-sheet-time`.`snapshot`.`qb_employee_snapshot`
    ),
    source_data as (
        select *,
            to_hex(md5(concat(coalesce(cast(id as string), ''), '|',coalesce(cast(MetaDataLastUpdatedTime as string), '')))) as dbt_scd_id,
            id as dbt_unique_key,
            MetaDataLastUpdatedTime as dbt_updated_at,
            MetaDataLastUpdatedTime as dbt_valid_from,
            nullif(MetaDataLastUpdatedTime, MetaDataLastUpdatedTime) as dbt_valid_to
        from snapshot_query
    ),
    insertions as (
        select
            'insert' as dbt_change_type,
            source_data.*
        from source_data
        left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_unique_key is null
           or (
                snapshotted_data.dbt_unique_key is not null
            and snapshotted_data.dbt_valid_to is null
            and (
                (snapshotted_data.MetaDataLastUpdatedTime < source_data.MetaDataLastUpdatedTime)
            )
        )
    )
    select * from insertions
  );

@drewbanin drewbanin added snapshots Issues related to dbt's snapshot functionality and removed triage labels Apr 23, 2020
@drewbanin
Copy link
Contributor

Hey @jrandrews - thanks for writing this issue up! It's very complete and very well written!

I think the issue here is indeed the null value for MetaDataLastUpdatedTime in this predicate:

where snapshotted_data.dbt_unique_key is null
or (
  snapshotted_data.dbt_unique_key is not null
  and snapshotted_data.dbt_valid_to is null
  and (
    -- This line is probably the problem
    (snapshotted_data.MetaDataLastUpdatedTime < source_data.MetaDataLastUpdatedTime)
  )
)

You're correct in saying that we'd expect the historical MetaDataLastUpdatedTime in the Snapshot table to be null when the column is added to the snapshot table. The problem then is that we should be using the dbt_updated_at timestamp (which should not be null) instead of the MetaDataLastUpdatedTime column to perform this time comparison.

There is always so much to consider when changing the snapshot code, but I suspect we can simply change this equality from:

    (snapshotted_data.MetaDataLastUpdatedTime < source_data.MetaDataLastUpdatedTime)

to

    (snapshotted_data.dbt_updated_at < source_data.MetaDataLastUpdatedTime)

We can guarantee that dbt_updated_at should never be null, and:

  • For strategy='timestamp': dbt_updated_at = dbt_valid_from = {{ config.updated_at }}
  • For strategy='check': dbt_updated_at = dbt_valid_from = current_timestamp()

So, I don't believe this change would manifest as a functional change for any snapshot users out there, and it should fix the issue you're seeing.

Let me know if you buy all of that, or if you can think of anything I might be missing here.

@jrandrews
Copy link
Author

@drewbanin thanks for the review and thoughts on this. Your approach sounds reasonable. Honestly I haven't read the details of the snapshot code thoroughly and I'll rely on your good judgement as it seems like it's pretty involved code.

I don't this fix would do anything to repair the existing data problem we have, right? We'll need to clean that up ourselves if so.

@drewbanin
Copy link
Contributor

@jrandrews unfortunately no, you'd need to manually repair the snapshot table yourself

@jrandrews
Copy link
Author

I think we've managed to fix it by doing this to all of the tables in question for now (although of course it will reoccur if we ever switch another table from check to snapshot in the future so the code fix will definitely be helpful):

UPDATE snapshot.my_snapshot SET MetaDataLastUpdatedTime = dbt_updated_at WHERE MetaDataLastUpdatedTime IS NULL

@drewbanin
Copy link
Contributor

closed by #2391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

No branches or pull requests

2 participants