Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn if updated_at field for snapshot is not same datatype as what's returned in snapshot_get_time() #10234

Closed
Tracked by #10151
gshank opened this issue May 28, 2024 · 4 comments · Fixed by #10352
Closed
Tracked by #10151
Assignees
Labels
snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation

Comments

@gshank
Copy link
Contributor

gshank commented May 28, 2024

Description

There are 4 different semantics for timestamp datatypes - for simplicity let’s say 2 categories:

  • naive timestamp - no timezone information, no UTC offset “noon on january 1st, 2024”
  • aware timestamp - is aware of timezone
    • UTC offset like “-2 hours”;
    • specific named timezone “America/Los Angeles”;
    • absolute timestamp ltz uses your session timezone

One of the trickiest things that can happen within snapshots is when there are naive timestamps (rather than aware). In those cases, we need clear ways for the user to configure a "mutual agreement" with the dbt system how to interpret those timestamps when they are actively involved in the snapshot configuration.

Up until this point, the approach has been that naive timestamp must be UTC

  • there is no way to specify anything else
  • we don't document this
  • there's no warning if you supply a data type that's not a UTC timestamp

Some users may want to use an adjacent column that contains the relevant UTC offset or time zone (or configure the time zone globally for the dbt project (like this) or specifically for one model).

Example of snapshot - updated_at when using strategy=timestamp:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',

      strategy='timestamp',
      updated_at='updated_at_field'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

PROBLEM

When the initial snapshot occurs, the dbt_valid_from and dbt_valid_to columns inherit the data type of whatever you've provided for updated_at config (could be a date, could be a naive timestamp, etc.). But once a record is changed, dbt updates the dbt_valid_to column to snapshot_get_time() which returns (in snowflake) a timestamp_ntz data type in the UTC time zone. This is then implicitly converted to the original inherited data type, which can lead to incorrect data if their updated_at field is NOT timestamp_ntz data type in the UTC time zone (incorrect comparisons; may coerce timestamp differently than user expects).

Naive vs. aware timestamps in Snowflake:
https://docs.google.com/presentation/d/14qrx8YbnGnoP2FgnxSD5Z8fxcnoYq4DKRrfHBDanbfA/edit#slide=id.g20f6efbe802_0_34

SOLUTION

Let’s go with Idea 1 for now, maybe some magic in the future

Idea 1 - put burden of using correct data type on user:

  • go look up what data type updated_at field is, see if it’s compatible
  • if it’s not throw an warning (or error but that would be backwards incompatible, so would need to use a flag) to tell user to convert field to correct data type (adapter dependent)
  • updated_at already accepts a SQL statement [though we should add docs for this]:
    {% snapshot orders_snapshot %}
    
    {{
        config(
          target_database='analytics',
          target_schema='snapshots',
          unique_key='id',
    
          strategy='timestamp',
          updated_at='convert_timezone('UTC', updated_at_field),'YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM')'
        )
    }}
    
    select * from {{ source('jaffle_shop', 'orders') }}
    
    {% endsnapshot %}

Idea 2 - magic stuff to automatically convert:

  • go look up what data type updated_at_field is, see if it’s compatible
  • based on what it is, automatically do the conversion to the correct data type (adapter dependent)
@gshank gshank added the snapshots Issues related to dbt's snapshot functionality label May 28, 2024
@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label May 29, 2024
@graciegoheen
Copy link
Contributor

[Adding example from snapshot discussion]

Regarding the naïve timestamp approach. I feel this is a pretty big problem because it can result in the wrong data being written. In our case, we use Fivetran to load Snowflake. We use a timestamp snapshot strategy with the _FIVETRAN_SYNCED column as the updated_at config. We also include "where _fivetran_deleted = FALSE" on our snapshots so when the row is deleted in the source the snapshot will indicate the row is no longer valid. The _FIVETRAN_SYNCED column is defined as a timestamp_tz and therefore when the initial snapshot occurs the dbt_valid_from and dbt_valid_to columns are also defined as timestamp_tz. When a record is then deleted from the source, the dbt snapshots sets the dbt_valid_to column to snapshot_get_time(). Snapshot_get_time() however returns a timestamp_ntz data type in the UTC time zone. This UTC time is then implicitly converted to a timestamp_tz which stores the UTC value and adds the time zone offset of the session, which in our case is currently -0500. This time is now 5 hours in the future, which is not the correct time.

As mentioned I feel this a bug and should be addressed in dbt-core. Some ideas.
- The snapshot should automatically convert the column specified in the updated_at config to timestamp_ntz.
○ This will result in the dbt_valid_from and dbt_valid_to column also being timestamp_ntz.
- The snapshot should be aware of the data type and as a result snapshot_get_time() should return that same data type.
- Return an error if the column specified in the updated_at config is anything but timestamp_ntz.

I realize some of these ideas are breaking changes so maybe a new snapshot strategy should be introduced that allows users to make the switch when ready.

Workarounds
- Manually convert the updated_at column to timestamp_ntz in the snapshot definition.
- Ensure every user performing the snapshot is configured in the UTC time zone.
- Utilize soft deletes in the source to avoid dbt using the hard delete snapshot logic.
○ This will require modifying downstream models so they account for the deleted flag.
- Override the snapshot_get_time() macro.
○ This is what we did. More details below.

We found that overriding the snapshot_get_time() was the best work around for us. I don't like the solution and it is a little risky but we changed the Snowflake command to:

to_varchar(convert_timezone('UTC', current_timestamp()),'YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM')

This converts the timestamp to a varchar which retains the offset information so when it is stored to timestamp_tz or timestamp_ltz the offset is correct. When stored to a timestamp_ntz column it is also correct and in UTC because the offset is trimmed away. As mentioned this is risky because I am not sure all the places snapshot_get_time() is used and changing the data type could be problematic.

Originally posted by @jvanpee in #7018 (comment)

@graciegoheen
Copy link
Contributor

graciegoheen commented May 29, 2024

I have confirmed updated_at already accepts a SQL statement:

{% snapshot orders_snapshot %}
    {{
        config(
          unique_key='id',
          strategy='timestamp',
          target_schema='dbt_ggoheen',
          updated_at='TO_TIMESTAMP_NTZ(date)'
        )
    }}

    select * from {{ source('raw', 'orders') }}
{% endsnapshot %}
20:23:26 Began running node snapshot.coalesce_ci_demo_2023.orders_snapshot
20:23:26 1 of 1 START snapshot dbt_ggoheen.orders_snapshot .............................. [RUN]
20:23:26 Began compiling node snapshot.coalesce_ci_demo_2023.orders_snapshot
20:23:26 Began executing node snapshot.coalesce_ci_demo_2023.orders_snapshot
20:23:26 Writing runtime sql for node "snapshot.coalesce_ci_demo_2023.orders_snapshot"
20:23:26 Using snowflake connection "snapshot.coalesce_ci_demo_2023.orders_snapshot"
20:23:26 On snapshot.coalesce_ci_demo_2023.orders_snapshot: /* {"app": "dbt", "dbt_version": "2024.5.164", "profile_name": "user", "target_name": "default", "node_id": "snapshot.coalesce_ci_demo_2023.orders_snapshot"} */
create or replace transient table DEVELOPMENT.dbt_ggoheen.orders_snapshot
         as
        (

    select *,
        md5(coalesce(cast(id as varchar ), '')
         || '|' || coalesce(cast(TO_TIMESTAMP_NTZ(date) as varchar ), '')
        ) as dbt_scd_id,
        TO_TIMESTAMP_NTZ(date) as dbt_updated_at,
        TO_TIMESTAMP_NTZ(date) as dbt_valid_from,
        nullif(TO_TIMESTAMP_NTZ(date), TO_TIMESTAMP_NTZ(date)) as dbt_valid_to
    from (
        
    

    select * from raw.magic_shop.orders
    ) sbq



        );
20:23:26 Opening a new connection, currently in state closed
20:23:28 SQL status: SUCCESS 1 in 2.0 seconds
20:23:28 On snapshot.coalesce_ci_demo_2023.orders_snapshot: Close
20:23:28 1 of 1 OK snapshotted dbt_ggoheen.orders_snapshot .............................. [�[32mSUCCESS 1�[0m in 1.79s]
20:23:28 Finished running node snapshot.coalesce_ci_demo_2023.orders_snapshot

@graciegoheen graciegoheen changed the title Timestamp fields are all the same datatype (with the same timezone) Warn if updated_at field for snapshot is not same datatype as what's returned in snapshot_get_time() May 29, 2024
@graciegoheen graciegoheen changed the title Warn if updated_at field for snapshot is not same datatype as what's returned in snapshot_get_time() Warn if updated_at field for snapshot is not same datatype (or timezone) as what's returned in snapshot_get_time() May 30, 2024
@gshank gshank changed the title Warn if updated_at field for snapshot is not same datatype (or timezone) as what's returned in snapshot_get_time() Warn if updated_at field for snapshot is not same datatype as what's returned in snapshot_get_time() Jun 21, 2024
@graciegoheen
Copy link
Contributor

There is differences on what is returned by snapshot_get_time() based on adapters:

  • snowflake returns a timestamp with no timezone
  • bigquery returns a timestamp in the timezone of your session

Here’s a comparison of the inconsistencies across adapters

@FishtownBuildBot
Copy link
Collaborator

Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#6012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants