-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates table will create "blind spots" between creation and update times #12
Comments
Yesterday I implemented this "duplicate first update record using creation date" in our own system but then it occurred to me that we could actually check for "this is the first instance" and return the created_at field, like this: SELECT *,
IF(LAG(${timestamp}) OVER (updates_window) IS NULL, ${creationTime}, ${timestamp}) AS scd_valid_from,
LEAD(${timestamp}) OVER (updates_window) AS scd_valid_to
FROM
${ref(schema, `${name}_updates`)}
WINDOW updates_window AS (PARTITION BY id ORDER BY ${timestamp} ASC) You can see here the only change made to the current logic of the view is that in addition to using the I did it with a window function hopefully to avoid slowing down the query, UPDATE: On our own dataset with about 6 million records using this LAG method, tested against our dataset with 10 million records when using the synthetic originals and the original/existing view, the LAG approach was much faster, used less slot time and scanned fewer GBs than the synthetic originals approach (just running select * against both views). Perhaps in the real world these differences would be less because we'd join to these views with a |
More in [this issue](dataform-co#12).
Thank you @michaelsnook , I had this exact wish on how it should work. |
Looks good to me! @michaelsnook could you please update the pull request 350org#1 to have this repo's master branch as the upstream, rather than the fork's master branch? Then I should be able to approve and merge. |
This is an issue occurs both when backfilling records (when deploying the SCD script for the first time) and when adding new records with no entries on the updates table which have different creation and update times.
My use case is we have a
users
table and users takeactions
and the users table has some slowly changing fields like their preferred language and their country of residence. So we create the updates table, we create the view withscd_valid_to
andscd_valid_from
and then we can join any action to the state of the user that existed at the time of the action (in theory). e.g.This is very convenient and fast except it doesn't work in two specific situations.
Back-filling old data: Consider a user, who joined in 2020, took dozens of actions over the last 4 years, has a recent updated_at of some time in late 2023. When I run the SCD script for the first time I get a single history record in 2024 with an updated_at value of 2023, and the view with scd_valid_from shows the same date in 2023. This means that ALL the action records from 2020 to late 2023 will fail to join in the above query. (This is what I mean by orphan records.)
Changes before the first update record: Similarly, if the SCD script is active and running daily, and a user signs up by taking an action at 12:00, not specifying a language so our system defaults them to English, and then at 12:05 they go into their profile and set their language to Spanish, and then at 2:00 the SCD job runs... it will insert a record with created_at: 12:00, updated_at: 12:05, and the view will show scd_valid_from 12:05. So the very first action they ever took (a very important one, from a business perspective) will be orphaned in the same way scenario 1.
It would be possible to complexify the views I'm using to join actions to histories, but this defeats the purpose of the convenience view. It might also be possible to add a configuration option for a created_at field so that the view detects when it's dealing with the first ever record update and set scd_valid_from to the creation date (figuring maybe there were changes in between there but we don't know so we have to be okay with some loss of specificity in exchange for not breaking all our other queries).
I think a better solution would be to add a config option for a created_at field, and have the insert script:
select * EXCEPT(updated_at), created_at AS updated_at
With this logic the Scenario 2 situation would create two records, and the scd_valid_* view would look like this
This way, trying to join the original action from 12:00 and the second action from 12:05 will both work, whereas under the current logic, the first one will not join.
The text was updated successfully, but these errors were encountered: