-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Incremental Normalization #7162
Changes from 18 commits
872c1f7
e025720
235688b
8244cc1
f675a93
18fc89b
86fa2bd
1e7bf64
6e7a64e
f2757e1
1be482e
9a5f0f0
aa14022
e8169c8
930fd1c
335240c
a3ea02a
de66cae
13d07d2
35490bc
04b62d4
777f5d5
5632c7b
9f35604
a8af628
a778537
fc54ec8
4fd8e4d
47f1518
fdba19c
fd395a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{# | ||
This macro controls how incremental models are updated in Airbyte's normalization step | ||
#} | ||
|
||
{%- macro incremental_clause(col_emitted_at) -%} | ||
{% if is_incremental() %} | ||
and {{ col_emitted_at }} > (select max({{ col_emitted_at }}) from {{ this }}) | ||
{% endif %} | ||
{%- endmacro -%} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
{# | ||
This overrides the behavior of the macro `should_full_refresh` so full refresh are triggered if: | ||
- the dbt cli is run with --full-refresh flag or the model is configured explicitly to full_refresh | ||
- the column _airbyte_ab_id does not exists in the normalized tables and make sure it is well populated. | ||
cgardens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
#} | ||
|
||
{%- macro need_full_refresh(col_ab_id, target_table=this) -%} | ||
{%- if not execute -%} | ||
{{ return(false) }} | ||
{%- endif -%} | ||
{%- set found_column = [] %} | ||
{%- set cols = adapter.get_columns_in_relation(target_table) -%} | ||
{%- for col in cols -%} | ||
{%- if col.column == col_ab_id -%} | ||
{% do found_column.append(col.column) %} | ||
{%- endif -%} | ||
{%- endfor -%} | ||
{%- if found_column -%} | ||
{{ return(false) }} | ||
{%- else -%} | ||
{{ dbt_utils.log_info(target_table ~ "." ~ col_ab_id ~ " does not exist. The table needs to be rebuilt in full_refresh") }} | ||
{{ return(true) }} | ||
{%- endif -%} | ||
{%- endmacro -%} | ||
|
||
{%- macro should_full_refresh() -%} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where is this macro called? Can't find it's usage apart from the comment in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it's called here: https://github.com/dbt-labs/dbt-core/blob/34c23fe6500afda763d49f83c0ebdf4846501663/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L9 it's a internal thing to dbt: https://docs.getdbt.com/reference/resource-configs/full_refresh#description |
||
{% set config_full_refresh = config.get('full_refresh') %} | ||
{%- if config_full_refresh is none -%} | ||
{% set config_full_refresh = flags.FULL_REFRESH %} | ||
{%- endif -%} | ||
{%- if not config_full_refresh -%} | ||
{% set config_full_refresh = need_full_refresh(get_col_ab_id(), this) %} | ||
{%- endif -%} | ||
{% do return(config_full_refresh) %} | ||
{%- endmacro -%} | ||
|
||
{%- macro get_col_ab_id() -%} | ||
{{ adapter.dispatch('get_col_ab_id')() }} | ||
{%- endmacro -%} | ||
|
||
{%- macro default__get_col_ab_id() -%} | ||
_airbyte_ab_id | ||
{%- endmacro -%} | ||
|
||
{%- macro oracle__get_col_ab_id() -%} | ||
"_AIRBYTE_AB_ID" | ||
{%- endmacro -%} | ||
|
||
{%- macro snowflake__get_col_ab_id() -%} | ||
_AIRBYTE_AB_ID | ||
{%- endmacro -%} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
{{ config( | ||
schema = "test_normalization", | ||
unique_key = env_var('AIRBYTE_DEFAULT_UNIQUE_KEY', '_airbyte_ab_id'), | ||
tags = [ "top-level" ] | ||
) }} | ||
-- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key | ||
select | ||
{{ dbt_utils.surrogate_key([ | ||
adapter.quote('id'), | ||
'currency', | ||
'nzd', | ||
]) }} as _airbyte_unique_key, | ||
{{ adapter.quote('id') }}, | ||
currency, | ||
{{ adapter.quote('date') }}, | ||
timestamp_col, | ||
{{ adapter.quote('HKD@spéçiäl & characters') }}, | ||
hkd_special___characters, | ||
nzd, | ||
usd, | ||
{{ adapter.quote('date') }} as _airbyte_start_at, | ||
lag({{ adapter.quote('date') }}) over ( | ||
partition by {{ adapter.quote('id') }}, currency, cast(nzd as {{ dbt_utils.type_string() }}) | ||
order by {{ adapter.quote('date') }} is null asc, {{ adapter.quote('date') }} desc, _airbyte_emitted_at desc | ||
) as _airbyte_end_at, | ||
case when lag({{ adapter.quote('date') }}) over ( | ||
partition by {{ adapter.quote('id') }}, currency, cast(nzd as {{ dbt_utils.type_string() }}) | ||
order by {{ adapter.quote('date') }} is null asc, {{ adapter.quote('date') }} desc, _airbyte_emitted_at desc | ||
) is null then 1 else 0 end as _airbyte_active_row, | ||
_airbyte_ab_id, | ||
_airbyte_emitted_at, | ||
_airbyte_dedup_exchange_rate_hashid | ||
from {{ ref('dedup_exchange_rate_ab4') }} | ||
-- dedup_exchange_rate from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} | ||
where 1 = 1 | ||
and _airbyte_row_num = 1 | ||
{{ incremental_clause('_airbyte_emitted_at') }} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
{{ config( | ||
schema = "test_normalization", | ||
unique_key = "_airbyte_unique_key", | ||
tags = [ "top-level" ] | ||
) }} | ||
-- Final base SQL model | ||
select | ||
_airbyte_unique_key, | ||
{{ adapter.quote('id') }}, | ||
currency, | ||
{{ adapter.quote('date') }}, | ||
timestamp_col, | ||
{{ adapter.quote('HKD@spéçiäl & characters') }}, | ||
hkd_special___characters, | ||
nzd, | ||
usd, | ||
_airbyte_ab_id, | ||
_airbyte_emitted_at, | ||
_airbyte_dedup_exchange_rate_hashid | ||
from {{ ref('dedup_exchange_rate_scd') }} | ||
-- dedup_exchange_rate from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} | ||
where 1 = 1 | ||
and _airbyte_active_row = 1 | ||
{{ incremental_clause('_airbyte_emitted_at') }} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @andresbravog
The incremental clause is isolated in a dbt macro to make it easier for a user to override it without having to rebuild the normalization docker image. It would be doable by exporting the generated dbt project and editing the macro file to behave differently as mentioned here #4286 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important for this the
col_emitted_at
to be indexed so that we avoid a full table scan on this query?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having an
and
in here feels wrong, the calling context should have knowledge of how to chain these predicates together whereas this macros can't be expected to know that. So shouldn't the context have theand
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's important for READ performances and dependent on the destination
That's why on some warehouse destinations, we would need to introduce the option of partitioning/clustering on raw tables. Maybe on databases destinations, it'd make sense to do create index.
without those changes on destinations sides, this PR starts to introduce optimization on the WRITE side at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChristopheDuong should this be
>=
? do we have a guarantee that in between normalization runs that another record with the same timestamp cannot be added? i don't think we have that guarantee. especially dodgy since the emitted_at timestamp is created by the worker. since we can't rely on the fact that timestamps are monotonically increasing, i think we always have to do>=
. I think that's okay, because you handle deduping records with airbyte_ab_id, so the only cost is we may re-process a handful of records. That seems fine relative to the potential of missing a few records.(this is another agument for keeping the raw data around like we were talking about the other day. it is definitely nice to be able to go back and re-process if we make a mistake in normalization without having to resend data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can make it
>=
just in case