-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
π Incremental Normalization #7162
Conversation
...gration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests/schema_test.yml
Show resolved
Hide resolved
...s/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml
Show resolved
Hide resolved
|
||
{%- macro incremental_clause(col_emitted_at) -%} | ||
{% if is_incremental() %} | ||
and {{ col_emitted_at }} > (select max({{ col_emitted_at }}) from {{ this }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @andresbravog
The incremental clause is isolated in a dbt macro to make it easier for a user to override it without having to rebuild the normalization docker image. It would be doable by exporting the generated dbt project and editing the macro file to behave differently as mentioned here #4286 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important for this the col_emitted_at
to be indexed so that we avoid a full table scan on this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having an and
in here feels wrong, the calling context should have knowledge of how to chain these predicates together whereas this macros can't be expected to know that. So shouldn't the context have the and
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important for this the col_emitted_at to be indexed so that we avoid a full table scan on this query?
Yes, it's important for READ performances and dependent on the destination
That's why on some warehouse destinations, we would need to introduce the option of partitioning/clustering on raw tables. Maybe on databases destinations, it'd make sense to do create index.
without those changes on destinations sides, this PR starts to introduce optimization on the WRITE side at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChristopheDuong should this be >=
? do we have a guarantee that in between normalization runs that another record with the same timestamp cannot be added? i don't think we have that guarantee. especially dodgy since the emitted_at timestamp is created by the worker. since we can't rely on the fact that timestamps are monotonically increasing, i think we always have to do >=
. I think that's okay, because you handle deduping records with airbyte_ab_id, so the only cost is we may re-process a handful of records. That seems fine relative to the potential of missing a few records.
(this is another agument for keeping the raw data around like we were talking about the other day. it is definitely nice to be able to go back and re-process if we make a mistake in normalization without having to resend data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can make it >=
just in case
/test connector=bases/base-normalization
|
/publish connector=bases/base-normalization
|
* Fix incremental update of SCD tables * add comments * format code * test * Restore partitioning on final dedup model * Apply changes from code review * Add some sample SQL outputs * format code * Remove unused method * update comments * add comments * add docs * Add test * Fix acceptance tests clean up * Apply suggestions from code review Co-authored-by: Sherif A. Nada <snadalive@gmail.com> Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
/publish connector=bases/base-normalization
|
|
||
## Partitioning, clustering, sorting, indexing | ||
|
||
Normalization produces tables that are partitioned, clustered, sorted or indexed depending on the destination engine and on the type of tables being built. The goal of these are to make read more performant, especially when running incremental updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avaidyanatha Maybe it's worth to point this new behavior in normalization (incremental + partitions) when releasing next version.
I am not sure if some exceptions are going to be raised or not. But I've noticed sometimes BigQuery throwing exceptions about detecting the partitioning specs being changed. Thus, requiring a manual drop of tables first.
Users (in bigquery at least, others should be fine...) may have to manually delete their normalized tables in the destinations (but keep the raw tables!)
What
Closes #4286
How
This PR is pretty "standard" when using dbt and is mostly following instructions from https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models (you will find many copy/paste parts between that doc page and this PR whether in the code or in the description)
What is an incremental model?
Incremental models are built as tables in your data warehouse β the first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the table that has already been built (the target table).
Often, the rows you filter for on an incremental run will be the rows in your source data that have been created or updated since the last time dbt ran. As such, on each dbt run, your model gets built incrementally.
Using an incremental model limits the amount of data that needs to be transformed, vastly reducing the runtime of your transformations. This improves warehouse performance and reduces compute costs.
To use incremental models, you also need to tell dbt:
1. Filtering rows on an incremental run
To tell dbt which rows it should transform on an incremental run, we append the
incremental_clause()
macro (defined inairbyte-integrations/bases/base-normalization/dbt-project-template/macros/incremental.sql
) inwhere
clauses at the end of every model generated by normalization.This ensures that every model filters for "new" rows, as in, rows that have been created since the last time dbt ran this model.
The best way to find the timestamp of the most recent run of this model is by checking the most recent timestamp in the target table. dbt makes it easy to query your target table by using the "{{ this }}" variable.
This condition is surrounded in jinja by
{% if is_incremental() %}
statement to toggle the incremental behavior depending on the mode dbt is run with for that particular model: full_refresh vs incremental?The is_incremental() macro will return True if:
2. Defining a uniqueness constraint (optional)
This is totally optional and we could completely ignore this (except in deduplicated tables that relies on primary keys)
However, we made the choice to always put a unique constraint on all models generated by normalization.
unique_key is an optional parameter for incremental models that specifies a field that should be unique within your model. If the unique key of an existing row in your target table matches one of your incrementally transformed rows, the existing row will be updated. This ensures that you don't have multiple rows in your target table for a single row in your source data.
You can define unique_key in a configuration block at the top of your model. The unique_key should be a single field name that is present in your model definition. While some databases support using expressions (eg. concat(user_id, session_number)), this syntax is not universally supported, so is not recommended. If you do not have a single field that is unique, consider first creating such a field in your model.
Building this model incrementally without the unique_key parameter would result in multiple rows in the target table for a single day β one row for each time dbt runs on that day. Instead, the inclusion of the unique_key parameter ensures the existing row is updated instead.
Without unique_key, the generated SQL would be a simple
insert into
whereas with the unique_key defined, it is generating (when possible) amerge on unique_key
statement (or equivalent throughdelete where unique_key; insert into where unique_key
)For every stream, we do have a single field that is unique for each row:
_airbyte_ab_id
in the_airbyte_raw
tables.Airbyte destination connectors always included a
randomUUID
value in the_airbyte_ab_id
column when writing a new row. This was always present on the_airbyte_raw
tables but unused in normalization until now.Thus we can use that for every normalized table too. In the special case of deduplicated tables, we need to use (potentially composite) primary keys defined by the user, so we first generate a single column
_airbyte_unique_key
by hashing the primary keys together (ingenerate_scd_type_2_model
function instream_processor.py
).For these special tables that need to be deduplicated, if we don't specify the unique_key, we risk breaking the constraint of having only one row per primary key (not being fully deduplicated anymore). So, it is mandatory to have the unique_key = primary keys.
Because we rely on this new column
_airbyte_ab_id
column to be present on all normalized tables (which were not propagated before this PR). We are effectively introducing a schema change with a new "mandatory" field (non-nullable).So, how does dbt handle this schema change?
New on_schema_change config in dbt version v0.21.0
Incremental models can now be configured to include an optional on_schema_change parameter to enable additional control when incremental model columns change. These options enable dbt to continue running incremental models in the presence of schema changes,
The modes we could choose are (among other modes that raise exceptions or do nothing when detecting changes):
Note: None of the on_schema_change behaviors backfill values in old records for newly added columns which is our case as we want
_airbyte_ab_id
to be always populated! Too bad, we can't leverage this... so FULL REFRESH!To force dbt to rebuild the entire incremental model from scratch, we can use the --full-refresh flag on the command line. This flag will cause dbt to drop the existing target table in the database before rebuilding it for all-time for all rows. (basically, it reverts to behave like older versions of normalization)
But normalization is always run without that flag on the CLI (since we want incremental now!) and we can't tweak it from the UI or Airbyte worker. (plus we may run into scenarios where only certain models need full-refresh and others don't, so we need a more granular way to trigger full_refresh, aka per stream)
Full refresh existing destinations tables (when migrating from older normalization version to this PR)
To implement this more granular full_refresh decision, we override dbt's special macro
should_full_refresh
inairbyte-integrations/bases/base-normalization/dbt-project-template/macros/should_full_refresh.sql
The macro tries to detect if there are existing normalized tables in the destination that were produced by older versions of normalization (ie that don't have a
_airbyte_ab_id
columns). In that case, it returnstrue
to signal a need for full_refresh.If we didn't choose to have a unique_key constraint by
_airbyte_ab_id
, we wouldn't need this logic at all.But introducing this column now allows users to make a clean join between the normalized tables and the raw tables and align how the JSON blob is expanded into tabular columns. As a result, we can also ensure that one row in the raw tables produces only one row in the normalized tables. We avoid the risk that multiple normalized rows are duplicated every time dbt is run. This is a new concern because we are now in incremental and not rebuilding the whole table from 0 every time dbt is run.
Recommended reading order
In terms of code changes:
airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
airbyte-integrations/bases/base-normalization/dbt-project-template/macros/incremental.sql
airbyte-integrations/bases/base-normalization/dbt-project-template/macros/should_full_refresh.sql
Or in terms of changes to the generated SQL:
airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/exchange_rate.sql
airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/dedup_exchange_rate.sql
Under the (dbt) hood
airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/test_normalization/exchange_rate.sql
airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/test_normalization/exchange_rate.sql
insert into ...
query, only the new data records are written = incrementally.create table ... as select ... from ...
query, it would rewrites the entire data = non-incrementaldelete ... from ...
query ensures unicity of records per_airbyte_ab_id
or_airbyte_unique_key
Known Issues or limitations
While implementing this PR, I ran into the following issues that need to be addressed in the future: