diff --git a/airbyte-integrations/bases/base-normalization/.gitignore b/airbyte-integrations/bases/base-normalization/.gitignore index 039ca654b14a..59647e6336ee 100644 --- a/airbyte-integrations/bases/base-normalization/.gitignore +++ b/airbyte-integrations/bases/base-normalization/.gitignore @@ -12,5 +12,6 @@ integration_tests/normalization_test_output/*/*/*.json integration_tests/normalization_test_output/*/*/*.md integration_tests/normalization_test_output/*/*/macros/ integration_tests/normalization_test_output/*/*/tests/ +integration_tests/normalization_test_output/*/*/models/dbt_data_tests/ integration_tests/normalization_test_output/*/*/models/dbt_schema_tests/ - +integration_tests/normalization_test_output/*/*/modified_models/ diff --git a/airbyte-integrations/bases/base-normalization/Dockerfile b/airbyte-integrations/bases/base-normalization/Dockerfile index 26c4f138b114..f00326856c65 100644 --- a/airbyte-integrations/bases/base-normalization/Dockerfile +++ b/airbyte-integrations/bases/base-normalization/Dockerfile @@ -27,5 +27,5 @@ WORKDIR /airbyte ENV AIRBYTE_ENTRYPOINT "/airbyte/entrypoint.sh" ENTRYPOINT ["/airbyte/entrypoint.sh"] -LABEL io.airbyte.version=0.1.56 +LABEL io.airbyte.version=0.1.58 LABEL io.airbyte.name=airbyte/normalization diff --git a/airbyte-integrations/bases/base-normalization/README.md b/airbyte-integrations/bases/base-normalization/README.md index b01941360353..4d02b23639ec 100644 --- a/airbyte-integrations/bases/base-normalization/README.md +++ b/airbyte-integrations/bases/base-normalization/README.md @@ -118,7 +118,7 @@ or directly with pytest: NORMALIZATION_TEST_TARGET=postgres pytest airbyte-integrations/bases/base-normalization/integration_tests Note that these tests are connecting and processing data on top of real data warehouse destinations. -Therefore, valid credentials files are expected to be injected in the `secrets/` folder in order to run +Therefore, valid credentials files are expected to be injected in the `secrets/` folder in order to run (not included in git repository). This is usually automatically done by the CI thanks to the `tools/bin/ci_credentials.sh` script or you can @@ -217,6 +217,9 @@ So, for each target destination, the steps run by the tests are: 7. Execute dbt cli command: `dbt tests` from the test workspace folder to run verifications and checks with dbt. 8. Optional checks (nothing for the moment) +Note that the tests are using the normalization code from the python files directly, so it is not necessary to rebuild the docker images +in between when iterating on the code base. However, dbt cli and destination connectors are invoked thanks to the dev docker images. + ### Integration Test Checks: #### dbt schema tests: diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template-mssql/dbt_project.yml b/airbyte-integrations/bases/base-normalization/dbt-project-template-mssql/dbt_project.yml index e3dd3019fddf..8c7494fdc58f 100755 --- a/airbyte-integrations/bases/base-normalization/dbt-project-template-mssql/dbt_project.yml +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template-mssql/dbt_project.yml @@ -42,17 +42,20 @@ quoting: # are materialized, and more! models: airbyte_utils: + +materialized: table generated: airbyte_ctes: +tags: airbyte_internal_cte +materialized: ephemeral - airbyte_views: - +tags: airbyte_internal_views - +materialized: view + airbyte_incremental: + +tags: incremental_tables + +materialized: incremental airbyte_tables: +tags: normalized_tables +materialized: table - +materialized: table + airbyte_views: + +tags: airbyte_internal_views + +materialized: view vars: dbt_utils_dispatch_list: ['airbyte_utils'] diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template-mysql/dbt_project.yml b/airbyte-integrations/bases/base-normalization/dbt-project-template-mysql/dbt_project.yml index e3dd3019fddf..b03cee8fe930 100755 --- a/airbyte-integrations/bases/base-normalization/dbt-project-template-mysql/dbt_project.yml +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template-mysql/dbt_project.yml @@ -42,17 +42,22 @@ quoting: # are materialized, and more! models: airbyte_utils: + +materialized: table generated: airbyte_ctes: +tags: airbyte_internal_cte +materialized: ephemeral - airbyte_views: - +tags: airbyte_internal_views - +materialized: view + airbyte_incremental: + +tags: incremental_tables + # incremental is not enabled for MySql yet + #+materialized: incremental + +materialized: table airbyte_tables: +tags: normalized_tables +materialized: table - +materialized: table + airbyte_views: + +tags: airbyte_internal_views + +materialized: view vars: dbt_utils_dispatch_list: ['airbyte_utils'] diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template-oracle/dbt_project.yml b/airbyte-integrations/bases/base-normalization/dbt-project-template-oracle/dbt_project.yml index 0a37a17f886f..0ded2a42d60e 100755 --- a/airbyte-integrations/bases/base-normalization/dbt-project-template-oracle/dbt_project.yml +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template-oracle/dbt_project.yml @@ -40,17 +40,22 @@ quoting: # are materialized, and more! models: airbyte_utils: + +materialized: table generated: airbyte_ctes: +tags: airbyte_internal_cte +materialized: ephemeral - airbyte_views: - +tags: airbyte_internal_views - +materialized: view + airbyte_incremental: + +tags: incremental_tables + # incremental is not enabled for Oracle yet + #+materialized: incremental + +materialized: table airbyte_tables: +tags: normalized_tables +materialized: table - +materialized: table + airbyte_views: + +tags: airbyte_internal_views + +materialized: view vars: dbt_utils_dispatch_list: ['airbyte_utils'] diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template/dbt_project.yml b/airbyte-integrations/bases/base-normalization/dbt-project-template/dbt_project.yml index 37f9cdc5f7a4..9ad815875900 100755 --- a/airbyte-integrations/bases/base-normalization/dbt-project-template/dbt_project.yml +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template/dbt_project.yml @@ -42,17 +42,21 @@ quoting: # are materialized, and more! models: airbyte_utils: + +materialized: table generated: airbyte_ctes: +tags: airbyte_internal_cte +materialized: ephemeral - airbyte_views: - +tags: airbyte_internal_views - +materialized: view + airbyte_incremental: + +tags: incremental_tables + +materialized: incremental + +on_schema_change: sync_all_columns airbyte_tables: +tags: normalized_tables +materialized: table - +materialized: table + airbyte_views: + +tags: airbyte_internal_views + +materialized: view dispatch: - macro_namespace: dbt_utils diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/current_timestamp.sql b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/current_timestamp.sql new file mode 100644 index 000000000000..a9df34c9e497 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/current_timestamp.sql @@ -0,0 +1,7 @@ +{% macro mysql__current_timestamp() %} + CURRENT_TIMESTAMP +{% endmacro %} + +{% macro oracle__current_timestamp() %} + CURRENT_TIMESTAMP +{% endmacro %} diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/drop_schema.sql b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/drop_schema.sql deleted file mode 100644 index 79bfa470c01a..000000000000 --- a/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/drop_schema.sql +++ /dev/null @@ -1,8 +0,0 @@ -{# - Drop schema to clean up the destination database -#} -{% macro drop_schemas(schemas) %} - {% for schema in schemas %} - drop schema if exists {{ schema }} cascade; - {% endfor %} -{% endmacro %} diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/incremental.sql b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/incremental.sql new file mode 100644 index 000000000000..af02a97f605e --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/incremental.sql @@ -0,0 +1,36 @@ +{# + These macros control how incremental models are updated in Airbyte's normalization step + - get_max_normalized_cursor retrieve the value of the last normalized data + - incremental_clause controls the predicate to filter on new data to process incrementally +#} + +{% macro incremental_clause(col_emitted_at) -%} + {{ adapter.dispatch('incremental_clause')(col_emitted_at) }} +{%- endmacro %} + +{%- macro default__incremental_clause(col_emitted_at) -%} +{% if is_incremental() %} +and {{ col_emitted_at }} >= (select max({{ col_emitted_at }}) from {{ this }}) +{% endif %} +{%- endmacro -%} + +{# -- see https://on-systems.tech/113-beware-dbt-incremental-updates-against-snowflake-external-tables/ #} +{%- macro snowflake__incremental_clause(col_emitted_at) -%} +{% if is_incremental() %} +and {{ col_emitted_at }} >= cast('{{ get_max_normalized_cursor(col_emitted_at) }}' as {{ type_timestamp_with_timezone() }}) +{% endif %} +{%- endmacro -%} + +{% macro get_max_normalized_cursor(col_emitted_at) %} +{% if execute and is_incremental() %} + {% if env_var('INCREMENTAL_CURSOR', 'UNSET') == 'UNSET' %} + {% set query %} + select coalesce(max({{ col_emitted_at }}), cast('1970-01-01 00:00:00' as {{ type_timestamp_with_timezone() }})) from {{ this }} + {% endset %} + {% set max_cursor = run_query(query).columns[0][0] %} + {% do return(max_cursor) %} + {% else %} + {% do return(env_var('INCREMENTAL_CURSOR')) %} + {% endif %} +{% endif %} +{% endmacro %} diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/should_full_refresh.sql b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/should_full_refresh.sql new file mode 100644 index 000000000000..bee3fa3d1e37 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/should_full_refresh.sql @@ -0,0 +1,51 @@ +{# + This overrides the behavior of the macro `should_full_refresh` so full refresh are triggered if: + - the dbt cli is run with --full-refresh flag or the model is configured explicitly to full_refresh + - the column _airbyte_ab_id does not exists in the normalized tables and make sure it is well populated. +#} + +{%- macro need_full_refresh(col_ab_id, target_table=this) -%} + {%- if not execute -%} + {{ return(false) }} + {%- endif -%} + {%- set found_column = [] %} + {%- set cols = adapter.get_columns_in_relation(target_table) -%} + {%- for col in cols -%} + {%- if col.column == col_ab_id -%} + {% do found_column.append(col.column) %} + {%- endif -%} + {%- endfor -%} + {%- if found_column -%} + {{ return(false) }} + {%- else -%} + {{ dbt_utils.log_info(target_table ~ "." ~ col_ab_id ~ " does not exist. The table needs to be rebuilt in full_refresh") }} + {{ return(true) }} + {%- endif -%} +{%- endmacro -%} + +{%- macro should_full_refresh() -%} + {% set config_full_refresh = config.get('full_refresh') %} + {%- if config_full_refresh is none -%} + {% set config_full_refresh = flags.FULL_REFRESH %} + {%- endif -%} + {%- if not config_full_refresh -%} + {% set config_full_refresh = need_full_refresh(get_col_ab_id(), this) %} + {%- endif -%} + {% do return(config_full_refresh) %} +{%- endmacro -%} + +{%- macro get_col_ab_id() -%} + {{ adapter.dispatch('get_col_ab_id')() }} +{%- endmacro -%} + +{%- macro default__get_col_ab_id() -%} + _airbyte_ab_id +{%- endmacro -%} + +{%- macro oracle__get_col_ab_id() -%} + "_AIRBYTE_AB_ID" +{%- endmacro -%} + +{%- macro snowflake__get_col_ab_id() -%} + _AIRBYTE_AB_ID +{%- endmacro -%} diff --git a/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/star_intersect.sql b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/star_intersect.sql new file mode 100644 index 000000000000..3f3d06c4eb10 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/dbt-project-template/macros/star_intersect.sql @@ -0,0 +1,46 @@ +{# + Similar to the star macro here: https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/star.sql + + This star_intersect macro takes an additional 'intersect' relation as argument. + Its behavior is to select columns from both 'intersect' and 'from' relations with the following rules: + - if the columns are existing in both 'from' and the 'intersect' relations, then the column from 'intersect' is used + - if it's not in the both relation, then only the column in the 'from' relation is used +#} +{% macro star_intersect(from, intersect, from_alias=False, intersect_alias=False, except=[]) -%} + {%- do dbt_utils._is_relation(from, 'star_intersect') -%} + {%- do dbt_utils._is_ephemeral(from, 'star_intersect') -%} + {%- do dbt_utils._is_relation(intersect, 'star_intersect') -%} + {%- do dbt_utils._is_ephemeral(intersect, 'star_intersect') -%} + + {#-- Prevent querying of db in parsing mode. This works because this macro does not create any new refs. #} + {%- if not execute -%} + {{ return('') }} + {% endif %} + + {%- set include_cols = [] %} + {%- set cols = adapter.get_columns_in_relation(from) -%} + {%- set except = except | map("lower") | list %} + {%- for col in cols -%} + {%- if col.column|lower not in except -%} + {% do include_cols.append(col.column) %} + {%- endif %} + {%- endfor %} + + {%- set include_intersect_cols = [] %} + {%- set intersect_cols = adapter.get_columns_in_relation(intersect) -%} + {%- for col in intersect_cols -%} + {%- if col.column|lower not in except -%} + {% do include_intersect_cols.append(col.column) %} + {%- endif %} + {%- endfor %} + + {%- for col in include_cols %} + {%- if col in include_intersect_cols -%} + {%- if intersect_alias %}{{ intersect_alias }}.{% else %}{%- endif -%}{{ adapter.quote(col)|trim }} + {%- if not loop.last %},{{ '\n ' }}{% endif %} + {%- else %} + {%- if from_alias %}{{ from_alias }}.{% else %}{{ from }}.{%- endif -%}{{ adapter.quote(col)|trim }} as {{ adapter.quote(col)|trim }} + {%- if not loop.last %},{{ '\n ' }}{% endif %} + {%- endif %} + {%- endfor -%} +{%- endmacro %} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/dbt_integration_test.py b/airbyte-integrations/bases/base-normalization/integration_tests/dbt_integration_test.py index 793754bc7100..3f23919a5cc1 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/dbt_integration_test.py +++ b/airbyte-integrations/bases/base-normalization/integration_tests/dbt_integration_test.py @@ -7,7 +7,6 @@ import os import random import re -import shutil import socket import string import subprocess @@ -298,7 +297,7 @@ def get_normalization_image(destination_type: DestinationType) -> str: else: return "airbyte/normalization:dev" - def dbt_run(self, destination_type: DestinationType, test_root_dir: str): + def dbt_check(self, destination_type: DestinationType, test_root_dir: str): """ Run the dbt CLI to perform transformations on the test raw data in the destination """ @@ -306,13 +305,17 @@ def dbt_run(self, destination_type: DestinationType, test_root_dir: str): # Perform sanity check on dbt project settings assert self.run_check_dbt_command(normalization_image, "debug", test_root_dir) assert self.run_check_dbt_command(normalization_image, "deps", test_root_dir) - final_sql_files = os.path.join(test_root_dir, "final") - shutil.rmtree(final_sql_files, ignore_errors=True) + + def dbt_run(self, destination_type: DestinationType, test_root_dir: str, force_full_refresh: bool = False): + """ + Run the dbt CLI to perform transformations on the test raw data in the destination + """ + normalization_image: str = self.get_normalization_image(destination_type) # Compile dbt models files into destination sql dialect, then run the transformation queries - assert self.run_check_dbt_command(normalization_image, "run", test_root_dir) + assert self.run_check_dbt_command(normalization_image, "run", test_root_dir, force_full_refresh) @staticmethod - def run_check_dbt_command(normalization_image: str, command: str, cwd: str) -> bool: + def run_check_dbt_command(normalization_image: str, command: str, cwd: str, force_full_refresh: bool = False) -> bool: """ Run dbt subprocess while checking and counting for "ERROR", "FAIL" or "WARNING" printed in its outputs """ @@ -327,7 +330,7 @@ def run_check_dbt_command(normalization_image: str, command: str, cwd: str) -> b "-v", f"{cwd}/build:/build", "-v", - f"{cwd}/final:/build/run/airbyte_utils/models/generated", + f"{cwd}/logs:/logs", "-v", "/tmp:/tmp", "--network", @@ -340,6 +343,9 @@ def run_check_dbt_command(normalization_image: str, command: str, cwd: str) -> b "--profiles-dir=/workspace", "--project-dir=/workspace", ] + if force_full_refresh: + commands.append("--full-refresh") + command = f"{command} --full-refresh" print("Executing: ", " ".join(commands)) print(f"Equivalent to: dbt {command} --profiles-dir={cwd} --project-dir={cwd}") with open(os.path.join(cwd, "dbt_output.log"), "ab") as f: @@ -424,6 +430,6 @@ def get_test_targets() -> List[str]: """ if os.getenv(NORMALIZATION_TEST_TARGET): target_str = os.getenv(NORMALIZATION_TEST_TARGET) - return [d.value for d in {DestinationType.from_string(s) for s in target_str.split(",")}] + return [d.value for d in {DestinationType.from_string(s.strip()) for s in target_str.split(",")}] else: return [d.value for d in DestinationType] diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/first_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/first_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..f999aac2ea61 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/first_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,104 @@ + + + create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd` + partition by range_bucket( + _airbyte_active_row, + generate_array(0, 1, 1) + ) + cluster by _airbyte_unique_key, _airbyte_emitted_at + OPTIONS() + as ( + +with + +input_data as ( + select * + from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab3` + -- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate +), + +scd_data as ( + -- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key + select + to_hex(md5(cast(concat(coalesce(cast(id as + string +), ''), '-', coalesce(cast(currency as + string +), ''), '-', coalesce(cast(NZD as + string +), '')) as + string +))) as _airbyte_unique_key, + id, + currency, + date, + timestamp_col, + HKD_special___characters, + HKD_special___characters_1, + NZD, + USD, + date as _airbyte_start_at, + lag(date) over ( + partition by id, currency, cast(NZD as + string +) + order by + date is null asc, + date desc, + _airbyte_emitted_at desc + ) as _airbyte_end_at, + case when lag(date) over ( + partition by id, currency, cast(NZD as + string +) + order by + date is null asc, + date desc, + _airbyte_emitted_at desc + ) is null then 1 else 0 end as _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + _airbyte_dedup_exchange_rate_hashid + from input_data +), +dedup_data as ( + select + -- we need to ensure de-duplicated rows for merge/update queries + -- additionally, we generate a unique key for the scd table + row_number() over ( + partition by _airbyte_unique_key, _airbyte_start_at, _airbyte_emitted_at + order by _airbyte_ab_id + ) as _airbyte_row_num, + to_hex(md5(cast(concat(coalesce(cast(_airbyte_unique_key as + string +), ''), '-', coalesce(cast(_airbyte_start_at as + string +), ''), '-', coalesce(cast(_airbyte_emitted_at as + string +), '')) as + string +))) as _airbyte_unique_key_scd, + scd_data.* + from scd_data +) +select + _airbyte_unique_key, + _airbyte_unique_key_scd, + id, + currency, + date, + timestamp_col, + HKD_special___characters, + HKD_special___characters_1, + NZD, + USD, + _airbyte_start_at, + _airbyte_end_at, + _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + CURRENT_TIMESTAMP() as _airbyte_normalized_at, + _airbyte_dedup_exchange_rate_hashid +from dedup_data where _airbyte_row_num = 1 + ); + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/models/generated/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/models/generated/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..35175abb8ed5 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/models/generated/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,123 @@ +{{ config( + cluster_by = ["_airbyte_unique_key","_airbyte_emitted_at"], + partition_by = {"field": "_airbyte_active_row", "data_type": "int64", "range": {"start": 0, "end": 1, "interval": 1}}, + unique_key = "_airbyte_unique_key_scd", + schema = "test_normalization", + tags = [ "top-level" ] +) }} +with +{% if is_incremental() %} +new_data as ( + -- retrieve incremental "new" data + select + * + from {{ ref('dedup_exchange_rate_ab3') }} + -- dedup_exchange_rate from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} + where 1 = 1 + {{ incremental_clause('_airbyte_emitted_at') }} +), +new_data_ids as ( + -- build a subset of _airbyte_unique_key from rows that are new + select distinct + {{ dbt_utils.surrogate_key([ + 'id', + 'currency', + 'NZD', + ]) }} as _airbyte_unique_key + from new_data +), +previous_active_scd_data as ( + -- retrieve "incomplete old" data that needs to be updated with an end date because of new changes + select + {{ star_intersect(ref('dedup_exchange_rate_ab3'), this, from_alias='inc_data', intersect_alias='this_data') }} + from {{ this }} as this_data + -- make a join with new_data using primary key to filter active data that need to be updated only + join new_data_ids on this_data._airbyte_unique_key = new_data_ids._airbyte_unique_key + -- force left join to NULL values (we just need to transfer column types only for the star_intersect macro) + left join {{ ref('dedup_exchange_rate_ab3') }} as inc_data on 1 = 0 + where _airbyte_active_row = 1 +), +input_data as ( + select {{ dbt_utils.star(ref('dedup_exchange_rate_ab3')) }} from new_data + union all + select {{ dbt_utils.star(ref('dedup_exchange_rate_ab3')) }} from previous_active_scd_data +), +{% else %} +input_data as ( + select * + from {{ ref('dedup_exchange_rate_ab3') }} + -- dedup_exchange_rate from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} +), +{% endif %} +scd_data as ( + -- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key + select + {{ dbt_utils.surrogate_key([ + 'id', + 'currency', + 'NZD', + ]) }} as _airbyte_unique_key, + id, + currency, + date, + timestamp_col, + HKD_special___characters, + HKD_special___characters_1, + NZD, + USD, + date as _airbyte_start_at, + lag(date) over ( + partition by id, currency, cast(NZD as {{ dbt_utils.type_string() }}) + order by + date is null asc, + date desc, + _airbyte_emitted_at desc + ) as _airbyte_end_at, + case when lag(date) over ( + partition by id, currency, cast(NZD as {{ dbt_utils.type_string() }}) + order by + date is null asc, + date desc, + _airbyte_emitted_at desc + ) is null then 1 else 0 end as _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + _airbyte_dedup_exchange_rate_hashid + from input_data +), +dedup_data as ( + select + -- we need to ensure de-duplicated rows for merge/update queries + -- additionally, we generate a unique key for the scd table + row_number() over ( + partition by _airbyte_unique_key, _airbyte_start_at, _airbyte_emitted_at + order by _airbyte_ab_id + ) as _airbyte_row_num, + {{ dbt_utils.surrogate_key([ + '_airbyte_unique_key', + '_airbyte_start_at', + '_airbyte_emitted_at' + ]) }} as _airbyte_unique_key_scd, + scd_data.* + from scd_data +) +select + _airbyte_unique_key, + _airbyte_unique_key_scd, + id, + currency, + date, + timestamp_col, + HKD_special___characters, + HKD_special___characters_1, + NZD, + USD, + _airbyte_start_at, + _airbyte_end_at, + _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + {{ current_timestamp() }} as _airbyte_normalized_at, + _airbyte_dedup_exchange_rate_hashid +from dedup_data where _airbyte_row_num = 1 + diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/second_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/second_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..591dfe0b4c34 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/second_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,27 @@ + + + + + + + + merge into `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd` as DBT_INTERNAL_DEST + using ( + select * from `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd__dbt_tmp` + ) as DBT_INTERNAL_SOURCE + on + DBT_INTERNAL_SOURCE._airbyte_unique_key_scd = DBT_INTERNAL_DEST._airbyte_unique_key_scd + + + + when matched then update set + `_airbyte_unique_key` = DBT_INTERNAL_SOURCE.`_airbyte_unique_key`,`_airbyte_unique_key_scd` = DBT_INTERNAL_SOURCE.`_airbyte_unique_key_scd`,`id` = DBT_INTERNAL_SOURCE.`id`,`currency` = DBT_INTERNAL_SOURCE.`currency`,`date` = DBT_INTERNAL_SOURCE.`date`,`timestamp_col` = DBT_INTERNAL_SOURCE.`timestamp_col`,`HKD_special___characters` = DBT_INTERNAL_SOURCE.`HKD_special___characters`,`HKD_special___characters_1` = DBT_INTERNAL_SOURCE.`HKD_special___characters_1`,`NZD` = DBT_INTERNAL_SOURCE.`NZD`,`USD` = DBT_INTERNAL_SOURCE.`USD`,`_airbyte_start_at` = DBT_INTERNAL_SOURCE.`_airbyte_start_at`,`_airbyte_end_at` = DBT_INTERNAL_SOURCE.`_airbyte_end_at`,`_airbyte_active_row` = DBT_INTERNAL_SOURCE.`_airbyte_active_row`,`_airbyte_ab_id` = DBT_INTERNAL_SOURCE.`_airbyte_ab_id`,`_airbyte_emitted_at` = DBT_INTERNAL_SOURCE.`_airbyte_emitted_at`,`_airbyte_normalized_at` = DBT_INTERNAL_SOURCE.`_airbyte_normalized_at`,`_airbyte_dedup_exchange_rate_hashid` = DBT_INTERNAL_SOURCE.`_airbyte_dedup_exchange_rate_hashid` + + + when not matched then insert + (`_airbyte_unique_key`, `_airbyte_unique_key_scd`, `id`, `currency`, `date`, `timestamp_col`, `HKD_special___characters`, `HKD_special___characters_1`, `NZD`, `USD`, `_airbyte_start_at`, `_airbyte_end_at`, `_airbyte_active_row`, `_airbyte_ab_id`, `_airbyte_emitted_at`, `_airbyte_normalized_at`, `_airbyte_dedup_exchange_rate_hashid`) + values + (`_airbyte_unique_key`, `_airbyte_unique_key_scd`, `id`, `currency`, `date`, `timestamp_col`, `HKD_special___characters`, `HKD_special___characters_1`, `NZD`, `USD`, `_airbyte_start_at`, `_airbyte_end_at`, `_airbyte_active_row`, `_airbyte_ab_id`, `_airbyte_emitted_at`, `_airbyte_normalized_at`, `_airbyte_dedup_exchange_rate_hashid`) + + + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/third_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/third_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..79e9dea40754 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/bigquery/test_simple_streams/third_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,27 @@ + + + + + + + + merge into `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd` as DBT_INTERNAL_DEST + using ( + select * from `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd__dbt_tmp` + ) as DBT_INTERNAL_SOURCE + on + DBT_INTERNAL_SOURCE._airbyte_unique_key_scd = DBT_INTERNAL_DEST._airbyte_unique_key_scd + + + + when matched then update set + `_airbyte_unique_key` = DBT_INTERNAL_SOURCE.`_airbyte_unique_key`,`_airbyte_unique_key_scd` = DBT_INTERNAL_SOURCE.`_airbyte_unique_key_scd`,`id` = DBT_INTERNAL_SOURCE.`id`,`currency` = DBT_INTERNAL_SOURCE.`currency`,`date` = DBT_INTERNAL_SOURCE.`date`,`timestamp_col` = DBT_INTERNAL_SOURCE.`timestamp_col`,`HKD_special___characters` = DBT_INTERNAL_SOURCE.`HKD_special___characters`,`NZD` = DBT_INTERNAL_SOURCE.`NZD`,`USD` = DBT_INTERNAL_SOURCE.`USD`,`_airbyte_start_at` = DBT_INTERNAL_SOURCE.`_airbyte_start_at`,`_airbyte_end_at` = DBT_INTERNAL_SOURCE.`_airbyte_end_at`,`_airbyte_active_row` = DBT_INTERNAL_SOURCE.`_airbyte_active_row`,`_airbyte_ab_id` = DBT_INTERNAL_SOURCE.`_airbyte_ab_id`,`_airbyte_emitted_at` = DBT_INTERNAL_SOURCE.`_airbyte_emitted_at`,`_airbyte_normalized_at` = DBT_INTERNAL_SOURCE.`_airbyte_normalized_at`,`_airbyte_dedup_exchange_rate_hashid` = DBT_INTERNAL_SOURCE.`_airbyte_dedup_exchange_rate_hashid`,`new_column` = DBT_INTERNAL_SOURCE.`new_column` + + + when not matched then insert + (`_airbyte_unique_key`, `_airbyte_unique_key_scd`, `id`, `currency`, `date`, `timestamp_col`, `HKD_special___characters`, `NZD`, `USD`, `_airbyte_start_at`, `_airbyte_end_at`, `_airbyte_active_row`, `_airbyte_ab_id`, `_airbyte_emitted_at`, `_airbyte_normalized_at`, `_airbyte_dedup_exchange_rate_hashid`, `new_column`) + values + (`_airbyte_unique_key`, `_airbyte_unique_key_scd`, `id`, `currency`, `date`, `timestamp_col`, `HKD_special___characters`, `NZD`, `USD`, `_airbyte_start_at`, `_airbyte_end_at`, `_airbyte_active_row`, `_airbyte_ab_id`, `_airbyte_emitted_at`, `_airbyte_normalized_at`, `_airbyte_dedup_exchange_rate_hashid`, `new_column`) + + + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..88f3125c0583 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,99 @@ + + + + create table "postgres".test_normalization."dedup_exchange_rate_scd" + as ( + +with + +input_data as ( + select * + from "postgres"._airbyte_test_normalization."dedup_exchange_rate_ab3" + -- dedup_exchange_rate from "postgres".test_normalization._airbyte_raw_dedup_exchange_rate +), + +scd_data as ( + -- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key + select + md5(cast(coalesce(cast("id" as + varchar +), '') || '-' || coalesce(cast(currency as + varchar +), '') || '-' || coalesce(cast(nzd as + varchar +), '') as + varchar +)) as _airbyte_unique_key, + "id", + currency, + "date", + timestamp_col, + "HKD@spéçiäl & characters", + hkd_special___characters, + nzd, + usd, + "date" as _airbyte_start_at, + lag("date") over ( + partition by "id", currency, cast(nzd as + varchar +) + order by + "date" is null asc, + "date" desc, + _airbyte_emitted_at desc + ) as _airbyte_end_at, + case when lag("date") over ( + partition by "id", currency, cast(nzd as + varchar +) + order by + "date" is null asc, + "date" desc, + _airbyte_emitted_at desc + ) is null then 1 else 0 end as _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + _airbyte_dedup_exchange_rate_hashid + from input_data +), +dedup_data as ( + select + -- we need to ensure de-duplicated rows for merge/update queries + -- additionally, we generate a unique key for the scd table + row_number() over ( + partition by _airbyte_unique_key, _airbyte_start_at, _airbyte_emitted_at + order by _airbyte_ab_id + ) as _airbyte_row_num, + md5(cast(coalesce(cast(_airbyte_unique_key as + varchar +), '') || '-' || coalesce(cast(_airbyte_start_at as + varchar +), '') || '-' || coalesce(cast(_airbyte_emitted_at as + varchar +), '') as + varchar +)) as _airbyte_unique_key_scd, + scd_data.* + from scd_data +) +select + _airbyte_unique_key, + _airbyte_unique_key_scd, + "id", + currency, + "date", + timestamp_col, + "HKD@spéçiäl & characters", + hkd_special___characters, + nzd, + usd, + _airbyte_start_at, + _airbyte_end_at, + _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + now() as _airbyte_normalized_at, + _airbyte_dedup_exchange_rate_hashid +from dedup_data where _airbyte_row_num = 1 + ); + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/final/airbyte_tables/test_normalization/exchange_rate.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/test_normalization/exchange_rate.sql similarity index 77% rename from airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/final/airbyte_tables/test_normalization/exchange_rate.sql rename to airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/test_normalization/exchange_rate.sql index aa65fa214765..b0a2937dfba2 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/final/airbyte_tables/test_normalization/exchange_rate.sql +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/first_output/airbyte_incremental/test_normalization/exchange_rate.sql @@ -1,6 +1,7 @@ + - create table "postgres".test_normalization."exchange_rate__dbt_tmp" + create table "postgres".test_normalization."exchange_rate" as ( -- Final base SQL model @@ -13,8 +14,12 @@ select hkd_special___characters, nzd, usd, + _airbyte_ab_id, _airbyte_emitted_at, _airbyte_exchange_rate_hashid from "postgres"._airbyte_test_normalization."exchange_rate_ab3" -- exchange_rate from "postgres".test_normalization._airbyte_raw_exchange_rate - ); \ No newline at end of file +where 1 = 1 + + ); + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab1.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab1.sql index a68cce687ef9..afce0ec584db 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab1.sql +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab1.sql @@ -1,4 +1,7 @@ -{{ config(schema="_airbyte_test_normalization", tags=["top-level-intermediate"]) }} +{{ config( + schema = "_airbyte_test_normalization", + tags = [ "top-level-intermediate" ] +) }} -- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema select {{ json_extract_scalar('_airbyte_data', ['id'], ['id']) }} as {{ adapter.quote('id') }}, @@ -9,7 +12,9 @@ select {{ json_extract_scalar('_airbyte_data', ['HKD_special___characters'], ['HKD_special___characters']) }} as hkd_special___characters, {{ json_extract_scalar('_airbyte_data', ['NZD'], ['NZD']) }} as nzd, {{ json_extract_scalar('_airbyte_data', ['USD'], ['USD']) }} as usd, + _airbyte_ab_id, _airbyte_emitted_at from {{ source('test_normalization', '_airbyte_raw_exchange_rate') }} as table_alias -- exchange_rate +where 1 = 1 diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab2.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab2.sql index 3a45b3f533a5..178badcb2596 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab2.sql +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab2.sql @@ -1,4 +1,7 @@ -{{ config(schema="_airbyte_test_normalization", tags=["top-level-intermediate"]) }} +{{ config( + schema = "_airbyte_test_normalization", + tags = [ "top-level-intermediate" ] +) }} -- SQL model to cast each column to its adequate SQL type converted from the JSON schema type select cast({{ adapter.quote('id') }} as {{ dbt_utils.type_bigint() }}) as {{ adapter.quote('id') }}, @@ -9,7 +12,9 @@ select cast(hkd_special___characters as {{ dbt_utils.type_string() }}) as hkd_special___characters, cast(nzd as {{ dbt_utils.type_float() }}) as nzd, cast(usd as {{ dbt_utils.type_float() }}) as usd, + _airbyte_ab_id, _airbyte_emitted_at from {{ ref('exchange_rate_ab1') }} -- exchange_rate +where 1 = 1 diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab3.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab3.sql index a6ff683db802..0469a220171c 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab3.sql +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_ctes/test_normalization/exchange_rate_ab3.sql @@ -1,4 +1,7 @@ -{{ config(schema="_airbyte_test_normalization", tags=["top-level-intermediate"]) }} +{{ config( + schema = "_airbyte_test_normalization", + tags = [ "top-level-intermediate" ] +) }} -- SQL model to build a hash column based on the values of this record select {{ dbt_utils.surrogate_key([ @@ -14,4 +17,5 @@ select tmp.* from {{ ref('exchange_rate_ab2') }} tmp -- exchange_rate +where 1 = 1 diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..ff9e861e971d --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,38 @@ +{{ config( + schema = "test_normalization", + unique_key = env_var('AIRBYTE_DEFAULT_UNIQUE_KEY', '_airbyte_ab_id'), + tags = [ "top-level" ] +) }} +-- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key +select + {{ dbt_utils.surrogate_key([ + adapter.quote('id'), + 'currency', + 'nzd', + ]) }} as _airbyte_unique_key, + {{ adapter.quote('id') }}, + currency, + {{ adapter.quote('date') }}, + timestamp_col, + {{ adapter.quote('HKD@spéçiäl & characters') }}, + hkd_special___characters, + nzd, + usd, + {{ adapter.quote('date') }} as _airbyte_start_at, + lag({{ adapter.quote('date') }}) over ( + partition by {{ adapter.quote('id') }}, currency, cast(nzd as {{ dbt_utils.type_string() }}) + order by {{ adapter.quote('date') }} is null asc, {{ adapter.quote('date') }} desc, _airbyte_emitted_at desc + ) as _airbyte_end_at, + case when lag({{ adapter.quote('date') }}) over ( + partition by {{ adapter.quote('id') }}, currency, cast(nzd as {{ dbt_utils.type_string() }}) + order by {{ adapter.quote('date') }} is null asc, {{ adapter.quote('date') }} desc, _airbyte_emitted_at desc + ) is null then 1 else 0 end as _airbyte_active_row, + _airbyte_ab_id, + _airbyte_emitted_at, + _airbyte_dedup_exchange_rate_hashid +from {{ ref('dedup_exchange_rate_ab4') }} +-- dedup_exchange_rate from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} +where 1 = 1 +and _airbyte_row_num = 1 +{{ incremental_clause('_airbyte_emitted_at') }} + diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/dedup_exchange_rate.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/dedup_exchange_rate.sql new file mode 100644 index 000000000000..f31846c24b9c --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/dedup_exchange_rate.sql @@ -0,0 +1,25 @@ +{{ config( + schema = "test_normalization", + unique_key = "_airbyte_unique_key", + tags = [ "top-level" ] +) }} +-- Final base SQL model +select + _airbyte_unique_key, + {{ adapter.quote('id') }}, + currency, + {{ adapter.quote('date') }}, + timestamp_col, + {{ adapter.quote('HKD@spéçiäl & characters') }}, + hkd_special___characters, + nzd, + usd, + _airbyte_ab_id, + _airbyte_emitted_at, + _airbyte_dedup_exchange_rate_hashid +from {{ ref('dedup_exchange_rate_scd') }} +-- dedup_exchange_rate from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} +where 1 = 1 +and _airbyte_active_row = 1 +{{ incremental_clause('_airbyte_emitted_at') }} + diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_tables/test_normalization/exchange_rate.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/exchange_rate.sql similarity index 64% rename from airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_tables/test_normalization/exchange_rate.sql rename to airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/exchange_rate.sql index 886cca7c7e72..15a9bcb2ac23 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_tables/test_normalization/exchange_rate.sql +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/models/generated/airbyte_incremental/test_normalization/exchange_rate.sql @@ -1,4 +1,8 @@ -{{ config(schema="test_normalization", tags=["top-level"]) }} +{{ config( + schema = "test_normalization", + unique_key = env_var('AIRBYTE_DEFAULT_UNIQUE_KEY', '_airbyte_ab_id'), + tags = [ "top-level" ] +) }} -- Final base SQL model select {{ adapter.quote('id') }}, @@ -9,8 +13,11 @@ select hkd_special___characters, nzd, usd, + _airbyte_ab_id, _airbyte_emitted_at, _airbyte_exchange_rate_hashid from {{ ref('exchange_rate_ab3') }} -- exchange_rate from {{ source('test_normalization', '_airbyte_raw_exchange_rate') }} +where 1 = 1 +{{ incremental_clause('_airbyte_emitted_at') }} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..fb60e5523174 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,14 @@ + + delete + from "postgres".test_normalization."dedup_exchange_rate_scd" + where (_airbyte_unique_key_scd) in ( + select (_airbyte_unique_key_scd) + from "dedup_exchange_rate_scd__dbt_tmp" + ); + + insert into "postgres".test_normalization."dedup_exchange_rate_scd" ("_airbyte_unique_key", "_airbyte_unique_key_scd", "id", "currency", "date", "timestamp_col", "HKD@spéçiäl & characters", "hkd_special___characters", "nzd", "usd", "_airbyte_start_at", "_airbyte_end_at", "_airbyte_active_row", "_airbyte_ab_id", "_airbyte_emitted_at", "_airbyte_normalized_at", "_airbyte_dedup_exchange_rate_hashid") + ( + select "_airbyte_unique_key", "_airbyte_unique_key_scd", "id", "currency", "date", "timestamp_col", "HKD@spéçiäl & characters", "hkd_special___characters", "nzd", "usd", "_airbyte_start_at", "_airbyte_end_at", "_airbyte_active_row", "_airbyte_ab_id", "_airbyte_emitted_at", "_airbyte_normalized_at", "_airbyte_dedup_exchange_rate_hashid" + from "dedup_exchange_rate_scd__dbt_tmp" + ); + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/test_normalization/exchange_rate.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/test_normalization/exchange_rate.sql new file mode 100644 index 000000000000..49f01c196e0f --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/second_output/airbyte_incremental/test_normalization/exchange_rate.sql @@ -0,0 +1,14 @@ + + delete + from "postgres".test_normalization."exchange_rate" + where (_airbyte_ab_id) in ( + select (_airbyte_ab_id) + from "exchange_rate__dbt_tmp" + ); + + insert into "postgres".test_normalization."exchange_rate" ("id", "currency", "date", "timestamp_col", "HKD@spéçiäl & characters", "hkd_special___characters", "nzd", "usd", "_airbyte_ab_id", "_airbyte_emitted_at", "_airbyte_exchange_rate_hashid") + ( + select "id", "currency", "date", "timestamp_col", "HKD@spéçiäl & characters", "hkd_special___characters", "nzd", "usd", "_airbyte_ab_id", "_airbyte_emitted_at", "_airbyte_exchange_rate_hashid" + from "exchange_rate__dbt_tmp" + ); + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/third_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/third_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql new file mode 100644 index 000000000000..a5de1de2333d --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/postgres/test_simple_streams/third_output/airbyte_incremental/scd/test_normalization/dedup_exchange_rate_scd.sql @@ -0,0 +1,14 @@ + + delete + from "postgres".test_normalization."dedup_exchange_rate_scd" + where (_airbyte_unique_key_scd) in ( + select (_airbyte_unique_key_scd) + from "dedup_exchange_rate_scd__dbt_tmp" + ); + + insert into "postgres".test_normalization."dedup_exchange_rate_scd" ("_airbyte_unique_key", "_airbyte_unique_key_scd", "currency", "date", "timestamp_col", "HKD@spéçiäl & characters", "nzd", "usd", "_airbyte_start_at", "_airbyte_end_at", "_airbyte_active_row", "_airbyte_ab_id", "_airbyte_emitted_at", "_airbyte_normalized_at", "_airbyte_dedup_exchange_rate_hashid", "new_column", "id") + ( + select "_airbyte_unique_key", "_airbyte_unique_key_scd", "currency", "date", "timestamp_col", "HKD@spéçiäl & characters", "nzd", "usd", "_airbyte_start_at", "_airbyte_end_at", "_airbyte_active_row", "_airbyte_ab_id", "_airbyte_emitted_at", "_airbyte_normalized_at", "_airbyte_dedup_exchange_rate_hashid", "new_column", "id" + from "dedup_exchange_rate_scd__dbt_tmp" + ); + \ No newline at end of file diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/messages_incremental.txt b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/messages_incremental.txt new file mode 100644 index 000000000000..acbcc644ea49 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/messages_incremental.txt @@ -0,0 +1,19 @@ +{"type": "RECORD", "record": {"stream": "nested_stream_with_complex_columns_resulting_into_long_names", "emitted_at": 1602638599000, "data": { "id": 4.2, "date": "2020-08-29T00:00:00Z", "partition": { "double_array_data": [[ { "id": "EUR" } ]], "DATA": [ {"currency": "EUR" } ], "column`_'with\"_quotes": [ {"currency": "EUR" } ] } }}} +{"type": "RECORD", "record": {"stream": "nested_stream_with_complex_columns_resulting_into_long_names", "emitted_at": 1602638599100, "data": { "id": "test record", "date": "2020-08-31T00:00:00Z", "partition": { "double_array_data": [[ { "id": "USD" } ], [ { "id": "GBP" } ]], "DATA": [ {"currency": "EUR" } ], "column`_'with\"_quotes": [ {"currency": "EUR" } ] } }}} +{"type": "RECORD", "record": {"stream": "nested_stream_with_complex_columns_resulting_into_long_names", "emitted_at": 1602638600000, "data": { "id": "new record", "date": "2020-09-10T00:00:00Z", "partition": { "double_array_data": [[ { "id": "GBP" } ], [ { "id": "HKD" } ]], "DATA": [ {"currency": "EUR" } ], "column`_'with\"_quotes": [ {"currency": "EUR" } ] } }}} + +{"type":"RECORD","record":{"stream":"conflict_stream_name","data":{"id":1,"conflict_stream_name":{"conflict_stream_name": {"groups": "1", "custom_fields": [{"id":1, "value":3}, {"id":2, "value":4}], "conflict_stream_name": 3}}},"emitted_at":1623861660}} +{"type":"RECORD","record":{"stream":"conflict_stream_name","data":{"id":2,"conflict_stream_name":{"conflict_stream_name": {"groups": "2", "custom_fields": [{"id":1, "value":3}, {"id":2, "value":4}], "conflict_stream_name": 3}}},"emitted_at":1623861660}} + +{"type":"RECORD","record":{"stream":"conflict_stream_scalar","data":{"id":1,"conflict_stream_scalar": 2},"emitted_at":1623861660}} +{"type":"RECORD","record":{"stream":"conflict_stream_scalar","data":{"id":2,"conflict_stream_scalar": 2},"emitted_at":1623861660}} + +{"type":"RECORD","record":{"stream":"conflict_stream_array","data":{"id":1, "conflict_stream_array": {"conflict_stream_array": [{"id": 1}, {"id": 2}, {"id": 3}]}}, "emitted_at":1623861660}} +{"type":"RECORD","record":{"stream":"conflict_stream_array","data":{"id":2, "conflict_stream_array": {"conflict_stream_array": [{"id": 4}, {"id": 5}, {"id": 6}]}}, "emitted_at":1623861860}} + +{"type":"RECORD","record":{"stream":"conflict_stream_scalar","data":{"id":1,"conflict_stream_scalar": 2},"emitted_at":1623861660}} +{"type":"RECORD","record":{"stream":"conflict_stream_scalar","data":{"id":2,"conflict_stream_scalar": 2},"emitted_at":1623861660}} + +{"type":"RECORD","record":{"stream":"unnest_alias","data":{"id":1, "children": [{"ab_id": 1, "owner": {"owner_id": 1}},{"ab_id": 2, "owner": {"owner_id": 2}}]},"emitted_at":1623861660}} +{"type":"RECORD","record":{"stream":"unnest_alias","data":{"id":2, "children": [{"ab_id": 3, "owner": {"owner_id": 3}},{"ab_id": 4, "owner": {"owner_id": 4}}]},"emitted_at":1623861660}} + diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/replace_identifiers.json b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/replace_identifiers.json index 9c54ea2f29ca..e15f5b7dd7f9 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/replace_identifiers.json +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/data_input/replace_identifiers.json @@ -28,7 +28,12 @@ } ], "snowflake": [ - { "TMP_TEST_DATA_CHECK_ROW_COUNTS": "tmp_test_data_check_row_counts" } + { + "NESTED_STREAMS_FIRST_RUN_ROW_COUNTS": "nested_streams_first_run_row_counts" + }, + { + "NESTED_STREAMS_SECOND_RUN_ROW_COUNTS": "nested_streams_second_run_row_counts" + } ], "redshift": [], "mysql": [ diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_data_tests/test_check_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_data_tests/test_check_row_counts.sql deleted file mode 100644 index 966c1477c147..000000000000 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_data_tests/test_check_row_counts.sql +++ /dev/null @@ -1 +0,0 @@ -select * from {{ ref('tmp_test_data_check_row_counts') }} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests/test_check_first_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests/test_check_first_run_row_counts.sql new file mode 100644 index 000000000000..4764acc1d39a --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests/test_check_first_run_row_counts.sql @@ -0,0 +1,2 @@ +select * from {{ ref('nested_streams_first_run_row_counts') }} +where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_incremental/test_check_second_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_incremental/test_check_second_run_row_counts.sql new file mode 100644 index 000000000000..169bb80895e6 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_incremental/test_check_second_run_row_counts.sql @@ -0,0 +1,2 @@ +select * from {{ ref('nested_streams_second_run_row_counts') }} +where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_data_tests_tmp/tmp_test_data_check_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_tmp/nested_streams_first_run_row_counts.sql similarity index 53% rename from airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_data_tests_tmp/tmp_test_data_check_row_counts.sql rename to airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_tmp/nested_streams_first_run_row_counts.sql index f724978e5d33..da83e42d826e 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_data_tests_tmp/tmp_test_data_check_row_counts.sql +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_tmp/nested_streams_first_run_row_counts.sql @@ -1,14 +1,14 @@ with table_row_counts as ( - select distinct count(*) as row_count, 2 as expected_count + select distinct '_airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names' as label, count(*) as row_count, 2 as expected_count from {{ source('test_normalization', '_airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names') }} union all - select distinct count(*) as row_count, 2 as expected_count + select distinct 'nested_stream_with_complex_columns_resulting_into_long_names' as label, count(*) as row_count, 2 as expected_count from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names') }} union all - select distinct count(*) as row_count, 2 as expected_count + select distinct 'nested_stream_with_complex_columns_resulting_into_long_names_partition' as label, count(*) as row_count, 2 as expected_count from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names_partition') }} union all - select count(distinct currency) as row_count, 1 as expected_count + select 'nested_stream_with_complex_columns_resulting_into_long_names_partition_DATA' as label, count(distinct currency) as row_count, 1 as expected_count from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names_partition_DATA') }} -- union all -- select count(distinct id) as row_count, 3 as expected_count @@ -16,4 +16,3 @@ union all ) select * from table_row_counts -where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_tmp_incremental/nested_streams_second_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_tmp_incremental/nested_streams_second_run_row_counts.sql new file mode 100644 index 000000000000..1d9623232229 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_data_tests_tmp_incremental/nested_streams_second_run_row_counts.sql @@ -0,0 +1,18 @@ +with table_row_counts as ( + select distinct '_airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names' as label, count(*) as row_count, 3 as expected_count + from {{ source('test_normalization', '_airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names') }} +union all + select distinct 'nested_stream_with_complex_columns_resulting_into_long_names' as label, count(*) as row_count, 3 as expected_count + from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names') }} +union all + select distinct 'nested_stream_with_complex_columns_resulting_into_long_names_partition' as label, count(*) as row_count, 3 as expected_count + from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names_partition') }} +union all + select 'nested_stream_with_complex_columns_resulting_into_long_names_partition_DATA' as label, count(distinct currency) as row_count, 1 as expected_count + from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names_partition_DATA') }} +-- union all +-- select count(distinct id) as row_count, 3 as expected_count +-- from {{ ref('nested_stream_with_complex_columns_resulting_into_long_names_partition_double_array_data') }} +) +select * +from table_row_counts diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_schema_tests/schema_test.yml b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_schema_tests/schema_test.yml similarity index 100% rename from airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_schema_tests/schema_test.yml rename to airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_schema_tests/schema_test.yml diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml new file mode 100644 index 000000000000..315b65ac1633 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_nested_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml @@ -0,0 +1,21 @@ +version: 2 + +models: + - name: nested_stream_with_complex_columns_resulting_into_long_names_partition + tests: + - dbt_utils.expression_is_true: + expression: "double_array_data is not null" + - dbt_utils.expression_is_true: + expression: "DATA is not null" + - dbt_utils.expression_is_true: + expression: "\"column`_'with\"\"_quotes\" is not null" + - name: nested_stream_with_complex_columns_resulting_into_long_names_partition_DATA + columns: + - name: currency + tests: + - not_null + - name: nested_stream_with_complex_columns_resulting_into_long_names_partition_double_array_data + columns: + - name: id + tests: + # - not_null # TODO Fix bug here diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog.json b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog.json index 8b76cd0d1faf..f6832f4315be 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog.json +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog.json @@ -85,6 +85,26 @@ "destination_sync_mode": "append_dedup", "primary_key": [["id"], ["currency"], ["NZD"]] }, + { + "stream": { + "name": "renamed_dedup_cdc_excluded", + "json_schema": { + "type": ["null", "object"], + "properties": { + "id": { + "type": "integer" + } + } + }, + "supported_sync_modes": ["full_refresh", "incremental"], + "source_defined_cursor": true, + "default_cursor_field": [] + }, + "sync_mode": "incremental", + "cursor_field": [], + "destination_sync_mode": "append_dedup", + "primary_key": [["id"]] + }, { "stream": { "name": "dedup_cdc_excluded", @@ -97,9 +117,6 @@ "name": { "type": ["string", "null"] }, - "column`_'with\"_quotes": { - "type": ["string", "null"] - }, "_ab_cdc_lsn": { "type": ["null", "number"] }, @@ -150,7 +167,7 @@ "source_defined_cursor": true, "default_cursor_field": [] }, - "sync_mode": "incremental", + "sync_mode": "full_refresh", "cursor_field": [], "destination_sync_mode": "append_dedup", "primary_key": [["id"]] diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog_schema_change.json b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog_schema_change.json new file mode 100644 index 000000000000..4d5cd0e00c04 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/catalog_schema_change.json @@ -0,0 +1,121 @@ +{ + "streams": [ + { + "stream": { + "name": "exchange_rate", + "json_schema": { + "type": ["null", "object"], + "properties": { + "id": { + "type": "number" + }, + "currency": { + "type": "string" + }, + "new_column": { + "type": "number" + }, + "date": { + "type": "string", + "format": "date" + }, + "timestamp_col": { + "type": "string", + "format": "date-time" + }, + "HKD@spéçiäl & characters": { + "type": "number" + }, + "NZD": { + "type": "number" + }, + "USD": { + "type": "number" + } + } + }, + "supported_sync_modes": ["incremental"], + "source_defined_cursor": true, + "default_cursor_field": [] + }, + "sync_mode": "incremental", + "cursor_field": [], + "destination_sync_mode": "overwrite" + }, + { + "stream": { + "name": "dedup_exchange_rate", + "json_schema": { + "type": ["null", "object"], + "properties": { + "id": { + "type": "number" + }, + "currency": { + "type": "string" + }, + "new_column": { + "type": "number" + }, + "date": { + "type": "string", + "format": "date" + }, + "timestamp_col": { + "type": "string", + "format": "date-time" + }, + "HKD@spéçiäl & characters": { + "type": "number" + }, + "NZD": { + "type": "number" + }, + "USD": { + "type": "integer" + } + } + }, + "supported_sync_modes": ["incremental"], + "source_defined_cursor": true, + "default_cursor_field": [] + }, + "sync_mode": "incremental", + "cursor_field": ["date"], + "destination_sync_mode": "append_dedup", + "primary_key": [["id"], ["currency"], ["NZD"]] + }, + { + "stream": { + "name": "renamed_dedup_cdc_excluded", + "json_schema": { + "type": ["null", "object"], + "properties": { + "id": { + "type": "integer" + }, + "name": { + "type": ["string", "null"] + }, + "_ab_cdc_lsn": { + "type": ["null", "number"] + }, + "_ab_cdc_updated_at": { + "type": ["null", "number"] + }, + "_ab_cdc_deleted_at": { + "type": ["null", "number"] + } + } + }, + "supported_sync_modes": ["full_refresh", "incremental"], + "source_defined_cursor": true, + "default_cursor_field": [] + }, + "sync_mode": "incremental", + "cursor_field": [], + "destination_sync_mode": "append_dedup", + "primary_key": [["id"]] + } + ] +} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/messages_incremental.txt b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/messages_incremental.txt new file mode 100644 index 000000000000..77dbc6f073f1 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/messages_incremental.txt @@ -0,0 +1,21 @@ +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602637990800, "data": { "id": 2, "currency": "EUR", "date": "", "timestamp_col": "", "NZD": 2.43, "HKD@spéçiäl & characters": 5.4, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602637990900, "data": { "id": 3, "currency": "GBP", "NZD": 3.14, "HKD@spéçiäl & characters": 9.2, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602650000000, "data": { "id": 2, "currency": "EUR", "NZD": 3.89, "HKD@spéçiäl & characters": 14.05, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602650010000, "data": { "id": 4, "currency": "HKD", "NZD": 1.19, "HKD@spéçiäl & characters": 0.01, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602650011000, "data": { "id": 1, "currency": "USD", "date": "2020-10-14", "timestamp_col": "2020-10-14T00:00:00.000-00", "NZD": 1.14, "HKD@spéçiäl & characters": 9.5, "HKD_special___characters": "column name collision?"}}} + +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602637990800, "data": { "id": 2, "currency": "EUR", "date": "", "timestamp_col": "", "NZD": 2.43, "HKD@spéçiäl & characters": 5.4, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602637990900, "data": { "id": 3, "currency": "GBP", "NZD": 3.14, "HKD@spéçiäl & characters": 9.2, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602650000000, "data": { "id": 2, "currency": "EUR", "NZD": 3.89, "HKD@spéçiäl & characters": 14.05, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602650010000, "data": { "id": 4, "currency": "HKD", "NZD": 1.19, "HKD@spéçiäl & characters": 0.01, "HKD_special___characters": "column name collision?"}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602650011000, "data": { "id": 1, "currency": "USD", "date": "2020-10-14", "timestamp_col": "2020-10-14T00:00:00.000-00", "NZD": 1.14, "HKD@spéçiäl & characters": 9.5, "HKD_special___characters": "column name collision?"}}} + +{"type":"RECORD","record":{"stream":"dedup_cdc_excluded","data":{"id":5,"name":"vw","column`_'with\"_quotes":"ma\"z`d'a","_ab_cdc_updated_at":1623849314663,"_ab_cdc_lsn":26975264,"_ab_cdc_deleted_at":null},"emitted_at":1623860160}} +{"type":"RECORD","record":{"stream":"dedup_cdc_excluded","data":{"id":5,"name":null,"column`_'with\"_quotes":"ma\"z`d'a","_ab_cdc_updated_at":1623900000000,"_ab_cdc_lsn":28010252,"_ab_cdc_deleted_at":1623900000000},"emitted_at":1623900000000}} + +{"type":"RECORD","record":{"stream":"pos_dedup_cdcx","data":{"id":1,"name":"mazda","_ab_cdc_updated_at":1623849130530,"_ab_cdc_lsn":26971624,"_ab_cdc_log_pos": 33274,"_ab_cdc_deleted_at":null},"emitted_at":1623859926}} +{"type":"RECORD","record":{"stream":"pos_dedup_cdcx","data":{"id":2,"name":"toyata","_ab_cdc_updated_at":1623849130549,"_ab_cdc_lsn":26971624,"_ab_cdc_log_pos": 33275,"_ab_cdc_deleted_at":null},"emitted_at":1623859926}} +{"type":"RECORD","record":{"stream":"pos_dedup_cdcx","data":{"id":2,"name":"bmw","_ab_cdc_updated_at":1623849314535,"_ab_cdc_lsn":26974776,"_ab_cdc_log_pos": 33278,"_ab_cdc_deleted_at":null},"emitted_at":1623860160}} +{"type":"RECORD","record":{"stream":"pos_dedup_cdcx","data":{"id":3,"name":null,"_ab_cdc_updated_at":1623849314791,"_ab_cdc_lsn":26975440,"_ab_cdc_log_pos": 33274,"_ab_cdc_deleted_at":1623849314791},"emitted_at":1623860160}} +{"type":"RECORD","record":{"stream":"pos_dedup_cdcx","data":{"id":4,"name":"lotus","_ab_cdc_updated_at":1623850868237,"_ab_cdc_lsn":27010048,"_ab_cdc_log_pos": 33271,"_ab_cdc_deleted_at":null},"emitted_at":1623861660}} +{"type":"RECORD","record":{"stream":"pos_dedup_cdcx","data":{"id":4,"name":null,"_ab_cdc_updated_at":1623850868371,"_ab_cdc_lsn":27010232,"_ab_cdc_log_pos": 33279,"_ab_cdc_deleted_at":1623850868371},"emitted_at":1623861660}} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/messages_schema_change.txt b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/messages_schema_change.txt new file mode 100644 index 000000000000..4491673d040c --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/messages_schema_change.txt @@ -0,0 +1,13 @@ +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602661281900, "data": { "id": 3.14, "currency": "EUR", "new_column": 2.1, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 2.43, "HKD@spéçiäl & characters": 2.12, "USD": 7}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602661291900, "data": { "id": 0.12, "currency": "GBP", "new_column": 3.81, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 3.14, "HKD@spéçiäl & characters": 3.01, "USD": 11}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602661381900, "data": { "id": 4.22, "currency": "EUR", "new_column": 89.1, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 3.89, "HKD@spéçiäl & characters": 8.88, "USD": 10}}} +{"type": "RECORD", "record": {"stream": "exchange_rate", "emitted_at": 1602661481900, "data": { "id": 1, "currency": "HKD", "new_column": 91.11, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 1.19, "HKD@spéçiäl & characters": 99.1, "USD": 10}}} + +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602661281900, "data": { "id": 3.14, "currency": "EUR", "new_column": 2.1, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 2.43, "HKD@spéçiäl & characters": 2.12, "USD": 7}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602661291900, "data": { "id": 0.12, "currency": "GBP", "new_column": 3.81, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 3.14, "HKD@spéçiäl & characters": 3.01, "USD": 11}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602661381900, "data": { "id": 4.22, "currency": "EUR", "new_column": 89.1, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 3.89, "HKD@spéçiäl & characters": 8.88, "USD": 10}}} +{"type": "RECORD", "record": {"stream": "dedup_exchange_rate", "emitted_at": 1602661481900, "data": { "id": 1, "currency": "HKD", "new_column": 91.11, "date": "2020-11-01", "timestamp_col": "2020-11-01T00:00:00Z", "NZD": 1.19, "HKD@spéçiäl & characters": 99.1, "USD": 10}}} + +{"type":"RECORD","record":{"stream":"renamed_dedup_cdc_excluded","data":{"id":8,"name":"vw","column`_'with\"_quotes":"ma\"z`d'a","_ab_cdc_updated_at":1623949314663,"_ab_cdc_lsn":26985264,"_ab_cdc_deleted_at":null},"emitted_at":1623960160}} +{"type":"RECORD","record":{"stream":"renamed_dedup_cdc_excluded","data":{"id":9,"name":"opel","column`_'with\"_quotes":"ma\"z`d'a","_ab_cdc_updated_at":1623950868109,"_ab_cdc_lsn":28009440,"_ab_cdc_deleted_at":null},"emitted_at":1623961660}} +{"type":"RECORD","record":{"stream":"renamed_dedup_cdc_excluded","data":{"id":9,"name":null,"column`_'with\"_quotes":"ma\"z`d'a","_ab_cdc_updated_at":1623950868371,"_ab_cdc_lsn":28010232,"_ab_cdc_deleted_at":1623950868371},"emitted_at":1623961660}} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/replace_identifiers.json b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/replace_identifiers.json index bcf23b50000e..827dd4fd1642 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/replace_identifiers.json +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/data_input/replace_identifiers.json @@ -20,7 +20,12 @@ "postgres": [], "snowflake": [ { "HKD@SPÉÇIÄL & CHARACTERS": "HKD@spéçiäl & characters" }, - { "TMP_TEST_DATA_CHECK_ROW_COUNTS": "tmp_test_data_check_row_counts" } + { + "SIMPLE_STREAMS_FIRST_RUN_ROW_COUNTS": "simple_streams_first_run_row_counts" + }, + { + "SIMPLE_STREAMS_SECOND_RUN_ROW_COUNTS": "simple_streams_second_run_row_counts" + } ], "redshift": [], "mysql": [ diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_data_tests/test_check_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_data_tests/test_check_row_counts.sql deleted file mode 100644 index 966c1477c147..000000000000 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_data_tests/test_check_row_counts.sql +++ /dev/null @@ -1 +0,0 @@ -select * from {{ ref('tmp_test_data_check_row_counts') }} diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_data_tests_tmp/tmp_test_data_check_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_data_tests_tmp/tmp_test_data_check_row_counts.sql deleted file mode 100644 index 6144a13617e8..000000000000 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_data_tests_tmp/tmp_test_data_check_row_counts.sql +++ /dev/null @@ -1,43 +0,0 @@ -with table_row_counts as ( - select distinct count(*) as row_count, 10 as expected_count - from {{ source('test_normalization', '_airbyte_raw_exchange_rate') }} -union all - select distinct count(*) as row_count, 10 as expected_count - from {{ ref('exchange_rate') }} - -union all - - select distinct count(*) as row_count, 10 as expected_count - from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} -union all - select distinct count(*) as row_count, 10 as expected_count - from {{ ref('dedup_exchange_rate_scd') }} -union all - select distinct count(*) as row_count, 5 as expected_count - from {{ ref('dedup_exchange_rate') }} - -union all - - select distinct count(*) as row_count, 8 as expected_count - from {{ source('test_normalization', '_airbyte_raw_dedup_cdc_excluded') }} -union all - select distinct count(*) as row_count, 8 as expected_count - from {{ ref('dedup_cdc_excluded_scd') }} -union all - select distinct count(*) as row_count, 4 as expected_count - from {{ ref('dedup_cdc_excluded') }} - -union all - - select distinct count(*) as row_count, 8 as expected_count - from {{ source('test_normalization', '_airbyte_raw_pos_dedup_cdcx') }} -union all - select distinct count(*) as row_count, 8 as expected_count - from {{ ref('pos_dedup_cdcx_scd') }} -union all - select distinct count(*) as row_count, 3 as expected_count - from {{ ref('pos_dedup_cdcx') }} -) -select * -from table_row_counts -where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests/test_check_first_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests/test_check_first_run_row_counts.sql new file mode 100644 index 000000000000..afbdc6ac5b30 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests/test_check_first_run_row_counts.sql @@ -0,0 +1,2 @@ +select * from {{ ref('simple_streams_first_run_row_counts') }} +where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_incremental/test_check_second_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_incremental/test_check_second_run_row_counts.sql new file mode 100644 index 000000000000..99e98a10a781 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_incremental/test_check_second_run_row_counts.sql @@ -0,0 +1,2 @@ +select * from {{ ref('simple_streams_second_run_row_counts') }} +where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_schema_change/test_check_third_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_schema_change/test_check_third_run_row_counts.sql new file mode 100644 index 000000000000..5979aa28cea4 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_schema_change/test_check_third_run_row_counts.sql @@ -0,0 +1,2 @@ +select * from {{ ref('simple_streams_third_run_row_counts') }} +where row_count != expected_count diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp/simple_streams_first_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp/simple_streams_first_run_row_counts.sql new file mode 100644 index 000000000000..462558881b27 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp/simple_streams_first_run_row_counts.sql @@ -0,0 +1,42 @@ +with table_row_counts as ( + select distinct '_airbyte_raw_exchange_rate' as label, count(*) as row_count, 10 as expected_count + from {{ source('test_normalization', '_airbyte_raw_exchange_rate') }} +union all + select distinct 'exchange_rate' as label, count(*) as row_count, 10 as expected_count + from {{ ref('exchange_rate') }} + +union all + + select distinct '_airbyte_raw_dedup_exchange_rate' as label, count(*) as row_count, 10 as expected_count + from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} +union all + select distinct 'dedup_exchange_rate_scd' as label, count(*) as row_count, 10 as expected_count + from {{ ref('dedup_exchange_rate_scd') }} +union all + select distinct 'dedup_exchange_rate' as label, count(*) as row_count, 5 as expected_count + from {{ ref('dedup_exchange_rate') }} + +union all + + select distinct '_airbyte_raw_dedup_cdc_excluded' as label, count(*) as row_count, 8 as expected_count + from {{ source('test_normalization', '_airbyte_raw_dedup_cdc_excluded') }} +union all + select distinct 'dedup_cdc_excluded_scd' as label, count(*) as row_count, 8 as expected_count + from {{ ref('dedup_cdc_excluded_scd') }} +union all + select distinct 'dedup_cdc_excluded' as label, count(*) as row_count, 4 as expected_count + from {{ ref('dedup_cdc_excluded') }} + +union all + + select distinct '_airbyte_raw_pos_dedup_cdcx' as label, count(*) as row_count, 8 as expected_count + from {{ source('test_normalization', '_airbyte_raw_pos_dedup_cdcx') }} +union all + select distinct 'pos_dedup_cdcx_scd' as label, count(*) as row_count, 8 as expected_count + from {{ ref('pos_dedup_cdcx_scd') }} +union all + select distinct 'pos_dedup_cdcx' as label, count(*) as row_count, 3 as expected_count + from {{ ref('pos_dedup_cdcx') }} +) +select * +from table_row_counts diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp_incremental/simple_streams_second_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp_incremental/simple_streams_second_run_row_counts.sql new file mode 100644 index 000000000000..28963326e82d --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp_incremental/simple_streams_second_run_row_counts.sql @@ -0,0 +1,42 @@ +with table_row_counts as ( + select distinct '_airbyte_raw_exchange_rate' as label, count(*) as row_count, 5 as expected_count + from {{ source('test_normalization', '_airbyte_raw_exchange_rate') }} +union all + select distinct 'exchange_rate' as label, count(*) as row_count, 13 as expected_count + from {{ ref('exchange_rate') }} + +union all + + select distinct '_airbyte_raw_dedup_exchange_rate' as label, count(*) as row_count, 5 as expected_count + from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} +union all + select distinct 'dedup_exchange_rate_scd' as label, count(*) as row_count, 13 as expected_count + from {{ ref('dedup_exchange_rate_scd') }} +union all + select distinct 'dedup_exchange_rate' as label, count(*) as row_count, 6 as expected_count + from {{ ref('dedup_exchange_rate') }} + +union all + + select distinct '_airbyte_raw_dedup_cdc_excluded' as label, count(*) as row_count, 2 as expected_count + from {{ source('test_normalization', '_airbyte_raw_dedup_cdc_excluded') }} +union all + select distinct 'dedup_cdc_excluded_scd' as label, count(*) as row_count, 9 as expected_count + from {{ ref('dedup_cdc_excluded_scd') }} +union all + select distinct 'dedup_cdc_excluded' as label, count(*) as row_count, 4 as expected_count + from {{ ref('dedup_cdc_excluded') }} + +union all + + select distinct '_airbyte_raw_pos_dedup_cdcx' as label, count(*) as row_count, 6 as expected_count + from {{ source('test_normalization', '_airbyte_raw_pos_dedup_cdcx') }} +union all + select distinct 'pos_dedup_cdcx_scd' as label, count(*) as row_count, 6 as expected_count + from {{ ref('pos_dedup_cdcx_scd') }} +union all + select distinct 'pos_dedup_cdcx' as label, count(*) as row_count, 2 as expected_count + from {{ ref('pos_dedup_cdcx') }} +) +select * +from table_row_counts diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp_schema_change/simple_streams_third_run_row_counts.sql b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp_schema_change/simple_streams_third_run_row_counts.sql new file mode 100644 index 000000000000..186eedf26dc7 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_data_tests_tmp_schema_change/simple_streams_third_run_row_counts.sql @@ -0,0 +1,31 @@ +with table_row_counts as ( + select distinct '_airbyte_raw_exchange_rate' as label, count(*) as row_count, 4 as expected_count + from {{ source('test_normalization', '_airbyte_raw_exchange_rate') }} +union all + select distinct 'exchange_rate' as label, count(*) as row_count, 17 as expected_count + from {{ ref('exchange_rate') }} + +union all + + select distinct '_airbyte_raw_dedup_exchange_rate' as label, count(*) as row_count, 9 as expected_count + from {{ source('test_normalization', '_airbyte_raw_dedup_exchange_rate') }} +union all + select distinct 'dedup_exchange_rate_scd' as label, count(*) as row_count, 17 as expected_count + from {{ ref('dedup_exchange_rate_scd') }} +union all + select distinct 'dedup_exchange_rate' as label, count(*) as row_count, 10 as expected_count + from {{ ref('dedup_exchange_rate') }} + +union all + + select distinct '_airbyte_raw_dedup_cdc_excluded' as label, count(*) as row_count, 2 as expected_count + from test_normalization._airbyte_raw_dedup_cdc_excluded +union all + select distinct 'dedup_cdc_excluded_scd' as label, count(*) as row_count, 9 as expected_count + from test_normalization.dedup_cdc_excluded_scd +union all + select distinct 'dedup_cdc_excluded' as label, count(*) as row_count, 4 as expected_count + from test_normalization.dedup_cdc_excluded +) +select * +from table_row_counts diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_schema_tests/schema_test.yml b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests/schema_test.yml similarity index 84% rename from airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_schema_tests/schema_test.yml rename to airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests/schema_test.yml index d7eb8ff56dca..d0192cefe26d 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_schema_tests/schema_test.yml +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests/schema_test.yml @@ -3,6 +3,10 @@ version: 2 models: - name: exchange_rate tests: + - dbt_utils.expression_is_true: + # description: check no column collisions + # Two columns having similar names especially after removing special characters should remain distincts + expression: cast("HKD@spéçiäl & characters" as {{ dbt_utils.type_string() }}) != HKD_special___characters - dbt_utils.equality: # description: check_streams_are_equal # In this integration test, we are sending the same records to both streams @@ -19,14 +23,6 @@ models: - HKD_special___characters - NZD - USD - - dbt_utils.equal_rowcount: - # description: check_raw_and_normalized_rowcounts - # Raw and normalized tables should be equal. - compare_model: source('test_normalization', '_airbyte_raw_exchange_rate') - - dbt_utils.expression_is_true: - # description: check no column collisions - # Two columns having similar names especially after removing special characters should remain distincts - expression: cast("HKD@spéçiäl & characters" as {{ dbt_utils.type_string() }}) != HKD_special___characters columns: - name: '"HKD@spéçiäl & characters"' # description: check special charactesrs @@ -45,9 +41,10 @@ models: - NZD - name: dedup_cdc_excluded - tests: - - dbt_utils.expression_is_true: - expression: "\"column`_'with\"\"_quotes\" is not null" +# Disabling because incremental dbt is not handling quotes well atm (dbt 0.21.0) +# tests: +# - dbt_utils.expression_is_true: +# expression: "\"column`_'with\"\"_quotes\" is not null" columns: - name: name tests: diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml new file mode 100644 index 000000000000..d0192cefe26d --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_incremental/schema_test.yml @@ -0,0 +1,57 @@ +version: 2 + +models: + - name: exchange_rate + tests: + - dbt_utils.expression_is_true: + # description: check no column collisions + # Two columns having similar names especially after removing special characters should remain distincts + expression: cast("HKD@spéçiäl & characters" as {{ dbt_utils.type_string() }}) != HKD_special___characters + - dbt_utils.equality: + # description: check_streams_are_equal + # In this integration test, we are sending the same records to both streams + # exchange_rate and dedup_exchange_rate. + # The SCD table of dedup_exchange_rate in append_dedup mode should therefore mirror + # the final table with append or overwrite mode from exchange_rate. + compare_model: ref('dedup_exchange_rate_scd') + compare_columns: + - id + - currency + - date + - timestamp_col + - '"HKD@spéçiäl & characters"' + - HKD_special___characters + - NZD + - USD + columns: + - name: '"HKD@spéçiäl & characters"' + # description: check special charactesrs + # Use special characters in column names and make sure they are correctly parsed in the JSON blob and populated + tests: + - not_null + + - name: dedup_exchange_rate + tests: + - dbt_utils.unique_combination_of_columns: + # description: check_deduplication_by_primary_key + # The final table for this stream should have unique composite primary key values. + combination_of_columns: + - id + - currency + - NZD + + - name: dedup_cdc_excluded +# Disabling because incremental dbt is not handling quotes well atm (dbt 0.21.0) +# tests: +# - dbt_utils.expression_is_true: +# expression: "\"column`_'with\"\"_quotes\" is not null" + columns: + - name: name + tests: + - not_null + + - name: pos_dedup_cdcx + columns: + - name: name + tests: + - not_null diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_schema_change/schema_test.yml b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_schema_change/schema_test.yml new file mode 100644 index 000000000000..b8367ee02db8 --- /dev/null +++ b/airbyte-integrations/bases/base-normalization/integration_tests/resources/test_simple_streams/dbt_test_config/dbt_schema_tests_schema_change/schema_test.yml @@ -0,0 +1,46 @@ +version: 2 + +models: + - name: exchange_rate + columns: + - name: '"HKD@spéçiäl & characters"' + # description: check special charactesrs + # Use special characters in column names and make sure they are correctly parsed in the JSON blob and populated + tests: + - not_null + tests: + - dbt_utils.equality: + # description: check_streams_are_equal + # In this integration test, we are sending the same records to both streams + # exchange_rate and dedup_exchange_rate. + # The SCD table of dedup_exchange_rate in append_dedup mode should therefore mirror + # the final table with append or overwrite mode from exchange_rate. + compare_model: ref('dedup_exchange_rate_scd') + compare_columns: + - id + - currency + - date + - timestamp_col + - '"HKD@spéçiäl & characters"' + - NZD + - USD + + - name: dedup_exchange_rate + tests: + - dbt_utils.unique_combination_of_columns: + # description: check_deduplication_by_primary_key + # The final table for this stream should have unique composite primary key values. + combination_of_columns: + - id + - currency + - NZD + + - name: renamed_dedup_cdc_excluded +# Disabling because incremental dbt is not handling quotes well atm (dbt 0.21.0) +# tests: +# - dbt_utils.expression_is_true: +# expression: "\"column`_'with\"\"_quotes\" is not null" + columns: + - name: name + tests: + - not_null diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/test_ephemeral.py b/airbyte-integrations/bases/base-normalization/integration_tests/test_ephemeral.py index 9c7b40a99f28..9e027e450d3c 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/test_ephemeral.py +++ b/airbyte-integrations/bases/base-normalization/integration_tests/test_ephemeral.py @@ -53,7 +53,8 @@ def test_destination_supported_limits(destination_type: DestinationType, column_ # not by absolute column count. It is way fewer than 1000. pytest.skip(f"Destinations {destination_type} is not in NORMALIZATION_TEST_TARGET env variable (MYSQL is also skipped)") if destination_type.value == DestinationType.ORACLE.value: - column_count = 998 + # Airbyte uses a few columns for metadata and Oracle limits are right at 1000 + column_count = 995 run_test(destination_type, column_count) @@ -99,6 +100,7 @@ def run_test(destination_type: DestinationType, column_count: int, expected_exce generate_dbt_models(destination_type, test_root_dir, column_count) # Use destination connector to create empty _airbyte_raw_* tables to use as input for the test assert setup_input_raw_data(integration_type, test_root_dir, destination_config) + dbt_test_utils.dbt_check(destination_type, test_root_dir) if expected_exception_message: with pytest.raises(AssertionError): dbt_test_utils.dbt_run(destination_type, test_root_dir) diff --git a/airbyte-integrations/bases/base-normalization/integration_tests/test_normalization.py b/airbyte-integrations/bases/base-normalization/integration_tests/test_normalization.py index 184922410677..04b59dae2ff1 100644 --- a/airbyte-integrations/bases/base-normalization/integration_tests/test_normalization.py +++ b/airbyte-integrations/bases/base-normalization/integration_tests/test_normalization.py @@ -83,26 +83,73 @@ def test_normalization(destination_type: DestinationType, test_resource_name: st def run_test_normalization(destination_type: DestinationType, test_resource_name: str): print(f"Testing normalization {destination_type} for {test_resource_name} in ", dbt_test_utils.target_schema) - integration_type = destination_type.value - # Create the test folder with dbt project and appropriate destination settings to run integration tests from - test_root_dir = setup_test_dir(integration_type, test_resource_name) - destination_config = dbt_test_utils.generate_profile_yaml_file(destination_type, test_root_dir) + test_root_dir = setup_test_dir(destination_type, test_resource_name) + run_first_normalization(destination_type, test_resource_name, test_root_dir) + if os.path.exists(os.path.join("resources", test_resource_name, "data_input", "messages_incremental.txt")): + run_incremental_normalization(destination_type, test_resource_name, test_root_dir) + if os.path.exists(os.path.join("resources", test_resource_name, "data_input", "messages_schema_change.txt")): + run_schema_change_normalization(destination_type, test_resource_name, test_root_dir) - # Use destination connector to create _airbyte_raw_* tables to use as input for the test - assert setup_input_raw_data(integration_type, test_resource_name, test_root_dir, destination_config) - # Normalization step - generate_dbt_models(destination_type, test_resource_name, test_root_dir) +def run_first_normalization(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): + destination_config = dbt_test_utils.generate_profile_yaml_file(destination_type, test_root_dir) + # Use destination connector to create _airbyte_raw_* tables to use as input for the test + assert setup_input_raw_data(destination_type, test_resource_name, test_root_dir, destination_config) + # generate models from catalog + generate_dbt_models(destination_type, test_resource_name, test_root_dir, "models", "catalog.json") # Setup test resources and models - dbt_test_setup(destination_type, test_resource_name, test_root_dir) - # Run DBT process + setup_dbt_test(destination_type, test_resource_name, test_root_dir) + dbt_test_utils.dbt_check(destination_type, test_root_dir) + # Run dbt process + dbt_test_utils.dbt_run(destination_type, test_root_dir, force_full_refresh=True) + copy_tree(os.path.join(test_root_dir, "build/run/airbyte_utils/models/generated/"), os.path.join(test_root_dir, "first_output")) + shutil.rmtree(os.path.join(test_root_dir, "build/run/airbyte_utils/models/generated/"), ignore_errors=True) + # Verify dbt process + dbt_test(destination_type, test_root_dir) + + +def run_incremental_normalization(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): + # Use destination connector to reset _airbyte_raw_* tables with new incremental data + setup_incremental_data(destination_type, test_resource_name, test_root_dir) + # setup new test files + setup_dbt_incremental_test(destination_type, test_resource_name, test_root_dir) + # Run dbt process dbt_test_utils.dbt_run(destination_type, test_root_dir) + normalize_dbt_output(test_root_dir, "build/run/airbyte_utils/models/generated/", "second_output") + + if destination_type.value in [DestinationType.MYSQL.value, DestinationType.ORACLE.value]: + pytest.skip(f"{destination_type} does not support incremental yet") dbt_test(destination_type, test_root_dir) - check_outputs(destination_type, test_resource_name, test_root_dir) -def setup_test_dir(integration_type: str, test_resource_name: str) -> str: +def run_schema_change_normalization(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): + if destination_type.value in [DestinationType.MSSQL.value, DestinationType.MYSQL.value, DestinationType.ORACLE.value]: + pytest.skip(f"{destination_type} does not support schema change in incremental yet (requires dbt 0.21.0+)") + if destination_type.value in [DestinationType.SNOWFLAKE.value]: + pytest.skip(f"{destination_type} is disabled as it doesnt support schema change in incremental yet (column type changes)") + + setup_schema_change_data(destination_type, test_resource_name, test_root_dir) + generate_dbt_models(destination_type, test_resource_name, test_root_dir, "modified_models", "catalog_schema_change.json") + setup_dbt_schema_change_test(destination_type, test_resource_name, test_root_dir) + dbt_test_utils.dbt_run(destination_type, test_root_dir) + normalize_dbt_output(test_root_dir, "build/run/airbyte_utils/modified_models/generated/", "third_output") + dbt_test(destination_type, test_root_dir) + + +def normalize_dbt_output(test_root_dir: str, input_dir: str, output_dir: str): + tmp_dir = os.path.join(test_root_dir, input_dir) + output_dir = os.path.join(test_root_dir, output_dir) + shutil.rmtree(output_dir, ignore_errors=True) + + def copy_replace_dbt_tmp(src, dst): + dbt_test_utils.copy_replace(src, dst, "__dbt_tmp[0-9]+", "__dbt_tmp") + + shutil.copytree(tmp_dir, output_dir, copy_function=copy_replace_dbt_tmp) + shutil.rmtree(tmp_dir, ignore_errors=True) + + +def setup_test_dir(destination_type: DestinationType, test_resource_name: str) -> str: """ We prepare a clean folder to run the tests from. @@ -117,25 +164,25 @@ def setup_test_dir(integration_type: str, test_resource_name: str) -> str: these are interpreted and compiled into the native SQL dialect of the final destination engine) """ if test_resource_name in git_versioned_tests: - test_root_dir = f"{pathlib.Path().absolute()}/normalization_test_output/{integration_type.lower()}" + test_root_dir = f"{pathlib.Path().absolute()}/normalization_test_output/{destination_type.value.lower()}" else: - test_root_dir = f"{pathlib.Path().joinpath('..', 'build', 'normalization_test_output', integration_type.lower()).resolve()}" + test_root_dir = f"{pathlib.Path().joinpath('..', 'build', 'normalization_test_output', destination_type.value.lower()).resolve()}" os.makedirs(test_root_dir, exist_ok=True) test_root_dir = f"{test_root_dir}/{test_resource_name}" shutil.rmtree(test_root_dir, ignore_errors=True) print(f"Setting up test folder {test_root_dir}") dbt_project_yaml = "../dbt-project-template/dbt_project.yml" copy_tree("../dbt-project-template", test_root_dir) - if integration_type == DestinationType.MSSQL.value: + if destination_type.value == DestinationType.MSSQL.value: copy_tree("../dbt-project-template-mssql", test_root_dir) dbt_project_yaml = "../dbt-project-template-mssql/dbt_project.yml" - elif integration_type == DestinationType.MYSQL.value: + elif destination_type.value == DestinationType.MYSQL.value: copy_tree("../dbt-project-template-mysql", test_root_dir) dbt_project_yaml = "../dbt-project-template-mysql/dbt_project.yml" - elif integration_type == DestinationType.ORACLE.value: + elif destination_type.value == DestinationType.ORACLE.value: copy_tree("../dbt-project-template-oracle", test_root_dir) dbt_project_yaml = "../dbt-project-template-oracle/dbt_project.yml" - if integration_type.lower() != "redshift" and integration_type.lower() != "oracle": + if destination_type.value not in (DestinationType.REDSHIFT.value, DestinationType.ORACLE.value): # Prefer 'view' to 'ephemeral' for tests so it's easier to debug with dbt dbt_test_utils.copy_replace( dbt_project_yaml, @@ -149,7 +196,9 @@ def setup_test_dir(integration_type: str, test_resource_name: str) -> str: return test_root_dir -def setup_input_raw_data(integration_type: str, test_resource_name: str, test_root_dir: str, destination_config: Dict[str, Any]) -> bool: +def setup_input_raw_data( + destination_type: DestinationType, test_resource_name: str, test_root_dir: str, destination_config: Dict[str, Any] +) -> bool: """ We run docker images of destinations to upload test data stored in the messages.txt file for each test case. This should populate the associated "raw" tables from which normalization is reading from when running dbt CLI. @@ -166,6 +215,45 @@ def setup_input_raw_data(integration_type: str, test_resource_name: str, test_ro config_file = os.path.join(test_root_dir, "destination_config.json") with open(config_file, "w") as f: f.write(json.dumps(destination_config)) + # Force a reset in destination raw tables + assert run_destination_process(destination_type, test_root_dir, "", "reset_catalog.json") + # Run a sync to create raw tables in destinations + return run_destination_process(destination_type, test_root_dir, message_file, "destination_catalog.json") + + +def setup_incremental_data(destination_type: DestinationType, test_resource_name: str, test_root_dir: str) -> bool: + message_file = os.path.join("resources", test_resource_name, "data_input", "messages_incremental.txt") + # Force a reset in destination raw tables + assert run_destination_process(destination_type, test_root_dir, "", "reset_catalog.json") + # Run a sync to create raw tables in destinations + return run_destination_process(destination_type, test_root_dir, message_file, "destination_catalog.json") + + +def setup_schema_change_data(destination_type: DestinationType, test_resource_name: str, test_root_dir: str) -> bool: + catalog_file = os.path.join("resources", test_resource_name, "data_input", "catalog_schema_change.json") + message_file = os.path.join("resources", test_resource_name, "data_input", "messages_schema_change.txt") + dbt_test_utils.copy_replace( + catalog_file, + os.path.join(test_root_dir, "reset_catalog.json"), + pattern='"destination_sync_mode": ".*"', + replace_value='"destination_sync_mode": "overwrite"', + ) + dbt_test_utils.copy_replace(catalog_file, os.path.join(test_root_dir, "destination_catalog.json")) + dbt_test_utils.copy_replace( + os.path.join(test_root_dir, "dbt_project.yml"), + os.path.join(test_root_dir, "first_dbt_project.yml"), + ) + dbt_test_utils.copy_replace( + os.path.join(test_root_dir, "first_dbt_project.yml"), + os.path.join(test_root_dir, "dbt_project.yml"), + pattern=r'source-paths: \["models"\]', + replace_value='source-paths: ["modified_models"]', + ) + # Run a sync to update raw tables in destinations + return run_destination_process(destination_type, test_root_dir, message_file, "destination_catalog.json") + + +def run_destination_process(destination_type: DestinationType, test_root_dir: str, message_file: str, catalog_file: str): commands = [ "docker", "run", @@ -176,55 +264,112 @@ def setup_input_raw_data(integration_type: str, test_resource_name: str, test_ro "--network", "host", "-i", - f"airbyte/destination-{integration_type.lower()}:dev", + f"airbyte/destination-{destination_type.value.lower()}:dev", "write", "--config", "/data/destination_config.json", "--catalog", ] - # Force a reset in destination raw tables - assert dbt_test_utils.run_destination_process("", test_root_dir, commands + ["/data/reset_catalog.json"]) - # Run a sync to create raw tables in destinations - return dbt_test_utils.run_destination_process(message_file, test_root_dir, commands + ["/data/destination_catalog.json"]) + return dbt_test_utils.run_destination_process(message_file, test_root_dir, commands + [f"/data/{catalog_file}"]) -def generate_dbt_models(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): +def generate_dbt_models(destination_type: DestinationType, test_resource_name: str, test_root_dir: str, output_dir: str, catalog_file: str): """ This is the normalization step generating dbt models files from the destination_catalog.json taken as input. """ - catalog_processor = CatalogProcessor(os.path.join(test_root_dir, "models", "generated"), destination_type) + catalog_processor = CatalogProcessor(os.path.join(test_root_dir, output_dir, "generated"), destination_type) catalog_processor.process( - os.path.join("resources", test_resource_name, "data_input", "catalog.json"), "_airbyte_data", dbt_test_utils.target_schema + os.path.join("resources", test_resource_name, "data_input", catalog_file), "_airbyte_data", dbt_test_utils.target_schema ) -def dbt_test_setup(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): +def setup_dbt_test(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): """ Prepare the data (copy) for the models for dbt test. """ replace_identifiers = os.path.join("resources", test_resource_name, "data_input", "replace_identifiers.json") - - # COMMON TEST RESOURCES copy_test_files( - os.path.join("resources", test_resource_name, "dbt_schema_tests"), + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_schema_tests"), os.path.join(test_root_dir, "models/dbt_schema_tests"), destination_type, replace_identifiers, ) copy_test_files( - os.path.join("resources", test_resource_name, "dbt_data_tests_tmp"), - os.path.join(test_root_dir, "models/dbt_data_tests_tmp"), + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_data_tests_tmp"), + os.path.join(test_root_dir, "models/dbt_data_tests"), destination_type, replace_identifiers, ) copy_test_files( - os.path.join("resources", test_resource_name, "dbt_data_tests"), + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_data_tests"), os.path.join(test_root_dir, "tests"), destination_type, replace_identifiers, ) +def setup_dbt_incremental_test(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): + """ + Prepare the data (copy) for the models for dbt test. + """ + replace_identifiers = os.path.join("resources", test_resource_name, "data_input", "replace_identifiers.json") + copy_test_files( + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_schema_tests_incremental"), + os.path.join(test_root_dir, "models/dbt_schema_tests"), + destination_type, + replace_identifiers, + ) + test_directory = os.path.join(test_root_dir, "models/dbt_data_tests") + shutil.rmtree(test_directory, ignore_errors=True) + os.makedirs(test_directory, exist_ok=True) + copy_test_files( + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_data_tests_tmp_incremental"), + test_directory, + destination_type, + replace_identifiers, + ) + test_directory = os.path.join(test_root_dir, "tests") + shutil.rmtree(test_directory, ignore_errors=True) + os.makedirs(test_directory, exist_ok=True) + copy_test_files( + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_data_tests_incremental"), + test_directory, + destination_type, + replace_identifiers, + ) + + +def setup_dbt_schema_change_test(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): + """ + Prepare the data (copy) for the models for dbt test. + """ + replace_identifiers = os.path.join("resources", test_resource_name, "data_input", "replace_identifiers.json") + copy_test_files( + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_schema_tests_schema_change"), + os.path.join(test_root_dir, "modified_models/dbt_schema_tests"), + destination_type, + replace_identifiers, + ) + test_directory = os.path.join(test_root_dir, "modified_models/dbt_data_tests") + shutil.rmtree(test_directory, ignore_errors=True) + os.makedirs(test_directory, exist_ok=True) + copy_test_files( + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_data_tests_tmp_schema_change"), + test_directory, + destination_type, + replace_identifiers, + ) + test_directory = os.path.join(test_root_dir, "tests") + shutil.rmtree(test_directory, ignore_errors=True) + os.makedirs(test_directory, exist_ok=True) + copy_test_files( + os.path.join("resources", test_resource_name, "dbt_test_config", "dbt_data_tests_schema_change"), + test_directory, + destination_type, + replace_identifiers, + ) + + def dbt_test(destination_type: DestinationType, test_root_dir: str): """ dbt provides a way to run dbt tests as described here: https://docs.getdbt.com/docs/building-a-dbt-project/tests @@ -238,13 +383,6 @@ def dbt_test(destination_type: DestinationType, test_root_dir: str): assert dbt_test_utils.run_check_dbt_command(normalization_image, "test", test_root_dir) -def check_outputs(destination_type: DestinationType, test_resource_name: str, test_root_dir: str): - """ - Implement other types of checks on the output directory (grepping, diffing files etc?) - """ - print("Checking test outputs") - - def copy_test_files(src: str, dst: str, destination_type: DestinationType, replace_identifiers: str): """ Copy file while hacking snowflake identifiers that needs to be uppercased... diff --git a/airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py b/airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py index b97acfdc9351..9af79c4a4e83 100644 --- a/airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py +++ b/airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py @@ -5,6 +5,7 @@ import os import re +from enum import Enum from typing import Dict, List, Optional, Tuple from airbyte_protocol.models.airbyte_protocol import DestinationSyncMode, SyncMode @@ -33,6 +34,29 @@ MAXIMUM_COLUMNS_TO_USE_EPHEMERAL = 450 +class PartitionScheme(Enum): + """ + When possible, normalization will try to output partitioned/indexed/sorted tables (depending on the destination support) + This enum specifies which column to use when doing so (which affects how fast the table can be read using that column as predicate) + """ + + ACTIVE_ROW = "active_row" # partition by _airbyte_active_row + UNIQUE_KEY = "unique_key" # partition by _airbyte_emitted_at, sorted by _airbyte_unique_key + NOTHING = "nothing" # no partitions + DEFAULT = "" # partition by _airbyte_emitted_at + + +class TableMaterializationType(Enum): + """ + Defines the folders and dbt materialization mode of models (as configured in dbt_project.yml file) + """ + + CTE = "airbyte_ctes" + VIEW = "airbyte_views" + TABLE = "airbyte_tables" + INCREMENTAL = "airbyte_incremental" + + class StreamProcessor(object): """ Takes as input an Airbyte Stream as described in the (configured) Airbyte Catalog's Json Schema. @@ -93,7 +117,10 @@ def __init__( self.parent: Optional["StreamProcessor"] = None self.is_nested_array: bool = False self.default_schema: str = default_schema + self.airbyte_ab_id = "_airbyte_ab_id" self.airbyte_emitted_at = "_airbyte_emitted_at" + self.airbyte_normalized_at = "_airbyte_normalized_at" + self.airbyte_unique_key = "_airbyte_unique_key" @staticmethod def create_from_parent( @@ -207,37 +234,47 @@ def process(self) -> List["StreamProcessor"]: from_table = self.from_table # Transformation Pipeline for this stream - from_table = self.add_to_outputs(self.generate_json_parsing_model(from_table, column_names), is_intermediate=True, suffix="ab1") from_table = self.add_to_outputs( - self.generate_column_typing_model(from_table, column_names), is_intermediate=True, column_count=column_count, suffix="ab2" + self.generate_json_parsing_model(from_table, column_names), + self.get_model_materialization_mode(is_intermediate=True), + suffix="ab1", ) from_table = self.add_to_outputs( - self.generate_id_hashing_model(from_table, column_names), is_intermediate=True, column_count=column_count, suffix="ab3" + self.generate_column_typing_model(from_table, column_names), + self.get_model_materialization_mode(is_intermediate=True, column_count=column_count), + suffix="ab2", ) - if self.destination_sync_mode.value == DestinationSyncMode.append_dedup.value: - from_table = self.add_to_outputs(self.generate_dedup_record_model(from_table, column_names), is_intermediate=True, suffix="ab4") - if self.destination_type == DestinationType.ORACLE: - where_clause = '\nwhere "_AIRBYTE_ROW_NUM" = 1' - else: - where_clause = "\nwhere _airbyte_row_num = 1" + if self.destination_sync_mode != DestinationSyncMode.append_dedup: from_table = self.add_to_outputs( - self.generate_scd_type_2_model(from_table, column_names) + where_clause, - is_intermediate=False, - column_count=column_count, - suffix="scd", + self.generate_id_hashing_model(from_table, column_names), + self.get_model_materialization_mode(is_intermediate=True, column_count=column_count), + suffix="ab3", ) - if self.destination_type == DestinationType.ORACLE: - where_clause = '\nwhere "_AIRBYTE_ACTIVE_ROW" = 1' - else: - where_clause = "\nwhere _airbyte_active_row = 1" - from_table = self.add_to_outputs( - self.generate_final_model(from_table, column_names) + where_clause, is_intermediate=False, column_count=column_count + self.generate_final_model(from_table, column_names), + self.get_model_materialization_mode(is_intermediate=False, column_count=column_count), ) - # TODO generate yaml file to dbt test final table where primary keys should be unique else: from_table = self.add_to_outputs( - self.generate_final_model(from_table, column_names), is_intermediate=False, column_count=column_count + self.generate_id_hashing_model(from_table, column_names), + # Force View materialization here because scd models rely on star* macros that requires it + TableMaterializationType.VIEW, + suffix="ab3", + ) + from_table = self.add_to_outputs( + self.generate_scd_type_2_model(from_table, column_names), + self.get_model_materialization_mode(is_intermediate=False, column_count=column_count), + suffix="scd", + subdir="scd", + unique_key=self.name_transformer.normalize_column_name("_airbyte_unique_key_scd"), + partition_by=PartitionScheme.ACTIVE_ROW, + ) + where_clause = f"\nand {self.name_transformer.normalize_column_name('_airbyte_active_row')} = 1" + from_table = self.add_to_outputs( + self.generate_final_model(from_table, column_names, self.get_unique_key()) + where_clause, + self.get_model_materialization_mode(is_intermediate=False, column_count=column_count), + unique_key=self.get_unique_key(), + partition_by=PartitionScheme.UNIQUE_KEY, ) return self.find_children_streams(from_table, column_names) @@ -278,6 +315,8 @@ def find_children_streams(self, from_table: str, column_names: Dict[str, Tuple[s children: List[StreamProcessor] = [] for field in properties.keys(): children_properties = None + is_nested_array = False + json_column_name = "" if is_airbyte_column(field): pass elif is_combining_node(properties[field]): @@ -322,34 +361,53 @@ def generate_json_parsing_model(self, from_table: str, column_names: Dict[str, T {%- for field in fields %} {{ field }}, {%- endfor %} - {{ col_emitted_at }} + {{ col_ab_id }}, + {{ col_emitted_at }}, + {{ '{{ current_timestamp() }}' }} as {{ col_normalized_at }} from {{ from_table }} {{ table_alias }} -{{ unnesting_after_query }} {{ sql_table_comment }} +{{ unnesting_from }} +where 1 = 1 +{{ unnesting_where }} """ ) sql = template.render( + col_ab_id=self.get_ab_id(), col_emitted_at=self.get_emitted_at(), + col_normalized_at=self.get_normalized_at(), table_alias=table_alias, unnesting_before_query=self.unnesting_before_query(), parent_hash_id=self.parent_hash_id(), fields=self.extract_json_columns(column_names), from_table=jinja_call(from_table), - unnesting_after_query=self.unnesting_after_query(), + unnesting_from=self.unnesting_from(), + unnesting_where=self.unnesting_where(), sql_table_comment=self.sql_table_comment(), ) return sql + def get_ab_id(self, in_jinja: bool = False): + # this is also tied to dbt-project-template/macros/should_full_refresh.sql + # as it is needed by the macro should_full_refresh + return self.name_transformer.normalize_column_name(self.airbyte_ab_id, in_jinja, False) + def get_emitted_at(self, in_jinja: bool = False): return self.name_transformer.normalize_column_name(self.airbyte_emitted_at, in_jinja, False) + def get_normalized_at(self, in_jinja: bool = False): + return self.name_transformer.normalize_column_name(self.airbyte_normalized_at, in_jinja, False) + + def get_unique_key(self, in_jinja: bool = False): + return self.name_transformer.normalize_column_name(self.airbyte_unique_key, in_jinja, False) + def extract_json_columns(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]: return [ self.extract_json_column(field, self.json_column_name, self.properties[field], column_names[field][0], "table_alias") for field in column_names ] - def extract_json_column(self, property_name: str, json_column_name: str, definition: Dict, column_name: str, table_alias: str) -> str: + @staticmethod + def extract_json_column(property_name: str, json_column_name: str, definition: Dict, column_name: str, table_alias: str) -> str: json_path = [property_name] # In some cases, some destination aren't able to parse the JSON blob using the original property name # we make their life easier by using a pre-populated and sanitized column name instead... @@ -380,13 +438,18 @@ def generate_column_typing_model(self, from_table: str, column_names: Dict[str, {%- for field in fields %} {{ field }}, {%- endfor %} - {{ col_emitted_at }} + {{ col_ab_id }}, + {{ col_emitted_at }}, + {{ '{{ current_timestamp() }}' }} as {{ col_normalized_at }} from {{ from_table }} {{ sql_table_comment }} +where 1 = 1 """ ) sql = template.render( + col_ab_id=self.get_ab_id(), col_emitted_at=self.get_emitted_at(), + col_normalized_at=self.get_normalized_at(), parent_hash_id=self.parent_hash_id(), fields=self.cast_property_types(column_names), from_table=jinja_call(from_table), @@ -458,7 +521,8 @@ def generate_mysql_date_format_statement(column_name: str) -> str: ) return template.render(column_name=column_name) - def generate_snowflake_timestamp_statement(self, column_name: str) -> str: + @staticmethod + def generate_snowflake_timestamp_statement(column_name: str) -> str: """ Generates snowflake DB specific timestamp case when statement """ @@ -501,6 +565,7 @@ def generate_id_hashing_model(self, from_table: str, column_names: Dict[str, Tup tmp.* from {{ from_table }} tmp {{ sql_table_comment }} +where 1 = 1 """ ) @@ -547,58 +612,120 @@ def safe_cast_to_string(definition: Dict, column_name: str, destination_type: De return col - def generate_dedup_record_model(self, from_table: str, column_names: Dict[str, Tuple[str, str]]) -> str: - template = Template( - """ --- SQL model to prepare for deduplicating records based on the hash record column -select - row_number() over ( - partition by {{ hash_id }} - order by {{ col_emitted_at }} asc - ) as {{ active_row }}, - tmp.* -from {{ from_table }} tmp -{{ sql_table_comment }} - """ - ) - sql = template.render( - active_row=self.process_col("_airbyte_row_num"), - col_emitted_at=self.get_emitted_at(), - hash_id=self.hash_id(), - from_table=jinja_call(from_table), - sql_table_comment=self.sql_table_comment(include_from_table=True), - ) - return sql - - def process_col(self, col: str): - return self.name_transformer.normalize_column_name(col) - def generate_scd_type_2_model(self, from_table: str, column_names: Dict[str, Tuple[str, str]]) -> str: - scd_sql_template = """ --- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key +with +{{ '{% if is_incremental() %}' }} +new_data as ( + -- retrieve incremental "new" data + select + * + from {{'{{'}} {{ from_table }} {{'}}'}} + {{ sql_table_comment }} + where 1 = 1 + {{'{{'}} incremental_clause({{ quoted_col_emitted_at }}) {{'}}'}} +), +new_data_ids as ( + -- build a subset of {{ unique_key }} from rows that are new + select distinct + {{ '{{' }} dbt_utils.surrogate_key([ + {%- for primary_key in primary_keys %} + {{ primary_key }}, + {%- endfor %} + ]) {{ '}}' }} as {{ unique_key }} + from new_data +), +previous_active_scd_data as ( + -- retrieve "incomplete old" data that needs to be updated with an end date because of new changes + select + {{ '{{' }} star_intersect({{ from_table }}, this, from_alias='inc_data', intersect_alias='this_data') {{ '}}' }} + from {{ '{{ this }}' }} as this_data + -- make a join with new_data using primary key to filter active data that need to be updated only + join new_data_ids on this_data.{{ unique_key }} = new_data_ids.{{ unique_key }} + -- force left join to NULL values (we just need to transfer column types only for the star_intersect macro) + left join {{'{{'}} {{ from_table }} {{'}}'}} as inc_data on 1 = 0 + where {{ active_row }} = 1 +), +input_data as ( + select {{ '{{' }} dbt_utils.star({{ from_table }}) {{ '}}' }} from new_data + union all + select {{ '{{' }} dbt_utils.star({{ from_table }}) {{ '}}' }} from previous_active_scd_data +), +{{ '{% else %}' }} +input_data as ( + select * + from {{'{{'}} {{ from_table }} {{'}}'}} + {{ sql_table_comment }} +), +{{ '{% endif %}' }} +scd_data as ( + -- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key + select + {%- if parent_hash_id %} + {{ parent_hash_id }}, + {%- endif %} + {{ '{{' }} dbt_utils.surrogate_key([ + {%- for primary_key in primary_keys %} + {{ primary_key }}, + {%- endfor %} + ]) {{ '}}' }} as {{ unique_key }}, + {%- for field in fields %} + {{ field }}, + {%- endfor %} + {{ cursor_field }} as {{ airbyte_start_at }}, + lag({{ cursor_field }}) over ( + partition by {{ primary_key_partition | join(", ") }} + order by + {{ cursor_field }} {{ order_null }}, + {{ cursor_field }} desc, + {{ col_emitted_at }} desc{{ cdc_updated_at_order }} + ) as {{ airbyte_end_at }}, + case when lag({{ cursor_field }}) over ( + partition by {{ primary_key_partition | join(", ") }} + order by + {{ cursor_field }} {{ order_null }}, + {{ cursor_field }} desc, + {{ col_emitted_at }} desc{{ cdc_updated_at_order }} + ) is null {{ cdc_active_row }} then 1 else 0 end as {{ active_row }}, + {{ col_ab_id }}, + {{ col_emitted_at }}, + {{ hash_id }} + from input_data +), +dedup_data as ( + select + -- we need to ensure de-duplicated rows for merge/update queries + -- additionally, we generate a unique key for the scd table + row_number() over ( + partition by {{ unique_key }}, {{ airbyte_start_at }}, {{ col_emitted_at }}{{ cdc_cols }} + order by {{ col_ab_id }} + ) as {{ airbyte_row_num }}, + {{ '{{' }} dbt_utils.surrogate_key([ + {{ quoted_unique_key }}, + {{ quoted_airbyte_start_at }}, + {{ quoted_col_emitted_at }}{{ quoted_cdc_cols }} + ]) {{ '}}' }} as {{ airbyte_unique_key_scd }}, + scd_data.* + from scd_data +) select - {%- if parent_hash_id %} - {{ parent_hash_id }}, - {%- endif %} - {%- for field in fields %} - {{ field }}, - {%- endfor %} - {{ cursor_field }} as {{ airbyte_start_at }}, - lag({{ cursor_field }}) over ( - partition by {{ primary_key }} - order by {{ cursor_field }} {{ order_null }}, {{ cursor_field }} desc, {{ col_emitted_at }} desc - ) as {{ airbyte_end_at }}, - case when lag({{ cursor_field }}) over ( - partition by {{ primary_key }} - order by {{ cursor_field }} {{ order_null }}, {{ cursor_field }} desc, {{ col_emitted_at }} desc{{ cdc_updated_at_order }} - ) is null {{ cdc_active_row }} then 1 else 0 end as {{ active_row }}, - {{ col_emitted_at }}, - {{ hash_id }} -from {{ from_table }} -{{ sql_table_comment }} + {%- if parent_hash_id %} + {{ parent_hash_id }}, + {%- endif %} + {{ unique_key }}, + {{ airbyte_unique_key_scd }}, + {%- for field in fields %} + {{ field }}, + {%- endfor %} + {{ airbyte_start_at }}, + {{ airbyte_end_at }}, + {{ active_row }}, + {{ col_ab_id }}, + {{ col_emitted_at }}, + {{ '{{ current_timestamp() }}' }} as {{ col_normalized_at }}, + {{ hash_id }} +from dedup_data where {{ airbyte_row_num }} = 1 """ - template = Template(scd_sql_template) order_null = "is null asc" @@ -608,34 +735,60 @@ def generate_scd_type_2_model(self, from_table: str, column_names: Dict[str, Tup # SQL Server treats NULL values as the lowest values, then sorted in ascending order, NULLs come first. order_null = "desc" + # TODO move all cdc columns out of scd models cdc_active_row_pattern = "" cdc_updated_order_pattern = "" + cdc_cols = "" + quoted_cdc_cols = "" if "_ab_cdc_deleted_at" in column_names.keys(): col_cdc_deleted_at = self.name_transformer.normalize_column_name("_ab_cdc_deleted_at") col_cdc_updated_at = self.name_transformer.normalize_column_name("_ab_cdc_updated_at") + quoted_col_cdc_deleted_at = self.name_transformer.normalize_column_name("_ab_cdc_deleted_at", in_jinja=True) + quoted_col_cdc_updated_at = self.name_transformer.normalize_column_name("_ab_cdc_updated_at", in_jinja=True) cdc_active_row_pattern = f"and {col_cdc_deleted_at} is null " cdc_updated_order_pattern = f", {col_cdc_updated_at} desc" + cdc_cols = ( + f", cast({col_cdc_deleted_at} as " + + "{{ dbt_utils.type_string() }})" + + f", cast({col_cdc_updated_at} as " + + "{{ dbt_utils.type_string() }})" + ) + quoted_cdc_cols = f", {quoted_col_cdc_deleted_at}, {quoted_col_cdc_updated_at}" if "_ab_cdc_log_pos" in column_names.keys(): col_cdc_log_pos = self.name_transformer.normalize_column_name("_ab_cdc_log_pos") + quoted_col_cdc_log_pos = self.name_transformer.normalize_column_name("_ab_cdc_log_pos", in_jinja=True) cdc_updated_order_pattern += f", {col_cdc_log_pos} desc" + cdc_cols += f", cast({col_cdc_log_pos} as " + "{{ dbt_utils.type_string() }})" + quoted_cdc_cols += f", {quoted_col_cdc_log_pos}" sql = template.render( order_null=order_null, airbyte_start_at=self.name_transformer.normalize_column_name("_airbyte_start_at"), + quoted_airbyte_start_at=self.name_transformer.normalize_column_name("_airbyte_start_at", in_jinja=True), airbyte_end_at=self.name_transformer.normalize_column_name("_airbyte_end_at"), active_row=self.name_transformer.normalize_column_name("_airbyte_active_row"), - lag_emitted_at=self.get_emitted_at(in_jinja=True), + airbyte_row_num=self.name_transformer.normalize_column_name("_airbyte_row_num"), + quoted_airbyte_row_num=self.name_transformer.normalize_column_name("_airbyte_row_num", in_jinja=True), + airbyte_unique_key_scd=self.name_transformer.normalize_column_name("_airbyte_unique_key_scd"), + unique_key=self.get_unique_key(), + quoted_unique_key=self.get_unique_key(in_jinja=True), + col_ab_id=self.get_ab_id(), col_emitted_at=self.get_emitted_at(), + quoted_col_emitted_at=self.get_emitted_at(in_jinja=True), + col_normalized_at=self.get_normalized_at(), parent_hash_id=self.parent_hash_id(), fields=self.list_fields(column_names), cursor_field=self.get_cursor_field(column_names), - primary_key=self.get_primary_key(column_names), + primary_keys=self.list_primary_keys(column_names), + primary_key_partition=self.get_primary_key_partition(column_names), hash_id=self.hash_id(), - from_table=jinja_call(from_table), + from_table=from_table, sql_table_comment=self.sql_table_comment(include_from_table=True), cdc_active_row=cdc_active_row_pattern, cdc_updated_at_order=cdc_updated_order_pattern, + cdc_cols=cdc_cols, + quoted_cdc_cols=quoted_cdc_cols, ) return sql @@ -653,9 +806,18 @@ def get_cursor_field(self, column_names: Dict[str, Tuple[str, str]], in_jinja: b return cursor - def get_primary_key(self, column_names: Dict[str, Tuple[str, str]]) -> str: + def list_primary_keys(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]: + primary_keys = [] + for key_path in self.primary_key: + if len(key_path) == 1: + primary_keys.append(column_names[key_path[0]][1]) + else: + raise ValueError(f"Unsupported nested path {'.'.join(key_path)} for stream {self.stream_name}") + return primary_keys + + def get_primary_key_partition(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]: if self.primary_key and len(self.primary_key) > 0: - return ", ".join([self.get_primary_key_from_path(column_names, path) for path in self.primary_key]) + return [self.get_primary_key_from_path(column_names, path) for path in self.primary_key] else: raise ValueError(f"No primary key specified for stream {self.stream_name}") @@ -681,7 +843,7 @@ def get_primary_key_from_path(self, column_names: Dict[str, Tuple[str, str]], pa else: raise ValueError(f"No path specified for stream {self.stream_name}") - def generate_final_model(self, from_table: str, column_names: Dict[str, Tuple[str, str]]) -> str: + def generate_final_model(self, from_table: str, column_names: Dict[str, Tuple[str, str]], unique_key: str = "") -> str: template = Template( """ -- Final base SQL model @@ -689,62 +851,168 @@ def generate_final_model(self, from_table: str, column_names: Dict[str, Tuple[st {%- if parent_hash_id %} {{ parent_hash_id }}, {%- endif %} + {%- if unique_key %} + {{ unique_key }}, + {%- endif %} {%- for field in fields %} {{ field }}, {%- endfor %} + {{ col_ab_id }}, {{ col_emitted_at }}, + {{ '{{ current_timestamp() }}' }} as {{ col_normalized_at }}, {{ hash_id }} from {{ from_table }} {{ sql_table_comment }} +where 1 = 1 """ ) sql = template.render( + col_ab_id=self.get_ab_id(), col_emitted_at=self.get_emitted_at(), + col_normalized_at=self.get_normalized_at(), parent_hash_id=self.parent_hash_id(), fields=self.list_fields(column_names), hash_id=self.hash_id(), from_table=jinja_call(from_table), sql_table_comment=self.sql_table_comment(include_from_table=True), + unique_key=unique_key, ) return sql - def list_fields(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]: + def add_incremental_clause(self, sql_query: str) -> str: + template = Template( + """ +{{ sql_query }} +{{'{{'}} incremental_clause({{ col_emitted_at }}) {{'}}'}} + """ + ) + sql = template.render( + sql_query=sql_query, + col_emitted_at=self.get_emitted_at(in_jinja=True), + ) + return sql + + @staticmethod + def list_fields(column_names: Dict[str, Tuple[str, str]]) -> List[str]: return [column_names[field][0] for field in column_names] - def add_to_outputs(self, sql: str, is_intermediate: bool, column_count: int = 0, suffix: str = "") -> str: + def add_to_outputs( + self, + sql: str, + materialization_mode: TableMaterializationType, + suffix: str = "", + unique_key: str = "", + subdir: str = "", + partition_by: PartitionScheme = PartitionScheme.DEFAULT, + ) -> str: + is_intermediate = materialization_mode in [TableMaterializationType.CTE, TableMaterializationType.VIEW] schema = self.get_schema(is_intermediate) # MySQL table names need to be manually truncated, because it does not do it automatically truncate_name = self.destination_type == DestinationType.MYSQL table_name = self.tables_registry.get_table_name(schema, self.json_path, self.stream_name, suffix, truncate_name) file_name = self.tables_registry.get_file_name(schema, self.json_path, self.stream_name, suffix, truncate_name) file = f"{file_name}.sql" + output = os.path.join(materialization_mode.value, subdir, self.schema, file) + config = self.get_model_partition_config(partition_by, unique_key) + if file_name != table_name: + # The alias() macro configs a model's final table name. + config["alias"] = f'"{table_name}"' + if self.destination_type == DestinationType.ORACLE: + # oracle does not allow changing schemas + config["schema"] = f'"{self.default_schema}"' + else: + config["schema"] = f'"{schema}"' + if self.source_sync_mode == SyncMode.incremental and suffix != "scd": + # incremental is handled in the SCD SQL already + sql = self.add_incremental_clause(sql) + template = Template( + """ +{{ '{{' }} config( +{%- for key in config %} + {{ key }} = {{ config[key] }}, +{%- endfor %} + tags = [ {{ tags }} ] +) {{ '}}' }} +{{ sql }} + """ + ) + self.sql_outputs[output] = template.render(config=config, sql=sql, tags=self.get_model_tags(is_intermediate)) + json_path = self.current_json_path() + print(f" Generating {output} from {json_path}") + return ref_table(file_name) + + def get_model_materialization_mode(self, is_intermediate: bool, column_count: int = 0) -> TableMaterializationType: if is_intermediate: if column_count <= MAXIMUM_COLUMNS_TO_USE_EPHEMERAL: - output = os.path.join("airbyte_ctes", self.schema, file) + return TableMaterializationType.CTE else: # dbt throws "maximum recursion depth exceeded" exception at runtime # if ephemeral is used with large number of columns, use views instead - output = os.path.join("airbyte_views", self.schema, file) + return TableMaterializationType.VIEW else: - output = os.path.join("airbyte_tables", self.schema, file) - tags = self.get_model_tags(is_intermediate) - # The alias() macro configs a model's final table name. - if file_name != table_name: - header = jinja_call(f'config(alias="{table_name}", schema="{schema}", tags=[{tags}])') - else: - if self.destination_type == DestinationType.ORACLE: - header = jinja_call(f'config(schema="{self.default_schema}", tags=[{tags}])') + if self.source_sync_mode == SyncMode.incremental: + return TableMaterializationType.INCREMENTAL else: - header = jinja_call(f'config(schema="{schema}", tags=[{tags}])') - self.sql_outputs[ - output - ] = f""" -{header} -{sql} -""" - json_path = self.current_json_path() - print(f" Generating {output} from {json_path}") - return ref_table(file_name) + return TableMaterializationType.TABLE + + def get_model_partition_config(self, partition_by: PartitionScheme, unique_key: str) -> Dict: + """ + Defines partition, clustering and unique key parameters for each destination. + The goal of these are to make read more performant. + + In general, we need to do lookups on the last emitted_at column to know if a record is freshly produced and need to be + incrementally processed or not. + But in certain models, such as SCD tables for example, we also need to retrieve older data to update their type 2 SCD end_dates, + thus a different partitioning scheme is used to optimize that use case. + """ + config = {} + if self.destination_type == DestinationType.BIGQUERY: + # see https://docs.getdbt.com/reference/resource-configs/bigquery-configs + if partition_by in [PartitionScheme.UNIQUE_KEY, PartitionScheme.ACTIVE_ROW]: + config["cluster_by"] = '["_airbyte_unique_key","_airbyte_emitted_at"]' + else: + config["cluster_by"] = '"_airbyte_emitted_at"' + if partition_by == PartitionScheme.ACTIVE_ROW: + config["partition_by"] = ( + '{"field": "_airbyte_active_row", "data_type": "int64", ' '"range": {"start": 0, "end": 1, "interval": 1}}' + ) + elif partition_by == PartitionScheme.NOTHING: + pass + else: + config["partition_by"] = '{"field": "_airbyte_emitted_at", "data_type": "timestamp", "granularity": "day"}' + elif self.destination_type == DestinationType.POSTGRES: + # see https://docs.getdbt.com/reference/resource-configs/postgres-configs + if partition_by == PartitionScheme.ACTIVE_ROW: + config["indexes"] = "[{'columns':['_airbyte_active_row','_airbyte_unique_key','_airbyte_emitted_at'],'type': 'btree'}]" + elif partition_by == PartitionScheme.UNIQUE_KEY: + config["indexes"] = "[{'columns':['_airbyte_unique_key','_airbyte_emitted_at'],'type': 'btree'}]" + else: + config["indexes"] = "[{'columns':['_airbyte_emitted_at'],'type':'hash'}]" + elif self.destination_type == DestinationType.REDSHIFT: + # see https://docs.getdbt.com/reference/resource-configs/redshift-configs + if partition_by == PartitionScheme.ACTIVE_ROW: + config["sort"] = '["_airbyte_active_row", "_airbyte_unique_key", "_airbyte_emitted_at"]' + elif partition_by == PartitionScheme.UNIQUE_KEY: + config["sort"] = '["_airbyte_unique_key", "_airbyte_emitted_at"]' + elif partition_by == PartitionScheme.NOTHING: + pass + else: + config["sort"] = '"_airbyte_emitted_at"' + elif self.destination_type == DestinationType.SNOWFLAKE: + # see https://docs.getdbt.com/reference/resource-configs/snowflake-configs + if partition_by == PartitionScheme.ACTIVE_ROW: + config["cluster_by"] = '["_AIRBYTE_ACTIVE_ROW", "_AIRBYTE_UNIQUE_KEY", "_AIRBYTE_EMITTED_AT"]' + elif partition_by == PartitionScheme.UNIQUE_KEY: + config["cluster_by"] = '["_AIRBYTE_UNIQUE_KEY", "_AIRBYTE_EMITTED_AT"]' + elif partition_by == PartitionScheme.NOTHING: + pass + else: + config["cluster_by"] = '["_AIRBYTE_EMITTED_AT"]' + if unique_key: + config["unique_key"] = f'"{unique_key}"' + else: + config["unique_key"] = f"env_var('AIRBYTE_DEFAULT_UNIQUE_KEY', {self.get_ab_id(in_jinja=True)})" + return config def get_model_tags(self, is_intermediate: bool) -> str: tags = "" @@ -807,19 +1075,19 @@ def unnesting_before_query(self) -> str: return jinja_call(f"unnest_cte({parent_file_name}, {parent_stream_name}, {quoted_field})") return "" - def unnesting_after_query(self) -> str: - result = "" + def unnesting_from(self) -> str: if self.parent: - cross_join = "" if self.is_nested_array: parent_stream_name = f"'{self.parent.normalized_stream_name()}'" quoted_field = self.name_transformer.normalize_column_name(self.stream_name, in_jinja=True) - cross_join = jinja_call(f"cross_join_unnest({parent_stream_name}, {quoted_field})") + return jinja_call(f"cross_join_unnest({parent_stream_name}, {quoted_field})") + return "" + + def unnesting_where(self) -> str: + if self.parent: column_name = self.name_transformer.normalize_column_name(self.stream_name) - result = f""" -{cross_join} -where {column_name} is not null""" - return result + return f"and {column_name} is not null" + return "" # Static Functions diff --git a/airbyte-integrations/bases/base-normalization/normalization/transform_config/dbt_project_base.yml b/airbyte-integrations/bases/base-normalization/normalization/transform_config/dbt_project_base.yml deleted file mode 100755 index 057395b406e1..000000000000 --- a/airbyte-integrations/bases/base-normalization/normalization/transform_config/dbt_project_base.yml +++ /dev/null @@ -1,55 +0,0 @@ -# Name your package! Package names should contain only lowercase characters -# and underscores. A good package name should reflect your organization's -# name or the intended use of these models -name: 'airbyte_utils' -version: '1.0' -config-version: 2 - -# This setting configures which "profile" dbt uses for this project. Profiles contain -# database connection information, and should be configured in the ~/.dbt/profiles.yml file -profile: 'normalize' - -# These configurations specify where dbt should look for different types of files. -# The `source-paths` config, for example, states that source models can be found -# in the "models/" directory. You probably won't need to change these! -source-paths: ["models"] -docs-paths: ["docs"] -analysis-paths: ["analysis"] -test-paths: ["tests"] -data-paths: ["data"] -macro-paths: ["macros"] - -target-path: "../build" # directory which will store compiled SQL files -log-path: "../logs" # directory which will store DBT logs -modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies - -clean-targets: # directories to be removed by `dbt clean` - - "build" - - "dbt_modules" - -quoting: - database: true -# Temporarily disabling the behavior of the ExtendedNameTransformer on table/schema names, see (issue #1785) -# all schemas should be unquoted - schema: false - identifier: true - -# You can define configurations for models in the `source-paths` directory here. -# Using these configurations, you can enable or disable models, change how they -# are materialized, and more! -models: - airbyte_utils: - generated: - airbyte_ctes: - +tags: airbyte_internal_cte - +materialized: ephemeral - airbyte_views: - +tags: airbyte_internal_views - +materialized: view - airbyte_tables: - +tags: normalized_tables - +materialized: table - +materialized: table - -vars: - dbt_utils_dispatch_list: ['airbyte_utils'] diff --git a/airbyte-integrations/bases/base-normalization/unit_tests/test_stream_processor.py b/airbyte-integrations/bases/base-normalization/unit_tests/test_stream_processor.py index 8e628b707fab..6d33c56bcf4d 100644 --- a/airbyte-integrations/bases/base-normalization/unit_tests/test_stream_processor.py +++ b/airbyte-integrations/bases/base-normalization/unit_tests/test_stream_processor.py @@ -96,7 +96,10 @@ def test_primary_key( from_table="", ) try: - assert stream_processor.get_primary_key(column_names=stream_processor.extract_column_names()) == expected_final_primary_key_string + assert ( + ", ".join(stream_processor.get_primary_key_partition(column_names=stream_processor.extract_column_names())) + == expected_final_primary_key_string + ) except ValueError as e: if not expecting_exception: raise e diff --git a/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/AbstractSshClickHouseSourceAcceptanceTest.java b/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/AbstractSshClickHouseSourceAcceptanceTest.java index 67cc99348497..deea69f3cb8e 100644 --- a/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/AbstractSshClickHouseSourceAcceptanceTest.java +++ b/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/AbstractSshClickHouseSourceAcceptanceTest.java @@ -5,11 +5,8 @@ package io.airbyte.integrations.io.airbyte.integration_tests.sources; import com.fasterxml.jackson.databind.JsonNode; -import com.google.common.collect.ImmutableMap; import com.google.common.collect.Lists; import io.airbyte.commons.json.Jsons; -import io.airbyte.commons.resources.MoreResources; -import io.airbyte.db.Database; import io.airbyte.db.Databases; import io.airbyte.db.jdbc.JdbcDatabase; import io.airbyte.db.jdbc.JdbcUtils; @@ -30,7 +27,6 @@ import java.util.Collections; import java.util.HashMap; import java.util.List; -import org.jooq.SQLDialect; import org.testcontainers.containers.ClickHouseContainer; public abstract class AbstractSshClickHouseSourceAcceptanceTest extends SourceAcceptanceTest { @@ -99,6 +95,7 @@ protected void setupEnvironment(final TestDestinationEnv environment) throws Exc populateDatabaseTestData(); } + private void startTestContainers() { bastion.initAndStartBastion(); initAndStartJdbcContainer(); diff --git a/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/ClickHouseSourceAcceptanceTest.java b/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/ClickHouseSourceAcceptanceTest.java index 2a348a556f37..a5c54c91ef67 100644 --- a/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/ClickHouseSourceAcceptanceTest.java +++ b/airbyte-integrations/connectors/source-clickhouse/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/ClickHouseSourceAcceptanceTest.java @@ -8,7 +8,6 @@ import com.google.common.collect.ImmutableMap; import com.google.common.collect.Lists; import io.airbyte.commons.json.Jsons; -import io.airbyte.commons.resources.MoreResources; import io.airbyte.db.Databases; import io.airbyte.db.jdbc.JdbcDatabase; import io.airbyte.db.jdbc.JdbcUtils; diff --git a/airbyte-tests/src/acceptanceTests/java/io/airbyte/test/acceptance/AcceptanceTests.java b/airbyte-tests/src/acceptanceTests/java/io/airbyte/test/acceptance/AcceptanceTests.java index 587d3b83a42e..24c73947f793 100644 --- a/airbyte-tests/src/acceptanceTests/java/io/airbyte/test/acceptance/AcceptanceTests.java +++ b/airbyte-tests/src/acceptanceTests/java/io/airbyte/test/acceptance/AcceptanceTests.java @@ -1149,7 +1149,7 @@ private void clearDestinationDbData() throws SQLException { final Database database = getDestinationDatabase(); final Set pairs = listAllTables(database); for (final SchemaTableNamePair pair : pairs) { - database.query(context -> context.execute(String.format("DROP TABLE %s.%s", pair.schemaName, pair.tableName))); + database.query(context -> context.execute(String.format("DROP TABLE %s.%s CASCADE", pair.schemaName, pair.tableName))); } } diff --git a/airbyte-workers/src/main/java/io/airbyte/workers/normalization/NormalizationRunnerFactory.java b/airbyte-workers/src/main/java/io/airbyte/workers/normalization/NormalizationRunnerFactory.java index 1299e5427141..fb238fb597e3 100644 --- a/airbyte-workers/src/main/java/io/airbyte/workers/normalization/NormalizationRunnerFactory.java +++ b/airbyte-workers/src/main/java/io/airbyte/workers/normalization/NormalizationRunnerFactory.java @@ -13,7 +13,7 @@ public class NormalizationRunnerFactory { public static final String BASE_NORMALIZATION_IMAGE_NAME = "airbyte/normalization"; - public static final String NORMALIZATION_VERSION = "0.1.56"; + public static final String NORMALIZATION_VERSION = "0.1.58"; static final Map> NORMALIZATION_MAPPING = ImmutableMap.>builder() diff --git a/build.gradle b/build.gradle index 7224e6d58394..271b27c23feb 100644 --- a/build.gradle +++ b/build.gradle @@ -58,24 +58,22 @@ def createJavaLicenseWith = { license -> // monorepo setup and it doesn't actually exclude directories reliably. This code makes the behavior predictable. def createSpotlessTarget = { pattern -> def excludes = [ - '.gradle', - 'node_modules', - '.eggs', - '.mypy_cache', - '.venv', - '*.egg-info', - 'build', - 'dbt-project-template', - 'dbt-project-template-mssql', - 'dbt-project-template-mysql', - 'dbt-project-template-oracle', - 'dbt_data_tests', - 'dbt_data_tests_tmp', - 'dbt_schema_tests', - 'normalization_test_output', - 'tools', - 'secrets', - 'charts' // Helm charts often have injected template strings that will fail general linting. Helm linting is done separately. + '.gradle', + 'node_modules', + '.eggs', + '.mypy_cache', + '.venv', + '*.egg-info', + 'build', + 'dbt-project-template', + 'dbt-project-template-mssql', + 'dbt-project-template-mysql', + 'dbt-project-template-oracle', + 'dbt_test_config', + 'normalization_test_output', + 'tools', + 'secrets', + 'charts' // Helm charts often have injected template strings that will fail general linting. Helm linting is done separately. ] if (System.getenv().containsKey("SUB_BUILD")) { diff --git a/docs/understanding-airbyte/basic-normalization.md b/docs/understanding-airbyte/basic-normalization.md index 3939e64e1403..1e3c02452a0a 100644 --- a/docs/understanding-airbyte/basic-normalization.md +++ b/docs/understanding-airbyte/basic-normalization.md @@ -22,23 +22,50 @@ Basic Normalization uses a fixed set of rules to map a json object from a source } ``` -Then basic normalization would create the following table: - +The destination connectors produce the following raw table in the destination database: ```sql -CREATE TABLE "cars" ( +CREATE TABLE "_airbyte_raw_cars" ( -- metadata added by airbyte - "_airbyte_cars_hashid" VARCHAR, -- uuid assigned by airbyte derived from a hash of the data. + "_airbyte_ab_id" VARCHAR, -- uuid value assigned by connectors to each row of the data written in the destination. "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was emitted. - "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was normalized. + "_airbyte_data" JSONB -- data stored as a Json Blob. +); +``` + +Then, basic normalization would create the following table: + +```sql +CREATE TABLE "cars" ( + "_airbyte_ab_id" VARCHAR, + "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, + "_airbyte_cars_hashid" VARCHAR, + "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE, - -- data + -- data from source "make" VARCHAR, "model" VARCHAR, "horsepower" INTEGER ); ``` -You'll notice that we add some metadata to keep track of important information about each record. +## Normalization metadata columns + +You'll notice that some metadata are added to keep track of important information about each record. +- Some are introduced at the destination connector level: These are propagated by the normalization process from the raw table to the final table + - `_airbyte_ab_id`: uuid value assigned by connectors to each row of the data written in the destination. + - `_airbyte_emitted_at`: time at which the record was emitted and recorded by destination connector. +- While other metadata columns are created at the normalization step. + - `_airbyte__hashid`: hash value assigned by airbyte normalization derived from a hash function of the record data. + - `_airbyte_normalized_at`: time at which the record was last normalized (useful to track when incremental transformations are performed) + +Additional metadata columns can be added on some tables depending on the usage: +- On the Slowly Changing Dimension (SCD) tables: +- `_airbyte_start_at`: equivalent to the cursor column defined on the table, denotes when the row was first seen +- `_airbyte_end_at`: denotes until when the row was seen with these particular values. If this column is not NULL, then the record has been updated and is no longer the most up to date one. If NULL, then the row is the latest version for the record. +- `_airbyte_active_row`: denotes if the row for the record is the latest version or not. +- `_airbyte_unique_key_scd`: hash of primary keys + cursors used to de-duplicate the scd table. +- On de-duplicated (and SCD) tables: +- `_airbyte_unique_key`: hash of primary keys used to de-duplicate the final table. The [normalization rules](basic-normalization.md#Rules) are _not_ configurable. They are designed to pick a reasonable set of defaults to hit the 80/20 rule of data normalization. We respect that normalization is a detail-oriented problem and that with a fixed set of rules, we cannot normalize your data in such a way that covers all use cases. If this feature does not meet your normalization needs, we always put the full json blob in destination as well, so that you can parse that object however best meets your use case. We will be adding more advanced normalization functionality shortly. Airbyte is focused on the EL of ELT. If you need a really featureful tool for the transformations then, we suggest trying out dbt. @@ -66,12 +93,13 @@ In Airbyte, the current normalization option is implemented using a dbt Transfor ## Destinations that Support Basic Normalization * [BigQuery](../integrations/destinations/bigquery.md) +* [MS Server SQL](../integrations/destinations/mssql.md) * [MySQL](../integrations/destinations/mysql.md) * The server must support the `WITH` keyword. * Require MySQL >= 8.0, or MariaDB >= 10.2.1. * [Postgres](../integrations/destinations/postgres.md) -* [Snowflake](../integrations/destinations/snowflake.md) * [Redshift](../integrations/destinations/redshift.md) +* [Snowflake](../integrations/destinations/snowflake.md) Basic Normalization can be used in each of these destinations by configuring the "basic normalization" field to true when configuring the destination in the UI. @@ -90,6 +118,7 @@ Airbyte uses the types described in the catalog to determine the correct type fo | `string` | string | | | `bit` | boolean | | | `boolean` | boolean | | +| `string` with format label `date-time`| timestamp with timezone | | | `array` | new table | see [nesting](basic-normalization.md#Nesting) | | `object` | new table | see [nesting](basic-normalization.md#Nesting) | @@ -287,9 +316,23 @@ To enable basic normalization \(which is optional\), you can toggle it on or dis ![](../.gitbook/assets/basic-normalization-configuration.png) +## Incremental runs + +When the source is configured with incremental sync modes such as ([incremental append](connections/incremental-append.md) or [incremental deduped history](connections/incremental-deduped-history.md)), only rows that have changed in the source are transferred over the network and written by the destination connector. +Normalization will then try to build the normalized tables incrementally as the rows in the raw tables that have been created or updated since the last time dbt ran. As such, on each dbt run, the models get built incrementally. This limits the amount of data that needs to be transformed, vastly reducing the runtime of the transformations. This improves warehouse performance and reduces compute costs. +Because normalization can be either run incrementally and, or, in full refresh, a technical column `_airbyte_normalized_at` can serve to track when was the last time a record has been transformed and written by normalization. +This may be greatly diverge from the `_airbyte_emitted_at` value as the normalized tables could be totally re-built at a latter time from the data stored in the `_airbyte_raw` tables. + +## Partitioning, clustering, sorting, indexing + +Normalization produces tables that are partitioned, clustered, sorted or indexed depending on the destination engine and on the type of tables being built. The goal of these are to make read more performant, especially when running incremental updates. + +In general, normalization needs to do lookup on the last emitted_at column to know if a record is freshly produced and need to be +incrementally processed or not. But in certain models, such as SCD tables for example, we also need to retrieve older data to update their type 2 SCD end_date and active_row flags, thus a different partitioning scheme is used to optimize that use case. + ## Extending Basic Normalization -Note that all the choices made by Normalization as described in this documentation page in terms of naming could be overridden by your own custom choices. To do so, you can follow the following tutorials: +Note that all the choices made by Normalization as described in this documentation page in terms of naming (and more) could be overridden by your own custom choices. To do so, you can follow the following tutorials: * to build a [custom SQL view](../operator-guides/transformation-and-normalization/transformations-with-sql.md) with your own naming conventions * to export, edit and run [custom dbt normalization](../operator-guides/transformation-and-normalization/transformations-with-dbt.md) yourself @@ -305,6 +348,7 @@ Therefore, in order to "upgrade" to the desired normalization version, you need | Airbyte Version | Normalization Version | Date | Pull Request | Subject | | :--- | :--- | :--- | :--- | :--- | +| 0.30.24-alpha | 0.1.57 | 2021-10-26 | [\#7162](https://github.com/airbytehq/airbyte/pull/7162) | Implement incremental dbt updates | | 0.30.16-alpha | 0.1.52 | 2021-10-07 | [\#6379](https://github.com/airbytehq/airbyte/pull/6379) | Handle empty string for date and date-time format | | 0.30.16-alpha | 0.1.51 | 2021-10-08 | [\#6799](https://github.com/airbytehq/airbyte/pull/6799) | Added support for ad\_cdc\_log\_pos while normalization | | 0.30.16-alpha | 0.1.50 | 2021-10-07 | [\#6079](https://github.com/airbytehq/airbyte/pull/6079) | Added support for MS SQL Server normalization |