Introduce normalization integration tests (#3025)

* Speed normalization unit tests by dropping hubspot catalog (too heavy, will be covering it in integration tests instead * Add integration tests for normalization * Add dedup test case * adjust build.gradle * add readme for normalization * Share PATH env variable with subprocess calls * Handle git non-versionned tests vs versionned ones * Format code * Add tests check to normalization integration tests * Add docs * complete docs on normalization integration tests * format code * Normalization integration tests output (#3026) * Version generated/output files from normalization integration tests * simplify cast of float columns to string when used as partition key (#3027) * bump version of normalization image * Apply suggestions from code review Co-authored-by: Jared Rhizor <jared@dataline.io> * Apply suggestions from code review Co-authored-by: Jared Rhizor <jared@dataline.io>
airbytehq · Apr 27, 2021 · c2fa3e4 · c2fa3e4
1 parent 4c96499
commit c2fa3e4
Show file tree

Hide file tree

Showing 113 changed files with 2,598 additions and 28,095 deletions.
diff --git a/airbyte-integrations/bases/base-normalization/.gitignore b/airbyte-integrations/bases/base-normalization/.gitignore
@@ -1,4 +1,10 @@
 build/
 logs/
 dbt-project-template/models/generated/
+dbt-project-template/test_output.log
 dbt_modules/
+secrets/
+
+integration_tests/normalization_test_output/*
+!integration_tests/normalization_test_output/*/*/models/generated
+!integration_tests/normalization_test_output/*/*/final
diff --git a/airbyte-integrations/bases/base-normalization/Dockerfile b/airbyte-integrations/bases/base-normalization/Dockerfile
@@ -19,5 +19,5 @@ WORKDIR /airbyte
 
 ENTRYPOINT ["/airbyte/entrypoint.sh"]
 
-LABEL io.airbyte.version=0.1.24
+LABEL io.airbyte.version=0.1.25
 LABEL io.airbyte.name=airbyte/normalization
diff --git a/airbyte-integrations/bases/base-normalization/README.md b/airbyte-integrations/bases/base-normalization/README.md
@@ -0,0 +1,203 @@
+# Normalization
+
+Related documentation on normalization is available here:
+
+- [architecture / Basic Normalization](../../../docs/architecture/basic-normalization.md)
+* [tutorials / Custom DBT normalization](../../../docs/tutorials/connecting-el-with-t-using-dbt.md)
+
+# Testing normalization
+
+Below are short descriptions of the kind of tests that may be affected by changes to the normalization code.
+
+## Unit Tests
+
+Unit tests are automatically included when building the normalization project.
+But you could invoke them explicitly by running the following commands for example:
+
+with Gradle:
+
+    ./gradlew :airbyte-integrations:bases:base-normalization:unitTest
+
+or directly with pytest:
+
+    pytest airbyte-integrations/bases/base-normalization/unit_tests
+
+Unit tests are targeted at the main code generation functionality of normalization.
+They should verify different logic rules on how to convert an input catalog.json (JSON Schema) file into
+dbt files.
+
+#### test_transform_config.py:
+
+This class is testing the transform config functionality that converts a destination_config.json into the adequate profiles.yml file for dbt to use
+see [related dbt docs on profiles.yml](https://docs.getdbt.com/reference/profiles.yml) for more context on what it actually is.
+
+#### test_stream_processor.py:
+
+These unit tests functions check how each stream is converted to dbt models files.
+For example, one big focus area is around how table names are chosen.
+(especially since some destination like postgres have a very low limit to identifiers length of 64 characters)
+In case of nested objects/arrays in a stream, names can be dragged on to even longer names...
+
+So you can find rules of how to truncate and concatenate part of the table names together in here.
+Depending on the catalog context and what identifiers have been already used in the past, some naming
+may also be affected and requires to choose new identifications to avoid collisions.
+
+Additional helper functions dealing with cursor fields, primary keys and other code generation parts are also being tested here.
+
+#### test_destination_name_transformer.py:
+
+These Unit tests checks implementation of specific rules of SQL identifier naming conventions for each destination.
+The specifications rules of each destinations are detailed in the corresponding docs, especially on the
+allowed characters, if quotes are needed or not, and the length limitations:
+
+- [bigquery](../../../docs/integrations/destinations/bigquery.md)
+- [postgres](../../../docs/integrations/destinations/postgres.md)
+- [redshift](../../../docs/integrations/destinations/redshift.md)
+- [snowflake](../../../docs/integrations/destinations/snowflake.md)
+
+Rules about truncations, for example for both of these strings which are too long for the postgres 64 limit:
+- `Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iiii`
+- `Aaaa_Bbbb_Cccc_Dddd_a_very_long_name_Ffff_Gggg_Hhhh_Iiii`
+
+Deciding on how to truncate (in the middle) are being verified in these tests.
+In this instance, both strings ends up as:
+
+- `Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii`
+
+The truncate operation gets rid of characters in the middle of the string to preserve the start
+and end characters as it may contain more useful information in table naming. However the final
+truncated name can still potentially cause collisions in table names...
+
+Note that dealing with such collisions is not part of `destination_name_transformer` but of the
+`stream_processor` since one is focused on destination conventions and the other on putting together
+identifier names from streams and catalogs.
+
+## Integration Tests
+
+With Gradle:
+
+    ./gradlew :airbyte-integrations:bases:base-normalization:integrationTest
+
+or directly with pytest:
+
+    pytest airbyte-integrations/bases/base-normalization/integration_tests
+
+or can also be invoked on github, thanks to the slash commands posted as comments:
+
+    /test connector=bases/base-normalization
+
+### Integration Tests Definitions:
+
+Some test suites can be selected to be versioned control in Airbyte git repository (or not).
+This is useful to see direct impacts of code changes on downstream files generated or compiled
+by normalization and dbt (directly in PR too). (_Simply refer to your test suite name in the
+`git_versioned_tests` variable in the `base-normalization/integration_tests/test_normalization.py` file_)
+
+We would typically choose small and meaningful test suites to include in git while others more complex tests
+can be left out. They would still be run in a temporary directory and thrown away at the end of the tests.
+
+They are defined, each one of them, in a separate directory in the resource folder.
+For example, below, we would have 2 different tests "suites" with this hierarchy:
+
+      base-normalization/integration_tests/resources/
+      ├── test_suite1/
+      │   ├── data_input/
+      │   │   ├── catalog.json
+      │   │   └── messages.txt
+      │   ├── dbt_data_tests/
+      │   │   ├── file1.sql
+      │   │   └── file2.sql
+      │   ├── dbt_schema_tests/
+      │   │   ├── file1.yml
+      │   │   └── file2.yml
+      │   └── README.md
+      └── test_suite2/
+          ├── data_input/
+          │   ├── catalog.json
+          │   └── messages.txt
+          ├── dbt_data_tests/
+          ├── dbt_schema_tests/
+          └── README.md
+
+#### README.md:
+
+Each test suite should have an optional `README.md` to include further details and descriptions of what the test is trying to verify and
+how it is specifically built.
+
+### Integration Test Data Input:
+
+#### data_input/catalog.json:
+
+The catalog.json is the main input for normalization from which the dbt models files are being
+generated from as it describes in JSON Schema format what the data structure is.
+
+#### data_input/messages.txt:
+
+The `messages.txt` are serialized Airbyte JSON records that should be sent to the destination as if they were
+transmitted by a source. In this integration test, the files is read and "cat" through to the docker image of
+each destination connectors to populate `_airbyte_raw_tables`. These tables are finally used as input
+data for dbt to run from.
+
+### Integration Test Execution Flow:
+
+These integration tests are run against all destinations that dbt can be executed on.
+So, for each target destination, the steps run by the tests are:
+
+1. Prepare the test execution workspace folder (copy skeleton from `dbt-project-template/`)
+2. Generate a dbt `profiles.yml` file to connect to the target destination
+3. Populate raw tables by running the target destination connectors, reading and uploading the
+   `messages.txt` file as data input.
+4. Run Normalization step to generate dbt models files from `catalog.json` input file.
+5. Execute dbt cli command: `dbt run` from the test workspace folder to compile generated models files
+   - from `models/generated/` folder
+   - into `../build/(compiled|run)/airbyte_utils/models/generated/` folder
+   - The final "run" SQL files are also copied (for archiving) to `final/` folder by the test script.
+6. Deploy the `schema_tests` and `data_tests` files into the test workspace folder.
+7. Execute dbt cli command: `dbt tests` from the test workspace folder to run verifications and checks with dbt.
+8. Optional checks (nothing for the moment)
+
+### Integration Test Checks:
+
+#### dbt schema tests:
+
+dbt allows out of the box to configure some tests as properties for an existing model (or source, seed, or snapshot).
+This can be done in yaml format as described in the following documentation pages:
+
+- [dbt schema-tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#schema-tests)
+- [custom schema test](https://docs.getdbt.com/docs/guides/writing-custom-schema-tests)
+
+We are leveraging these capabilities in these integration tests to verify some relationships in our
+generated tables on the destinations.
+
+#### dbt data tests:
+
+Additionally, dbt also supports "data tests" which are specified as SQL queries.
+A data test is a select statement that returns 0 records when the test is successful.
+
+- [dbt data-tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#data-tests)
+
+#### Notes using dbt seeds:
+
+Because some functionalities are not stable enough on dbt side, it is difficult to properly use
+`dbt seed` commands to populate a set of expected data tables at the moment. Hopefully, this can be
+more easily be done in the future...
+
+Related issues to watch on dbt progress to improve this aspects:
+- https://github.com/fishtown-analytics/dbt/issues/2959#issuecomment-747509782
+- https://medium.com/hashmapinc/unit-testing-on-dbt-models-using-a-static-test-dataset-in-snowflake-dfd35549b5e2
+
+A nice improvement would be to add csv/json seed files as expected output data from tables.
+The integration tests would verify that the content of such tables in the destination would match
+these seed files or fail.
+
+## Standard Destination Tests
+
+Generally, to invoke standard destination tests, you run with gradle using:
+
+    ./gradlew :airbyte-integrations:connectors:destination-<connector name>:integrationTest
+
+For more details and options, you can also refer to the [testing connectors docs](../../../docs/contributing-to-airbyte/building-new-connector/testing-connectors.md).
+
+## Acceptance Tests
+
+Please refer to the [developing docs](../../../docs/contributing-to-airbyte/developing-locally.md) on how to run Acceptance Tests.
diff --git a/airbyte-integrations/bases/base-normalization/build.gradle b/airbyte-integrations/bases/base-normalization/build.gradle
@@ -15,3 +15,15 @@ installReqs.dependsOn(":airbyte-integrations:bases:airbyte-protocol:installReqs"
 
 project.task('integrationTest')
 integrationTest.dependsOn(build)
+
+task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs){
+    module = "pytest"
+    command = "-s integration_tests"
+
+    dependsOn ':airbyte-integrations:connectors:destination-bigquery:airbyteDocker'
+    dependsOn ':airbyte-integrations:connectors:destination-postgres:airbyteDocker'
+    dependsOn ':airbyte-integrations:connectors:destination-redshift:airbyteDocker'
+    dependsOn ':airbyte-integrations:connectors:destination-snowflake:airbyteDocker'
+}
+
+integrationTest.dependsOn("customIntegrationTestPython")
diff --git a/...treams/final/airbyte_tables/test_normalization/test_normalization_dedup_exchange_rate.sql b/...treams/final/airbyte_tables/test_normalization/test_normalization_dedup_exchange_rate.sql
@@ -0,0 +1,23 @@
+
+
+  create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate`
+
+
+  OPTIONS()
+  as (
+
+-- Final base SQL model
+select
+    id,
+    currency,
+    date,
+    HKD,
+    NZD,
+    USD,
+    _airbyte_emitted_at,
+    _airbyte_dedup_exchange_rate_hashid
+from `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd`
+-- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate
+where _airbyte_active_row = True
+  );
+
diff --git a/...ms/final/airbyte_tables/test_normalization/test_normalization_dedup_exchange_rate_scd.sql b/...ms/final/airbyte_tables/test_normalization/test_normalization_dedup_exchange_rate_scd.sql
@@ -0,0 +1,36 @@
+
+
+  create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd`
+
+
+  OPTIONS()
+  as (
+
+-- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key
+select
+    id,
+    currency,
+    date,
+    HKD,
+    NZD,
+    USD,
+    date as _airbyte_start_at,
+    lag(date) over (
+        partition by id, currency, cast(NZD as 
+    string
+)
+        order by date desc, _airbyte_emitted_at desc
+    ) as _airbyte_end_at,
+    lag(date) over (
+        partition by id, currency, cast(NZD as 
+    string
+)
+        order by date desc, _airbyte_emitted_at desc
+    ) is null as _airbyte_active_row,
+    _airbyte_emitted_at,
+    _airbyte_dedup_exchange_rate_hashid
+from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab4`
+-- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate
+where _airbyte_row_num = 1
+  );
+
diff --git a/..._key_streams/final/airbyte_tables/test_normalization/test_normalization_exchange_rate.sql b/..._key_streams/final/airbyte_tables/test_normalization/test_normalization_exchange_rate.sql
@@ -0,0 +1,22 @@
+
+
+  create or replace table `dataline-integration-testing`.test_normalization.`exchange_rate`
+
+
+  OPTIONS()
+  as (
+
+-- Final base SQL model
+select
+    id,
+    currency,
+    date,
+    HKD,
+    NZD,
+    USD,
+    _airbyte_emitted_at,
+    _airbyte_exchange_rate_hashid
+from `dataline-integration-testing`._airbyte_test_normalization.`exchange_rate_ab3`
+-- exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_exchange_rate
+  );
+
diff --git a/.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab1.sql b/.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab1.sql
@@ -0,0 +1,17 @@
+
+
+  create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab1`
+  OPTIONS()
+  as 
+-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema
+select
+    json_extract_scalar(_airbyte_data, "$.id") as id,
+    json_extract_scalar(_airbyte_data, "$.currency") as currency,
+    json_extract_scalar(_airbyte_data, "$.date") as date,
+    json_extract_scalar(_airbyte_data, "$.HKD") as HKD,
+    json_extract_scalar(_airbyte_data, "$.NZD") as NZD,
+    json_extract_scalar(_airbyte_data, "$.USD") as USD,
+    _airbyte_emitted_at
+from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate
+-- dedup_exchange_rate;
+
diff --git a/.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab2.sql b/.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab2.sql
@@ -0,0 +1,29 @@
+
+
+  create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab2`
+  OPTIONS()
+  as 
+-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type
+select
+    cast(id as 
+    int64
+) as id,
+    cast(currency as 
+    string
+) as currency,
+    cast(date as 
+    string
+) as date,
+    cast(HKD as 
+    float64
+) as HKD,
+    cast(NZD as 
+    float64
+) as NZD,
+    cast(USD as 
+    float64
+) as USD,
+    _airbyte_emitted_at
+from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab1`
+-- dedup_exchange_rate;
+
diff --git a/.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab3.sql b/.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab3.sql
@@ -0,0 +1,26 @@
+
+
+  create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab3`
+  OPTIONS()
+  as 
+-- SQL model to build a hash column based on the values of this record
+select
+    *,
+    to_hex(md5(cast(concat(coalesce(cast(id as 
+    string
+), ''), '-', coalesce(cast(currency as 
+    string
+), ''), '-', coalesce(cast(date as 
+    string
+), ''), '-', coalesce(cast(HKD as 
+    string
+), ''), '-', coalesce(cast(NZD as 
+    string
+), ''), '-', coalesce(cast(USD as 
+    string
+), '')) as 
+    string
+))) as _airbyte_dedup_exchange_rate_hashid
+from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab2`
+-- dedup_exchange_rate;
+