Skip to content

Commit

Permalink
Introduce normalization integration tests (#3025)
Browse files Browse the repository at this point in the history
* Speed normalization unit tests by dropping hubspot catalog (too heavy, will be covering it in integration tests instead

* Add integration tests for normalization

* Add dedup test case

* adjust build.gradle

* add readme for normalization

* Share PATH env variable with subprocess calls

* Handle git non-versionned tests vs versionned ones

* Format code

* Add tests check to normalization integration tests

* Add docs

* complete docs on normalization integration tests

* format code

* Normalization integration tests output (#3026)

* Version generated/output files from normalization integration tests

* simplify cast of float columns to string when used as partition key (#3027)

* bump version of normalization image

* Apply suggestions from code review

Co-authored-by: Jared Rhizor <jared@dataline.io>

* Apply suggestions from code review

Co-authored-by: Jared Rhizor <jared@dataline.io>
  • Loading branch information
ChristopheDuong and jrhizor committed Apr 27, 2021
1 parent 4c96499 commit c2fa3e4
Show file tree
Hide file tree
Showing 113 changed files with 2,598 additions and 28,095 deletions.
6 changes: 6 additions & 0 deletions airbyte-integrations/bases/base-normalization/.gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
build/
logs/
dbt-project-template/models/generated/
dbt-project-template/test_output.log
dbt_modules/
secrets/

integration_tests/normalization_test_output/*
!integration_tests/normalization_test_output/*/*/models/generated
!integration_tests/normalization_test_output/*/*/final
2 changes: 1 addition & 1 deletion airbyte-integrations/bases/base-normalization/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ WORKDIR /airbyte

ENTRYPOINT ["/airbyte/entrypoint.sh"]

LABEL io.airbyte.version=0.1.24
LABEL io.airbyte.version=0.1.25
LABEL io.airbyte.name=airbyte/normalization
203 changes: 203 additions & 0 deletions airbyte-integrations/bases/base-normalization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# Normalization

Related documentation on normalization is available here:

- [architecture / Basic Normalization](../../../docs/architecture/basic-normalization.md)
* [tutorials / Custom DBT normalization](../../../docs/tutorials/connecting-el-with-t-using-dbt.md)

# Testing normalization

Below are short descriptions of the kind of tests that may be affected by changes to the normalization code.

## Unit Tests

Unit tests are automatically included when building the normalization project.
But you could invoke them explicitly by running the following commands for example:

with Gradle:

./gradlew :airbyte-integrations:bases:base-normalization:unitTest

or directly with pytest:

pytest airbyte-integrations/bases/base-normalization/unit_tests

Unit tests are targeted at the main code generation functionality of normalization.
They should verify different logic rules on how to convert an input catalog.json (JSON Schema) file into
dbt files.

#### test_transform_config.py:

This class is testing the transform config functionality that converts a destination_config.json into the adequate profiles.yml file for dbt to use
see [related dbt docs on profiles.yml](https://docs.getdbt.com/reference/profiles.yml) for more context on what it actually is.

#### test_stream_processor.py:

These unit tests functions check how each stream is converted to dbt models files.
For example, one big focus area is around how table names are chosen.
(especially since some destination like postgres have a very low limit to identifiers length of 64 characters)
In case of nested objects/arrays in a stream, names can be dragged on to even longer names...

So you can find rules of how to truncate and concatenate part of the table names together in here.
Depending on the catalog context and what identifiers have been already used in the past, some naming
may also be affected and requires to choose new identifications to avoid collisions.

Additional helper functions dealing with cursor fields, primary keys and other code generation parts are also being tested here.

#### test_destination_name_transformer.py:

These Unit tests checks implementation of specific rules of SQL identifier naming conventions for each destination.
The specifications rules of each destinations are detailed in the corresponding docs, especially on the
allowed characters, if quotes are needed or not, and the length limitations:

- [bigquery](../../../docs/integrations/destinations/bigquery.md)
- [postgres](../../../docs/integrations/destinations/postgres.md)
- [redshift](../../../docs/integrations/destinations/redshift.md)
- [snowflake](../../../docs/integrations/destinations/snowflake.md)

Rules about truncations, for example for both of these strings which are too long for the postgres 64 limit:
- `Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iiii`
- `Aaaa_Bbbb_Cccc_Dddd_a_very_long_name_Ffff_Gggg_Hhhh_Iiii`

Deciding on how to truncate (in the middle) are being verified in these tests.
In this instance, both strings ends up as:

- `Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii`

The truncate operation gets rid of characters in the middle of the string to preserve the start
and end characters as it may contain more useful information in table naming. However the final
truncated name can still potentially cause collisions in table names...

Note that dealing with such collisions is not part of `destination_name_transformer` but of the
`stream_processor` since one is focused on destination conventions and the other on putting together
identifier names from streams and catalogs.

## Integration Tests

With Gradle:

./gradlew :airbyte-integrations:bases:base-normalization:integrationTest

or directly with pytest:

pytest airbyte-integrations/bases/base-normalization/integration_tests

or can also be invoked on github, thanks to the slash commands posted as comments:

/test connector=bases/base-normalization

### Integration Tests Definitions:

Some test suites can be selected to be versioned control in Airbyte git repository (or not).
This is useful to see direct impacts of code changes on downstream files generated or compiled
by normalization and dbt (directly in PR too). (_Simply refer to your test suite name in the
`git_versioned_tests` variable in the `base-normalization/integration_tests/test_normalization.py` file_)

We would typically choose small and meaningful test suites to include in git while others more complex tests
can be left out. They would still be run in a temporary directory and thrown away at the end of the tests.

They are defined, each one of them, in a separate directory in the resource folder.
For example, below, we would have 2 different tests "suites" with this hierarchy:

base-normalization/integration_tests/resources/
├── test_suite1/
│ ├── data_input/
│ │ ├── catalog.json
│ │ └── messages.txt
│ ├── dbt_data_tests/
│ │ ├── file1.sql
│ │ └── file2.sql
│ ├── dbt_schema_tests/
│ │ ├── file1.yml
│ │ └── file2.yml
│ └── README.md
└── test_suite2/
├── data_input/
│ ├── catalog.json
│ └── messages.txt
├── dbt_data_tests/
├── dbt_schema_tests/
└── README.md

#### README.md:

Each test suite should have an optional `README.md` to include further details and descriptions of what the test is trying to verify and
how it is specifically built.

### Integration Test Data Input:

#### data_input/catalog.json:

The catalog.json is the main input for normalization from which the dbt models files are being
generated from as it describes in JSON Schema format what the data structure is.

#### data_input/messages.txt:

The `messages.txt` are serialized Airbyte JSON records that should be sent to the destination as if they were
transmitted by a source. In this integration test, the files is read and "cat" through to the docker image of
each destination connectors to populate `_airbyte_raw_tables`. These tables are finally used as input
data for dbt to run from.

### Integration Test Execution Flow:

These integration tests are run against all destinations that dbt can be executed on.
So, for each target destination, the steps run by the tests are:

1. Prepare the test execution workspace folder (copy skeleton from `dbt-project-template/`)
2. Generate a dbt `profiles.yml` file to connect to the target destination
3. Populate raw tables by running the target destination connectors, reading and uploading the
`messages.txt` file as data input.
4. Run Normalization step to generate dbt models files from `catalog.json` input file.
5. Execute dbt cli command: `dbt run` from the test workspace folder to compile generated models files
- from `models/generated/` folder
- into `../build/(compiled|run)/airbyte_utils/models/generated/` folder
- The final "run" SQL files are also copied (for archiving) to `final/` folder by the test script.
6. Deploy the `schema_tests` and `data_tests` files into the test workspace folder.
7. Execute dbt cli command: `dbt tests` from the test workspace folder to run verifications and checks with dbt.
8. Optional checks (nothing for the moment)

### Integration Test Checks:

#### dbt schema tests:

dbt allows out of the box to configure some tests as properties for an existing model (or source, seed, or snapshot).
This can be done in yaml format as described in the following documentation pages:

- [dbt schema-tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#schema-tests)
- [custom schema test](https://docs.getdbt.com/docs/guides/writing-custom-schema-tests)

We are leveraging these capabilities in these integration tests to verify some relationships in our
generated tables on the destinations.

#### dbt data tests:

Additionally, dbt also supports "data tests" which are specified as SQL queries.
A data test is a select statement that returns 0 records when the test is successful.

- [dbt data-tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#data-tests)

#### Notes using dbt seeds:

Because some functionalities are not stable enough on dbt side, it is difficult to properly use
`dbt seed` commands to populate a set of expected data tables at the moment. Hopefully, this can be
more easily be done in the future...

Related issues to watch on dbt progress to improve this aspects:
- https://github.com/fishtown-analytics/dbt/issues/2959#issuecomment-747509782
- https://medium.com/hashmapinc/unit-testing-on-dbt-models-using-a-static-test-dataset-in-snowflake-dfd35549b5e2

A nice improvement would be to add csv/json seed files as expected output data from tables.
The integration tests would verify that the content of such tables in the destination would match
these seed files or fail.

## Standard Destination Tests

Generally, to invoke standard destination tests, you run with gradle using:

./gradlew :airbyte-integrations:connectors:destination-<connector name>:integrationTest

For more details and options, you can also refer to the [testing connectors docs](../../../docs/contributing-to-airbyte/building-new-connector/testing-connectors.md).

## Acceptance Tests

Please refer to the [developing docs](../../../docs/contributing-to-airbyte/developing-locally.md) on how to run Acceptance Tests.
12 changes: 12 additions & 0 deletions airbyte-integrations/bases/base-normalization/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,15 @@ installReqs.dependsOn(":airbyte-integrations:bases:airbyte-protocol:installReqs"

project.task('integrationTest')
integrationTest.dependsOn(build)

task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs){
module = "pytest"
command = "-s integration_tests"

dependsOn ':airbyte-integrations:connectors:destination-bigquery:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-postgres:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-redshift:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-snowflake:airbyteDocker'
}

integrationTest.dependsOn("customIntegrationTestPython")
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@


create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate`


OPTIONS()
as (

-- Final base SQL model
select
id,
currency,
date,
HKD,
NZD,
USD,
_airbyte_emitted_at,
_airbyte_dedup_exchange_rate_hashid
from `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd`
-- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate
where _airbyte_active_row = True
);

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@


create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd`


OPTIONS()
as (

-- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key
select
id,
currency,
date,
HKD,
NZD,
USD,
date as _airbyte_start_at,
lag(date) over (
partition by id, currency, cast(NZD as
string
)
order by date desc, _airbyte_emitted_at desc
) as _airbyte_end_at,
lag(date) over (
partition by id, currency, cast(NZD as
string
)
order by date desc, _airbyte_emitted_at desc
) is null as _airbyte_active_row,
_airbyte_emitted_at,
_airbyte_dedup_exchange_rate_hashid
from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab4`
-- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate
where _airbyte_row_num = 1
);

Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@


create or replace table `dataline-integration-testing`.test_normalization.`exchange_rate`


OPTIONS()
as (

-- Final base SQL model
select
id,
currency,
date,
HKD,
NZD,
USD,
_airbyte_emitted_at,
_airbyte_exchange_rate_hashid
from `dataline-integration-testing`._airbyte_test_normalization.`exchange_rate_ab3`
-- exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_exchange_rate
);

Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@


create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab1`
OPTIONS()
as
-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema
select
json_extract_scalar(_airbyte_data, "$.id") as id,
json_extract_scalar(_airbyte_data, "$.currency") as currency,
json_extract_scalar(_airbyte_data, "$.date") as date,
json_extract_scalar(_airbyte_data, "$.HKD") as HKD,
json_extract_scalar(_airbyte_data, "$.NZD") as NZD,
json_extract_scalar(_airbyte_data, "$.USD") as USD,
_airbyte_emitted_at
from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate
-- dedup_exchange_rate;

Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@


create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab2`
OPTIONS()
as
-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type
select
cast(id as
int64
) as id,
cast(currency as
string
) as currency,
cast(date as
string
) as date,
cast(HKD as
float64
) as HKD,
cast(NZD as
float64
) as NZD,
cast(USD as
float64
) as USD,
_airbyte_emitted_at
from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab1`
-- dedup_exchange_rate;

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@


create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab3`
OPTIONS()
as
-- SQL model to build a hash column based on the values of this record
select
*,
to_hex(md5(cast(concat(coalesce(cast(id as
string
), ''), '-', coalesce(cast(currency as
string
), ''), '-', coalesce(cast(date as
string
), ''), '-', coalesce(cast(HKD as
string
), ''), '-', coalesce(cast(NZD as
string
), ''), '-', coalesce(cast(USD as
string
), '')) as
string
))) as _airbyte_dedup_exchange_rate_hashid
from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab2`
-- dedup_exchange_rate;

Loading

0 comments on commit c2fa3e4

Please sign in to comment.