-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce normalization integration tests (#3025)
* Speed normalization unit tests by dropping hubspot catalog (too heavy, will be covering it in integration tests instead * Add integration tests for normalization * Add dedup test case * adjust build.gradle * add readme for normalization * Share PATH env variable with subprocess calls * Handle git non-versionned tests vs versionned ones * Format code * Add tests check to normalization integration tests * Add docs * complete docs on normalization integration tests * format code * Normalization integration tests output (#3026) * Version generated/output files from normalization integration tests * simplify cast of float columns to string when used as partition key (#3027) * bump version of normalization image * Apply suggestions from code review Co-authored-by: Jared Rhizor <jared@dataline.io> * Apply suggestions from code review Co-authored-by: Jared Rhizor <jared@dataline.io>
- Loading branch information
1 parent
4c96499
commit c2fa3e4
Showing
113 changed files
with
2,598 additions
and
28,095 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,10 @@ | ||
build/ | ||
logs/ | ||
dbt-project-template/models/generated/ | ||
dbt-project-template/test_output.log | ||
dbt_modules/ | ||
secrets/ | ||
|
||
integration_tests/normalization_test_output/* | ||
!integration_tests/normalization_test_output/*/*/models/generated | ||
!integration_tests/normalization_test_output/*/*/final |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
203 changes: 203 additions & 0 deletions
203
airbyte-integrations/bases/base-normalization/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,203 @@ | ||
# Normalization | ||
|
||
Related documentation on normalization is available here: | ||
|
||
- [architecture / Basic Normalization](../../../docs/architecture/basic-normalization.md) | ||
* [tutorials / Custom DBT normalization](../../../docs/tutorials/connecting-el-with-t-using-dbt.md) | ||
|
||
# Testing normalization | ||
|
||
Below are short descriptions of the kind of tests that may be affected by changes to the normalization code. | ||
|
||
## Unit Tests | ||
|
||
Unit tests are automatically included when building the normalization project. | ||
But you could invoke them explicitly by running the following commands for example: | ||
|
||
with Gradle: | ||
|
||
./gradlew :airbyte-integrations:bases:base-normalization:unitTest | ||
|
||
or directly with pytest: | ||
|
||
pytest airbyte-integrations/bases/base-normalization/unit_tests | ||
|
||
Unit tests are targeted at the main code generation functionality of normalization. | ||
They should verify different logic rules on how to convert an input catalog.json (JSON Schema) file into | ||
dbt files. | ||
|
||
#### test_transform_config.py: | ||
|
||
This class is testing the transform config functionality that converts a destination_config.json into the adequate profiles.yml file for dbt to use | ||
see [related dbt docs on profiles.yml](https://docs.getdbt.com/reference/profiles.yml) for more context on what it actually is. | ||
|
||
#### test_stream_processor.py: | ||
|
||
These unit tests functions check how each stream is converted to dbt models files. | ||
For example, one big focus area is around how table names are chosen. | ||
(especially since some destination like postgres have a very low limit to identifiers length of 64 characters) | ||
In case of nested objects/arrays in a stream, names can be dragged on to even longer names... | ||
|
||
So you can find rules of how to truncate and concatenate part of the table names together in here. | ||
Depending on the catalog context and what identifiers have been already used in the past, some naming | ||
may also be affected and requires to choose new identifications to avoid collisions. | ||
|
||
Additional helper functions dealing with cursor fields, primary keys and other code generation parts are also being tested here. | ||
|
||
#### test_destination_name_transformer.py: | ||
|
||
These Unit tests checks implementation of specific rules of SQL identifier naming conventions for each destination. | ||
The specifications rules of each destinations are detailed in the corresponding docs, especially on the | ||
allowed characters, if quotes are needed or not, and the length limitations: | ||
|
||
- [bigquery](../../../docs/integrations/destinations/bigquery.md) | ||
- [postgres](../../../docs/integrations/destinations/postgres.md) | ||
- [redshift](../../../docs/integrations/destinations/redshift.md) | ||
- [snowflake](../../../docs/integrations/destinations/snowflake.md) | ||
|
||
Rules about truncations, for example for both of these strings which are too long for the postgres 64 limit: | ||
- `Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iiii` | ||
- `Aaaa_Bbbb_Cccc_Dddd_a_very_long_name_Ffff_Gggg_Hhhh_Iiii` | ||
|
||
Deciding on how to truncate (in the middle) are being verified in these tests. | ||
In this instance, both strings ends up as: | ||
|
||
- `Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii` | ||
|
||
The truncate operation gets rid of characters in the middle of the string to preserve the start | ||
and end characters as it may contain more useful information in table naming. However the final | ||
truncated name can still potentially cause collisions in table names... | ||
|
||
Note that dealing with such collisions is not part of `destination_name_transformer` but of the | ||
`stream_processor` since one is focused on destination conventions and the other on putting together | ||
identifier names from streams and catalogs. | ||
|
||
## Integration Tests | ||
|
||
With Gradle: | ||
|
||
./gradlew :airbyte-integrations:bases:base-normalization:integrationTest | ||
|
||
or directly with pytest: | ||
|
||
pytest airbyte-integrations/bases/base-normalization/integration_tests | ||
|
||
or can also be invoked on github, thanks to the slash commands posted as comments: | ||
|
||
/test connector=bases/base-normalization | ||
|
||
### Integration Tests Definitions: | ||
|
||
Some test suites can be selected to be versioned control in Airbyte git repository (or not). | ||
This is useful to see direct impacts of code changes on downstream files generated or compiled | ||
by normalization and dbt (directly in PR too). (_Simply refer to your test suite name in the | ||
`git_versioned_tests` variable in the `base-normalization/integration_tests/test_normalization.py` file_) | ||
|
||
We would typically choose small and meaningful test suites to include in git while others more complex tests | ||
can be left out. They would still be run in a temporary directory and thrown away at the end of the tests. | ||
|
||
They are defined, each one of them, in a separate directory in the resource folder. | ||
For example, below, we would have 2 different tests "suites" with this hierarchy: | ||
|
||
base-normalization/integration_tests/resources/ | ||
├── test_suite1/ | ||
│ ├── data_input/ | ||
│ │ ├── catalog.json | ||
│ │ └── messages.txt | ||
│ ├── dbt_data_tests/ | ||
│ │ ├── file1.sql | ||
│ │ └── file2.sql | ||
│ ├── dbt_schema_tests/ | ||
│ │ ├── file1.yml | ||
│ │ └── file2.yml | ||
│ └── README.md | ||
└── test_suite2/ | ||
├── data_input/ | ||
│ ├── catalog.json | ||
│ └── messages.txt | ||
├── dbt_data_tests/ | ||
├── dbt_schema_tests/ | ||
└── README.md | ||
|
||
#### README.md: | ||
|
||
Each test suite should have an optional `README.md` to include further details and descriptions of what the test is trying to verify and | ||
how it is specifically built. | ||
|
||
### Integration Test Data Input: | ||
|
||
#### data_input/catalog.json: | ||
|
||
The catalog.json is the main input for normalization from which the dbt models files are being | ||
generated from as it describes in JSON Schema format what the data structure is. | ||
|
||
#### data_input/messages.txt: | ||
|
||
The `messages.txt` are serialized Airbyte JSON records that should be sent to the destination as if they were | ||
transmitted by a source. In this integration test, the files is read and "cat" through to the docker image of | ||
each destination connectors to populate `_airbyte_raw_tables`. These tables are finally used as input | ||
data for dbt to run from. | ||
|
||
### Integration Test Execution Flow: | ||
|
||
These integration tests are run against all destinations that dbt can be executed on. | ||
So, for each target destination, the steps run by the tests are: | ||
|
||
1. Prepare the test execution workspace folder (copy skeleton from `dbt-project-template/`) | ||
2. Generate a dbt `profiles.yml` file to connect to the target destination | ||
3. Populate raw tables by running the target destination connectors, reading and uploading the | ||
`messages.txt` file as data input. | ||
4. Run Normalization step to generate dbt models files from `catalog.json` input file. | ||
5. Execute dbt cli command: `dbt run` from the test workspace folder to compile generated models files | ||
- from `models/generated/` folder | ||
- into `../build/(compiled|run)/airbyte_utils/models/generated/` folder | ||
- The final "run" SQL files are also copied (for archiving) to `final/` folder by the test script. | ||
6. Deploy the `schema_tests` and `data_tests` files into the test workspace folder. | ||
7. Execute dbt cli command: `dbt tests` from the test workspace folder to run verifications and checks with dbt. | ||
8. Optional checks (nothing for the moment) | ||
|
||
### Integration Test Checks: | ||
|
||
#### dbt schema tests: | ||
|
||
dbt allows out of the box to configure some tests as properties for an existing model (or source, seed, or snapshot). | ||
This can be done in yaml format as described in the following documentation pages: | ||
|
||
- [dbt schema-tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#schema-tests) | ||
- [custom schema test](https://docs.getdbt.com/docs/guides/writing-custom-schema-tests) | ||
|
||
We are leveraging these capabilities in these integration tests to verify some relationships in our | ||
generated tables on the destinations. | ||
|
||
#### dbt data tests: | ||
|
||
Additionally, dbt also supports "data tests" which are specified as SQL queries. | ||
A data test is a select statement that returns 0 records when the test is successful. | ||
|
||
- [dbt data-tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#data-tests) | ||
|
||
#### Notes using dbt seeds: | ||
|
||
Because some functionalities are not stable enough on dbt side, it is difficult to properly use | ||
`dbt seed` commands to populate a set of expected data tables at the moment. Hopefully, this can be | ||
more easily be done in the future... | ||
|
||
Related issues to watch on dbt progress to improve this aspects: | ||
- https://github.com/fishtown-analytics/dbt/issues/2959#issuecomment-747509782 | ||
- https://medium.com/hashmapinc/unit-testing-on-dbt-models-using-a-static-test-dataset-in-snowflake-dfd35549b5e2 | ||
|
||
A nice improvement would be to add csv/json seed files as expected output data from tables. | ||
The integration tests would verify that the content of such tables in the destination would match | ||
these seed files or fail. | ||
|
||
## Standard Destination Tests | ||
|
||
Generally, to invoke standard destination tests, you run with gradle using: | ||
|
||
./gradlew :airbyte-integrations:connectors:destination-<connector name>:integrationTest | ||
|
||
For more details and options, you can also refer to the [testing connectors docs](../../../docs/contributing-to-airbyte/building-new-connector/testing-connectors.md). | ||
|
||
## Acceptance Tests | ||
|
||
Please refer to the [developing docs](../../../docs/contributing-to-airbyte/developing-locally.md) on how to run Acceptance Tests. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
23 changes: 23 additions & 0 deletions
23
...treams/final/airbyte_tables/test_normalization/test_normalization_dedup_exchange_rate.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
|
||
|
||
create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate` | ||
|
||
|
||
OPTIONS() | ||
as ( | ||
|
||
-- Final base SQL model | ||
select | ||
id, | ||
currency, | ||
date, | ||
HKD, | ||
NZD, | ||
USD, | ||
_airbyte_emitted_at, | ||
_airbyte_dedup_exchange_rate_hashid | ||
from `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd` | ||
-- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate | ||
where _airbyte_active_row = True | ||
); | ||
|
36 changes: 36 additions & 0 deletions
36
...ms/final/airbyte_tables/test_normalization/test_normalization_dedup_exchange_rate_scd.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
|
||
|
||
create or replace table `dataline-integration-testing`.test_normalization.`dedup_exchange_rate_scd` | ||
|
||
|
||
OPTIONS() | ||
as ( | ||
|
||
-- SQL model to build a Type 2 Slowly Changing Dimension (SCD) table for each record identified by their primary key | ||
select | ||
id, | ||
currency, | ||
date, | ||
HKD, | ||
NZD, | ||
USD, | ||
date as _airbyte_start_at, | ||
lag(date) over ( | ||
partition by id, currency, cast(NZD as | ||
string | ||
) | ||
order by date desc, _airbyte_emitted_at desc | ||
) as _airbyte_end_at, | ||
lag(date) over ( | ||
partition by id, currency, cast(NZD as | ||
string | ||
) | ||
order by date desc, _airbyte_emitted_at desc | ||
) is null as _airbyte_active_row, | ||
_airbyte_emitted_at, | ||
_airbyte_dedup_exchange_rate_hashid | ||
from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab4` | ||
-- dedup_exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate | ||
where _airbyte_row_num = 1 | ||
); | ||
|
22 changes: 22 additions & 0 deletions
22
..._key_streams/final/airbyte_tables/test_normalization/test_normalization_exchange_rate.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
|
||
|
||
create or replace table `dataline-integration-testing`.test_normalization.`exchange_rate` | ||
|
||
|
||
OPTIONS() | ||
as ( | ||
|
||
-- Final base SQL model | ||
select | ||
id, | ||
currency, | ||
date, | ||
HKD, | ||
NZD, | ||
USD, | ||
_airbyte_emitted_at, | ||
_airbyte_exchange_rate_hashid | ||
from `dataline-integration-testing`._airbyte_test_normalization.`exchange_rate_ab3` | ||
-- exchange_rate from `dataline-integration-testing`.test_normalization._airbyte_raw_exchange_rate | ||
); | ||
|
17 changes: 17 additions & 0 deletions
17
.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab1.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
|
||
|
||
create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab1` | ||
OPTIONS() | ||
as | ||
-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema | ||
select | ||
json_extract_scalar(_airbyte_data, "$.id") as id, | ||
json_extract_scalar(_airbyte_data, "$.currency") as currency, | ||
json_extract_scalar(_airbyte_data, "$.date") as date, | ||
json_extract_scalar(_airbyte_data, "$.HKD") as HKD, | ||
json_extract_scalar(_airbyte_data, "$.NZD") as NZD, | ||
json_extract_scalar(_airbyte_data, "$.USD") as USD, | ||
_airbyte_emitted_at | ||
from `dataline-integration-testing`.test_normalization._airbyte_raw_dedup_exchange_rate | ||
-- dedup_exchange_rate; | ||
|
29 changes: 29 additions & 0 deletions
29
.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab2.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
|
||
|
||
create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab2` | ||
OPTIONS() | ||
as | ||
-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type | ||
select | ||
cast(id as | ||
int64 | ||
) as id, | ||
cast(currency as | ||
string | ||
) as currency, | ||
cast(date as | ||
string | ||
) as date, | ||
cast(HKD as | ||
float64 | ||
) as HKD, | ||
cast(NZD as | ||
float64 | ||
) as NZD, | ||
cast(USD as | ||
float64 | ||
) as USD, | ||
_airbyte_emitted_at | ||
from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab1` | ||
-- dedup_exchange_rate; | ||
|
26 changes: 26 additions & 0 deletions
26
.../airbyte_views/test_normalization/_airbyte_test_normalization_dedup_exchange_rate_ab3.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
|
||
|
||
create or replace view `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab3` | ||
OPTIONS() | ||
as | ||
-- SQL model to build a hash column based on the values of this record | ||
select | ||
*, | ||
to_hex(md5(cast(concat(coalesce(cast(id as | ||
string | ||
), ''), '-', coalesce(cast(currency as | ||
string | ||
), ''), '-', coalesce(cast(date as | ||
string | ||
), ''), '-', coalesce(cast(HKD as | ||
string | ||
), ''), '-', coalesce(cast(NZD as | ||
string | ||
), ''), '-', coalesce(cast(USD as | ||
string | ||
), '')) as | ||
string | ||
))) as _airbyte_dedup_exchange_rate_hashid | ||
from `dataline-integration-testing`._airbyte_test_normalization.`dedup_exchange_rate_ab2` | ||
-- dedup_exchange_rate; | ||
|
Oops, something went wrong.