Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT project for Airbyte normalization #802

Merged
merged 21 commits into from Nov 6, 2020
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
bad20b2
Start DBT module for running normalization transformations
ChristopheDuong Nov 4, 2020
cf7b921
Generates DBT models from JSON Schema catalog
ChristopheDuong Nov 4, 2020
bb5f6eb
Add Doc on how to run transformation of catalog
ChristopheDuong Nov 4, 2020
ec8f803
Make generated DBT SQL cross db compatible
ChristopheDuong Nov 4, 2020
7693d5c
Tweaks from review
ChristopheDuong Nov 4, 2020
f02d70d
turn on basic normalization for data warehouses
cgardens Nov 4, 2020
f5d1ec9
Merge remote-tracking branch 'origin/master' into normalization
ChristopheDuong Nov 5, 2020
8bbb4c9
Hook up dbt normalization
ChristopheDuong Nov 5, 2020
eebc286
Merge remote-tracking branch 'origin/cgardens/turn_on_normalization' …
ChristopheDuong Nov 5, 2020
9a2c8c9
Connect normalization pieces together
ChristopheDuong Nov 5, 2020
a44b52c
Exclude spotless from DBT folders
ChristopheDuong Nov 5, 2020
1ad010a
Add placeholder adapter functions for postgres flavor of sql
ChristopheDuong Nov 5, 2020
08b063c
Correct some snowflakes specifities
ChristopheDuong Nov 5, 2020
c9c6fe4
pull from master
sherifnada Nov 6, 2020
73ce71b
merge master part 2
sherifnada Nov 6, 2020
56aa32e
Merge branch 'master' of github.com:airbytehq/airbyte into chris/dbt
sherifnada Nov 6, 2020
3a18856
Postgres normalization (#831)
sherifnada Nov 6, 2020
05d422f
Changes from code reviews
ChristopheDuong Nov 6, 2020
fb32f13
Handle Quoting in generated SQL so it can be configured as needed per…
ChristopheDuong Nov 6, 2020
e6cc9b1
Check if items is in dict before accessing it
ChristopheDuong Nov 6, 2020
9904b29
It works on snowflake!!
ChristopheDuong Nov 6, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Expand Up @@ -15,3 +15,5 @@ __pycache__
.venv
.mypy_cache

# dbt
profiles.yml
Expand Up @@ -3,3 +3,4 @@
!entrypoint.sh
!setup.py
!normalization
!dbt-project-template
3 changes: 3 additions & 0 deletions airbyte-integrations/bases/base-normalization/Dockerfile
Expand Up @@ -10,6 +10,9 @@ WORKDIR /airbyte/normalization_code
COPY normalization ./normalization
COPY setup.py .
RUN pip install .
COPY dbt-project-template/macros ./dbt-template/macros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do

COPY dbt-project-template/ ./dbt-template/

Instead?

COPY dbt-project-template/dbt_project.yml ./dbt-template/dbt_project.yml
COPY dbt-project-template/packages.yml ./dbt-template/packages.yml

WORKDIR /airbyte

Expand Down
@@ -0,0 +1,4 @@
build/
logs/
models/generated/
dbt_modules/
@@ -0,0 +1,19 @@
## Installing DBT

1. Activate your venv and run `pip3 install dbt`
1. Copy `airbyte-normalization/sample_files/profiles.yml` over to `~/.dbt/profiles.yml`
1. Edit to configure your profiles accordingly

## Running DBT

1. `cd airbyte-normalization`
1. You can now run DBT commands, to check the setup is fine: `dbt debug`
1. To build the DBT tables in your warehouse: `dbt run`

## Running DBT from Airbyte generated config

1. You can also change directory (`cd /tmp/dev_root/workspace/1/0/normalize` for example) to one of the workspace generated by Airbyte within one of the `normalize` folder.
1. You should find `profiles.yml` and a bunch of other DBT files/folders created there.
1. To check everything is setup properly: `dbt debug --profiles-dir=$(pwd) --project-dir=$(pwd)`
1. You can modify the `.sql` files and run `dbt run --profiles-dir=$(pwd) --project-dir=$(pwd)` too
1. You can inspect compiled DBT `.sql` files before they are run in the destination engine in `normalize/build/compiled` or `normalize/build/run` folders
@@ -0,0 +1,40 @@
# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "build" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

# https://docs.getdbt.com/reference/project-configs/quoting/
quoting:
database: true
schema: true
identifier: true

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte:
# Schema (dataset) defined in profiles.yml is concatenated with schema below for dbt's final output
+schema: NORMALIZED
+materialized: view
@@ -0,0 +1,34 @@
{#
Adapter Macros for the following functions:
- Bigquery: unnest() -> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening-arrays-and-repeated-fields
- Snowflake: flatten() -> https://docs.snowflake.com/en/sql-reference/functions/flatten.html
- Redshift: -> https://blog.getdbt.com/how-to-unnest-arrays-in-redshift/
- postgres: unnest() -> https://www.postgresqltutorial.com/postgresql-array/
#}

{# flatten ------------------------------------------------- #}

{% macro unnest(array_col) -%}
{{ adapter.dispatch('unnest')(array_col) }}
{%- endmacro %}

{% macro default__unnest(array_col) -%}
unnest({{ adapter.quote_as_configured(array_col, 'identifier')|trim }})
{%- endmacro %}

{% macro bigquery__unnest(array_col) -%}
unnest({{ adapter.quote_as_configured(array_col, 'identifier')|trim }})
{%- endmacro %}

{% macro postgres__unnest(array_col) -%}
unnest({{ adapter.quote_as_configured(array_col, 'identifier')|trim }})
{%- endmacro %}

{% macro redshift__unnest(array_col) -%}
-- FIXME to implement as described here? https://blog.getdbt.com/how-to-unnest-arrays-in-redshift/
{%- endmacro %}

{% macro snowflake__unnest(array_col) -%}
-- TODO test this!! not so sure yet...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it working?

table(flatten({{ adapter.quote_as_configured(array_col, 'identifier')|trim }}))
{%- endmacro %}
@@ -0,0 +1,110 @@
{#
Adapter Macros for the following functions:
- Bigquery: JSON_EXTRACT(json_string_expr, json_path_format) -> https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions
- Snowflake: JSON_EXTRACT_PATH_TEXT( <column_identifier> , '<path_name>' ) -> https://docs.snowflake.com/en/sql-reference/functions/json_extract_path_text.html
- Redshift: json_extract_path_text('json_string', 'path_elem' [,'path_elem'[, …] ] [, null_if_invalid ] ) -> https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html
- Postgres: json_extract_path_text(<from_json>, 'path' [, 'path' [, ...]]) -> https://www.postgresql.org/docs/12/functions-json.html
#}

{# format_json_path -------------------------------------------------- #}
{% macro format_json_path(json_path_list) -%}
{{ adapter.dispatch('format_json_path')(json_path_list) }}
{%- endmacro %}

{% macro default__format_json_path(json_path_list) -%}
{{ '.' ~ json_path_list|join('.') }}
{%- endmacro %}

{% macro bigquery__format_json_path(json_path_list) -%}
{{ '"$.' ~ json_path_list|join('"."') ~ '"' }}
{%- endmacro %}

{% macro postgres__format_json_path(json_path_list) -%}
{{ "'" ~ json_path_list|join("','") ~ "'" }}
{%- endmacro %}

{% macro redshift__format_json_path(json_path_list) -%}
{{ "'" ~ json_path_list|join("','") ~ "'" }}
{%- endmacro %}

{% macro snowflake__format_json_path(json_path_list) -%}
{{ "'" ~ json_path_list|join('"."') ~ "'" }}
{%- endmacro %}

{# json_extract ------------------------------------------------- #}

{% macro json_extract(json_column, json_path_list) -%}
{{ adapter.dispatch('json_extract')(json_column, json_path_list) }}
{%- endmacro %}

{% macro default__json_extract(json_column, json_path_list) -%}
json_extract({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro bigquery__json_extract(json_column, json_path_list) -%}
json_extract({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro postgres__json_extract(json_column, json_path_list) -%}
jsonb_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro redshift__json_extract(json_column, json_path_list) -%}
json_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro snowflake__json_extract(json_column, json_path_list) -%}
json_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{# json_extract_scalar ------------------------------------------------- #}

{% macro json_extract_scalar(json_column, json_path_list) -%}
{{ adapter.dispatch('json_extract_scalar')(json_column, json_path_list) }}
{%- endmacro %}

{% macro default__json_extract_scalar(json_column, json_path_list) -%}
json_extract_scalar({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro bigquery__json_extract_scalar(json_column, json_path_list) -%}
json_extract_scalar({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro postgres__json_extract_scalar(json_column, json_path_list) -%}
jsonb_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }},{{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro redshift__json_extract_scalar(json_column, json_path_list) -%}
json_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro snowflake__json_extract_scalar(json_column, json_path_list) -%}
json_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{# json_extract_array ------------------------------------------------- #}

{% macro json_extract_array(json_column, json_path_list) -%}
{{ adapter.dispatch('json_extract_array')(json_column, json_path_list) }}
{%- endmacro %}

{% macro default__json_extract_array(json_column, json_path_list) -%}
json_extract_array({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro bigquery__json_extract_array(json_column, json_path_list) -%}
json_extract_array({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro postgres__json_extract_array(json_column, json_path_list) -%}
jsonb_array_elements(jsonb_extract_path({{ adapter.quote_as_configured(json_column, 'identifier')|trim }},{{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro redshift__json_extract_array(json_column, json_path_list) -%}
json_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}

{% macro snowflake__json_extract_array(json_column, json_path_list) -%}
json_extract_path_text({{ adapter.quote_as_configured(json_column, 'identifier')|trim }}, {{ format_json_path(json_path_list) }})
{%- endmacro %}
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.2
24 changes: 13 additions & 11 deletions airbyte-integrations/bases/base-normalization/entrypoint.sh
Expand Up @@ -2,10 +2,6 @@

set -e

# dbt looks specifically for files named profiles.yml and dbt_project.yml
DBT_PROFILE=profiles.yml
DBT_MODEL=dbt_project.yml

function echo2() {
echo >&2 "$@"
}
Expand All @@ -15,6 +11,8 @@ function error() {
exit 1
}

PROJECT_DIR=$(pwd)

## todo: make it easy to select source or destination and validate based on selection by adding an integration type env variable.
function main() {
CMD="$1"
Expand Down Expand Up @@ -43,15 +41,19 @@ function main() {

case "$CMD" in
run)
transform-config --config "$CONFIG_FILE" --integration-type "$INTEGRATION_TYPE" --out "$DBT_PROFILE"
# todo (cgardens) - @ChristopheDuong adjust these args if necessary.
transform-catalog --catalog "$CATALOG_FILE" --integration-type "$INTEGRATION_TYPE" --out "$DBT_MODEL"

# todo (cgardens) - @ChristopheDuong this is my best guess at how we are supposed to invoke dbt. adjust when they are inevitably not quite right.
dbt run --profiles-dir $(pwd) --project-dir $(pwd) --full-refresh --fail-fast
cp -r /airbyte/normalization_code/dbt-template/* $PROJECT_DIR
transform-config --config "$CONFIG_FILE" --integration-type "$INTEGRATION_TYPE" --out $PROJECT_DIR
transform-catalog --profile-config-dir $PROJECT_DIR --catalog "$CATALOG_FILE" --out $PROJECT_DIR/models/generated/ --json-column data
dbt deps --profiles-dir $PROJECT_DIR --project-dir $PROJECT_DIR
dbt run --profiles-dir $PROJECT_DIR --project-dir $PROJECT_DIR
;;
dry-run)
ChristopheDuong marked this conversation as resolved.
Show resolved Hide resolved
error "Not Implemented"
cp -r /airbyte/normalization_code/dbt-template/* $PROJECT_DIR
transform-config --config "$CONFIG_FILE" --integration-type "$INTEGRATION_TYPE" --out $PROJECT_DIR
dbt debug --profiles-dir $PROJECT_DIR --project-dir $PROJECT_DIR
transform-catalog --profile-config-dir $PROJECT_DIR --catalog "$CATALOG_FILE" --out $PROJECT_DIR/models/generated/ --json-column data
dbt deps --profiles-dir $PROJECT_DIR --project-dir $PROJECT_DIR
dbt compile --profiles-dir $PROJECT_DIR --project-dir $PROJECT_DIR
;;
*)
error "Unknown command: $CMD"
Expand Down
@@ -0,0 +1,31 @@
"""
MIT License

Copyright (c) 2020 Airbyte

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
"""

from enum import Enum


class DestinationType(Enum):
bigquery = "bigquery"
postgres = "postgres"
snowflake = "snowflake"