# dbt (data build tool)
[dbt](https://docs.getdbt.com/docs/introduction) does not extract or load data, but it’s state-of-the-art at transforming data that’s already available in a database — dbt does the T in ELT (Extract, Load, Transform) processes.

<div class="alert alert-block alert-info">
<b>Note:</b> To run this Notebook, make sure you select "dbt" as Kernel, not "Python 3 (ipkernel)" (Kernel -> Change Kernel)
</div>

---


## Set-up
### Start PostgreSQL database using the controlboard
Let's assume our raw data lives in a PostgreSQL database and that we want 

Follow the instructions in the Notebook [Controlboard](controlboard.ipynb). 

### Load tutorial data into PostgreSQL
todo: Automate this with Airflow ;-)




In [2]:
! dbt --version

/bin/bash: line 1: dbt: command not found


### Init dbt
We are using the Terminal within Jupyter since dbt expects some answers using the command prompt - this is impossible to implement using Jupyter cells here :-(. 

1. In Jupyter, hit the blue "+" sign at the top left and start a new Terminal.
2. Tell the Terminal that you want to use the Python environment `dbt` instead of the standard `base` one (otherwise you'll get an error "bash: dbt: command not found"):
```console
(base) jovyan@...:~/$ conda activate dbt
(dbt) jovyan@...:~/$ 
```

3. Initialize dbt to create a new project called e.g. `tutorial` an entire new folder structure for your dbt project:
```console
dbt init tutorial
```
4. Answer dbt's questions e.g. like this:
    * Enter a name for your project (letters, digits, underscore): `tutorial`
    * Which database would you like to use? `1`
5. Create a new folder for your dbt profiles:
```console
$ mkdir -p ~/.dbt
```
6. Open nano to edit your dbt profiles.yaml:
```console
$ nano ~/.dbt/profiles.yaml
```
7. Paste the following into nano. Don't forget to adjust any values if necessary
```yaml
tutorial:
  target: dev
  outputs:
    dev:
      type: postgres
      host: postgresql.myproject.svc.cluster.local
      user: dbuser
      password: <yourpassword>
      port: 5432
      dbname: raw_data
      # schema: [dbt schema]
      threads: 1
      keepalives_idle: 0
      connect_timeout: 10
      retries: 1
      # search_path: [optional, override the default postgres search_path]
      # role: [optional, set the role dbt assumes when executing queries]
      # sslmode: [optional, set the sslmode used to connect to the database]
```

Make sure that you can view hidden Linux files: go to View -> "Show Hidden Files"
<div class="alert alert-block alert-info">
<b>Note:</b> Jupyter's "Show Hidden Files" is currently broken - you won't see all dbt files. Either edit any dbt files on your host computer or use the Jupyter command prompt.
</div>

## Set-up git
Todo

## Extract and load
Switch kernel from `dbt` to `Python 3 (ipkernel)`

In [None]:
from sqlalchemy import create_engine, MetaData
from sqlalchemy_utils import database_exists, create_database
from urllib import parse

import pandas as pd
import requests
import json

raw_data_url = 'https://github.com/openZH/covid_19/raw/master/fallzahlen_kanton_alter_geschlecht_csv/COVID19_Fallzahlen_Kanton_ZH_altersklassen_geschlecht.csv'
coingecko = 'https://api.coingecko.com/api/v3'
endpoint = '/coins/ethereum/market_chart'
params = {
    'vs_currency': 'CHF',
    'days': 90,
    'interval': 'hourly'
}

In [None]:
# Name of your database - this database does NOT exist yet (create it below with `create_database()`)
database = 'raw_data'
username = 'dbuser'
password = '662VZUE5RJJG2puI'

# Connection details according to docker-compose.yml - do not change this
dialect = 'postgresql'  # Could be almost any other DB technology
host = 'postgresql.myproject.svc.cluster.local'  # Name of the Kubernetes service
port = 5432

# URL-encode password for characters like %, ä, ...
password = parse.quote_plus(password)

url = f'{dialect}://{username}:{password}@{host}:{port}/{database}'
engine = create_engine(url)

In [None]:
if not database_exists(engine.url):
    create_database(engine.url)

print(f'Database "{database}" exists: {database_exists(engine.url)}')

In [None]:
df = pd.read_csv(raw_data_url)
df

In [None]:
r = requests.get(coingecko + endpoint, params=params)
r

In [None]:
df = json.loads(r.text)
df = df['prices']
df = pd.DataFrame(df, columns=['time', 'value'])

In [None]:
df['time'] = pd.to_datetime(df['time'] * 1000000)

In [None]:
df

In [None]:
df.to_sql(
    'raw_data',  # table name
    con=engine,
    if_exists='replace',
    index=False,  # In order to avoid writing DataFrame index as a column
)

In [None]:
df2 = pd.read_sql('SELECT * FROM raw_data', con=engine)

In [None]:
df2

## Airports and flights

In [None]:
raw_data = (
    ('airports', 'https://github.com/hgrif/dbt_tutorial/raw/master/flights_data/airports.csv'),
    ('carriers', 'https://github.com/hgrif/dbt_tutorial/raw/master/flights_data/carriers.csv'),
    ('flights', 'https://github.com/hgrif/dbt_tutorial/raw/master/flights_data/flights.csv')
)
for table, url in raw_data:
    df = pd.read_csv(url)
    df.to_sql(table, con=engine, if_exists='replace')