# Getting started with your database
### Database-coding using SQLAlchemy and another (!!) Jupyter Notebook
Start a [Vanilla Jupyterlab](ControlBoard.ipynb#Vanilla-Jupyter-Datascience-Notebook) instance, then **copy** the code snippets below. Don't use this AWK DataLab Controlboard directly ;-).
### Imports

In [1]:
from sqlalchemy import create_engine, MetaData
from sqlalchemy_utils import database_exists, create_database
from urllib import parse

import pandas as pd


### Connect to your database
SQLAlchemy let's you use the same syntax and logic for different databases. All you need to change is the connection piece. You'll receive an `Engine`-object from SQLAlchemy (the connection won't be established until you do something with it). [Check here](https://docs.sqlalchemy.org/en/13/core/connections.html) to get started.

#### The Passwords will be different for you
To get the passwords below, use the initial `helm upgrade --install ...` command. Or (more complex), grab the [passwords from the respective Kubernetes secret](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kubectl/#decoding-secret), e.g. named `postgresql`.

#### PostgreSQL (state of the art)

In [11]:
# Name of your database - this database does NOT exist yet (create it below with `create_database()`)
database = 'information_schema_catalog_name'
username = 'dbuser'
password = 'rA4I2WYECsV4bqbw'

# Connection details according to docker-compose.yml - do not change this
dialect = 'postgresql'  # Could be almost any other DB technology
host = 'postgresql'  # Name of the Kubernetes service
port = 5432

# URL-encode password for characters like %, ä, ...
password = parse.quote_plus(password)

url = f'{dialect}://{username}:{password}@{host}:{port}/{database}'
engine = create_engine(url)

In [12]:
from sqlalchemy import text
with engine.connect() as connection:
    result = connection.execute(text("SELECT nspname FROM pg_catalog.pg_namespace;"))
    for row in result:
        print("username:", row['username'])

OperationalError: (psycopg2.OperationalError) connection to server at "postgresql" (10.43.219.128), port 5432 failed: FATAL:  database "information_schema_catalog_name" does not exist

(Background on this error at: https://sqlalche.me/e/14/e3q8)

#### MySQL

In [None]:
# Name of your database - this database does NOT exist yet (create it below with `create_database()`)
database = 'my-new-database'
username = 'root'  # Only root user can create a new database
password = 'zu6yUAgHbCuTg72mGbdg3dMj'

# Connection details according to docker-compose.yml - do not change this
dialect = 'mysql+mysqlconnector'  # Could be almost any other DB technology
host = 'mysql'
port = 3306
# URL-encode password for characters like %, ä, ...
password = parse.quote_plus(password)

url = f'{dialect}://{username}:{password}@{host}:{port}/{database}'
engine = create_engine(url)

***
## Create a new database once (you start with an empty database)

Create a new database called `my-new-database` (or anything, really). This command should return the value `True`, which means you could also successfully connect to the db

In [None]:
if not database_exists(engine.url):
    create_database(engine.url)

print(f'Database "{database}" exists: {database_exists(engine.url)}')

## Load example data into the database
### Postgres
Download `northwind.sql` from [this link](https://github.com/pthom/northwind_psql/raw/master/northwind.sql) (shift-click, then `Save link as...`), taken from the famous [Northwind example database](https://github.com/pthom/northwind_psql). Move the file into your `work` directory mounted in Jupyter.

In [None]:
sql = open("/home/jovyan/work/northwind.sql").read()
with engine.begin() as connection:
    connection.execute(sql)

### MySQL
MySQL needs more work. Download the database schema `northwind.sql` and the actual data `northwind-data.sql` from [this Github Repo](https://github.com/dalers/mywind). Move the 2 files into your `work` directory mounted in Jupyter.

Read the SQL commands in sequence and feed the individual commands (separated by a `;` and a subsequent line-break) individually to MySQL:

In [None]:
for filename in ("/home/jovyan/work/northwind.sql", "/home/jovyan/work/northwind-data.sql"):
    sql = open(filename).read()
    with engine.begin() as connection:
        for command in sql.split(';\n'):
            if not command.strip() or command.startswith('--'):
                # Empty or commented line - MySQL would throw an exception
                continue
            connection.execute(command)

***
## Explore the DB
Apart from the database itself and the table, you might need to specify a schema. In our case:

In [None]:
# Postgres - let's use the standard/default schema
schema = 'public'

In [None]:
# MySQL - the load above created its own schema
schema = 'northwind'

List all tables in the current database. SQLAlchemy uses an object called `MetaData` to describe the database:

In [None]:
# Associate the metadata with our database (the engine-object)
meta = MetaData(bind=engine, schema=schema)
# Load the existing database metadata from the database into meta
meta.reflect()
# Print all tables
meta.tables.keys()

Print all columns of all tables of the current database:

In [None]:
for table in meta.sorted_tables:
    for column in table.columns:
        print(f'{table.name}: {column.name}')

## SQLAlchemy and Pandas Dataframes
SQLAlchemy plays nicely with Pandas. In general, you pass the `Engine`-object to Pandas as well as the schema - that's it.

To get you started, try this to **read** an entire DB table into a dataframe `df`:

In [None]:
table_name = 'customers'
df = pd.read_sql_table(
    table_name,
    con=engine,
    schema=schema,
    index_col='customer_id'  # column name to use as dataframe-index (optional)
)
df

To **write** a dataframe `df` to a new table, do this:

In [None]:
table_name = 'customers_copy'

df.to_sql(
    table_name,
    con=engine,
    schema=schema,
    if_exists='fail',  # What to do with an existing table? Could also be `replace` or `append`
    index=True,  # Whether to write the dataframe index as an additional column. Won't be a primary key automatically!
)

## Create an entity diagram to understand the structure of the database
[SQLAlchemy_Schemadisplay](https://github.com/sqlalchemy/sqlalchemy/wiki/SchemaDisplay) allows you to quickly see the structure of a DB like this: ![example schema](https://raw.githubusercontent.com/wiki/sqlalchemy/sqlalchemy/UsageRecipes/SchemaDisplay/schema.png)

In [None]:
from sqlalchemy_schemadisplay import create_schema_graph

**Postgres only:** We need to do some cleanup as SQLAlchemy did not recognize all DB types: `SAWarning: Did not recognize type 'bpchar' of column 'customer_id'`. Every column needs to have set a type.

In [None]:
# Postgres only
# SQLAlchemy has issues with the following columns when using Postgres. They all seem to be strings
offending = ['territory_description', 'region_description', 'customer_id', 'customer_type_id']

from sqlalchemy.types import VARCHAR

for table in meta.sorted_tables:
    for column in table.columns:
        if column.name in offending:
            print(f'{table.name}: {column.name}')
            column.type = VARCHAR(32)

Create the entity diagram. It will be saved as `db_entity_diagram.png`

In [None]:
graph = create_schema_graph(metadata=meta,
   show_datatypes=True,
   show_indexes=True,
   rankdir='LR', # From left to right (instead of top to bottom)
   concentrate=False # Don't try to join the relation lines together
)

graph.write_png('db_entity_diagram.png')