# Load Example Data Into the Exasol database

In this Notebook we will load the "Air pressure system failures in Scania trucks" dataset into the exasol database using Python and Pyexasol. This Scania trucks dataset is a predictive maintenance scenario:

> The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that is utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.

You can find further information [here](https://archive.ics.uci.edu/ml/datasets/IDA2016Challenge).

For this we need:

    - Connection information of the running Exasol database we want to load the data into.
    - The url of the dataset we want to load (and knowledge of its structure).


First we enter the connection details for the Exasol database we want to load the dataset into.
Then we install pyexasol and import some dependencies.

In [None]:
EXASOL_HOST = "<database_host>" # change, in case of Exasol Saas this can be a "connection string"
EXASOL_PORT = "8563" # change if needed
EXASOL_USER = "sys" # change if needed
EXASOL_PASSWORD = "<database_password>" # change, in case of Exasol Saas this can be a personal access token
EXASOL_SCHEMA = "IDA"

In [None]:
!pip install pyexasol

import pyexasol
from io import BytesIO
from urllib.request import urlopen
import pandas as pd
from zipfile import ZipFile

Next we can use  the pyexasol connection to connect to our Exasol DB.

In [None]:
EXASOL_CONNECTION = "{host}:{port}".format(host=EXASOL_HOST, port=EXASOL_PORT)
exasol = pyexasol.connect(dsn=EXASOL_CONNECTION, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)

## Download Example Data

Now we download the dataset and write it to a zip-file.

In [None]:
DATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00414/to_uci.zip"

resp = urlopen(DATA_URL)
with open('to_uci.zip', 'wb') as f:  
    f.write(resp.read())
    
print("data downloaded")

And then we read the contents of the downloaded zip-file into "train_set" and "test_set" variables respectively, using pandas to load the train- and test-tables from the csv files.

In [None]:
TRAINING_FILE = "to_uci/aps_failure_training_set.csv"
TEST_FILE = "to_uci/aps_failure_test_set.csv"

# Data is preceded with a 20-line header (copyright & license)
NUM_SKIP_ROWS = 20
NA_VALUE = "na"

with ZipFile('to_uci.zip') as z:
    with z.open(TRAINING_FILE, "r") as f:
        train_set = pd.read_csv(f, skiprows=NUM_SKIP_ROWS, na_values=NA_VALUE)
    with z.open(TEST_FILE, "r") as f:
        test_set = pd.read_csv(f, skiprows=NUM_SKIP_ROWS, na_values=NA_VALUE)

## Import Example Data

In the last step we want to load the dataset into the exasol database. First we need to create a new schema "EXASOL_SCHEMA" using the pyexasol connection.

In [None]:
exasol.execute(query="CREATE SCHEMA IF NOT EXISTS {schema!i}", query_params={"schema": EXASOL_SCHEMA})

Then we need to create the "EXASOL_SCHEMA.TRAIN" and "EXASOL_SCHEMA.TEST" tables in the Exasol database with column names and types that match the tables from the data set. We do this by extracting the column names from the pandas table we created in the previous step. The column types for the Scania Trucks data set are VARCHAR(3) for the first column ("class"), and DECIMAL(18,2) for all other columns. We use the pyexasol connection we created previously to create these tables.

In [None]:
# Define column names and types
column_names = list(train_set.columns)
column_types = ["VARCHAR(3)"] + ["DECIMAL(18,2)"] * (len(column_names) - 1)
column_desc = [" ".join(t) for t in zip(column_names, column_types)]

params = {"schema": EXASOL_SCHEMA, "column_names": column_names, "column_desc": column_desc}

# Create tables for data
exasol.execute(query="CREATE OR REPLACE TABLE {schema!i}.TRAIN(" + ", ".join(column_desc) + ")", query_params=params)
exasol.execute(query="CREATE OR REPLACE TABLE {schema!i}.TEST LIKE {schema!i}.TRAIN", query_params=params)

Finally, we can use pyexasol's "import_from_pandas" functionality to import our pandas tables into our newly created Exasol tables using the pyexasol connection.

In [None]:
# Import data into Exasol
exasol.import_from_pandas(train_set, (EXASOL_SCHEMA, "TRAIN"))
print(f"Imported {exasol.last_statement().rowcount()} rows into TRAIN.")
exasol.import_from_pandas(test_set, (EXASOL_SCHEMA, "TEST"))
print(f"Imported {exasol.last_statement().rowcount()} rows into TEST.")

Now te Scania Trucks dataset should be available in the Exasol database in the Schema "EXASOL_SCHEMA" sorted into the "TRAIN" and the "TEST" tables.