# Machine Learning Classification Example Using Exasol

## Introduction

This example project provides an end-to-end demonstration of how machine learning techniques can be used directly inside Exasol to enable and improve data-driven processes and decision making. We'll use real-world data provided by a heavy truck manufacturer (see [Problem and Data Description](#Problem-Description)) to predict if truck failures are related to a specific component or not. The data is publicly available in the [IDA 2016 Challenge dataset](https://archive.ics.uci.edu/ml/datasets/IDA2016Challenge) from the Industrial Challenge at the [15th International Symposium on Intelligent Data Analysis](http://ida2016.blogs.dsv.su.se/) (IDA) in 2016.

In the process, we demonstrate that there is no need to export data from Exasol to a different computer or server in order to examine or transform the data. Furthermore, it also isn't necessary for training and testing machine learning models. Everything can be done using user-defined functions (UDFs) directly inside Exasol where the data is stored.

Because the focus of this example project is on how you can better use machine learning tools with Exasol, we will not discuss machine learning topics such as classifier selection and tuning in depth. Because there are many different machine learning methods, choosing a "good" one is highly dependent on the problem to be solved and the data available. Rather, we want to demonstrate how you can more effectively use <b>your models</b> with <b>your data</b> in Exasol.

This example project is broken down into the following sections:
1. [Problem and Data Description](#Problem-Description)
2. [Exasol Setup](#Exasol-Setup)
3. [Loading the Data into Exasol](#Loading-the-Data-into-Exasol)
4. [Examining the Data](#Examining-the-Data)
5. [Transforming the Data](#Transforming-the-Data)
6. [Building the Model](#Building-the-Model)
7. [Training the Model](#Training-the-Model)
8. [Testing the Model](#Testing-the-Model)
9. [Evaluating the Model](#Evaluating-the-Model)
10. [Deploying the Model](#Deploying-the-Model)
11. [Summary](#Summary)


### Prerequisites

The intended audience of this article is assumed to have a basic understanding of the following.
* Machine learning methods
* Exasol, in particular user-defined functions (UDFs)
* Python programming, including
  * [Scikit-learn](https://scikit-learn.org/stable/)
  * [Pandas](https://pandas.pydata.org/)
  * [NumPy](http://www.numpy.org/)

The following resources might help to understand these topics:
  * Python Machine Learning - Second Edition, Sebastian Raschka, Vahid Mirjalili, September 2017
  * Learning scikit-learn: Machine Learning in Python, Raúl Garreta, Guillermo Moncecchi, November 2013
  * [EXASOL Manual](https://www.exasol.com/portal/display/DOC/User+Manual+6.1.0)
### Technical Notes and Recommendations

* The code in this Jupyter Notebook was tested using Python 3.6 and 3.7.
* We used Exasol 6.0 for this example. If you prefer to use Exasol 6.1, please remember to select the corresponding script-languages flavor (see [Exasol Setup](#Exasol-Setup)).
* We recommend that your Exasol instance have at least 6GB RAM.
* The code below uses HTTPS to access Exasol's BucketFS by default. If you prefer to use HTTP, please set `EXASOL_BUCKETFS_USE_HTTPS = False` (see [Exasol Setup](#Exasol-Setup)).

### Readability Tip

The UDF scripts below are defined in triple-quoted Python strings (e.g. `sql = textwrap.dedent(f"""...""")`), which prevent the Python syntax highlighting from working in the Jupyter Notebook. If you simply remove one of the quotation marks at the beginning of the string (i.e. `sql = textwrap.dedent(f""...""")`), the syntax highlighting will work which can greatly improve the code's readability. However, please don't forget to add the quotation mark back before executing the cell.

## Problem Description

This example is based on the publicly available [IDA 2016 Challenge dataset](https://archive.ics.uci.edu/ml/datasets/IDA2016Challenge) from the Industrial Challenge at the [15th International Symposium on Intelligent Data Analysis](http://ida2016.blogs.dsv.su.se/) (IDA) in 2016.

The purpose of the challenge was to best predict which failures were related to a specific component of a truck's air pressure system (APS) as opposed to failures unrelated to the APS. Specifically, the following cost metric was given, which was to be minimized.

$cost_{total}=cost_{FP}\cdot{FP} + cost_{FN}\cdot{FN}$

where  
$FP$ is the number of false positives (predicted APS failure, but really isn't),  
$FN$ is the number of false negatives (predicted non-APS failure, but really is),  
$cost_{FP}=10$ is the cost of an unnecessary check by a mechanic, and  
$cost_{FN}=500$ is the cost of not checking a faulty truck and possibly causing a breakdown.

From the cost metric, we can see that an unnecessary preventative check is much cheaper (50x) than overlooking a faulty truck, which makes sense.

### Data Description

The dataset, provided by [Scania CV AB](https://www.scania.com), consists of real data from heavy Scania trucks during normal operation. The following is a brief description of the data. For details, please see the data description file provided with the data.

* Number of attributes: 171
* Training data:
    * Total instances: 60,000
    * Positive instances (APS failures): 1000 (1.7% of total)
* Test data:
    * Total instances: 16,000
    * Positive instances (APS failures): 375 (2.3% of total)

<div class="alert alert-info">
Please read the copyright and license information contained in the data files before proceeding.
</div>

## Exasol Setup

Here, we specify some basic information, which is used throughout this example. In particular, we specify the URL, user name, and password for the Exasol host(s) and EXABucket.

We also specify the scripting language to be used, which is the 'python3-ds-EXASOL-6.0.0' flavor (i.e. Python 3 with selected data science modules for Exasol 6.0), available in Exasol's [script-languages](https://github.com/exasol/script-languages) GitHub repository. Pre-packaged releases are available in the [release area](https://github.com/exasol/script-languages/releases) of the Github repository. The 'python3-ds-\*' flavors have the added benefit of integrated [Pandas](https://pandas.pydata.org/) DataFrame support for loading data from Exasol into a script (i.e. `ctx.get_dataframe()`) and emitting data from a script (i.e. `ctx.emit()`). If you use newer version of the 'python3-ds-\*' flavors (since commit [480d79a](https://github.com/exasol/script-languages/commit/480d79acaf06df789a7a752b956ffcc7969ca596)), you need to change in the EXASOL_UDF_CLIENT from 'exaudfclient' to 'exaudfclient_py3', because since then, Exasol supports Python2 and Python3 UDFs in the same container.

In [1]:
EXASOL_EXTERNAL_HOST_NAME = "MyCluster_11"
EXASOL_HOST_PORT = "8888"
EXASOL_EXTERNAL_HOST = f"""{EXASOL_EXTERNAL_HOST_NAME}:{EXASOL_HOST_PORT}"""
EXASOL_USER = "sys"
EXASOL_PASSWORD = "exasol"
EXASOL_BUCKETFS_PORT = "6583"
EXASOL_EXTERNAL_BUCKETFS_HOST = f"""{EXASOL_EXTERNAL_HOST_NAME}:{EXASOL_BUCKETFS_PORT}"""
EXASOL_BUCKETFS_USER = "w"
EXASOL_BUCKETFS_PASSWORD = "pw"
EXASOL_BUCKETFS_USE_HTTPS = False
EXASOL_BUCKETFS_URL_PREFIX = "https://" if EXASOL_BUCKETFS_USE_HTTPS else "http://"
EXASOL_BUCKETFS_SERVICE = "bfsdefault"
EXASOL_BUCKETFS_BUCKET = "default"
EXASOL_BUCKETFS_PATH = f"/buckets/{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}" # Filesystem-Path to the read-only mounted BucketFS inside the running UDF Container
EXASOL_UDF_FLAVOR = "python3-ds-EXASOL-6.0.0"
EXASOL_UDF_RELEASE= "20190116"
EXASOL_UDF_CLIENT = "exaudfclient" # or for newer versions of the flavor exaudfclient_py3
EXASOL_SCRIPT_LANGUAGES = f"PYTHON3_60=localzmq+protobuf:///{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}?lang=python#buckets/{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}/exaudf/{EXASOL_UDF_CLIENT}";

First we need install the python modules, such as pyexasol.

In [2]:
!pip install pyexasol stopwatch.py requests



 Further more we need to upload the the script language container to the database.

In [3]:
download_command=f"""curl -L -o {EXASOL_UDF_FLAVOR}.tar.gz  https://github.com/exasol/script-languages/releases/download/{EXASOL_UDF_RELEASE}/{EXASOL_UDF_FLAVOR}-{EXASOL_UDF_RELEASE}.tar.gz"""
print("Download: %s"%download_command)
! {download_command}
upload_command=f"""curl {EXASOL_BUCKETFS_URL_PREFIX}{EXASOL_BUCKETFS_USER}:{EXASOL_BUCKETFS_PASSWORD}@{EXASOL_EXTERNAL_BUCKETFS_HOST}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}.tar.gz --upload-file {EXASOL_UDF_FLAVOR}.tar.gz"""
print("Upload: %s"%upload_command)
! {upload_command}
check_command=f"""curl {EXASOL_BUCKETFS_URL_PREFIX}{EXASOL_BUCKETFS_USER}:{EXASOL_BUCKETFS_PASSWORD}@{EXASOL_EXTERNAL_BUCKETFS_HOST}/{EXASOL_BUCKETFS_BUCKET}/ | grep {EXASOL_UDF_FLAVOR}.tar.gz"""
print("Check Upload: %s"%check_command)
! {check_command}
print("Finished upload")

Download: curl -L -o python3-ds-EXASOL-6.0.0.tar.gz  https://github.com/exasol/script-languages/releases/download/20190116/python3-ds-EXASOL-6.0.0-20190116.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   629    0   629    0     0   1230      0 --:--:-- --:--:-- --:--:--  1233
100  743M  100  743M    0     0  11.0M      0  0:01:07  0:01:07 --:--:-- 8936k
Upload: curl http://w:pw@MyCluster_11:6583/default/python3-ds-EXASOL-6.0.0.tar.gz --upload-file python3-ds-EXASOL-6.0.0.tar.gz
Check Upload: curl http://w:pw@MyCluster_11:6583/default/ | grep python3-ds-EXASOL-6.0.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   493  100   493    0     0   481k      0 --:--:-- --:--:-- --:--:--  481k
python3-ds-EXASOL-6.0.0.tar.gz
Finished upload


For this project, we create an Exasol schema named `IDA`, in which everything will be stored. For this step and throughout the rest of this example project, we use the very convenient [pyexasol](https://github.com/badoo/pyexasol) module, which encapsulates the communication functionality between Python and Exasol.

In [4]:
import pyexasol

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)

# Create schema
conn.execute("CREATE SCHEMA IF NOT EXISTS IDA")
conn.execute("OPEN SCHEMA IDA")

# Close Exasol connection
conn.close()

### Create an EXABucket Helper Script

Before we proceed, we'll create a small helper script in Exasol to define a function, `upload_object_to_bucketfs()`, which simply uploads a Python object to the specified EXABucket so that the object can be loaded and used by Exasol UDFs later. Specifically, we will use this function to save our transformation pipeline and classifier model, which we create below.

In [5]:
import textwrap
import pyexasol

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# Create script to upload files to an EXABucket
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT IDA.EXABUCKET_HELPER(...)
RETURNS INT AS

import os
import pycurl
import uuid

from sklearn.externals import joblib

# Upload object to EXABucket
def upload_object_to_bucketfs(object, host, path, user, pw, secure=True):
    temp_file = "/tmp/" + str(uuid.uuid4().hex + ".pkl")
    joblib.dump(object, temp_file, compress=True)
    protocol = 'https' if secure else 'http'

    with open(temp_file, "rb") as f:
        url = protocol + "://" + user + ":" + pw + "@" + host + path
        curl = pycurl.Curl()
        curl.setopt(pycurl.URL, url)
        curl.setopt(pycurl.SSL_VERIFYPEER, 0)   
        curl.setopt(pycurl.SSL_VERIFYHOST, 0)
        curl.setopt(curl.UPLOAD, 1)
        curl.setopt(curl.READDATA, f)
        curl.perform()
        curl.close()

    try:
        os.remove(temp_file)
    except OSError:
        pass
/
""")

conn.execute(sql)

# Close Exasol connection
conn.close()

print("EXABucket Helper Script created")

EXABucket Helper Script created


## Loading the Data into Exasol

To begin, we download the [IDA 2016 Challenge dataset](https://archive.ics.uci.edu/ml/datasets/IDA2016Challenge) (20MB) from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and import it into local training and test DataFrames. Because the ZIP file contains a data description file in addition to both the training and test data, which must be kept separate, we cannot import the ZIP file directly into Exasol using Exasol's IMPORT statement. First, we download the ZIP file to our local filesystem.

In [6]:
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import pandas as pd

from stopwatch import Stopwatch
stopwatch = Stopwatch()

DATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00414/to_uci.zip"
TRAINING_FILE = "to_uci/aps_failure_training_set.csv"
TEST_FILE = "to_uci/aps_failure_test_set.csv"

# Data is preceeded with a 20-line header (copyright & license)
NUM_SKIP_ROWS = 20
NA_VALUE = "na"

resp = urlopen(DATA_URL)
with open('to_uci.zip', 'wb') as f:  
    f.write(resp.read())
    
print("Downloading the data took: %s"%str(stopwatch))

Downloading the data took: 10.45s


Afterwards, we reading the ZIP File into Dataframes.

In [7]:
from stopwatch import Stopwatch
stopwatch = Stopwatch()

with ZipFile('to_uci.zip') as z:
    with z.open(TRAINING_FILE, "r") as f:
        train_set = pd.read_csv(f, skiprows=NUM_SKIP_ROWS, na_values=NA_VALUE)
    with z.open(TEST_FILE, "r") as f:
        test_set = pd.read_csv(f, skiprows=NUM_SKIP_ROWS, na_values=NA_VALUE)
        
print("Reading the data took: %s"%str(stopwatch))

Reading the data took: 1.70s


By having a quick look at the data and/or reading the provided data description file, we can see that the first data column is the class label ('`neg`'/'`pos`') and can be stored in a `VARCHAR(3)` column. The other data columns are all numerical features which can be stored in `DECIMAL(18, 2)` columns. With this information we can now define our column names and types.

In [8]:
# Define column names and types
column_names = list(train_set.columns)
column_types = ["VARCHAR(3)"] + ["DECIMAL(18,2)"] * (len(column_names) - 1)
column_desc = [" ".join(t) for t in zip(column_names, column_types)]

Now, we load the training and test data from the local DataFrames into two tables named `TRAIN` and `TEST`, respectively. At first, we need to define the columns and their types for new tables.

In [9]:
import textwrap
import pyexasol
from stopwatch import Stopwatch
stopwatch = Stopwatch()

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)

# Create schema
conn.execute("CREATE SCHEMA IF NOT EXISTS IDA")
conn.execute("OPEN SCHEMA IDA")

# Create tables for data
conn.execute("CREATE OR REPLACE TABLE IDA.TRAIN(" + ", ".join(column_desc) + ")")
conn.execute("CREATE OR REPLACE TABLE IDA.TEST LIKE IDA.TRAIN")

# Import data into Exasol
conn.import_from_pandas(train_set, "TRAIN")
print(f"Imported {conn.last_statement().rowcount()} rows into IDA.TRAIN.")
conn.import_from_pandas(test_set, "TEST")
print(f"Imported {conn.last_statement().rowcount()} rows into IDA.TEST.")

# Close Exasol connection
conn.close()

print("Importing the data took: %s"%str(stopwatch))

Imported 60000 rows into IDA.TRAIN.
Imported 16000 rows into IDA.TEST.
Importing the data took: 7.01s


## Examining the Data

After loading the data into Exasol, we may first want to get a feel for the data before creating a classifier. There are many different ways to do so, such as visualizing the data, viewing basic statistical information, examining feature correlation, etc.

### Examine the Data Statistics

We will examine the training data's basic statistical information using `pandas.DataFrame.describe()` and `pandas.DataFrame.var()`. The combined results are only shown for the first five columns in order to limit the output for this example, but you can, of course, easily remove this limitation in the `print()` function below to view the statistical information for all columns.

In [11]:
import textwrap
import pyexasol
from stopwatch import Stopwatch
stopwatch = Stopwatch()

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# Create script output column descriptions
# Numeric data
out_column_types = ["DOUBLE"] * len(column_names)
out_column_desc = [" ".join(t) for t in zip(column_names, out_column_types)]

# Create script to run pandas.DataFrame.describe() and pandas.DataFrame.var() in Exasol
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT IDA.DF_DESCRIBE({", ".join(column_desc)})
EMITS ({", ".join(out_column_desc)}) AS

import pandas as pd

def get_stats(X):
    # Replace 'neg'/'pos' with 0/1
    X.loc[:, 'class'] = X.loc[:, 'class'].replace({{'neg': 0, 'pos': 1}})
    # Convert all columns to numeric data types
    X = X.apply(pd.to_numeric)

    # Get DataFrame stats
    X_describe = X.describe()

    # Get DataFrame variance
    X_var = X.var()
    X_var.name = 'var'

    # Append variance to stats
    return X_describe.append(X_var)

def run(ctx):
    # Create DataFrame using all columns
    df = ctx.get_dataframe(num_rows='all', start_col=0)

    # Calculate statistics info
    df = get_stats(df)

    # Output data description
    ctx.emit(df)
/
""")

conn.execute(sql)

# Create table "TRAIN_DESCRIPTION" to hold the description output
sql = textwrap.dedent(f"""\
CREATE OR REPLACE TABLE IDA.TRAIN_DESCRIPTION AS
    SELECT IDA.DF_DESCRIBE({", ".join(column_names)}) FROM IDA.TRAIN
""")

conn.execute(sql)

# Create local data frame from the "TRAIN_DESCRIPTION" table
train_desc = conn.export_to_pandas("SELECT * FROM IDA.TRAIN_DESCRIPTION")
train_desc.index = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'var']

# Print first 5 columns, for example
print(train_desc.iloc[:, 0:5])

# Close Exasol connection
conn.close()

print("Creating statistics for the data took: %s"%str(stopwatch))

              CLASS        AA_000        AB_000        AC_000        AD_000
count  60000.000000  6.000000e+04  13671.000000  5.666500e+04  4.513900e+04
mean       0.016667  5.933650e+04      0.713189  3.560143e+08  1.906206e+05
std        0.128020  1.454301e+05      3.478962  7.948749e+08  4.040441e+07
min        0.000000  0.000000e+00      0.000000  0.000000e+00  0.000000e+00
25%        0.000000  8.340000e+02      0.000000  1.600000e+01  2.400000e+01
50%        0.000000  3.077600e+04      0.000000  1.520000e+02  1.260000e+02
75%        0.000000  4.866800e+04      0.000000  9.640000e+02  4.300000e+02
max        1.000000  2.746564e+06    204.000000  2.130707e+09  8.584298e+09
var        0.016389  2.114990e+10     12.103176  6.318261e+17  1.632516e+15
Creating statistics for the data took: 39.41s


## Transforming the Data

Looking at the statistics summary from the previous step, we can see, for example, that some features have missing values and that the means and variances of the features differ greatly. Because of this, it's most likely a good idea to transform, clean and normalize the data.

### Create and Run the Transformation Pipeline

Depending on which classifier one plans to use, among other things, there are many different techniques one may use to transform the data, such as feature scaling and extraction.

In this example, we first use imputation to replace missing values with the median value of that feature. Then, we scale the data such that each feature is normally distributed (with zero mean and unit variance). These are very simple transformations that work fairly well with many learning algorithms.

In the script below, the transformation pipeline is fitted to the training data and used to transform it. Then, the transformer object is saved to the specified EXABucket for future use. Finally, the transformed training data is emitted and stored in the table `TRAIN_TRANSFORMED`.

The same script is then called again with the test data. Since the test data should be transformed exactly as the training data, the transformer which was previously saved is simply loaded from the EXABucket and used to transform the test data, which is then emitted and stored in the table `TEST_TRANSFORMED`.

In [12]:
import textwrap
import pyexasol
from stopwatch import Stopwatch
stopwatch = Stopwatch()

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# Create script output column descriptions
# One class label, numeric data
out_column_types = ["INT"] + ["DOUBLE"] * (len(column_names) - 1)
out_column_desc = [" ".join(t) for t in zip(column_names, out_column_types)]

# File to store the transformer
transformer_file = f"transform_pipeline.pkl"

# Create script to transform the data
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT
IDA.DF_TRANSFORM(fit_transformer BOOL, transformer_path VARCHAR(200), {", ".join(column_desc)})
EMITS ({", ".join(out_column_desc)}) AS

import pandas as pd

from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

# Import helper script
exabucket_helper = exa.import_script('IDA.EXABUCKET_HELPER')

# Transform DataFrame
def transform_dataframe(X, class_col_name, fit_transformer=False, save_path=None):
    y = X.loc[:, class_col_name]
    X_data = X.loc[:, X.columns != class_col_name]

    # Replace 'neg'/'pos' with 0/1
    y = y.replace({{'neg': 0, 'pos': 1}})

    # Convert columns to numeric data types
    X_data = X_data.apply(pd.to_numeric)

    if fit_transformer:
        # Fit transformer and transform data
        transformer = Pipeline ([
            ('imputer', Imputer(strategy="median")),
            ('scaler', StandardScaler())
        ])
        X_data_transformed = transformer.fit_transform(X_data)
        if save_path:
            # Save transformer
            exabucket_helper.upload_object_to_bucketfs(transformer,
                                                        'localhost:{EXASOL_BUCKETFS_PORT}',
                                                        save_path,
                                                        '{EXASOL_BUCKETFS_USER}',
                                                        '{EXASOL_BUCKETFS_PASSWORD}',
                                                        {EXASOL_BUCKETFS_USE_HTTPS})
    else:
        # Load transformer and transform data
        transformer = joblib.load(save_path)
        X_data_transformed = transformer.transform(X_data)

    # Create transformed DataFrame with column names
    y_df = pd.DataFrame(y, columns=[class_col_name])
    X_data_df = pd.DataFrame(X_data_transformed, columns=X.columns[X.columns != class_col_name])
    return y_df.join(X_data_df)

def run(ctx):
    # Non-data Input arguments
    num_non_data_cols = 2
    fit_transformer = ctx.fit_transformer
    transformer_path = ctx.transformer_path

    df = ctx.get_dataframe(num_rows='all', start_col=num_non_data_cols)

    # Transform feature data
    df = transform_dataframe(df,
                                class_col_name='class',
                                fit_transformer=fit_transformer,
                                save_path=transformer_path)

    # Output data
    ctx.emit(df)
/
""")

conn.execute(sql)

# Transform training data
sql = textwrap.dedent(f"""\
CREATE OR REPLACE TABLE IDA.TRAIN_TRANSFORMED AS
    SELECT IDA.DF_TRANSFORM(TRUE, '/{EXASOL_BUCKETFS_BUCKET}/{transformer_file}', {", ".join(column_names)})
    FROM IDA.TRAIN
""")

conn.execute(sql)

# Transform test data
sql = textwrap.dedent(f"""\
CREATE OR REPLACE TABLE IDA.TEST_TRANSFORMED AS
    SELECT IDA.DF_TRANSFORM(FALSE,
                            '{EXASOL_BUCKETFS_PATH}/{transformer_file}',
                            {", ".join(column_names)})
    FROM IDA.TEST
""")

conn.execute(sql)

# Close Exasol connection
conn.close()

print("Creating and running the transformation pipeline for the data took: %s"%str(stopwatch))

Creating and running the transformation pipeline for the data took: 85.47s


## Building the Model and runing Grid Search to find good Hyper Parameters

Now that we've transformed the data, we'll build a model which will be used to predict if each instance is an APS failure or not.

For this example, we'll use a classifier based on the Extra-Trees (extremely randomized trees) algorithm. This tree-based ensemble method is similar to a random forest, except that the tree splitting is randomized, among other things. This can improve the accuracy as well as the computation time. Details can be found [here](https://orbi.uliege.be/handle/2268/9357).

As with most machine learning algorithms, there are multiple parameters which need to be tuned in order to optimize the performance of the classifier for our problem and data. Rather than try many combinations by hand, we'll use grid search and 5-fold cross validation on the training data to find the optimal parameters in the specified subset of parameters.

Because searching a large grid can be computationally intensive, a good set of parameter values has already been found using grid search offline. Thus, only a small, coarse subspace of the search grid is used in the example code below so that executing the script will not take too long.

A good set of parameter values found offline using grid search is the following:

|Parameter|Value|
| :--- | ---: |
|n_estimators|61|
|max_depth|10|
|class_weight|{0: 1, 1: 89}|

After the optimal parameter values have been found using grid search, an `ExtraTreesClassifier` model is created using the parameter values and then saved to an EXABucket for use in the next step&mdash;training the model.

In [13]:
import textwrap
import pyexasol
from stopwatch import Stopwatch
stopwatch = Stopwatch()

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# File to store the classifier
classifier_file = "classifier.pkl"

# Script input column descriptions are now the same as output
# One class label, numeric data
column_types = ["INT"] + ["DOUBLE"] * (len(column_names) - 1)
column_desc = [" ".join(t) for t in zip(column_names, column_types)]

# Create script to build the model
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT
IDA.BUILD_MODEL(classifier_path VARCHAR(200), {", ".join(column_desc)})
EMITS (n_estimators int, max_depth int, class_weight VARCHAR(200)) AS

import pandas as pd

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.externals import joblib
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# Import helper script
exabucket_helper = exa.import_script('IDA.EXABUCKET_HELPER')

# Random state to use for reproducibility
RAND_STATE = 3

# Build extra-tree classifier
def build_et_classifier(X, class_col_name, model_path=None):
    # Convert columns to numeric data types
    X = X.apply(pd.to_numeric)

    y = X.loc[:, class_col_name]
    X_data = X.loc[:, X.columns != class_col_name]

    # Create classifier
    clf = ExtraTreesClassifier(random_state=RAND_STATE, n_jobs=-1)

    # Specify parameter search grid
    # The grid size is kept small to reduce the computation time
    # Good values (known from offline grid search) are:
    # 'n_estimators': 61
    # 'max_depth': 10
    # 'class_weight': {{0: 1, 1: 89}}
    param_grid = [{{
        'n_estimators': [30, 61],
        'max_depth': [5, 10],
        'class_weight': [{{0: 1, 1: 89}}]
    }}]

    # Define scoring metric for grid search from problem description
    def ida_score(y, y_pred):
        false_preds = y - y_pred
        num_false_pos = (false_preds < 0).sum()
        num_false_neg = (false_preds > 0).sum()
        return -(num_false_pos * 10 + num_false_neg * 500)

    ida_scorer = make_scorer(ida_score)

    # Search for optimal values in grid using 5-fold cross validation
    grid_search = GridSearchCV(clf, param_grid, cv=5, scoring=ida_scorer, n_jobs=-1)
    grid_search.fit(X_data, y.values.ravel())

    # Create new model with optimal parameter values
    clf = ExtraTreesClassifier(random_state=RAND_STATE, n_jobs=-1,
                                n_estimators=grid_search.best_params_['n_estimators'],
                                max_depth=grid_search.best_params_['max_depth'], 
                                class_weight=grid_search.best_params_['class_weight'])

    # Save classifier to EXABucket
    if model_path:
        exabucket_helper.upload_object_to_bucketfs(clf,
                                                    'localhost:{EXASOL_BUCKETFS_PORT}',
                                                    model_path,
                                                    '{EXASOL_BUCKETFS_USER}',
                                                    '{EXASOL_BUCKETFS_PASSWORD}',
                                                    {EXASOL_BUCKETFS_USE_HTTPS})
    return grid_search

def run(ctx):
    # Input argument
    num_non_data_cols = 1
    classifier_path = ctx.classifier_path

    df = ctx.get_dataframe(num_rows='all', start_col=num_non_data_cols)

    # Shuffle data
    train_set = resample(df, n_samples=30000, replace=False, random_state=RAND_STATE)

    # Build extra-tree classifier
    grid_search=build_et_classifier(train_set, class_col_name='class', model_path=classifier_path)
    ctx.emit(grid_search.best_params_['n_estimators'],
              grid_search.best_params_['max_depth'],
              str(grid_search.best_params_['class_weight']))
/
""")

conn.execute(sql)

# Build model
sql = textwrap.dedent(f"""\
SELECT IDA.BUILD_MODEL('/{EXASOL_BUCKETFS_BUCKET}/{classifier_file}', {", ".join(column_names)})
FROM IDA.TRAIN_TRANSFORMED
""")

print("Gridsearch result:",conn.execute(sql).fetchall())

# Close Exasol connection
conn.close()

print("Building the model and Grid search to find good hyper parameters took: %s"%str(stopwatch))

Gridsearch result: [(61, 10, '{0: 1, 1: 89}')]
Building the model and Grid search to find good hyper parameters took: 26.28s


## Training the Model

After transforming the data and creating the classifier, we'll now train it on all the transformed training data. Then, the model will be stored in the provided EXABucket for later use during testing.

In [14]:
import textwrap
import pyexasol
from stopwatch import Stopwatch
stopwatch = Stopwatch()

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# Create script to train the model
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT
IDA.TRAIN_MODEL(classifier_load_path VARCHAR(200), classifier_save_path VARCHAR(200), {", ".join(column_desc)})
EMITS (dummy int) AS

import pandas as pd

from sklearn.externals import joblib
from sklearn.utils import resample

# Import helper script
exabucket_helper = exa.import_script('IDA.EXABUCKET_HELPER')

# Random state to use for reproducibility
RAND_STATE = 3

# Train classifier
def train(X, class_col_name, model_load_path, model_save_path):
    # Convert columns to numeric data types
    X = X.apply(pd.to_numeric)

    y = X.loc[:, class_col_name]
    X_data = X.loc[:, X.columns != class_col_name]

    # Load model from EXABucket
    clf = joblib.load(model_load_path)
    clf.fit(X_data, y.values.ravel())

    # Save classifier to EXABucket
    if model_save_path:
        exabucket_helper.upload_object_to_bucketfs(clf,
                                                    'localhost:{EXASOL_BUCKETFS_PORT}',
                                                    model_save_path,
                                                    '{EXASOL_BUCKETFS_USER}',
                                                    '{EXASOL_BUCKETFS_PASSWORD}',
                                                    {EXASOL_BUCKETFS_USE_HTTPS})

def run(ctx):
    # Input arguments
    num_non_data_cols = 2
    classifier_load_path = ctx.classifier_load_path
    classifier_save_path = ctx.classifier_save_path

    df = ctx.get_dataframe(num_rows='all', start_col=num_non_data_cols)

    # Shuffle data
    train_set = resample(df, replace=False, random_state=RAND_STATE)

    # Train the classifier
    train(train_set,
            class_col_name='class',
            model_load_path=classifier_load_path,
            model_save_path=classifier_save_path)
/
""")

conn.execute(sql)

# Train model
sql = textwrap.dedent(f"""\
SELECT IDA.TRAIN_MODEL('{EXASOL_BUCKETFS_PATH}/{classifier_file}',
                        '/{EXASOL_BUCKETFS_BUCKET}/{classifier_file}',
                        {", ".join(column_names)})
FROM IDA.TRAIN_TRANSFORMED
""")

conn.execute(sql)

# Close Exasol connection
conn.close()

print("Training the model took: %s"%str(stopwatch))

Training the model took: 17.69s


## Testing the Model

After training the classifier, we'll now test it using the transformed test data.

The model that was saved after training will now be loaded from the EXABucket and used to predict the classes of the test data (i.e. whether a failure is an APS failure or not). The emitted results, which are stored in the table `TEST_PREDICTIONS`, are the predicted classes (first column) joined to the transformed test data. By joining the predicted class labels to the test data, we ensure that the predicted and real class labels remain properly ordered/linked for evaluation.

In [15]:
import textwrap
import pyexasol
from stopwatch import Stopwatch
stopwatch = Stopwatch()

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# Create script output column descriptions
# Two class labels, numeric data
out_column_types = ["INT"] * 2 + ["DOUBLE"] * (len(column_names) - 1)
out_column_desc = [" ".join(t) for t in zip(["class_pred"] + column_names, out_column_types)]

# Create script to test the model
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT
IDA.TEST_MODEL(classifier_path VARCHAR(200), {", ".join(column_desc)})
EMITS ({", ".join(out_column_desc)}) AS

import pandas as pd

from sklearn.externals import joblib

# Test classifier
def test(X, class_col_name, model_path=None):
    # Convert columns to numeric data types
    X = X.apply(pd.to_numeric)

    X_data = X.loc[:, X.columns != class_col_name]

    # Load model from EXABucket
    clf = joblib.load(model_path)

    # Predict classes of test data
    return clf.predict(X_data)

def run(ctx):
    # Input argument
    num_non_data_cols = 1
    classifier_path = ctx.classifier_path

    df = ctx.get_dataframe(num_rows='all', start_col=num_non_data_cols)

    # Test the classifier
    y_pred = test(df, class_col_name='class', model_path=classifier_path)

    # Add class predictions as first column of test DataFrame
    df_pred = (pd.DataFrame(y_pred, columns=['class_pred'])).join(df)

    # Convert columns to numeric data types
    df_pred = df_pred.apply(pd.to_numeric)

    # Output data
    ctx.emit(df_pred)
/
""")

conn.execute(sql)

# Test model
sql = textwrap.dedent(f"""\
CREATE OR REPLACE TABLE IDA.TEST_PREDICTIONS AS
    SELECT IDA.TEST_MODEL('{EXASOL_BUCKETFS_PATH}/{classifier_file}', {", ".join(column_names)})
    FROM IDA.TEST_TRANSFORMED
""")

conn.execute(sql)

# Close Exasol connection
conn.close()

print("Test the model took: %s"%str(stopwatch))

Test the model took: 12.98s


## Evaluating the Model

Now that we have the predicted class labels of the test data, we can simply compare them to the actual class labels to evaluate how well the classifier performed.

For the performance metric, we use the `ida_cost()` method defined below, which implements the cost function specified in the problem description. Additionally, the confusion matrix is also displayed to see how the instances were classified.

In [16]:
import textwrap
import pyexasol
from sklearn.metrics import confusion_matrix

# Define cost function from the problem description
def ida_cost(y, y_pred):
    false_preds = y - y_pred
    num_false_pos = (false_preds < 0).sum()
    num_false_neg = (false_preds > 0).sum()
    return 10 * num_false_pos + 500 * num_false_neg

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)

# Get predicted and real class labels
test_preds = conn.export_to_pandas("SELECT CLASS_PRED, CLASS FROM IDA.TEST_PREDICTIONS")

# Close Exasol connection
conn.close()

y_pred = test_preds.loc[:, 'CLASS_PRED']
y = test_preds.loc[:, 'CLASS']

# Examine the results
confusion_mat = confusion_matrix(y, y_pred)
confusion_matrix_df = pd.DataFrame(confusion_mat,
                                   index=['actual neg', 'actual pos'],
                                   columns=['predicted neg', 'predicted pos'])

print("Total Cost:", ida_cost(y, y_pred),"\n")
print("Confusion Matrix:\n", confusion_matrix_df)

Total Cost: 10590 

Confusion Matrix:
             predicted neg  predicted pos
actual neg          14866            759
actual pos              6            369


After running the evaluation script above, the following results (or values very similar to them&mdash;possibly depending on the system) should be displayed.

***

<b>Total Cost:</b> 10590

<b>Confusion Matrix:</b>

|&nbsp;|<b>predicted neg</b>|<b>predicted pos</b>|
|---------|------------|------------|
|<b>actual neg</b>|14866|759|
|<b>actual pos</b>|    6|369|

***

While there are many different methods for evaluating a model, we are interested in minimizing the total cost as described in the problem description. By looking at the confusion matrix, we can see that the total cost was calculated as $10\cdot{759} + 500\cdot{6}=10590$.

Note that since the costs for false negatives and false positives are not equal, this model may not necessarily have the highest classification accuracy. This makes sense since false negatives are punished much more severely (50x) than false positives. So intuitively, the model would much rather err classifying a negative as positive than a positive as a negative. And when we look at the confusion matrix and some other performance metrics, we see that this is indeed the case.

The <b>classification accuracy</b>, which is the ratio of correct predictions to total predictions, is $\frac{14866+369}{16000}=0.95$. However, this value is not the most relevant in this case as mentioned above. We are much more interested in minimizing false negatives than false positives.

Similarly, if we look at the <b>precision</b> metric for the classifier, which is the ratio of true positives to true predictions, $\frac{369}{369+759}=0.33$ does not seem to be too good. In fact, the model has predicted over 2x as many false positives as true positives. However because the false positives have a relatively low cost, the performance is perhaps not as bad as it seems.

On the other hand, because false negatives are so expensive, the <b>recall</b> (or true positive rate) metric is more telling in our case. The recall, which is the ratio of true positives to actual positives, has a value of $\frac{369}{369+6}=0.98$, which means that the model correctly classified over 98% of all actual positives.

If we switch back to the terminology of the problem, the performance of the classifier can be summarized as the following.
* 98% of APS failures are correctly identified and the trucks are properly checked.
* 2% of trucks with a faulty APS are not properly checked resulting in a potential breakdown.
* 67% of APS checks are unnecessary because the trucks do not have a faulty APS.

## Deploying the Model

After evaluating the model and deciding that it is ready for production, all we need to do is deploy it. This is quite simple since we just need to copy the model we previously evaluated to an EXABucket of a production Exasol cluster, where it can be used with live data.

In the short script below, we upload the evaluated model to a new EXABucket location. Note: In order to keep this example simple, the model is simply uploaded to a different path on the same Exasol cluster.

In [None]:
import textwrap
import pyexasol

# Create Exasol connection
conn = pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)
conn.execute(f"ALTER SESSION SET SCRIPT_LANGUAGES='{EXASOL_SCRIPT_LANGUAGES}'")

# File to store the production classifier
production_file = "production_classifier.pkl"

# Create script to deploy the model
sql = textwrap.dedent(f"""\
CREATE OR REPLACE PYTHON3_60 SET SCRIPT
IDA.DEPLOY_MODEL(classifier_load_path VARCHAR(200), classifier_production_path VARCHAR(200))
EMITS (dummy int) AS

from sklearn.externals import joblib

# Import helper script
exabucket_helper = exa.import_script('IDA.EXABUCKET_HELPER')

def run(ctx):
    # Load model from EXABucket
    clf = joblib.load(ctx.classifier_load_path)

    # Save classifier to EXABucket
    exabucket_helper.upload_object_to_bucketfs(clf,
                                                'localhost:{EXASOL_BUCKETFS_PORT}',
                                                ctx.classifier_production_path,
                                                '{EXASOL_BUCKETFS_USER}',
                                                '{EXASOL_BUCKETFS_PASSWORD}',
                                                {EXASOL_BUCKETFS_USE_HTTPS})
/
""")

conn.execute(sql)

# Deploy model
sql = textwrap.dedent(f"""\
SELECT IDA.DEPLOY_MODEL('{EXASOL_BUCKETFS_PATH}/{classifier_file}', '/{EXASOL_BUCKETFS_BUCKET}/{production_file}')
""")

conn.execute(sql)

# Close Exasol connection
conn.close()

## Summary

In this small example project, we went through each of the main steps of a machine learning project while using a real-world, industrial problem and data as an example. We started from the very beginning by downloading the data and finished with a production-deployed machine learning model ready to use for making intelligent, data-driven business decisions.

We demonstrated that each step in the process can be completed using Exasol's UDFs, so there isn't a need to separate the database from machine learning methods anymore. There's no need to export the data to a separate machine or server in order to analyze it and build and train machine learning models based on it. You can build, train, and test your models by accessing the data directly from inside the database. You can do it all in one place &mdash; Exasol.