# Train a Sagemaker model with data from Exasol

This notebook shows you how to train an AWS Sagemaker model using data from withing the Exasol database.
The trick with the Python Sagemaker SDK is, that the actual training is not run on this notebooks instance but on a dedicated EC2 instance.
That saves costs, since you only pay for the fast and expensive training instance during training.
The exchange is handled using an CSV file. So this notebook will write the training data as a CSV file to an S3 bucket.
From there the training instance will read it, and write back the trained model.

Typically, you prepare the data using python in this notebook (for example using pandas).
We will, however, use a more performant approach: Instead of passing all data through this notebook,
we will directly prepare and export them in the Exasol database using SQL.
By that you can use the computation power of your entire Exasol cluster!

## Parameters

In [None]:
EXASOL_HOST = "<database_host>" # change
EXASOL_PORT = "8563" # change if needed
EXASOL_USER = "sys" # change if needed
EXASOL_PASSWORD = "<database_password>" #change
EXASOL_SCHEMA = "IDA"

# We will asign these tags to all resources created by this notebook. That's usefull to monitor your costs.
# add for example: {"Key":"owner", "Value": "your.name@example.com"}
tags = []

## Preparations

In [3]:
!pip install pyexasol

import boto3, sys, math, os, sagemaker, pyexasol
import numpy as np                                
import pandas as pd                                                            
from sagemaker import get_execution_role
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
from sagemaker.serializers import CSVSerializer

## Prepare Training Data

First, create a connection to the Exasol database:

In [95]:
EXASOL_CONNECTION = "{host}:{port}".format(host=EXASOL_HOST, port=EXASOL_PORT)
exasol = pyexasol.connect(dsn=EXASOL_CONNECTION, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True)

First let's take a look on our training data:

In [52]:
exasol.export_to_pandas("SELECT * FROM {schema!q}.TRAIN LIMIT 4",{"schema": EXASOL_SCHEMA})

Unnamed: 0,CLASS,AA_000,AB_000,AC_000,AD_000,AE_000,AF_000,AG_000,AG_001,AG_002,...,EE_002,EE_003,EE_004,EE_005,EE_006,EE_007,EE_008,EE_009,EF_000,EG_000
0,neg,41116,,0,,0,0,0,0,0,...,199780,101244,272518,433912,372908,163418,195492,13496,0,0
1,neg,59726,,78,40.0,0,0,0,0,0,...,340830,201640,583574,708870,582644,227474,70738,14,0,0
2,neg,30066,,340,340.0,0,0,0,0,0,...,258204,129524,265784,294224,199650,91130,111416,3228,0,0
3,neg,125250,,0,,0,0,0,0,0,...,794026,510006,1348866,1101396,825410,489878,505230,5224,0,0


There are two things we need to do:
    
* Split into train and validation data
* Replace `CLASS` column by a column with boolean values

For the split we add a column `SPLIT` that has a random value between 0 and 1, so we can partition the data by a condition on that column.

In addition, we replace the `CLASS` with the text values `pos` and `neg` by a new column `CLASS_POS` with boolean values.

In [86]:
all_columns = exasol.export_to_pandas("SELECT * FROM {schema!q}.TRAIN LIMIT 1;", {"schema": EXASOL_SCHEMA})
column_names = list(all_columns)
column_names.remove("CLASS")
exasol.execute("""CREATE OR REPLACE TABLE {schema!q}.TRAIN_PREPARED AS (
               SELECT RANDOM() AS SPLIT,
               (CLASS = 'pos') as CLASS_POS, {all_columns_except_class!q} FROM {schema!q}.TRAIN)""",
               { "schema": EXASOL_SCHEMA, "all_columns_except_class": column_names})

<ExaStatement session_id=1685765842211831808 stmt_idx=58>

Let's check:

In [87]:
exasol.export_to_pandas("SELECT * FROM {schema!q}.TRAIN_PREPARED LIMIT 4", {"schema": EXASOL_SCHEMA})

Unnamed: 0,SPLIT,CLASS_POS,AA_000,AB_000,AC_000,AD_000,AE_000,AF_000,AG_000,AG_001,...,EE_002,EE_003,EE_004,EE_005,EE_006,EE_007,EE_008,EE_009,EF_000,EG_000
0,0.370128,0,76698,,2130706438,280.0,0,0,0,0,...,1240520,493384,721044,469792,339156,157956,73224,0,0,0
1,0.031734,0,33058,,0,,0,0,0,0,...,421400,178064,293306,245416,133654,81140,97576,1500,0,0
2,0.101403,0,41040,,228,100.0,0,0,0,0,...,277378,159812,423992,409564,320746,158022,95128,514,0,0
3,0.765778,0,12,0.0,70,66.0,0,10,0,0,...,240,46,58,44,10,0,0,0,4,32


... and export the test data into an CSV file.

To do so, we will introduce the Exasol database to directly export to the S3 bucket. For that it write permission for the S3 bucket.
For that there are two options:

* Provide credentials to the EXPORT command (if you decide for this solution, you have to edit the export statements below)
* Leave connection empty and assign the permissions to the Exasol databases EC2 instance role

In [88]:
role = get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()

columns = exasol.export_to_pandas("SELECT * FROM {schema!q}.TRAIN_PREPARED LIMIT 1;", {"schema": EXASOL_SCHEMA})
column_names = list(columns)
column_names.remove("SPLIT") ## remove split column from select_list
parameters = {"all_columns_except_split": column_names, "schema": EXASOL_SCHEMA, "bucket": bucket}
exasol.execute("""EXPORT (SELECT {all_columns_except_split!q} FROM {schema!q}.TRAIN_PREPARED WHERE SPLIT <= 0.8)
                    INTO CSV AT 'https://{bucket!r}.s3.amazonaws.com'
                    USER '' IDENTIFIED BY '' FILE 'train/train.csv';""", parameters)
exasol.execute("""EXPORT (SELECT {all_columns_except_split!q} FROM {schema!q}.TRAIN_PREPARED WHERE SPLIT > 0.8)
                    INTO CSV AT 'https://{bucket!r}.s3.amazonaws.com'
                    USER '' IDENTIFIED BY '' FILE 'validation/validation.csv';""", parameters)

<ExaStatement session_id=1685765842211831808 stmt_idx=62>

In [None]:
exasol.execute("DROP TABLE {schema!q}.TRAIN_PREPARED;",{"schema": EXASOL_SCHEMA});

## Train model

Now let's train the model. The following cell will start an `ml.m4.xlarge` ec2 instance and run the training on it. It will read the training data from the S3 bucket.

In [90]:
s3_input_train = TrainingInput(s3_data='s3://{}/train'.format(bucket), content_type='csv')
s3_input_validation = TrainingInput(s3_data='s3://{}/validation/'.format(bucket), content_type='csv')

container = image_uris.retrieve('xgboost', boto3.Session().region_name, '1.2-1' )
xgb = Estimator(container,
                role,
                instance_count=1,
                instance_type='ml.m4.xlarge',
                output_path='s3://{}/output'.format(bucket),
                tags = tags
               )

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2020-12-11 09:05:54 Starting - Starting the training job...
2020-12-11 09:06:18 Starting - Launching requested ML instancesProfilerReport-1607677554: InProgress
......
2020-12-11 09:07:19 Starting - Preparing the instances for training......
2020-12-11 09:08:19 Downloading - Downloading input data...
2020-12-11 09:08:40 Training - Downloading the training image...
2020-12-11 09:09:20 Training - Training image download completed. Training in progress...[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:roo

## Deploy

Now we create an endpoint (ec2-instance that runs inference)

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.t2.medium')

------

## Test the Model

For testing the newly created model, we will load the contents of the `TEST` table and run the predictions.

In [70]:
test_data = exasol.export_to_pandas("SELECT * FROM {schema!q}.TEST", {"schema": EXASOL_SCHEMA})
test_data = pd.get_dummies(test_data)     # Convert categorical variables to sets of indicators
test_data.sample(2)

Unnamed: 0,AA_000,AB_000,AC_000,AD_000,AE_000,AF_000,AG_000,AG_001,AG_002,AG_003,...,EE_004,EE_005,EE_006,EE_007,EE_008,EE_009,EF_000,EG_000,CLASS_neg,CLASS_pos
3355,454,2.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2332.0,1912.0,1822.0,8476.0,90.0,0.0,0.0,0.0,1,0
8023,38224,,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,...,337118.0,332806.0,225650.0,123556.0,131950.0,3338.0,0.0,0.0,1,0


In [67]:
xgb_predictor.serializer = CSVSerializer()
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['CLASS_pos', "CLASS_neg"], axis=1).to_numpy())

In [68]:
pd.crosstab(index=test_data['CLASS_pos'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,15623,2
1,349,26


## Delete endpoint

This notebook leaves the endpoint running, so that you can use it for predictions.

However, that causes costs. So if you don't need it anymore, don't forget to delete it.

In [None]:
#xgb_predictor.delete_endpoint();