# **Amazon SageMaker in Practice - Workshop**
## **Click-Through Rate Prediction**

This lab covers the steps for creating a click-through rate (CTR) prediction pipeline. Workshop was prepared by [Pattern Match](https://pattern-match.com). 

You can reach authors us via those emails:

- [Sebastian Feduniak](mailto:sebastian.feduniak@pattern-match.com)
- [Wojciech Gawroński](mailto:wojciech.gawronski@pattern-match.com)
- [Paweł Pikuła](mailto:pawel.pikula@pattern-match.com)

Today we will together work with the [Criteo Labs](http://labs.criteo.com/) dataset that was used for the old [Kaggle competition](https://www.kaggle.com/c/criteo-display-ad-challenge).

**WARNING**: First you need to update `pandas` to 0.23.4 for the `conda_python3` kernel.

# Background

Ad targeting is a crucial technique . Because resources and a customer's attention is limited, the goal is to provide an ad to a most revelant and interested users. Predicting those potential clicks based on readily available information like device metadata, demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook presents an example problem to predict if a customer will click on a given advertisment. The steps include:

- Preparing your *Amazon SageMaker* notebook.
- Downloading data from the internet into *Amazon SageMaker*.
- Investigating and transforming the data so that it can be fed to *Amazon SageMaker* algorithms.
- Estimating a model using the Gradient Boosting algorithm.
- Leveraging hyperparameter optimization for training multiple models with varying hyperparameters in parallel.
- Evaluating and comparing the effectiveness of the models.
- Hosting the model up to make on-going predictions.

# Preparation

In [1]:
data_bucket = 'amazon-sagemaker-in-practice-workshop.pattern-match.com'

output_bucket = 'YOUR_USER_BUCKET_NAME_GOES_HERE'

path = 'criteo-display-ad-challenge'
key = 'sample.csv'

data_location = 's3://{}/{}/{}'.format(data_bucket, path, key)

In [2]:
import boto3
from sagemaker import get_execution_role

role = get_execution_role()

print(role)

arn:aws:iam::450349639042:role/service-role/AmazonSageMaker-ExecutionRole-20180923T131486


In [3]:
import numpy as np                                    # For matrix operations and numerical processing
import pandas as pd                                   # For munging tabular data
import matplotlib.pyplot as plt                       # For charts and visualizations

from IPython.display import Image                     # For displaying images in the notebook
from IPython.display import display                   # For displaying outputs in the notebook

from time import gmtime, strftime                     # For labeling SageMaker models, endpoints, etc.

import sys                                            # For writing outputs to notebook
import math                                           # For ceiling function
import json                                           # For parsing hosting outputs
import os                                             # For manipulating filepath names

import sagemaker                                      # Amazon SageMaker's Python SDK provides helper functions
from sagemaker.predictor import csv_serializer        # Converts strings for HTTP POST requests on inference

from sagemaker.tuner import IntegerParameter          # Importing HPO elements.
from sagemaker.tuner import CategoricalParameter 
from sagemaker.tuner import ContinuousParameter
from sagemaker.tuner import HyperparameterTuner


# Data

The training dataset consists of a portion of Criteo's traffic over a period
of 7 days. Each row corresponds to a display ad served by Criteo and the first
column is indicates whether this ad has been clicked or not.
The positive (clicked) and negatives (non-clicked) examples have both been
subsampled (but at different rates) in order to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26
categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes. 
The semantic of these features is undisclosed. Some features may have missing values.

The rows are chronologically ordered.

The test set is computed in the same way as the training set but it 
corresponds to events on the day following the training period. 
The first column (label) has been removed.

### Format

The columns are tab separeted with the following schema:

```
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
```

When a value is missing, the field is just empty.
There is no label field in the test set.

Sample dataset contains 100 000 random rows taken from a train dataset to ease the exploration. 

## How to load a dataset?

Easy, if it is less than 5 GB - as the disk available on our Notebook instance is equal to 5 GB.

However, there is no way to increase that. :( 

The EBS volume is currently fixed at 5GB. As a workaround, you can use the /tmp directory for storing large files temporarily. The /tmp directory is on the root drive that has around 20GB of free space, however, it won't be persisted across stopping and restarting of the notebook instance. 

What if we need more? We need to preprocess the data in other way and store it on S3 available for SageMaker training machines.

In [4]:
data = pd.read_csv(data_location, header = None, sep = '\t')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page

In [None]:
# Frequency tables for each categorical feature:
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index = data[column], columns = '% observations', normalize = 'columns'))

# Histograms for each numeric features:
display(data.describe())
%matplotlib inline
hist = data.hist(bins = 30, sharey = True, figsize=(10, 10))

**TODO**: Conclusions from histograms and frequency tables posted above.

In [5]:
categorical_feature = data[14]
unique_values = data[14].unique()

print("Number of unique values in 14th feature: {}\n".format(len(unique_values)))
print(data[14])

Number of unique values in 14th feature: 541

0        68fd1e64
1        68fd1e64
2        287e684f
3        68fd1e64
4        8cf07265
5        05db9164
6        439a44a4
7        68fd1e64
8        05db9164
9        05db9164
           ...   
99990    8cf07265
99991    9a89b36c
99992    75ac2fe6
99993    05db9164
99994    05db9164
99995    68fd1e64
99996    68fd1e64
99997    8cf07265
99998    0e78bd46
99999    05db9164
Name: 14, Length: 100000, dtype: object


**TODO**: Describing OHE and why we need it here.

In [7]:
for column in data.select_dtypes(include=['object']).columns:
    size = data.groupby([column]).size()
    print("Column '{}' - number of categories: {}".format(column, len(size)))

Column '14' - number of categories: 541
Column '15' - number of categories: 497
Column '16' - number of categories: 43869
Column '17' - number of categories: 25183
Column '18' - number of categories: 145
Column '19' - number of categories: 11
Column '20' - number of categories: 7623
Column '21' - number of categories: 257
Column '22' - number of categories: 3
Column '23' - number of categories: 10997
Column '24' - number of categories: 3799
Column '25' - number of categories: 41311
Column '26' - number of categories: 2796
Column '27' - number of categories: 26
Column '28' - number of categories: 5238
Column '29' - number of categories: 34616
Column '30' - number of categories: 10
Column '31' - number of categories: 2548
Column '32' - number of categories: 1302
Column '33' - number of categories: 3
Column '34' - number of categories: 38617
Column '35' - number of categories: 10
Column '36' - number of categories: 14
Column '37' - number of categories: 12334
Column '38' - number of categ

In [6]:
for column in data.select_dtypes(include=['number']).columns:
    size = data.groupby([column]).size()
    print("Column '{}' - number of categories: {}".format(column, len(size)))

Column '0' - number of categories: 2
Column '1' - number of categories: 152
Column '2' - number of categories: 2693
Column '3' - number of categories: 943
Column '4' - number of categories: 135
Column '5' - number of categories: 23041
Column '6' - number of categories: 2055
Column '7' - number of categories: 628
Column '8' - number of categories: 155
Column '9' - number of categories: 1942
Column '10' - number of categories: 7
Column '11' - number of categories: 86
Column '12' - number of categories: 71
Column '13' - number of categories: 273


**TODO**: Problem with OHE in this dataset - too many distinct categories! How to workaround that? First indexing - then bucketing.

Example of such features: Device ID, User Agent strings ... etc.

In [8]:
for column in data.select_dtypes(include = ['object']).columns:
    print("Converting '{}' column to indexed values...".format(column))
    
    indexed_column = "{}_index".format(column)
    
    data[indexed_column] = pd.Categorical(data[column])
    data[indexed_column] = data[indexed_column].cat.codes

Converting '14' column to indexed values...
Converting '15' column to indexed values...
Converting '16' column to indexed values...
Converting '17' column to indexed values...
Converting '18' column to indexed values...
Converting '19' column to indexed values...
Converting '20' column to indexed values...
Converting '21' column to indexed values...
Converting '22' column to indexed values...
Converting '23' column to indexed values...
Converting '24' column to indexed values...
Converting '25' column to indexed values...
Converting '26' column to indexed values...
Converting '27' column to indexed values...
Converting '28' column to indexed values...
Converting '29' column to indexed values...
Converting '30' column to indexed values...
Converting '31' column to indexed values...
Converting '32' column to indexed values...
Converting '33' column to indexed values...
Converting '34' column to indexed values...
Converting '35' column to indexed values...
Converting '36' column to indexe

In [9]:
categorical_feature = data['14_index']
unique_values = data['14_index'].unique()

print("Number of unique values in 14th feature: {}\n".format(len(unique_values)))
print(data['14_index'])

Number of unique values in 14th feature: 541

0        229
1        229
2         74
3        229
4        300
5          8
6        135
7        229
8          8
9          8
        ... 
99990    300
99991    327
99992    250
99993      8
99994      8
99995    229
99996    229
99997    300
99998     17
99999      8
Name: 14_index, Length: 100000, dtype: int16


In [None]:
display(data.corr())
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.show()

In [10]:
for column in data.select_dtypes(include=['object']).columns:
    data.drop([ column ], axis = 1, inplace = True)
    
display(data)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14_index,15_index,16_index,17_index,18_index,19_index,20_index,21_index,22_index,23_index,24_index,25_index,26_index,27_index,28_index,29_index,30_index,31_index,32_index,33_index,34_index,35_index,36_index,37_index,38_index,39_index
0,0,1.0,1,5.0,0.0,1382.0,4.0,15.0,2.0,181.0,1.0,2.0,,2.0,229,250,43139,12136,23,4,6591,27,2,7231,2691,8956,424,4,2809,18685,9,2422,149,2,1162,-1,2,9569,40,5603
1,0,2.0,0,44.0,1.0,102.0,8.0,2.0,2.0,4.0,1.0,1.0,,4.0,229,471,19100,6422,23,10,4407,8,2,1804,1162,15919,2355,17,4700,27297,0,1727,149,0,14511,-1,2,3232,40,4214
2,0,2.0,0,1.0,14.0,767.0,89.0,4.0,2.0,245.0,1.0,3.0,3.0,45.0,74,22,488,19187,23,4,5941,8,2,2505,1396,23281,1874,2,2235,7352,6,510,-1,-1,34643,5,2,2817,-1,-1
3,0,,893,,,4392.0,,0.0,0.0,0.0,,0.0,,,229,87,29108,4555,23,10,1449,8,2,10355,3420,26352,586,2,482,11286,1,1170,-1,-1,15996,-1,2,6970,-1,-1
4,0,3.0,-1,,0.0,2.0,0.0,3.0,0.0,0.0,1.0,1.0,,0.0,300,348,34236,24513,23,0,5210,8,2,3530,3430,16683,2632,4,2891,236,1,390,-1,-1,5100,-1,1,8662,-1,-1
5,0,,-1,,,12824.0,,0.0,0.0,6.0,,0.0,,,8,210,6755,8338,51,3,2588,8,2,2505,2190,30744,1755,2,129,21168,5,1445,-1,-1,5458,4,10,5502,-1,-1
6,0,,1,2.0,,3168.0,,0.0,1.0,2.0,,0.0,,,135,344,32874,20815,51,10,2342,8,2,2505,2486,3274,1316,2,5111,23921,5,2032,-1,-1,4822,-1,7,1259,-1,-1
7,1,1.0,4,2.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,,0.0,229,87,13771,22530,137,0,1770,27,2,3596,2827,15681,782,2,482,31226,9,1170,-1,-1,12398,-1,1,6970,-1,-1
8,0,,44,4.0,8.0,19010.0,249.0,28.0,31.0,141.0,,1.0,,8.0,8,428,35650,19187,23,4,6352,8,2,1773,117,36114,726,17,4593,18075,9,662,-1,-1,16,-1,1,2817,-1,-1
9,0,,35,,1.0,33737.0,21.0,1.0,2.0,3.0,,1.0,,1.0,8,160,35656,23158,23,-1,2534,8,2,62,3420,24079,586,2,889,19455,8,1136,-1,-1,2168,-1,1,680,-1,-1


In [11]:
# Replace all -1 to NaN:

for column in data.columns:
    data[column] = data[column].replace(-1, np.nan)
    
testing = data[2]
testing_unique_values = data[2].unique()

print("Number of unique values in 2nd feature: {}\n".format(len(testing_unique_values)))
print(testing)

Number of unique values in 2nd feature: 2693

0          1.0
1          0.0
2          0.0
3        893.0
4          NaN
5          NaN
6          1.0
7          4.0
8         44.0
9         35.0
         ...  
99990    113.0
99991      0.0
99992      0.0
99993     21.0
99994      1.0
99995     60.0
99996      0.0
99997      2.0
99998    390.0
99999      NaN
Name: 2, Length: 100000, dtype: float64


In [12]:
# Randomly sort the data then split out first 70%, second 20%, and last 10%.

data_len = len(data)
sampled_data = data.sample(frac = 1, random_state = 1729)
train_data, validation_data, test_data = np.split(sampled_data, [ int(0.7 * data_len), int(0.9 * data_len) ])

In [13]:
train_data.to_csv('train.sample.csv', index = False, header = False)
validation_data.to_csv('validation.sample.csv', index = False, header = False)

In [15]:
boto3.Session().resource('s3').Bucket(data_bucket).Object(os.path.join(path, 'train/train.csv')).upload_file('train.sample.csv')
boto3.Session().resource('s3').Bucket(data_bucket).Object(os.path.join(path, 'validation/validation.csv')).upload_file('validation.sample.csv')

# Train

In [16]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost')

In [17]:
s3_input_train = sagemaker.s3_input(s3_data = 's3://{}/{}/train/train.csv'.format(data_bucket, path), content_type = 'csv')
s3_input_validation = sagemaker.s3_input(s3_data = 's3://{}/{}/validation/validation.csv'.format(data_bucket, path), content_type = 'csv')

In [18]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(data_bucket, path),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(eval_metric='logloss',
                        objective='binary:logistic',
                        eta=0.2,
                        max_depth=10,
                        colsample_bytree=0.7,
                        colsample_bylevel=0.8,
                        min_child_weight=4,
                        rate_drop=0.3,
                        num_round=75,
                        gamma=0.8)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating training-job with name: xgboost-2018-09-26-18-08-38-950


...............
[31mArguments: train[0m
[31m[2018-09-26:18:11:00:INFO] Running standalone xgboost training.[0m
[31m[2018-09-26:18:11:00:INFO] File size need to be processed in the node: 14.16mb. Available memory size in the node: 8547.16mb[0m
[31m[2018-09-26:18:11:00:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:11:00] S3DistributionType set as FullyReplicated[0m
[31m[18:11:00] 70000x39 matrix with 2700936 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2018-09-26:18:11:00:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:11:00] S3DistributionType set as FullyReplicated[0m
[31m[18:11:00] 20000x39 matrix with 771763 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,[0m
[31m[18:11:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1118 extra nodes, 16 pruned nodes, max_depth=10[0m
[31m[0]#011train-logloss:0.615853#011validation-logloss:0.62489[0m
[31m[18:11

[31m[18:11:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 392 extra nodes, 20 pruned nodes, max_depth=10[0m
[31m[43]#011train-logloss:0.297787#011validation-logloss:0.456889[0m
[31m[18:11:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 650 extra nodes, 42 pruned nodes, max_depth=10[0m
[31m[44]#011train-logloss:0.294412#011validation-logloss:0.456889[0m
[31m[18:11:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 232 extra nodes, 10 pruned nodes, max_depth=10[0m
[31m[45]#011train-logloss:0.293282#011validation-logloss:0.457078[0m
[31m[18:11:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 156 extra nodes, 6 pruned nodes, max_depth=10[0m
[31m[46]#011train-logloss:0.292659#011validation-logloss:0.457232[0m
[31m[18:11:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 6 pruned nodes, max_depth=10[0m
[31m[47]#011train-logloss:0.292252#011validation-logloss:0.457226[0m
[31m[18:11:07] src/tree/upd

**TODO**: Strange errors: "ClientError: Hidden file found in the data path! Remove that before training."

**TODO**: What Amazon SageMaker needs from you to prepare in your input data?

**TODO**: Beware the '!' when hosting model, it means success ;)

## Hyperparameter Tuning (HPO)


**TODO**: Explain how to do it.

In [27]:
hpo_sess = sagemaker.Session()

hpo_xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output_hpo'.format(data_bucket, path),
                                    sagemaker_session=hpo_sess)


hpo_xgb.set_hyperparameters(eval_metric='logloss',
                            objective='binary:logistic',
                            colsample_bytree=0.7,
                            colsample_bylevel=0.8,
                            num_round=75,
                            rate_drop=0.3,
                            gamma=0.8)


hyperparameter_ranges = {
                         'eta': ContinuousParameter(0, 1),
                         'min_child_weight': ContinuousParameter(1, 10),
                         'alpha': ContinuousParameter(0, 2),
                         'max_depth': IntegerParameter(1, 10),
                        }

objective_metric_name = 'validation:logloss'
objective_type = 'Minimize'

tuner = HyperparameterTuner(hpo_xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs = 20,
                            max_parallel_jobs = 5,
                            objective_type = objective_type)

In [28]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating hyperparameter tuning job with name: xgboost-180926-1817


In [30]:
smclient = boto3.client('sagemaker')

job_name = tuner.latest_tuning_job.job_name

hpo_job = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = job_name)
hpo_job['HyperParameterTuningJobStatus']

'Completed'

# Hosting the single model


In [31]:
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2018-09-26-18-31-21-066
INFO:sagemaker:Creating endpoint with name xgboost-2018-09-26-18-08-38-950


---------------------------------------------------------------!

# Hosting the best model from HPO


In [32]:
xgb_predictor_hpo = tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2018-09-26-18-36-43-538


[31mArguments: train[0m
[31m[2018-09-26:18:29:36:INFO] Running standalone xgboost training.[0m
[31m[2018-09-26:18:29:36:INFO] Setting up HPO optimized metric to be : logloss[0m
[31m[2018-09-26:18:29:36:INFO] File size need to be processed in the node: 14.16mb. Available memory size in the node: 8431.71mb[0m
[31m[2018-09-26:18:29:36:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:29:36] S3DistributionType set as FullyReplicated[0m
[31m[18:29:36] 70000x39 matrix with 2700936 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2018-09-26:18:29:36:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:29:36] S3DistributionType set as FullyReplicated[0m
[31m[18:29:36] 20000x39 matrix with 771763 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,[0m
[31m[18:29:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 364 extra nodes, 10 pruned nodes, max_depth=8[0m
[31m[0]#011tr

INFO:sagemaker:Creating endpoint with name xgboost-180926-1817-018-78dc0080


---------------------------------------------------------------!

# Evaluation

In [33]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

xgb_predictor_hpo.content_type = 'text/csv'
xgb_predictor_hpo.serializer = csv_serializer

In [34]:
def predict(predictor, data, rows = 500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep =',')

predictions = predict(xgb_predictor, test_data.drop([0], axis=1).values)
hpo_predictions = predict(xgb_predictor_hpo, test_data.drop([0], axis=1).values)

In [35]:
# is_clicked = lambda p: 1 if p >= 0.25 else 0

# predicted_clicks =  np.array(list(map(is_clicked, predictions)))
# predicted_clicks_hpo =  np.array(list(map(is_clicked, hpo_predictions)))

# #result = pd.crosstab(index = test_data[0], columns = predicted_clicks, rownames=['actuals'], colnames=['predictions'])
# #result_hpo = pd.crosstab(index = test_data[0], columns = predicted_clicks_hpo, rownames=['actuals'], colnames=['predictions'])

result = pd.crosstab(index = test_data[0], columns = np.round(predictions), rownames=['actuals'], colnames=['predictions'])
result_hpo = pd.crosstab(index = test_data[0], columns = np.round(hpo_predictions), rownames=['actuals'], colnames=['predictions'])

display("Single job results:")
display(result)
display(result.apply(lambda r: r/r.sum(), axis = 1))

display("HPO job results:")
display(result_hpo)
display(result_hpo.apply(lambda r: r/r.sum(), axis = 1))

'Single job results:'

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7291,419
1,1692,598


predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.945655,0.054345
1,0.738865,0.261135


'HPO job results:'

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7316,394
1,1689,601


predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.948898,0.051102
1,0.737555,0.262445


# Clean-up

In [36]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
sagemaker.Session().delete_endpoint(xgb_predictor_hpo.endpoint)

INFO:sagemaker:Deleting endpoint with name: xgboost-2018-09-26-18-08-38-950
INFO:sagemaker:Deleting endpoint with name: xgboost-180926-1817-018-78dc0080
