![Pattern Match](https://pattern-match.com/img/new-logo.png)

# **Amazon SageMaker in Practice - Workshop**
## **Click-Through Rate Prediction**

This lab covers the steps for creating a click-through rate (CTR) prediction pipeline. The source code of the workshop prepared by [Pattern Match](https://pattern-match.com) is available on the [company's Github account](https://github.com/patternmatch/amazon-sagemaker-in-practice). 

You can reach authors us via the following emails:

- [Sebastian Feduniak](mailto:sebastian.feduniak@pattern-match.com)
- [Wojciech Gawroński](mailto:wojciech.gawronski@pattern-match.com)
- [Paweł Pikuła](mailto:pawel.pikula@pattern-match.com)

Today we use the [Criteo Labs](http://labs.criteo.com/) dataset, used for the old [Kaggle competition](https://www.kaggle.com/c/criteo-display-ad-challenge) for the same purpose.

**WARNING**: First you need to update `pandas` to 0.23.4 for the `conda_python3` kernel.

# Background

In advertising, the most critical aspect when it comes to revenue is the final click on the ad. It is one of the ways to compensate for ad delivery for the provider. In the industry, an individual view of the specific ad is called an *impression*.

To compare different algorithms and heuristics of ad serving, "clickability" of the ad is measured and presented in the form of [*click-through rate* metric (CTR)](https://en.wikipedia.org/wiki/Click-through_rate): 

![CTR formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/24ae7fdf648530de2083f72ab4b4ae2bc0c47d85)

If you present randomly sufficient amount of ads to your user base, you get a baseline level of clicks. It is the easiest and simple solution. However, random ads have multiple problems - starting with a lack of relevance, causing distrust and annoyance.

**Ad targeting** is a crucial technique for increasing the relevance of the ad presented to the user. Because resources and a customer's attention is limited, the goal is to provide an ad to most interested users. Predicting those potential clicks based on readily available information like device metadata, demographics, past interactions, and environmental factors is a universal machine learning problem.

# Steps

This notebook presents an example problem to predict if a customer clicks on a given advertisement. The steps include:

- Prepare your *Amazon SageMaker* notebook.
- Download data from the internet into *Amazon SageMaker*.
- Investigate and transforming the data for usage inside *Amazon SageMaker* algorithms.
- Estimate a model using the *Gradient Boosting* algorithm (`xgboost`).
- Leverage hyperparameter optimization for training multiple models with varying hyperparameters in parallel.
- Evaluate and compare the effectiveness of the models.
- Host the model up to make on-going predictions.

# What is *Amazon SageMaker*?

*Amazon SageMaker* is a fully managed machine learning service. It enables discovery and exploration with use of *Jupyter* notebooks and then allows for very easy industrialization on a production-grade, distributed environment - that can handle and scale to extensive datasets. 

It provides solutions and algorithms for existing problems, but you can bring your algorithms into service without any problem. Everything mentioned above happens inside your *AWS infrastructure*. That includes secure and isolated *VPC* (*Virtual Private Cloud*), supported by the full power of the platform.

[Typical workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html) for creating machine learning models:

![Machine Learning with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/images/ml-concepts-10.png)

## Note about *Amazon* vs. *AWS* prefix

Why *Amazon* and not *AWS*? 

Some services available in *Amazon Web Services* portfolio are branded by *AWS* itself, and some by Amazon. 

Everything depends on the origin and team that maintains it - in that case, it originated from the core of the Amazon, and they maintain this service inside the core division.

## Working with *Amazon SageMaker* locally

It is possible to fetch *Amazon SageMaker SDK* library via `pip` and use containers provided by *Amazon* locally, and you are free to do it. The reason why and when you should use *Notebook Instance* is when your datasets are far more significant than you want to store locally and they are residing on *S3* - for such cases it is very convenient to have the *Amazon SageMaker* notebooks available.

# Preparation

The primary way for interacting with *Amazon SageMaker* is to use *S3* as storage for input data and output results. 

For our workshops, we have prepared two buckets. One is a dedicated bucket for each user (see the credentials card you have received at the beginning of the workshop) - you should put the name of that bucket into `output_bucket` variable. That bucket is used for storing output models and transformed and split input datasets.  

We have also prepared a shared bucket called `amazon-sagemaker-in-practice-workshop.pattern-match.com` which contains the input dataset inside a path presented below.

In [None]:
data_bucket = 'amazon-sagemaker-in-practice-workshop.pattern-match.com'

output_bucket = 'YOUR_USER_BUCKET_NAME_GOES_HERE'

path = 'criteo-display-ad-challenge'
key = 'sample.csv'

data_location = 's3://{}/{}/{}'.format(data_bucket, path, key)

*Amazon SageMaker* as a service runs is a specific security context applied via *IAM role*. You have created that role when creating *notebook instance* before we have uploaded this content. 

Each *notebook* instance provides a *Jupyter* environment with preinstalled libraries and *AWS SDKs*. One of such *SDKs* is *Amazon SageMaker SDK* available from the *Python* environment. With the use of that *SDK* we can check which security context we can use:

In [None]:
import boto3
from sagemaker import get_execution_role

role = get_execution_role()

print(role)

As a next, we need to import some stuff. It includes *IPython*, *Pandas*, *numpy*, commonly used libraries from *Python's* Standard Library and *Amazon SageMaker* utilities:

In [None]:
import numpy as np                                    # For matrix operations and numerical processing
import pandas as pd                                   # For munging tabular data
import matplotlib.pyplot as plt                       # For charts and visualizations

from IPython.display import Image                     # For displaying images in the notebook
from IPython.display import display                   # For displaying outputs in the notebook

from time import gmtime, strftime                     # For labeling SageMaker models, endpoints, etc.

import sys                                            # For writing outputs to notebook
import math                                           # For ceiling function
import json                                           # For parsing hosting outputs
import os                                             # For manipulating filepath names

import sagemaker                                      # Amazon SageMaker's Python SDK provides helper functions
from sagemaker.predictor import csv_serializer        # Converts strings for HTTP POST requests on inference

from sagemaker.tuner import IntegerParameter          # Importing HPO elements.
from sagemaker.tuner import CategoricalParameter 
from sagemaker.tuner import ContinuousParameter
from sagemaker.tuner import HyperparameterTuner


Now we are ready to investigate the dataset.

In [None]:
# Data

The training dataset consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo and the first column indicates whether this ad was clicked or not. The positive (clicked) and negative (non-clicked) examples have both been subsampled (but at different rates) to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26 categorical features. Authors hashed values of the categorical features onto 32 bits for anonymization purposes. The semantics of these features is undisclosed. Some features may have missing values (represented as a `-1` for integer values and empty string for categorical ones). Order of the rows is chronological.

You may ask, why in the first place we are investigating such *obfuscated* dataset. In *ad tech* it is not unusual to deal with anonymized, or pseudonymized data, which are not semantical - mostly due to privacy and security reasons.

The test set is similar to the training set but, it corresponds to events on the day following the training period. For that dataset author removed *label* (the first column).

Unfortunately, because of that, it is hard to guess for sure which feature means what, but we can infer that based on the distribution - as we can see below. 

## Format

The columns are tab separeted with the following schema:

```
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
```

When a value is missing, the field is just empty. There is no label field in the test set.

Sample dataset (`sample.csv`) contains *100 000* random rows which are taken from a training dataset to ease the exploration. 

## How to load the dataset?

Easy, if it is less than 5 GB - as the disk available on our Notebook instance is equal to 5 GB.

However, there is no way to increase that. :( 

It is because of that EBS volume size is fixed at 5GB. As a workaround, you can use the `/tmp` directory for storing large files temporarily. The `/tmp` directory is on the root drive that has around 20GB of free space. However, data stored there cannot be persisted across stopping and restarting of the notebook instance. 

What if we need more? We need to preprocess the data in another way (e.g., using *AWS Glue*) and store it on *S3* available for *Amazon SageMaker* training machines.

To read a *CSV* correctly we use *Pandas*. We need to be aware that dataset uses tabs as separators and we do not have the header:

In [None]:
data = pd.read_csv(data_location, header = None, sep = '\t')

pd.set_option('display.max_columns', 500)                       # Make sure we can see all of the columns.
pd.set_option('display.max_rows', 20)                           # Keep the output on one page.

## Exploration

Now we would like to explore our data, especially that we do not know anything about the semantics. How can we do that?

We can do that by reviewing the histograms, frequency tables, correlation matrix, and scatter matrix. Based on that we can try to infer and *"sniff"* the meaning and semantics of the particular features.

### Integer features

First 13 features from the dataset are represented as an integer features, let's review them:

In [None]:
# Histograms for each numeric features:

display(data.describe())

%matplotlib inline
hist = data.hist(bins = 30, sharey = True, figsize = (10, 10))

In [None]:
display(data.corr())
pd.plotting.scatter_matrix(data, figsize = (12, 12))
plt.show()

### Categorical features

Next 26 features from the dataset are represented as an categorical features. Now it's time to review those:

In [None]:
# Frequency tables for each categorical feature:

for column in data.select_dtypes(include = ['object']).columns:
    display(pd.crosstab(index = data[column], columns = '% observations', normalize = 'columns'))

In [None]:
categorical_feature = data[14]
unique_values = data[14].unique()

print("Number of unique values in 14th feature: {}\n".format(len(unique_values)))
print(data[14])

As for *integer features*, we can push them as-is to the *Amazon SageMaker* algorithms. We cannot do the same thing for *categorical* one.

As you can see above, we have many unique values inside the categorical column. They hashed that into a *32-bit number* represented in a hexadecimal format - as a *string*. 

We need to convert that into a number, and we can leverage *one-hot encoding* for that.

#### One-Hot Encoding

It is a way of converting categorical data (e.g., type of animal - *dog*, *cat*, *bear*, and so on) into a numerical one, one-hot encoding means that for a row we create `N` additional columns and we put a `1` if that category is applicable for such row.

#### Sparse Vectors

It is the more efficient way to store data points which are not dense and do not contain all features. It is possible to efficiently compute various operations between those two forms - dense and sparse.

### Problem with *one-hot encoding* in this dataset

Unfortunately, we cannot use *OHE* as-is for this dataset. Why?

In [None]:
for column in data.select_dtypes(include=['object']).columns:
    size = data.groupby([column]).size()
    print("Column '{}' - number of categories: {}".format(column, len(size)))

In [None]:
for column in data.select_dtypes(include=['number']).columns:
    size = data.groupby([column]).size()
    print("Column '{}' - number of categories: {}".format(column, len(size)))

We have too many distinct categories per feature! In the worst case, for an individual feature, we create couple hundred thousands of new columns. Even with the sparse representation it significantly affects memory usage and execution time.  

What kind of features are represented by that? Examples of such features are *Device ID*, *User Agent* strings and similar.

How to workaround that? We can use *indexing*.

In [None]:
for column in data.select_dtypes(include = ['object']).columns:
    print("Converting '{}' column to indexed values...".format(column))
    
    indexed_column = "{}_index".format(column)
    
    data[indexed_column] = pd.Categorical(data[column])
    data[indexed_column] = data[indexed_column].cat.codes

In [None]:
categorical_feature = data['14_index']
unique_values = data['14_index'].unique()

print("Number of unique values in 14th feature: {}\n".format(len(unique_values)))
print(data['14_index'])

In [None]:
for column in data.select_dtypes(include=['object']).columns:
    data.drop([ column ], axis = 1, inplace = True)
    
display(data)

It is another way of representing a categorical feature in *encoded* form. It is not friendly for *Linear Learner* and classical logistic regression, but we use `xgboost` library - which can leverage such a column without any problems.

## Finishing Touches

Last, but not least - we need to unify the values that are pointing out a missing value `NaN` and `-1`. We use `NaN` everywhere:

In [None]:
# Replace all -1 to NaN:

for column in data.columns:
    data[column] = data[column].replace(-1, np.nan)
    
testing = data[2]
testing_unique_values = data[2].unique()

print("Number of unique values in 2nd feature: {}\n".format(len(testing_unique_values)))
print(testing)

## Splitting the dataset

We need to split the dataset. We decided to randomize the dataset, and split into 70% for training, 20% for validation and 10% for the test.  

In [None]:
# Randomly sort the data then split out first 70%, second 20%, and last 10%:

data_len = len(data)
sampled_data = data.sample(frac = 1)

train_data, validation_data, test_data = np.split(sampled_data, [ int(0.7 * data_len), int(0.9 * data_len) ])

After splitting, we need to save new training and validation dataset as *CSV* files. After saving, we upload them to the `output_bucket`.

In [None]:
train_data.to_csv('train.sample.csv', index = False, header = False)
validation_data.to_csv('validation.sample.csv', index = False, header = False)

In [None]:
s3client = boto3.Session().resource('s3')

train_csv_file = os.path.join(path, 'train/train.csv')
validation_csv_file = os.path.join(path, 'validation/validation.csv')

s3client.Bucket(output_bucket).Object(train_csv_file).upload_file('train.sample.csv')
s3client.Bucket(output_bucket).Object(validation_csv_file).upload_file('validation.sample.csv')

Now we are ready to leverage *Amazon SageMaker* for training.

# Training

## Preparation

As a first step, we need to point which libraries we want to use. We do that by fetching the container name based on the name of the library we want to use. In our case, it is `xgboost`.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost')

Then, we need to point out where to look for input data. In our case, we use *CSV* files uploaded in the previous section to `output_bucket`.

In [None]:
train_csv_key = 's3://{}/{}/train/train.csv'.format(output_bucket, path)
validation_csv_key = 's3://{}/{}/validation/validation.csv'.format(output_bucket, path)

s3_input_train = sagemaker.s3_input(s3_data = train_csv_key, content_type = 'csv')
s3_input_validation = sagemaker.s3_input(s3_data = validation_csv_key, content_type = 'csv')

## Differences from usual workflow and frameworks usage

Even that *Amazon SageMaker* supports *CSV* files, most of the algorithms work best when you use the optimized `protobuf` `recordIO` format for the training data. 

Using this format allows you to take advantage of *pipe mode* when training the algorithms that support it. File mode loads all of your data from *Amazon S3* to the training instance volumes. In *pipe mode*, your training job streams data directly from *Amazon S3*. Streaming can provide faster start times for training jobs and better throughput. 

With this mode, you also reduce the size of the *Amazon EBS* volumes for your training instances. *Pipe mode* needs only enough disk space to store your final model artifacts. File mode needs disk space to store both your final model artifacts and your full training dataset.

For our use case - we leverage *CSV* files.

## Single training job

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count = 1, 
                                    train_instance_type = 'ml.m4.xlarge',
                                    output_path = 's3://{}/{}/output'.format(output_bucket, path),
                                    sagemaker_session = sess)

xgb.set_hyperparameters(eval_metric = 'logloss',
                        objective = 'binary:logistic',
                        eta = 0.2,
                        max_depth = 10,
                        colsample_bytree = 0.7,
                        colsample_bylevel = 0.8,
                        min_child_weight = 4,
                        rate_drop = 0.3,
                        num_round = 75,
                        gamma = 0.8)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

Now, we are ready to create *Amazon SageMaker session* and `xgboost` framework objects.

For a single training job, we need to create *Estimator*, where we point the container and *security context*. In this step, we are specifying the instance type and amount of those used for learning. Last, but not least - we need to specify `output_path` and pass the session object.

For the created *Estimator* instance we need to specify the `objective`, `eval_metric` and other hyperparameters used for that training session. 

As the last step, we need to start the training process passing the training and validation datasets. Whole training job takes approximately 1-2 minutes at most for the following setup.

## FAQ

**Q**: I see a strange error: `ClientError: Hidden file found in the data path! Remove that before training`. What is that?

**A**: There is something wrong with your input files, probably you messed up the *S3* path passed into training job.

## Hyperparameter Tuning (HPO)

The single job is just one way. We can automate the whole process with use of *hyperparameter tuning*. 

As in the case of a single training job, we need to create *Estimator* with the specification for an individual job and set up initial and fixed values for *hyperparameters*. However, outside those - we are setting up the ranges in which algorithm automatically tune in, inside the process of the *HPO*.

Inside the *HyperparameterTuner* specification we are specifying how many jobs we want to run and how many of them we want to run in parallel.

In [None]:
hpo_sess = sagemaker.Session()

hpo_xgb = sagemaker.estimator.Estimator(container,
                                        role, 
                                        train_instance_count = 1, 
                                        train_instance_type = 'ml.m4.xlarge',
                                        output_path = 's3://{}/{}/output_hpo'.format(output_bucket, path),
                                        sagemaker_session = hpo_sess)


hpo_xgb.set_hyperparameters(eval_metric = 'logloss',
                            objective = 'binary:logistic',
                            colsample_bytree = 0.7,
                            colsample_bylevel = 0.8,
                            num_round = 75,
                            rate_drop = 0.3,
                            gamma = 0.8)


hyperparameter_ranges = {
                         'eta': ContinuousParameter(0, 1),
                         'min_child_weight': ContinuousParameter(1, 10),
                         'alpha': ContinuousParameter(0, 2),
                         'max_depth': IntegerParameter(1, 10),
                        }

objective_metric_name = 'validation:logloss'
objective_type = 'Minimize'

tuner = HyperparameterTuner(hpo_xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs = 20,
                            max_parallel_jobs = 5,
                            objective_type = objective_type)

In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

Another thing that is different is how we see the progress of that particular type of the job. In the previous case, logs were shipped automatically into a *notebook*. For *HPO*, we need to fetch job status via *Amazon SageMaker SDK*. Unfortunately, it allows fetching the only status - logs are available in *Amazon CloudWatch*.

**Beware**, that with current setup whole *HPO* job may take 20-30 minutes.

In [None]:
smclient = boto3.client('sagemaker')

job_name = tuner.latest_tuning_job.job_name

hpo_job = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = job_name)
hpo_job['HyperParameterTuningJobStatus']

# Hosting the single model

After finishing the training, *Amazon SageMaker* by default saves the model inside *S3* bucket we have specified. Moreover, based on that model we can either download the archive and use inside our source code and services when deploying, or we can leverage the hosting mechanism available in the *Amazon SageMaker* service. 

## How it works?

After you deploy a model into production using *Amazon SageMaker* hosting services, it creates the endpoint with its configuration. 

Your client applications use `InvokeEndpoint` API to get inferences from the model hosted at the specified endpoint. *Amazon SageMaker* strips all `POST` headers except those supported by the *API*. Service may add additional headers. 

Does it mean that everyone can call our model? No, calls to `InvokeEndpoint` are authenticated by using *AWS Signature Version 4*. 

A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to the /invocations. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

**Beware**, the '!' in the output after hosting model means that it deployed with success.

# Hosting the best model from HPO

Hosting *HPO* model is no different from a single job. *Amazon SageMaker SDK* in very convenient way selects the best model automatically and uses that as a back-end for the endpoint.

In [None]:
xgb_predictor_hpo = tuner.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

# Evaluation

After training and hosting the best possible model, we would like to evaluate its performance with `test_data` subset prepared when splitting data.

As a first step, we need to prepare our hosted predictors to expect `text/csv` payload, which deserializes via *Amazon SageMaker SDK* entity `csv_serializer`.

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

In [None]:
xgb_predictor_hpo.content_type = 'text/csv'
xgb_predictor_hpo.serializer = csv_serializer

As a next step, we need to prepare a helper function that split `test_data` into smaller chunks and serialize them before passing it to predictors. 

In [None]:
def predict(predictor, data, rows = 500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep =',')

In [None]:
predictions = predict(xgb_predictor, test_data.drop([0], axis=1).values)

In [None]:
hpo_predictions = predict(xgb_predictor_hpo, test_data.drop([0], axis=1).values)

As a final step, we would like to compare how many clicks available in `test_data` subset were predicted correctly for job trained individually and with *HPO* jobs. 

In [None]:
rows = ['actuals']
cols = ['predictions']

In [None]:
clicks = np.round(predictions)
result = pd.crosstab(index = test_data[0], columns = clicks, rownames = rows, colnames = cols)

display("Single job results:")
display(result)
display(result.apply(lambda r: r/r.sum(), axis = 1))

In [None]:
hpo_clicks = np.round(hpo_predictions)
result_hpo = pd.crosstab(index = test_data[0], columns = hpo_clicks, rownames = rows, colnames = cols)

display("HPO job results:")
display(result_hpo)
display(result_hpo.apply(lambda r: r/r.sum(), axis = 1))

As you may expect, the model trained with the use of *HPO* works better.

What is interesting - without any tuning and significant improvements, we were able to be classified in the first 25-30 results of the leaderboard from the old [Kaggle competition](https://www.kaggle.com/c/criteo-display-ad-challenge/leaderboard). Impressive!

# Clean-up

To avoid incurring unnecessary charges, use the *AWS Management Console* to delete the resources that you created for this exercise.

Open the *Amazon SageMaker* console at and delete the following resources:

1. The endpoint - that also deletes the ML compute instance or instances.
2. The endpoint configuration.
3. The model.
4. The notebook instance. You need to stop the instance before deleting it.

Keep in mind that *you can not* delete the history of trained individual and hyperparameter optimization jobs, but that do not incur any charges.

Open the Amazon S3 console at and delete the bucket that you created for storing model artifacts and the training dataset. Remember, that before deleting you need to empty it, by removing all objects.

Open the *IAM* console at and delete the *IAM* role. If you created permission policies, you could delete them, too.

Open the *Amazon CloudWatch* console at and delete all of the log groups that have names starting with `/aws/sagemaker`.

When it comes to *endpoints* you can leverage the *Amazon SageMaker SDK* for that operation:

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor_hpo.endpoint)