# Predicting ad click likelihood

This tutorial shows you how to use Scikit-learn with Sagemaker by utilizing the pre-built container. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing.

The sagemaker-python-sdk module makes it easy to take existing scikit-learn code, which we will show by training a model on the Kaggle advertising dataset (https://www.kaggle.com/fayomi/advertising) and generating a set of predictions. 

The goal is to classify ad clicks based on a certain number of input variables. We won't show here the data exploration phase

Table of contents
Upload the data for training
Create a Scikit-learn script to train with
Create the SageMaker Scikit Estimator
Train the SKLearn Estimator on the Iris data
Using the trained model to make inference requests
Deploy the model
Choose some data and use it for a prediction
Endpoint cleanup
Batch Transform
Prepare Input Data
Run Transform Job
Check Output Data
First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [1]:
import os
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri


## Reading data from S3

In [2]:
bucket = "hj-sagemaker-demo"
s3_base_path = "advertising-regression"
raw_data_path = "raw/advertising.csv"

In [3]:
df = pd.read_csv(f"s3://{bucket}/{s3_base_path}/{raw_data_path}")

In [4]:
# checking we read the data correctly
df.describe()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,65.0002,36.009,55000.00008,180.0001,0.481,0.5
std,15.853615,8.785562,13414.634022,43.902339,0.499889,0.50025
min,32.6,19.0,13996.5,104.78,0.0,0.0
25%,51.36,29.0,47031.8025,138.83,0.0,0.0
50%,68.215,35.0,57012.3,183.13,0.0,0.5
75%,78.5475,42.0,65470.635,218.7925,1.0,1.0
max,91.43,61.0,79484.8,269.96,1.0,1.0


Here we could do a great deal of exploration, feature engineering, etc. But since we already did that in another notebook, we are only going to select the variables we are interested in and we are going upload train and test dataset to s3

In [5]:
# splitting data in training and tests 
X = df.loc[:, ["Age", "Area Income", "Daily Internet Usage", "Daily Time Spent on Site"]]
y = df.loc[:, ['Clicked on Ad']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [6]:
# storing data in csv locally
data_dir = '../data/advertising-regression'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

pd.concat([y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)
pd.concat([y_test, X_test], axis=1).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

In [7]:
# uploading csv to s3 base bucket
session = sagemaker.Session(default_bucket=bucket)
role = get_execution_role()

train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix="advertising-regression/train")
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix="advertising-regression/test")

To run our Scikit-learn training script on SageMaker, we construct a sagemaker.sklearn.estimator.sklearn estimator, which accepts several constructor arguments:

* entry_point: The path to the Python script SageMaker runs for training.
* framework_version: the version of scikit-learn used by our training script
* role: Role ARN
* train_instance_type (optional): The type of SageMaker instances for training. 
* sagemaker_session (optional): The session used to train on Sagemaker.
* hyperparameters (optional): A dictionary passed to the train function as hyperparameters.

In [8]:
# we instantiate an Estimator for training purpose, pointing to the logistic_training.py file

from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"
script_path = 'logistic_training.py'

sklearn = SKLearn(
    entry_point=script_path,
    framework_version=FRAMEWORK_VERSION,
    train_instance_type="ml.m5.large",
    role=role,
    sagemaker_session=session,
    hyperparameters=None)


After creating the object, we call the fit method to actually train our model, passing in the parameters expected the script (argparse)

In [9]:
sklearn.fit({'train': train_location})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-10-14 09:40:30 Starting - Starting the training job...
2020-10-14 09:40:32 Starting - Launching requested ML instances......
2020-10-14 09:41:33 Starting - Preparing the instances for training......
2020-10-14 09:42:33 Downloading - Downloading input data...
2020-10-14 09:43:02 Training - Downloading the training image...
2020-10-14 09:43:58 Uploading - Uploading generated training model
2020-10-14 09:43:58 Completed - Training job completed
[34m2020-10-14 09:43:45,886 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-10-14 09:43:45,889 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-10-14 09:43:45,898 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-10-14 09:43:46,152 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-10-14 09:43:49,205 sagemaker-training-toolkit INFO     No GPUs detected 

If we go and check the logs generated by the fit method, we can see that the trained model has been dumped in 

`s3://hj-sagemaker-demo/sagemaker-scikit-learn-2020-10-14-08-26-21-040/output/model.tar.gz` 

Now we can go ahead and try to make predictions on the test dataset. We are going to do this through batch transform. 

It is essential to define in the training script a model_fn method that helps with desarializing the fitted model. 
The transform job will fail otherwise

In [25]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn.transformer(instance_count=1, instance_type='ml.m5.large',  assemble_with="Line")

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [16]:
pd.concat([X_test], axis=1).to_csv(os.path.join(data_dir, 'test_no_y.csv'), header=False, index=False)
test_no_y_location = session.upload_data(os.path.join(data_dir, 'test_no_y.csv'), key_prefix="advertising-regression/test")

In [26]:
# Start a transform job and wait for it to finish
transformer.transform(test_no_y_location, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()


Waiting for transform job: sagemaker-scikit-learn-2020-10-14-10-16-02-104
...........................
[34m2020-10-14 10:20:26,337 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-10-14 10:20:26,339 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-10-14 10:20:26,340 INFO - sagemaker-containers - nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_fo

If we go and examine in S3 at `s3://hj-sagemaker-demo/sagemaker-scikit-learn-2020-10-14-10-16-02-104/test_no_y.csv.out` we will be able to see the predictions of our model.

Let's read in the predictions and compare those with the actual values so that we can build a confusion matrix to evaluate our model. 

In [58]:
y_pred = pd.read_csv(f"s3://hj-sagemaker-demo/{transformer.latest_transform_job.job_name}/test_no_y.csv.out", 
                      header=None)

# for some reason the output of the model is stored as [0, 1, 1, ...], 1 prediction for each column. 
# we want 1 prediction each row and we also need to get rid of [ ]
y_pred = y_pred.replace(['\[','\]'], ['',''], regex=True)
y_pred_transposed = y_pred.transpose()
y_pred_transposed = y_pred_transposed.astype(int)


In [59]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred_transposed)
print(conf_matrix)


[[155   9]
 [ 16 120]]


The confusion matrix tells us that we classified correctly 155 + 120 samples and 16 + 9 incorrectly