In [39]:
%%sh
pip install --upgrade pip
pip install sagemaker awscli boto3 --upgrade

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.0.2)
Requirement already up-to-date: sagemaker in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.55.3)
Requirement already up-to-date: awscli in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.18.39)
Requirement already up-to-date: boto3 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.12.39)


In [None]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

# Direct Marketing with Amazon SageMaker AutoPilot

Last update: February 6th, 2019

In [1]:
import sagemaker
import smdebug_rulesconfig as rule_configs
import boto3
import os, sys

print (sagemaker.__version__)

sess   = sagemaker.Session()
bucket = sess.default_bucket()                     
prefix = 'sagemaker/DEMO-automl-dm'
region = boto3.Session().region_name

1.55.3


In [2]:
import numpy as np 
import pandas as pd

In [3]:
!wget -N --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip

--2020-04-11 22:08:30--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘bank-additional.zip’ not modified on server. Omitting download.

Archive:  bank-additional.zip
  inflating: bank-additional/.DS_Store  
  inflating: __MACOSX/bank-additional/._.DS_Store  
  inflating: bank-additional/.Rhistory  
  inflating: bank-additional/bank-additional-full.csv  
  inflating: bank-additional/bank-additional-names.txt  
  inflating: bank-additional/bank-additional.csv  
  inflating: __MACOSX/._bank-additional  


Let's read the CSV file into a Pandas data frame and take a look at the first few lines.

In [4]:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page
data[:10] # Show the first 10 lines

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
5,45,services,married,basic.9y,unknown,no,no,telephone,may,mon,198,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
6,59,admin.,married,professional.course,no,no,no,telephone,may,mon,139,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
7,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,217,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
8,24,technician,single,professional.course,no,yes,no,telephone,may,mon,380,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
9,25,services,single,high.school,no,yes,no,telephone,may,mon,50,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
data.shape # (number of lines, number of columns)

(41188, 21)

## Splitting the dataset

We split the dataset into training (95%) and test (5%) datasets. We will use the training dataset for AutoML, where it will be automatically split again for training and validation.
 
Once the model has been deployed, we'll use the test dataset to evaluate its performance.

In [6]:
# Set the seed to 123 for reproductibility
# https://pandas.pydata.org/pandas-docs/version/0.25/generated/pandas.DataFrame.sample.html
# https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.split.html
train_data, test_data, _ = np.split(data.sample(frac=1, random_state=123), 
                                                  [int(0.95 * len(data)), int(len(data))])  

# Save to CSV files
train_data.to_csv('automl-train.csv', index=False, header=True, sep=',') # Need to keep column names
test_data.to_csv('automl-test.csv', index=False, header=True, sep=',')

In [7]:
!ls -l automl*.csv

-rw-rw-r-- 1 ec2-user ec2-user  257339 Apr 11 22:08 automl-test.csv
-rw-rw-r-- 1 ec2-user ec2-user 4889516 Apr 11 22:08 automl-train.csv


**No preprocessing needed!** AutoML will take care of this, so let's just copy the training set to S3.

In [8]:
s3_input_data = sess.upload_data(path="automl-train.csv", key_prefix=prefix + "/input")
print(s3_input_data)

s3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/input/automl-train.csv


## Setting up the SageMaker AutoPilot job

After uploading the dataset to S3, we can invoke SageMaker AutoPilot to find the best ML pipeline to train a model on this dataset. 

The required inputs for invoking a SageMaker AutoML job are the dataset location in S3, the name of the column of the dataset you want to predict (`y` in this case) and an IAM role.

In [9]:
from sagemaker.automl.automl import AutoML
# https://sagemaker.readthedocs.io/en/stable/automl.html

role = sagemaker.get_execution_role()
job_tags =[{ 
         "Key": "xgb-c1-automl",
         "Value": "job1"
      }]
auto_ml_job = AutoML(
    role = role,                                              # IAM permissions for SageMaker
    sagemaker_session = sess,                                 # 
    target_attribute_name = 'y',                              # The column we want to predict
    output_path = 's3://{}/{}/output'.format(bucket,prefix),  # Save artefacts here
    max_candidates = 100,                                     # Default is 500 
    base_job_name = 'xgb-c1',                                 # search convinience
    tags = job_tags,
    max_runtime_per_training_job_in_seconds = 600, 
    total_job_runtime_in_seconds = 3600
)

## Launching the SageMaker AutoPilot job

We can now launch the job by calling the `fit()` API.

In [11]:
auto_ml_job.fit(inputs=s3_input_data, logs=False, wait=False)

In [12]:
auto_ml_job.describe_auto_ml_job()

{'AutoMLJobName': 'xgb-c1-2020-04-11-22-09-21-981',
 'AutoMLJobArn': 'arn:aws:sagemaker:us-west-1:262002448484:automl-job/xgb-c1-2020-04-11-22-09-21-981',
 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/input/automl-train.csv'}},
   'TargetAttributeName': 'y'}],
 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/output'},
 'RoleArn': 'arn:aws:iam::262002448484:role/service-role/AmazonSageMaker-ExecutionRole-20190606T095855',
 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 100,
   'MaxRuntimePerTrainingJobInSeconds': 600,
   'MaxAutoMLJobRuntimeInSeconds': 3600},
  'SecurityConfig': {'EnableInterContainerTrafficEncryption': False}},
 'CreationTime': datetime.datetime(2020, 4, 11, 22, 9, 22, 132000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2020, 4, 11, 22, 9, 23, 241000, tzinfo=tzlocal()),
 'AutoMLJo

### Tracking the progress of the AutoPilot job
SageMaker AutoPilot job consists of four high-level steps : 
* Data Preprocessing, where the dataset is split into train and validation sets.
* Recommending Pipelines, where the dataset is analyzed and SageMaker AutoPilot comes up with a list of ML pipelines that should be tried out on the dataset.
* Automatic Feature Engineering, where SageMaker AutoPilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* ML pipeline selection and hyperparameter tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [13]:
from time import sleep

job = auto_ml_job.describe_auto_ml_job()
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        sleep(30)
        job = auto_ml_job.describe_auto_ml_job()
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print (job_status, job_sec_status)
    print("Data analysis complete")

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete


## Viewing notebooks generated by SageMaker AutoPilot
Once data analysis is complete, SageMaker AutoPilot generates two notebooks: 
* Data exploration,
* Candidate definition.

In [14]:
job = auto_ml_job.describe_auto_ml_job()
job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

print(job_candidate_notebook)
print(job_data_notebook)

s3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/output/xgb-c1-2020-04-11-22-09-21-981/sagemaker-automl-candidates/pr-1-4b34747772154bc0b1d5924ce17aa391a9917a53f2084af6b9542512b9/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/output/xgb-c1-2020-04-11-22-09-21-981/sagemaker-automl-candidates/pr-1-4b34747772154bc0b1d5924ce17aa391a9917a53f2084af6b9542512b9/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb


Let's copy these two notebooks.

In [15]:
%%sh -s $job_candidate_notebook $job_data_notebook
aws s3 cp $1 .
aws s3 cp $2 .

Completed 46.2 KiB/46.2 KiB (780.6 KiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/output/xgb-c1-2020-04-11-22-09-21-981/sagemaker-automl-candidates/pr-1-4b34747772154bc0b1d5924ce17aa391a9917a53f2084af6b9542512b9/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb to ./SageMakerAutopilotCandidateDefinitionNotebook.ipynb
Completed 23.0 KiB/23.0 KiB (191.4 KiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-west-1-262002448484/sagemaker/DEMO-automl-dm/output/xgb-c1-2020-04-11-22-09-21-981/sagemaker-automl-candidates/pr-1-4b34747772154bc0b1d5924ce17aa391a9917a53f2084af6b9542512b9/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb to ./SageMakerAutopilotDataExplorationNotebook.ipynb


Go back to the folder view, and open these notebooks. Lots of useful information in there!

SageMaker AutoPilot then launches feature engineering, and prepares different training and validation datasets.

In [16]:
job = auto_ml_job.describe_auto_ml_job()
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        sleep(30)
        job = auto_ml_job.describe_auto_ml_job()
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print (job_status, job_sec_status)
    print("Feature engineering complete")

InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress ModelTuning
Feature engineering complete


Once feature engineering is complete, SageMaker AutoPilot launches Automatic Model Tuning on the different candidates. While model tuning is running, we can explore its progress with SageMaker Experiments.

In [23]:
import pandas as pd
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=job['AutoMLJobName'] + '-aws-auto-ml-job',
)

df = exp.dataframe()
print("Number of jobs: ", len(df))

# Move metric to first column
df = pd.concat([df['ObjectiveMetric - Max'], df.drop(['ObjectiveMetric - Max'], axis=1)], axis=1)
# Show top 5 jobs
df.sort_values('ObjectiveMetric - Max', ascending=0)[:5]

Number of jobs:  121


Unnamed: 0,ObjectiveMetric - Max,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,_tuning_objective_metric,alpha,colsample_bytree,eta,gamma,lambda,max_depth,min_child_weight,num_round,objective,subsample,ObjectiveMetric - Min,ObjectiveMetric - Avg,ObjectiveMetric - StdDev,ObjectiveMetric - Last,ObjectiveMetric - Count,validation:error - Min,validation:error - Max,validation:error - Avg,validation:error - StdDev,validation:error - Last,validation:error - Count,validation:accuracy - Min,validation:accuracy - Max,validation:accuracy - Avg,validation:accuracy - StdDev,validation:accuracy - Last,validation:accuracy - Count,train:error - Min,train:error - Max,train:error - Avg,train:error - StdDev,train:error - Last,train:error - Count,train:accuracy - Min,train:accuracy - Max,train:accuracy - Avg,train:accuracy - StdDev,train:accuracy - Last,train:accuracy - Count,binary_classifier_model_selection_criteria,l1,learning_rate,loss,mini_batch_size,num_models,positive_example_weight_mult,predictor_type,wd,validation:objective_loss - Min,validation:objective_loss - Max,validation:objective_loss - Avg,validation:objective_loss - StdDev,validation:objective_loss - Last,validation:objective_loss - Count,train:progress - Min,train:progress - Max,train:progress - Avg,train:progress - StdDev,train:progress - Last,train:progress - Count,validation:recall - Min,validation:recall - Max,validation:recall - Avg,validation:recall - StdDev,validation:recall - Last,validation:recall - Count,validation:binary_classification_accuracy - Min,validation:binary_classification_accuracy - Max,validation:binary_classification_accuracy - Avg,validation:binary_classification_accuracy - StdDev,validation:binary_classification_accuracy - Last,validation:binary_classification_accuracy - Count,train:throughput - Min,train:throughput - Max,train:throughput - Avg,train:throughput - StdDev,train:throughput - Last,train:throughput - Count,train:objective_loss - Min,train:objective_loss - Max,train:objective_loss - Avg,train:objective_loss - StdDev,train:objective_loss - Last,train:objective_loss - Count,validation:objective_loss:final - Min,validation:objective_loss:final - Max,validation:objective_loss:final - Avg,validation:objective_loss:final - StdDev,validation:objective_loss:final - Last,validation:objective_loss:final - Count,validation:binary_f_beta - Min,validation:binary_f_beta - Max,validation:binary_f_beta - Avg,validation:binary_f_beta - StdDev,validation:binary_f_beta - Last,validation:binary_f_beta - Count,validation:precision - Min,validation:precision - Max,validation:precision - Avg,validation:precision - StdDev,validation:precision - Last,validation:precision - Count,SageMaker.ModelName,SageMaker.ModelPrimary.DataUrl,SageMaker.ModelPrimary.Image,processor_module,sagemaker_program,sagemaker_submit_directory,input_channel_mode,job_name,label_col
44,0.917827,tuning-job-1-b8e76c119ad1486ebf-056-9a194682-a...,tuning-job-1-b8e76c119ad1486ebf-056-9a194682-a...,arn:aws:sagemaker:us-west-1:262002448484:train...,746614075791.dkr.ecr.us-west-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.000238,0.903418,0.651562,24.086601,0.003371,5.0,0.000755,426.0,binary:hinge,0.901838,0.835527,0.913341,0.015488,0.917444,28.0,0.082173,0.164473,0.086659,0.015488,0.082556,28.0,0.835527,0.917827,0.913341,0.015488,0.917444,28.0,0.080344,0.163722,0.084894,0.01571,0.080408,28.0,0.836278,0.919656,0.915106,0.01571,0.919592,28.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
47,0.917827,tuning-job-1-b8e76c119ad1486ebf-053-77caf8d2-a...,tuning-job-1-b8e76c119ad1486ebf-053-77caf8d2-a...,arn:aws:sagemaker:us-west-1:262002448484:train...,746614075791.dkr.ecr.us-west-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.059646,0.973912,0.297536,0.000137,0.029299,7.0,1.6e-05,121.0,binary:hinge,0.967283,0.112588,0.891336,0.13241,0.915655,37.0,0.082173,0.887412,0.108664,0.13241,0.084345,37.0,0.112588,0.917827,0.891336,0.13241,0.915655,37.0,0.05648,0.887519,0.090502,0.136047,0.05648,37.0,0.112481,0.94352,0.909498,0.136047,0.94352,37.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
71,0.917572,tuning-job-1-b8e76c119ad1486ebf-029-b0b9294f-a...,tuning-job-1-b8e76c119ad1486ebf-029-b0b9294f-a...,arn:aws:sagemaker:us-west-1:262002448484:train...,746614075791.dkr.ecr.us-west-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,7.4e-05,0.801532,0.221304,5e-06,0.016209,8.0,0.000164,56.0,binary:hinge,0.909361,0.112588,0.816568,0.265369,0.917061,17.0,0.082428,0.887412,0.183432,0.265369,0.082939,17.0,0.112588,0.917572,0.816568,0.265369,0.917061,17.0,0.059323,0.887519,0.167646,0.271588,0.059323,17.0,0.112481,0.940677,0.832354,0.271588,0.940677,17.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
35,0.917444,tuning-job-1-b8e76c119ad1486ebf-065-53a4cfb9-a...,tuning-job-1-b8e76c119ad1486ebf-065-53a4cfb9-a...,arn:aws:sagemaker:us-west-1:262002448484:train...,746614075791.dkr.ecr.us-west-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,2.6e-05,0.775335,0.03188,0.000105,2e-06,9.0,2.492775,512.0,binary:hinge,0.579126,0.112588,0.723321,0.330635,0.916933,70.0,0.082556,0.887412,0.276679,0.330635,0.083067,70.0,0.112588,0.917444,0.723321,0.330635,0.916933,70.0,0.062965,0.887519,0.263995,0.33785,0.062965,70.0,0.112481,0.937035,0.736005,0.33785,0.937035,70.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
37,0.917188,tuning-job-1-b8e76c119ad1486ebf-062-6acfde2b-a...,tuning-job-1-b8e76c119ad1486ebf-062-6acfde2b-a...,arn:aws:sagemaker:us-west-1:262002448484:train...,746614075791.dkr.ecr.us-west-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.007904,0.642642,0.039466,8.863009,2e-06,6.0,4.7e-05,459.0,binary:hinge,0.935931,0.112588,0.749656,0.315982,0.917061,67.0,0.082812,0.887412,0.250344,0.315982,0.082939,67.0,0.112588,0.917188,0.749656,0.315982,0.917061,67.0,0.07913,0.887519,0.248652,0.3169,0.079194,67.0,0.112481,0.92087,0.751348,0.3169,0.920806,67.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [18]:
job = auto_ml_job.describe_auto_ml_job()
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        sleep(30)
        job = auto_ml_job.describe_auto_ml_job()
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print (job_status, job_sec_status)
    print("Model tuning complete")

InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress 

## Deploying the best candidate
Now that we have successfully completed the AutoML job on our dataset and visualized the trials, we can create a model from any of the trials with a single API call and then deploy that model for online or batch prediction using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). For this notebook, we deploy only the best performing trial for inference.

The best candidate is the one we're really interested in.

In [43]:
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())

best_cand=auto_ml_job.best_candidate()

print("Best Candidate: {}".format(best_cand['CandidateName']))
print("Metric: {} - {:0.3f}".format(best_cand['FinalAutoMLJobObjectiveMetric']['MetricName'], \
                                    best_cand['FinalAutoMLJobObjectiveMetric']['Value']))

endpoint_name = job['AutoMLJobName']+'-'+timestamp

Best Candidate: tuning-job-1-b8e76c119ad1486ebf-078-bfb734b6
Metric: validation:accuracy - 0.919


In [44]:
#if no candidate is specified, the best candidate is deployed by default
auto_ml_job.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    endpoint_name = endpoint_name
)

---------------!

## Scoring the best candidate

Let's predict and score the validation set. We'll compute metrics ourselves just for fun.

In [45]:
from sagemaker.predictor import csv_serializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV

predictor = RealTimePredictor(
    endpoint=endpoint_name, 
    sagemaker_session=sess, 
    serializer=csv_serializer,
    content_type=CONTENT_TYPE_CSV, 
    accept='text/csv'
)

In [46]:
tp = tn = fp = fn = count = 0

with open('automl-test.csv') as f:
    lines = f.readlines()
    for l in lines[1:]:   # Skip header
        l = l.split(',')  # Split CSV line into feature array
        label = l[-1]     # Store 'yes'/'no' label
        l = l[:-1]        # Remove label
        l = ','.join(l)   # Rebuild CSV line without label
                
        response = predictor.predict(l)
        response = response.decode("utf-8")
        #print ("label %s response %s" %(label,response))

        if 'yes' in label:
            # Sample is positive
            if 'yes' in response:
                # True positive
                tp=tp+1
            else:
                # False negative
                fn=fn+1
        else:
            # Sample is negative
            if 'no' in response:
                # True negative
                tn=tn+1
            else:
                # False positive
                fp=fp+1
        count = count+1
        if (count % 100 == 0):   
            sys.stdout.write(str(count)+' ')
            
print ("Done")

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Done


In [35]:
#Confusion matrix
print ("%d %d" % (tn, fp))
print ("%d %d" % (fn, tp))

accuracy  = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall    = tn/(tn+fn)
f1        = (2*precision*recall)/(precision+recall)

print ("%.4f %.4f %.4f %.4f" % (accuracy, precision, recall, f1))

1757 65
107 131
0.9165 0.6684 0.9426 0.7821


## Deleting the endpoint
Once that we're done predicting, we can delete the endpoint (and stop paying for it).

In [37]:
# Uncomment to delete
sess.delete_endpoint(predictor.endpoint)

The SageMaker AutoML job creates many underlying artifacts such as dataset splits, preprocessing scripts, preprocessed data, etc. Let's delete them.

In [38]:
import boto3

job_outputs_prefix = '{}/output/{}'.format(prefix, job['AutoMLJobName'])
print(job_outputs_prefix)

s3_bucket =boto3.resource('s3').Bucket(bucket)
# Uncomment to delete
s3_bucket.objects.filter(Prefix=job_outputs_prefix).delete()

sagemaker/DEMO-automl-dm/output/automl-2020-04-10-00-36-07-848


[{'ResponseMetadata': {'RequestId': '3219CD06C5C2DE40',
   'HostId': 'ruA4IveVGknD269VWkT3TOFjhLsp5BJaJiBfV68w9Oq2EKFdW435AKugUNUEE5w3zBX5uZ1IT/w=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'ruA4IveVGknD269VWkT3TOFjhLsp5BJaJiBfV68w9Oq2EKFdW435AKugUNUEE5w3zBX5uZ1IT/w=',
    'x-amz-request-id': '3219CD06C5C2DE40',
    'date': 'Fri, 10 Apr 2020 04:17:37 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-04-10-00-36-07-848/transformed-data/dpp2/rpb/train/chunk_81.csv.out'},
   {'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-04-10-00-36-07-848/transformed-data/dpp6/csv/train/chunk_47.csv.out'},
   {'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-04-10-00-36-07-848/transformed-data/dpp0/csv/train/chunk_27.csv.out'},
   {'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-04-10-00-36-07-848/tra