# Train a Model with SageMaker Autopilot

We will use Autopilot to predict sentiment of customer reviews. Autopilot implements a unique white-box approach to AutoML. 

<img src="img/autopilot.png" width="80%" align="left">

# Introduction

Amazon SageMaker Autopilot is a service to perform automated machine learning (AutoML) on your datasets.  Autopilot is available through the UI or AWS SDK.  In this notebook, we will use the AWS SDK to create and deploy a text processing and sentiment classification machine learning pipeline.

# Pre-Requisite
Make sure the previous notebook has run fully and prepared the dataset.

# Setup

Let's start by specifying:

* The S3 bucket and prefix to use to train our model.  _Note:  This should be in the same region as this notebook._
* The IAM role of this notebook needs access to your data.

# Note:  This notebook will take some time.  Feel free to continue to the next notebooks whenever you are waiting for the current notebook to finish.
We do this throughout the entire workshop as some of these notebooks may run for a while.

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Dataset

In [2]:
%store -r header_train_s3_uri

print(header_train_s3_uri)

s3://sagemaker-us-west-2-393371431575/data/amazon_reviews_us_Digital_Software_v1_00_header.csv


In [3]:
if not header_train_s3_uri:
    print('***********************************************************************')
    print('**************** PLEASE RE-RUN THE PREVIOUS NOTEBOOK ******************')
    print('**************** THIS NOTEBOOK WILL NOT RUN PROPERLY ******************')
    print('***********************************************************************')

In [4]:
!aws s3 ls $header_train_s3_uri

2020-07-25 15:59:24   13643033 amazon_reviews_us_Digital_Software_v1_00_header.csv


# Setup the S3 Location for the Autopilot-Generated Assets 
This include Jupyter Notebooks (Analysis), Python Scripts (Feature Engineering), and Trained Models.

In [5]:
prefix_model_output = 'models/autopilot'

model_output_s3_uri = 's3://{}/{}'.format(bucket, prefix_model_output)

print(model_output_s3_uri)


s3://sagemaker-us-west-2-393371431575/models/autopilot


In [6]:
max_candidates = 3

job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': 600,
      'MaxCandidates': max_candidates,
      'MaxAutoMLJobRuntimeInSeconds': 3600
    },
}

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': '{}'.format(header_train_s3_uri)
        }
      },
      'TargetAttributeName': 'star_rating'
    }
]

output_data_config = {
    'S3OutputPath': '{}'.format(model_output_s3_uri)
}

# Launch the SageMaker Autopilot job

We can now launch the job by calling the `create_auto_ml_job` API.

In [7]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

AutoMLJobName: automl-dm-25-16-11-45


_Note that we are not specifying the `ProblemType`.  Autopilot will automatically detect if we're using regression or classification (binary or multi-class)._

In [8]:
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig=job_config,
                      RoleArn=role)

{'AutoMLJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:automl-job/automl-dm-25-16-11-45',
 'ResponseMetadata': {'RequestId': '27f7eb04-2316-4da7-b567-1709766f15b1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '27f7eb04-2316-4da7-b567-1709766f15b1',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '92',
   'date': 'Sat, 25 Jul 2020 16:11:45 GMT'},
  'RetryAttempts': 0}}

# Tracking the progress of the Autopilot job

SageMaker Autopilot job consists of the following high-level steps: 
* _Data Analysis_ where the data is summarized and analyzed to determine which feature engineering techniques, hyper-parameters, and models to explore.
* _Feature Engineering_ where the data is scrubbed, balanced, combined, and split into train and validation.
* _Model Training and Tuning_ where the top performing features, hyper-parameters, and models are selected and trained.

# Analyzing Data

In [9]:
# Sleep for a bit to ensure the AutoML job above has time to start
import time
time.sleep(30)

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Data analysis complete")
    
print(job)

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete
{'AutoMLJobName': 'automl-dm-25-16-11-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:automl-job/automl-dm-25-16-11-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-393371431575/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-393371431575/models/autopilot'}, 'RoleArn': 'arn:aws:iam::393371431575:role/TeamRole', 'AutoMLJ

In [10]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(job)

{'AutoMLJobName': 'automl-dm-25-16-11-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:automl-job/automl-dm-25-16-11-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-393371431575/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-393371431575/models/autopilot'}, 'RoleArn': 'arn:aws:iam::393371431575:role/TeamRole', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 7, 25, 16, 11, 45, 201000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 7, 25, 16, 21, 7, 785000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'FeatureEngineering', 'GenerateCandidateDefinitionsOnly': False, 'AutoMLJobArtifacts': {'CandidateDefinit

# View Generated Notebook Samples
Once data analysis is complete, SageMaker AutoPilot generates two notebooks: 
* Data exploration,
* Candidate definition.

# In the Jupyter File Browser, Open the Following Folders to See Samples of the Generated Assets:
```
notebooks/
generated_module/
```

Lots of useful information ^^ in these folders ^^

(Optional) You can download the actual files generated for your specific Autopilot run using the following:
```
generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')

!aws s3 cp --recursive $generated_resources .
```

# Feature Engineering

In [11]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Feature engineering complete")
    
print(job)

InProgress
FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress ModelTuning
Feature engineering complete
{'AutoMLJobName': 'automl-dm-25-16-11-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:automl-job/automl-dm-25-16-11-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-393371431575/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-393371431575/models/autopilot'}, 'RoleArn': 'arn:aws:iam::393371431575:

# Model Training and Tuning

In [12]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Model tuning complete")
    
print(job)

InProgress
ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
Completed MaxCandidatesReached
Model tuning complete
{'AutoMLJobName': 'automl-dm-25-16-11-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:automl-job/automl-dm-25-16-11-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-393371431575/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-393371431575/models/autopilot'}, 'RoleArn': 'arn:aws:iam::393371431575:role/TeamRole', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 7, 25, 16, 11, 45, 201000, tzinfo=tzlocal()), 'EndTime': datetim

# _Please Wait Until ^^ Autopilot ^^ Completes Above_
Make sure the status below indicates `Completed`.

In [13]:
print(job_status)

if job_status not in ('Completed'):
    print('*******************************************************************')
    print('*************** THIS JOB DID NOT COMPLETE PROPERLY ****************')
    print('***************  REPORT THE ISSUE OR ASK FOR HELP  ****************')    
    print('*******************************************************************')

Completed


# Viewing All Candidates
Once model tuning is complete, you can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by AutoML and sort them by their final performance metric.

In [14]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                SortBy='FinalObjectiveMetricValue')['Candidates']
for index, candidate in enumerate(candidates):
    print(str(index) + "  " 
        + candidate['CandidateName'] + "  " 
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

0  tuning-job-1-18a74fc5bad7484f80-003-969aaeef  0.4468599855899811
1  tuning-job-1-18a74fc5bad7484f80-001-228b87f2  0.27463001012802124
2  tuning-job-1-18a74fc5bad7484f80-002-3a0f5549  0.2668899893760681


# Inspect Trials using Experiments API

SageMaker Autopilot automatically creates a new experiment, and pushes information for each trial. 

In [15]:
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=auto_ml_job_name + '-aws-auto-ml-job',
)

df = exp.dataframe()
print(df)

                                  TrialComponentName  \
0  tuning-job-1-18a74fc5bad7484f80-002-3a0f5549-a...   
1  tuning-job-1-18a74fc5bad7484f80-003-969aaeef-a...   
2  tuning-job-1-18a74fc5bad7484f80-001-228b87f2-a...   
3  automl-dm--dpp2-rpb-1-db546770d7034e80909ce7ff...   
4  automl-dm--dpp1-csv-1-1fe79782a7f84792bb58c5b0...   
5  automl-dm--dpp2-1-f8cb6cb1e0e04eb2af62e3ac2f08...   
6  automl-dm--dpp1-1-a7b22637cfe94ebcaa4c276f32f0...   
7  db-1-b606debcfcaf4f24b63bca34ee9e8892b5597f8f9...   

                                         DisplayName  \
0  tuning-job-1-18a74fc5bad7484f80-002-3a0f5549-a...   
1  tuning-job-1-18a74fc5bad7484f80-003-969aaeef-a...   
2  tuning-job-1-18a74fc5bad7484f80-001-228b87f2-a...   
3  automl-dm--dpp2-rpb-1-db546770d7034e80909ce7ff...   
4  automl-dm--dpp1-csv-1-1fe79782a7f84792bb58c5b0...   
5  automl-dm--dpp2-1-f8cb6cb1e0e04eb2af62e3ac2f08...   
6  automl-dm--dpp1-1-a7b22637cfe94ebcaa4c276f32f0...   
7  db-1-b606debcfcaf4f24b63bca34ee9e8892b5597f8

# Explore the Best Candidate
Now that we have successfully completed the AutoML job on our dataset and visualized the trials, we can create a model from any of the trials with a single API call and then deploy that model for online or batch prediction using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). For this notebook, we deploy only the best performing trial for inference.

The best candidate is the one we're really interested in.

In [16]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_identifier = best_candidate['CandidateName']

print("Candidate name: " + best_candidate_identifier)
print("Metric name: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("Metric value: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

Candidate name: tuning-job-1-18a74fc5bad7484f80-003-969aaeef
Metric name: validation:accuracy
Metric value: 0.4468599855899811


In [17]:
best_candidate

{'CandidateName': 'tuning-job-1-18a74fc5bad7484f80-003-969aaeef',
 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:accuracy',
  'Value': 0.4468599855899811},
 'ObjectiveStatus': 'Succeeded',
 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-west-2:393371431575:processing-job/db-1-b606debcfcaf4f24b63bca34ee9e8892b5597f8f98db4fd7b3ed854316',
   'CandidateStepName': 'db-1-b606debcfcaf4f24b63bca34ee9e8892b5597f8f98db4fd7b3ed854316'},
  {'CandidateStepType': 'AWS::SageMaker::TrainingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-west-2:393371431575:training-job/automl-dm--dpp2-1-f8cb6cb1e0e04eb2af62e3ac2f08942b5f35a2fdd2514',
   'CandidateStepName': 'automl-dm--dpp2-1-f8cb6cb1e0e04eb2af62e3ac2f08942b5f35a2fdd2514'},
  {'CandidateStepType': 'AWS::SageMaker::TransformJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-west-2:393371431575:transform-job/automl-dm--dpp2-rpb-1-db546770d7034e80909ce7ff319486a8b677

We can see the containers and models composing the Inference Pipeline.

In [18]:
for container in best_candidate['InferenceContainers']:
    print(container['Image'])
    print(container['ModelDataUrl'])
    print('======================')

246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:0.1.0-cpu-py3
s3://sagemaker-us-west-2-393371431575/models/autopilot/automl-dm-25-16-11-45/data-processor-models/automl-dm--dpp2-1-f8cb6cb1e0e04eb2af62e3ac2f08942b5f35a2fdd2514/output/model.tar.gz
246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
s3://sagemaker-us-west-2-393371431575/models/autopilot/automl-dm-25-16-11-45/tuning/automl-dm--dpp2-xgb/tuning-job-1-18a74fc5bad7484f80-003-969aaeef/output/model.tar.gz
246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:0.1.0-cpu-py3
s3://sagemaker-us-west-2-393371431575/models/autopilot/automl-dm-25-16-11-45/data-processor-models/automl-dm--dpp2-1-f8cb6cb1e0e04eb2af62e3ac2f08942b5f35a2fdd2514/output/model.tar.gz


# Autopilot Chooses XGBoost as Best Candidate!

Note that Autopilot chose different hyper-parameters and feature transformations than we used in our own XGBoost model.

# Deploy the Model as a REST Endpoint
Batch transformations are also supported, but for now, we will use a REST Endpoint.

In [19]:
model_name = 'automl-dm-model-' + timestamp_suffix

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Best candidate model ARN: ', model_arn['ModelArn'])

Best candidate model ARN:  arn:aws:sagemaker:us-west-2:393371431575:model/automl-dm-model-25-16-11-45


In [20]:
# EndpointConfig name
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
epc_name = 'automl-dm-epc-' + timestamp_suffix

# Endpoint name
autopilot_endpoint_name = 'automl-dm-ep-' + timestamp_suffix
variant_name = 'automl-dm-variant-' + timestamp_suffix

print(autopilot_endpoint_name)
print(variant_name)

automl-dm-ep-25-16-32-50
automl-dm-variant-25-16-32-50


In [21]:
ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.large',
                                                           'InitialInstanceCount': 1,
                                                           'ModelName': model_name,
                                                           'VariantName': variant_name}])


In [22]:
create_endpoint_response = sm.create_endpoint(EndpointName=autopilot_endpoint_name,
                                              EndpointConfigName=epc_name)
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-west-2:393371431575:endpoint/automl-dm-ep-25-16-32-50


# Wait for the Model to Deploy
This may take 5-10 mins.  Please be patient.

In [23]:
sm.get_waiter('endpoint_in_service').wait(EndpointName=autopilot_endpoint_name)


In [24]:
resp = sm.describe_endpoint(EndpointName=autopilot_endpoint_name)
status = resp['EndpointStatus']

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Arn: arn:aws:sagemaker:us-west-2:393371431575:endpoint/automl-dm-ep-25-16-32-50
Status: InService


# Test Our Model with Some Example Reviews
Let's do some ad-hoc predictions on our model.

In [25]:
sm_runtime = boto3.client('sagemaker-runtime')

In [26]:
csv_line_predict_positive = """I loved it!"""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_positive)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'5'

In [27]:
csv_line_predict_meh = """It's OK."""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_meh)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'3'

In [28]:
csv_line_predict_negative = """The worst product ever."""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_negative)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'1'

In [29]:
%store autopilot_endpoint_name

Stored 'autopilot_endpoint_name' (str)


In [30]:
%store

Stored variables and their in-db values:
autopilot_endpoint_name             -> 'automl-dm-ep-25-16-32-50'
header_train_s3_uri                 -> 's3://sagemaker-us-west-2-393371431575/data/amazon
noheader_train_s3_uri               -> 's3://sagemaker-us-west-2-393371431575/data/amazon


In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();

# Summary
We used Autopilot to automatically find the best model, hyper-parameters, and feature-engineering scripts for our dataset.  

Autopilot uses a white-box approach to generate re-usable exploration Jupyter Notebooks and transformation Python scripts to continue to train and deploy our model on new data - well after this initial interaction with the Autopilot service.