## Train a Model with SageMaker AutoPilot
We will use AutoPilot to predict sentiment of customer reviews.

### Introduction
Amazon SageMaker Autopilot is a service to perform automated machine learning (AutoML) on your datasets.  AutoPilot is available through the UI or AWS SDK.  In this notebook, we will use the AWS SDK to create and deploy a text processing and sentiment classification machine learning pipeline.

### Setup

Let's start by specifying:

* The S3 bucket and prefix to use to train our model.  _Note:  This should be in the same region as this notebook._
* The IAM role of this notebook needs access to your data.

In [26]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [27]:
%store -r header_train_s3_uri

print(header_train_s3_uri)

s3://sagemaker-us-east-2-533787958253/data/amazon_reviews_us_Digital_Software_v1_00_header.csv


In [28]:
!aws s3 ls $header_train_s3_uri

2020-04-29 14:18:44   15164605 amazon_reviews_us_Digital_Software_v1_00_header.csv


## Setup the S3 Location for the AutoPilot-Generated Assets 
This include Jupyter Notebooks (Analysis), Python Scripts (Feature Engineering), and Trained Models.

In [29]:
prefix_model_output = 'models/autopilot'

model_output_s3_uri = 's3://{}/{}'.format(bucket, prefix_model_output)

print(model_output_s3_uri)


s3://sagemaker-us-east-2-533787958253/models/autopilot


In [30]:
max_candidates = 3

job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': 600,
      'MaxCandidates': max_candidates,
      'MaxAutoMLJobRuntimeInSeconds': 3600
    },
}

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': '{}'.format(header_train_s3_uri)
        }
      },
      'TargetAttributeName': 'star_rating'
    }
]

output_data_config = {
    'S3OutputPath': '{}'.format(model_output_s3_uri)
}

## Launch the SageMaker AutoPilot job

We can now launch the job by calling the `create_auto_ml_job` API.

In [7]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

AutoMLJobName: automl-dm-27-20-20-56


_Note that we are not specifying the `ProblemType`.  AutoPilot will automatically detect if we're using regression or classification (binary or multi-class)._

In [8]:
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig=job_config,
#                      ProblemType="Classification",
                      RoleArn=role)

{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:533787958253:automl-job/automl-dm-27-20-20-56',
 'ResponseMetadata': {'RequestId': '9e4b42e5-13eb-463e-a80f-6e081152e2bf',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9e4b42e5-13eb-463e-a80f-6e081152e2bf',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '92',
   'date': 'Mon, 27 Apr 2020 20:21:24 GMT'},
  'RetryAttempts': 0}}

# Tracking the progress of the AutoPilot job
SageMaker AutoPilot job consists of the following high-level steps: 
* _Data Analysis_ where the data is summarized and analyzed to determine which feature engineering techniques, hyper-parameters, and models to explore.
* _Feature Engineering_ where the data is scrubbed, balanced, combined, and split into train and validation
* _Model Training and Tuning_ where the top performing features, hyper-parameters, and models are selected and trained.

## Analyzing Data

In [9]:
# Sleep for a bit to ensure the AutoML job above has time to start
import time
time.sleep(3)

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Data analysis complete")
    
print(job)

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete
{'AutoMLJobName': 'automl-dm-27-20-20-56', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:533787958253:automl-job/automl-dm-27-20-20-56', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-533787958253/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-east-2-533787958253/models/autopilot'}, 'RoleArn': 'arn:aws:iam::533787958253:role/CoE_AI_SageMaker_Notebook', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJ

In [10]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(job)

{'AutoMLJobName': 'automl-dm-27-20-20-56', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:533787958253:automl-job/automl-dm-27-20-20-56', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-533787958253/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-east-2-533787958253/models/autopilot'}, 'RoleArn': 'arn:aws:iam::533787958253:role/CoE_AI_SageMaker_Notebook', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 4, 27, 20, 21, 24, 711000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 4, 27, 20, 39, 4, 752000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'ModelTuning', 'GenerateCandidateDefinitionsOnly': False, 'AutoMLJobArtifacts': {'Candid

## Feature Engineering

In [11]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Feature engineering complete")
    
print(job)

Completed
MaxCandidatesReached
Feature engineering complete
{'AutoMLJobName': 'automl-dm-27-20-20-56', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:533787958253:automl-job/automl-dm-27-20-20-56', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-533787958253/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-east-2-533787958253/models/autopilot'}, 'RoleArn': 'arn:aws:iam::533787958253:role/CoE_AI_SageMaker_Notebook', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 4, 27, 20, 21, 24, 711000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2020, 4, 27, 20, 39, 40, 922000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 4, 27, 20, 39, 40, 947000, tzinfo=tzlocal()), 'BestCa

## Model Training and Tuning

In [12]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Model tuning complete")
    
print(job)

Completed
MaxCandidatesReached
Model tuning complete
{'AutoMLJobName': 'automl-dm-27-20-20-56', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:533787958253:automl-job/automl-dm-27-20-20-56', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-533787958253/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-east-2-533787958253/models/autopilot'}, 'RoleArn': 'arn:aws:iam::533787958253:role/CoE_AI_SageMaker_Notebook', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 4, 27, 20, 21, 24, 711000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2020, 4, 27, 20, 39, 40, 922000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 4, 27, 20, 39, 40, 947000, tzinfo=tzlocal()), 'BestCandidate

## Wait Until All Jobs are Done Above Before Proceeding

# View Generated Notebooks
Once data analysis is complete, SageMaker AutoPilot generates two notebooks: 
* Data exploration,
* Candidate definition.

### Copy the Generated Notebooks Locally

In [13]:
generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')
generated_resources

's3://sagemaker-us-east-2-533787958253/models/autopilot/automl-dm-27-20-20-56/sagemaker-automl-candidates/pr-1-abdf3cc0395847a59aae3eb1548fe43ad9ba1a26a4d444ef80c480bf1'

In [14]:
!aws s3 cp --recursive $generated_resources .

## In the file view, open the following folders:
```
notebooks/
generated_module/
```

Lots of useful information ^^ in these folders ^^

## Viewing All Candidates
Once model tuning is complete, you can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by AutoML and sort them by their final performance metric.

In [15]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                SortBy='FinalObjectiveMetricValue')['Candidates']
for index, candidate in enumerate(candidates):
    print(str(index) + "  " 
        + candidate['CandidateName'] + "  " 
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

0  tuning-job-1-5ebf290f7b714803b3-002-a0c1c68c  0.3787199854850769
1  tuning-job-1-5ebf290f7b714803b3-001-25c033cf  0.3745099902153015
2  tuning-job-1-5ebf290f7b714803b3-003-fe0a1316  0.25881800055503845


## Inspect Trials using Experiments API
SageMaker AutoPilot automatically creates a new experiment, and pushes information for each trial. 

In [16]:
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=auto_ml_job_name + '-aws-auto-ml-job',
)

df = exp.dataframe()
print(df)

                                  TrialComponentName  \
0  tuning-job-1-5ebf290f7b714803b3-001-25c033cf-a...   
1  tuning-job-1-5ebf290f7b714803b3-002-a0c1c68c-a...   
2  tuning-job-1-5ebf290f7b714803b3-003-fe0a1316-a...   
3  automl-dm--dpp2-rpb-1-6ff77104e6944bb28dc4fcb8...   
4  automl-dm--dpp1-csv-1-c09fa794a16f4b27aef1f73f...   
5  automl-dm--dpp0-rpb-1-35a30e7d22ad447687fbd336...   
6  automl-dm--dpp1-1-600729a9ec964286861e092a3c26...   
7  automl-dm--dpp0-1-23b98bdeb3d846c79017cb9f359a...   
8  automl-dm--dpp2-1-f6fa3c0be79a4063b97b6571ff33...   
9  db-1-534bccbc2999467f9253dc442fe05c71281409594...   

                                         DisplayName  \
0  tuning-job-1-5ebf290f7b714803b3-001-25c033cf-a...   
1  tuning-job-1-5ebf290f7b714803b3-002-a0c1c68c-a...   
2  tuning-job-1-5ebf290f7b714803b3-003-fe0a1316-a...   
3  automl-dm--dpp2-rpb-1-6ff77104e6944bb28dc4fcb8...   
4  automl-dm--dpp1-csv-1-c09fa794a16f4b27aef1f73f...   
5  automl-dm--dpp0-rpb-1-35a30e7d22ad447687fbd3

## Explore the Best Candidate
Now that we have successfully completed the AutoML job on our dataset and visualized the trials, we can create a model from any of the trials with a single API call and then deploy that model for online or batch prediction using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). For this notebook, we deploy only the best performing trial for inference.

The best candidate is the one we're really interested in.

In [17]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_identifier = best_candidate['CandidateName']

print("Candidate name: " + best_candidate_identifier)
print("Metric name: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("Metric value: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

Candidate name: tuning-job-1-5ebf290f7b714803b3-002-a0c1c68c
Metric name: validation:accuracy
Metric value: 0.3787199854850769


In [18]:
best_candidate

{'CandidateName': 'tuning-job-1-5ebf290f7b714803b3-002-a0c1c68c',
 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:accuracy',
  'Value': 0.3787199854850769},
 'ObjectiveStatus': 'Succeeded',
 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:533787958253:processing-job/db-1-534bccbc2999467f9253dc442fe05c712814095946424fb8b9c0987dfc',
   'CandidateStepName': 'db-1-534bccbc2999467f9253dc442fe05c712814095946424fb8b9c0987dfc'},
  {'CandidateStepType': 'AWS::SageMaker::TrainingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:533787958253:training-job/automl-dm--dpp2-1-f6fa3c0be79a4063b97b6571ff3321c366be99ac91d64',
   'CandidateStepName': 'automl-dm--dpp2-1-f6fa3c0be79a4063b97b6571ff3321c366be99ac91d64'},
  {'CandidateStepType': 'AWS::SageMaker::TransformJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:533787958253:transform-job/automl-dm--dpp2-rpb-1-6ff77104e6944bb28dc4fcb89de37d80ec85

We can see the containers and models composing the Inference Pipeline.

In [19]:
for container in best_candidate['InferenceContainers']:
    print(container['Image'])
    print(container['ModelDataUrl'])
    print('======================')

257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-sklearn-automl:0.1.0-cpu-py3
s3://sagemaker-us-east-2-533787958253/models/autopilot/automl-dm-27-20-20-56/data-processor-models/automl-dm--dpp2-1-f6fa3c0be79a4063b97b6571ff3321c366be99ac91d64/output/model.tar.gz
257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:0.90-1-cpu-py3
s3://sagemaker-us-east-2-533787958253/models/autopilot/automl-dm-27-20-20-56/tuning/automl-dm--dpp2-xgb/tuning-job-1-5ebf290f7b714803b3-002-a0c1c68c/output/model.tar.gz
257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-sklearn-automl:0.1.0-cpu-py3
s3://sagemaker-us-east-2-533787958253/models/autopilot/automl-dm-27-20-20-56/data-processor-models/automl-dm--dpp2-1-f6fa3c0be79a4063b97b6571ff3321c366be99ac91d64/output/model.tar.gz


## Best Candidate!
Note that AutoPilot chose different hyper-parameters and feature transformations to achieve the best score for accuracy. We don't have any mention of predictor speed though.

## Deploy the Model as a REST Endpoint
Batch transformations are also supported, but for now, we will use a REST Endpoint.

In [20]:
model_name = 'automl-dm-model-' + timestamp_suffix

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Best candidate model ARN: ', model_arn['ModelArn'])

Best candidate model ARN:  arn:aws:sagemaker:us-east-2:533787958253:model/automl-dm-model-27-20-20-56


## Amazon Resource Name (ARN)

As a refresher: Amazon Resource Name (ARN) is a file naming convention used to identify a particular resource in the Amazon Web Services (AWS) public cloud. ARNs, which are specific to AWS, help an administrator track and use AWS items and policies across AWS products and API calls.

In [21]:
# EndpointConfig name
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
epc_name = 'automl-dm-epc-' + timestamp_suffix

# Endpoint name
xgb_endpoint_name = 'automl-dm-ep-' + timestamp_suffix
variant_name = 'automl-dm-variant-' + timestamp_suffix

print(xgb_endpoint_name)
print(variant_name)

automl-dm-ep-28-11-57-43
automl-dm-variant-28-11-57-43


In [22]:
ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m4.xlarge',
                                                           'InitialInstanceCount': 1,
                                                           'ModelName': model_name,
                                                           'VariantName': variant_name}])


In [23]:
create_endpoint_response = sm.create_endpoint(EndpointName=xgb_endpoint_name,
                                              EndpointConfigName=epc_name)
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-east-2:533787958253:endpoint/automl-dm-ep-28-11-57-43


## Wait for the Model to Deploy
This may take 5-10 mins.  Please be patient.

In [25]:
sm.get_waiter('endpoint_in_service').wait(EndpointName=xgb_endpoint_name)

In [31]:
resp = sm.describe_endpoint(EndpointName=xgb_endpoint_name)
status = resp['EndpointStatus']

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Arn: arn:aws:sagemaker:us-east-2:533787958253:endpoint/automl-dm-ep-28-11-57-43
Status: InService


## Test Our Model with Some Example Reviews
Let's do some ad-hoc predictions on our model.

In [32]:
sm_runtime = boto3.client('sagemaker-runtime')

In [33]:
csv_line_predict_positive = """I loved it!"""

response = sm_runtime.invoke_endpoint(EndpointName=xgb_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_positive)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'4'

In [34]:
csv_line_predict_meh = """It's OK."""

response = sm_runtime.invoke_endpoint(EndpointName=xgb_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_meh)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'3'

In [35]:
csv_line_predict_negative = """The worst product ever."""

response = sm_runtime.invoke_endpoint(EndpointName=xgb_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_negative)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'1'

## Summary
We used AutoPilot to automatically find the best model, hyper-parameters, and feature-engineering scripts for our dataset.  

AutoPilot uses a white-box approach to generate re-usable exploration Jupyter Notebooks and transformation Python scripts to continue to train and deploy our model on new data - well after this initial interaction with the AutoPilot service.