# Kaggle Titanic Using SageMaker AutoPilot

This notebook is an AutoPilot demonstration to support a blog comparing SageMaker with DIY approaches. For a more detailed overview of AutoPilot see the SageMaker examples.

## Import Required Libraries

In [1]:
import pandas as pd
import io
import boto3
import sagemaker
from time import sleep, strftime, gmtime
from sklearn.metrics import mean_squared_error
from math import sqrt
from urllib.parse import urlparse

## Set-Up SageMaker

In [2]:
role = sagemaker.get_execution_role()

## Review Data

Just a quick review to see what feature types, headings and target column.

In [3]:
data = pd.read_csv( "s3://edskaggletitanic/train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Configure AutoML

In [4]:
bucket = "edskaggletitanic"
input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "s3://{}/train.csv".format(bucket),
            }
        },
        "TargetAttributeName": "Survived",
    }
]

job_config = {"CompletionCriteria": {"MaxCandidates": 10}}


output_data_config = {"S3OutputPath": "s3://{}/output".format(bucket)}

## Run AutoML

In [5]:
timestamp_suffix = strftime("%Y%m%d-%H-%M", gmtime())
auto_ml_job_name = "autoTitanic" + timestamp_suffix

sm = boto3.client('sagemaker')
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=role)

{'AutoMLJobArn': 'arn:aws:sagemaker:eu-west-2:046439977443:automl-job/autotitanic20220729-14-52',
 'ResponseMetadata': {'RequestId': 'dd8e7246-4cd3-4c9e-be44-383df2dabd85',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dd8e7246-4cd3-4c9e-be44-383df2dabd85',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '96',
   'date': 'Fri, 29 Jul 2022 14:52:40 GMT'},
  'RetryAttempts': 0}}

## Track Progress

The job will run for a while so the next cell tracks it's progress.

In [6]:
describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(60)

InProgress - Starting
InProgress - Starting
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgres

## List Candidates

Review the candidate models provided by the job.

In [7]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
  index += 1
    
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]

1  autoTitanic20220729-14-52GMDj5xB-060-9d6c940b  0.7871000170707703
2  autoTitanic20220729-14-52GMDj5xB-080-d083c54f  0.7828199863433838
3  autoTitanic20220729-14-52GMDj5xB-099-335109a6  0.7783600091934204
4  autoTitanic20220729-14-52GMDj5xB-084-3381dd8e  0.7780799865722656
5  autoTitanic20220729-14-52GMDj5xB-094-8b1b242f  0.7775200009346008
6  autoTitanic20220729-14-52GMDj5xB-021-267f4cda  0.7774199843406677
7  autoTitanic20220729-14-52GMDj5xB-086-c284a62d  0.7771599888801575
8  autoTitanic20220729-14-52GMDj5xB-093-3f8e2fc8  0.7737500071525574
9  autoTitanic20220729-14-52GMDj5xB-052-f633a743  0.7734699845314026
10  autoTitanic20220729-14-52GMDj5xB-030-9605930e  0.7730200290679932


## Make Predictions with Best Candidate

Use the best performing candidate

### Set Up Transform

In [8]:
sagemaker.automl = sagemaker.AutoML.attach(auto_ml_job_name=auto_ml_job_name)

s3_transform_output_path = "s3://{}/inference-results/".format(bucket)

model_name = "{0}-model".format(best_candidate_name)

model = sagemaker.automl.create_model(
    name=model_name,
    candidate=best_candidate,
)

output_path = s3_transform_output_path + best_candidate_name + "/"

transformer = model.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    assemble_with="Line",
    strategy="SingleRecord",
    output_path=output_path,
    env={"SAGEMAKER_MODEL_SERVER_TIMEOUT": "100", "SAGEMAKER_MODEL_SERVER_WORKERS": "1"},
)

### Launch and Track Transform

In [9]:
transformer.transform(
    data="s3://{}/test.csv".format(bucket),
    split_type="Line",
    content_type="text/csv",
    wait=False,
    model_client_config={"InvocationsTimeoutInSeconds": 80, "InvocationsMaxRetries": 1},
)

print("Starting transform job {}".format(transformer._current_job_name))

## Wait for jobs to finish
pending_complete = True
job_name = transformer._current_job_name

while pending_complete:
    pending_complete = False

    description = sm.describe_transform_job(TransformJobName=job_name)
    if description["TransformJobStatus"] not in ["Failed", "Completed"]:
        pending_complete = True

    print("{} transform job is running.".format(job_name))
    sleep(60)

print("\nCompleted.")

Starting transform job autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-50-56-474 transform job is running.

Completed.


### Evaluate

In [10]:
def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip("/")
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, "{}/{}".format(prefix, file_name))
    return obj.get()["Body"].read().decode("utf-8")


job_status = sm.describe_transform_job(TransformJobName=job_name)["TransformJobStatus"]

test_file = "test.csv"

if job_status == "Completed":
    pred_csv = get_csv_from_s3(transformer.output_path, "{}.out".format(test_file))
    predictions = pd.read_csv(io.StringIO(pred_csv), header=None)

In [11]:
transformer.transform(
    data="s3://{}/train.csv".format(bucket),
    split_type="Line",
    content_type="text/csv",
    wait=False,
    model_client_config={"InvocationsTimeoutInSeconds": 80, "InvocationsMaxRetries": 1},
)

print("Starting transform job {}".format(transformer._current_job_name))

## Wait for jobs to finish
pending_complete = True
job_name = transformer._current_job_name

while pending_complete:
    pending_complete = False

    description = sm.describe_transform_job(TransformJobName=job_name)
    if description["TransformJobStatus"] not in ["Failed", "Completed"]:
        pending_complete = True

    print("{} transform job is running.".format(job_name))
    sleep(60)

print("\nCompleted.")

def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip("/")
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, "{}/{}".format(prefix, file_name))
    return obj.get()["Body"].read().decode("utf-8")


job_status = sm.describe_transform_job(TransformJobName=job_name)["TransformJobStatus"]

test_file = "train.csv"

if job_status == "Completed":
    pred_csv = get_csv_from_s3(transformer.output_path, "{}.out".format(test_file))
    predictions = pd.read_csv(io.StringIO(pred_csv), header=None)

Starting transform job autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.
autoTitanic20220729-14-52GMDj5xB-060-9d-2022-07-29-16-57-57-798 transform job is running.

Completed.
