# Exploring AWS SageMaker through the Kaggle "House Prices: Advanced Regression Techniques" competition
---


## Background
The purpose of this notebook is to use AWS SageMaker for the first time as a tool for data exploration and machine learning.

__Disclaimer__: I'm new to machine learning. I'm going through this exercise to further my own learning. I'm sharing it in case it can help someone else, as I have benefited greatly from other's postings. As you read this, you will assuredly find many opportunities for improvements. Please share them with me!

---

## Contents

1. [Background](#Background)
1. [Approach](#Approach)
1. [Environment](#Environment)
1. [Data](#Data)
1. [SageMaker](#SageMaker)
1. [Hyperparameters](#Hyperparameters)
1. [Conclusions](#Conclusions)

---


## Approach

The exercise we will try to perform on SageMaker is making a submisssion to the "House Prices" Kaggle competition:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Our goal is _not_ to score highly on the leaderboard. We just want to use the SageMake infrastructure to arrive at a solution that is somehow better than a naive solution. As such, many corners will be cut: default algorithms will be used, data exploration will be mostly skipped, feature engineering won't be much of a concern. And because there is a ton of information and notebooks available on the Kaggle site about this competition, we won't repeat them here. So refer to the Kaggle site for background information.

Because random forest algorithms can perform reasonably well on this type of regression problem, we'll use a eXtreme Gradient Boosting (XGBoost) SageMaker example as seed notebook. That'll help us so we don't have to hunt for the right syntax and such. You can find that notebook on the SageMaker examples tab or on GitHub if you don't want to use this notebook: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb



---

## Environment

### AWS Setup
1. First, we need an AWS account on aws.amazon.com
1. Once signed in, locate the SageMaker service: https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/landing
1. Then hit "Create notebook instance". Since this EC2 instance will only be used to run the Jupyter notebook, it does not need to be particularly powerful. I chose the default of "ml.t2.medium" which costs 5 cents/hour at the time of this writing. I used the default, permissive IAM role because... cutting a straight line to Sagemaker.
1. Use/copy the "direct marketing" SageMaker example notebook or this notebook
1. Start your notebook, and get a new terminal while you are at it since we'll use it in a second.

### Enabling Kaggle Tools
For convenience, let's enable the Kaggle API and command line tools. First, we install the Python package to get the command line tools.

In [None]:
!pip install kaggle

In [None]:
!kaggle competitions list

__Kaggle API__

Then we enable the Kaggle API. This assumes you have an account on Kaggle. It's free and only takes a minute. Once you have that, follow instructions here to retrieve your kaggle.json file

https://github.com/Kaggle/kaggle-api

Using the AWS Jupyter files tab, upload your kaggle.json. I had to use the Jupyter terminal to move it to ~/.kaggle/

In [None]:
!mkdir ~/.kaggle
!mv kaggle.json ../.kaggle/

Finally, we follow the advice to make sure our Kaggle key isn't readable by other users of this system. This is a corner we could have cut on this private EC2 instance.

In [None]:
!chmod 600 ../.kaggle/kaggle.json

### Python and SageMaker Setup

Now we start coding in Python. We import the necessary libraries and create a connection to the SageMaker service.

In [None]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker                                  # Amazon SageMaker's Python SDK provides many helper functions
from sagemaker.predictor import csv_serializer    # Converts strings for HTTP POST requests on inference

In [None]:
bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/house_price_xgboost'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name 
smclient = boto3.Session().client('sagemaker')

---

## Data

### Retrieval
Let's start by downloading the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository.

In [None]:
!kaggle competitions list

In [None]:
!kaggle competitions download -p ./data house-prices-advanced-regression-techniques

### Exploration
Now lets read this into a Pandas data frame and take a look, but just a quick one. Normally, we would investigate the data. But because our goal is to get a model trained end-to-end on SageMaker as quickly as possible, we'll skip all those best practices and see what happens.

In [None]:
df_train = pd.read_csv('./data/train.csv')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
df_train.head()

### First submission

Before we go further investigating the data and creating sophisticated predictive models, let's make sure we're able to submit something, anything, to the competition. The idea is to validate our pipeline and that we've understood the submission format. And by getting an initial score, we're better able to judge future improvements.

The required format is provided in the "sample_submission.csv" file retrieved earlier.

In [None]:
!head ./data/sample_submission.csv

For the purpose of test driving the submission process, we don't need a good result, just a valid one. A trivial attempt would be to use the means of the training set as the answer for all items in the testing set.

In [None]:
df_train.describe()['SalePrice']

In [None]:
df_competition = pd.read_csv('./data/test.csv')
df_competition.head()

In [None]:
df_submit = pd.DataFrame(df_competition['Id'], dtype=int)
df_submit['SalePrice'] = df_train.describe()['SalePrice']['mean']
df_submit.head()
df_submit.tail()

In [None]:
df_submit.to_csv('./data/sub_mean.csv',index=False)
!head ./data/sub_mean.csv
!tail ./data/sub_mean.csv


That is the format we were looking for. Now let's try to submit it through the Kaggle API:

> usage: kaggle competitions submit [-h] -f FILE_NAME -m MESSAGE [-q] [competition]

In [None]:
!kaggle competitions submit -f ./data/sub_mean.csv -m "Means-based submission" house-prices-advanced-regression-techniques
#
# Need UI buffer space so horizontal scrollbar does not get in the way

On Kaggle.com under "My submission", I can see that this trivial technique gives a public score of 0.42949. At the time of this writing, this technique places us 4515 out of 4745 on the public leaderboard. Clearly not a good score, but we didn't expect it to be.

Now that our submission pipeline has been validated, we can turn our attention to submitting something more clever through SageMaker.

### Data Transformation

To create a more sophisticated solution, we should know our data. The pros urge us to do the following:

> Cleaning up data is part of nearly every machine learning project.  It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process.  Several common techniques include:

>* Handling missing values: Some machine learning algorithms are capable of handling missing values, but most would rather not.  Options include:
 * Removing observations with missing values: This works well if only a very small fraction of observations have incomplete information.
 * Removing features with missing values: This works well if there are a small number of features which have a large number of missing values.
 * Imputing missing values: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.
* Converting categorical to numeric: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.
* Oddly distributed data: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data.  In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data.  In others, bucketing values into discrete ranges is helpful.  These buckets can then be treated as categorical variables and included in the model when one hot encoded.
* Handling more complicated data types: Mainpulating images, text, or data at varying grains is left for other notebook templates.

> Luckily, some of these aspects have already been handled for us, and the algorithm we are showcasing tends to do well at handling sparse or oddly distributed data.  Therefore, let's keep pre-processing simple.

So, for the purpose of racing to the end, we'll disregard most of it and simply convert categorical variables to numerical data

In [None]:
df_train = pd.get_dummies(df_train)   # Convert categorical variables to sets of indicators
df_train.describe()

### Feature engineering

If we cared about the score, we would totally look into that.

We, however, will heed this advice regarding creating a validation test from our test data:

> When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting.  Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given.  This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown.  These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.

> The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on "new" data.  There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc.  For our purposes, we'll simply randomly split the data into 3 uneven groups.  The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [None]:
#model_data = data
train_data, validation_data, test_data = np.split(df_train.sample(frac=1, random_state=1729), [int(0.7 * len(df_train)), int(0.9*len(df_train))])  

In [None]:
train_data.shape

>Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
#pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
#pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)
pd.concat([train_data['SalePrice'], train_data.drop(['SalePrice'], axis=1)], axis=1).to_csv('./data/sm_train.csv', index=False, header=False)
pd.concat([validation_data['SalePrice'], validation_data.drop(['SalePrice'], axis=1)], axis=1).to_csv('./data/sm_validation.csv', index=False, header=False)
!ls -l ./data

Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('./data/sm_train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('./data/sm_validation.csv')

---
## SageMaker
### SageMaker Training

Yup, that's why we're using XGBoost:

> Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

> There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

> `xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

> First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

> Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

> First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

> And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

_Note_: we are about to start a m4.xlarge which cost, as of now, 22.2 cents/hr. It'll run for the duration of our model training.

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

# These are the default parameters that came with the Targeting Direct Marketing example
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='reg:linear',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

__Some observations__

Launching the instance takes about 2 min to start. That's a long time to be staring at the screen. Once running, training of our algorithm only takes 39 seconds, which is also the amont of time we are billed. That's quite efficient. Think about it. Being able to borrow someone else's computer for 39 seconds.

Still, a 39 second process did take nearly 3 min of wall clock time.

---

### Deploying model
> Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

This is a 3rd instance we are starting. The first one is used to run this notebook. The second one was used for training our model. And this third one is used for inferences.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

__Observations__
Again, starting a new instance takes several minutes.

---

### First SM Model Evaluation
Let's compare how our SageMaker (SM) model does at predicting SalePrice we already know.

> First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string 
then decode the resulting CSV.

> *Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

Like we did for our training data, we need to convert the categorical columns in the "test" data provided by Kaggle to numerical format.

In [None]:
df_competition = pd.get_dummies(df_competition)
df_competition.shape

In [None]:
df_train.shape

__Note__ Our competition test data has 18 fewer columns than our training data. One column is due to the absence of 'SalePrice' from the Kaggle test data, as expected. The rest is due in differences in values present in the various column, and the net impact once apparently categorical data is transformed to numerical data through get_dummies().

Since Random Forest, which include XGBoost algorithms, require a fixed number of columns, let's pad our competition with as many columns as required. I will admit that I do not know what impact these useless columns have on the XGBoost algorithm. By not adding a distribution in the data of these columns, I'm hoping it's negligeable.

A better way would have been to read and processed all of the data, training + competition, together.

In [None]:
df_comp_padded=df_competition
for x in range(0, 18):
    df_comp_padded[x] = pd.Series(1, index=df_comp_padded.index)
df_comp_padded.shape

In [None]:
df_comp_padded.head()

In [None]:
def predict(data):
    ids = data['Id']
    saleprice = np.array(xgb_predictor.predict(data.as_matrix()).decode('utf-8').split(',')).astype(np.float)
    predictions = list(zip(ids,saleprice))
    return predictions

#predictions = predict(test_data.drop(['SalePrice'], axis=1).as_matrix())
#predictions = predict(kaggle_data.drop(['SalePrice'], axis=1).as_matrix())
#predictions = predict(kd.as_matrix())
%time predictions = predict(df_comp_padded)

__Observations__
Asking oru model to predict the saleprice of 1459 houses took about one half second. That feels pretty quick for real-time human interaction.

Let's take a quick look at the predictions:

In [None]:
predictions[0:5]

They are different than our means-based trivial submission from earlier. Let's submit them and see how the default XGBoost algorithm did.

### First SM submission

In [None]:
np.array(predictions).shape

In terms of shape, it's consistent with the submission requirements, minus the colum headers.

In [None]:
#np.savetxt("../data/housing.csv", predictions, delimiter=",", header='Id,SalePrice', fmt='%u')

In [None]:
df = pd.DataFrame(predictions, columns=['Id', 'SalePrice'])
print(df.head())

In [None]:
df.to_csv('./data/sub_xgboost_default.csv', header=True, index=False)

In [None]:
!kaggle competitions submit -f ./data/sub_xgboost_default.csv -m "Default XGBoost submission" house-prices-advanced-regression-techniques
#
# Damned scroll bar

This default XGBoost submittion gives me a public score of 0.17203, which is good for 3640th place as of this writing, a significant improvement on our means-based result! We thus have good confidence we've improved our modeling.

---

## Hyperparameters 

### Setup
Now we'll try to use one of the strength of SageMaker: automated hyperparameter tuning!

>*Note, with the default setting below, the hyperparameter tuning job can take about 30 minutes to complete.*

>Now that we have prepared the dataset, we are ready to train models. Before we do that, one thing to note is there are algorithm settings which are called "hyperparameters" that can dramtically affect the performance of the trained models. For example, XGBoost algorithm has dozens of hyperparameters and we need to pick the right values for those hyperparameters in order to achieve the desired model training results. Since which hyperparameter setting can lead to the best result depends on the dataset as well, it is almost impossible to pick the best hyperparameter setting without searching for it, and a good search algorithm can search for the best hyperparameter setting in an automated and effective way.

>We will use SageMaker hyperparameter tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.

>Now we configure the hyperparameter tuning job by defining a JSON object that specifies following information:
* The ranges of hyperparameters we want to tune
* Number of training jobs to run in total and how many training jobs should be run simultaneously. More parallel jobs will finish tuning sooner, but may sacrifice accuracy. We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).
* The objective metric that will be used to evaluate training results, in this example, we select *validation:auc* to be the objective metric and the goal is to maximize the value throughout the hyperparameter tuning process. One thing to note is the objective metric has to be among the metrics that are emitted by the algorithm during training. In this example, the built-in XGBoost algorithm emits a bunch of metrics and *validation:auc* is one of them. If you bring your own algorithm to SageMaker, then you need to make sure whatever objective metric you select, your algorithm actually emits it.

>We will tune four hyperparameters in this examples:
* *eta*: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative. 
* *alpha*: L1 regularization term on weights. Increasing this value makes models more conservative. 
* *min_child_weight*: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. 
* *max_depth*: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

Note that I have no idea if these are the correct parameters to tune, nor if their ranges are appropriate. Once again, we don't care so much at the moment as we are just trying to make the end-to-end process work. We have our baseline result above to ensure we don't end up with worse results.

In [None]:
from time import gmtime, strftime, sleep
tuning_job_name = 'xgboost-tuningjob-' + strftime("%d-%H-%M-%S", gmtime())

print (tuning_job_name)

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta",
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight",
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha",            
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth",
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 20,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:rmse",
      "Type": "Minimize"
    }
  }

>Then we configure the training jobs the hyperparameter tuning job will launch by defining a JSON object that specifies following information:
* The container image for the algorithm (XGBoost)
* The input configuration for the training and validation data
* Configuration for the output of the algorithm
* The values of any algorithm hyperparameters that are not tuned in the tuning job (StaticHyperparameters)
* The type and number of instances to use for the training jobs
* The stopping condition for the training jobs

>Again, since we are using built-in XGBoost algorithm here, it emits two predefined metrics: *validation:auc* and *train:auc*, and we elected to monitor *validation_auc* as you can see above. One thing to note is if you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics.

__Note__ I left the default example text above. In our Kaggle competition, we do not want "validation:auc", we want "reg:linear". Or, acknowledging my undertainty, "reg:linear" is closer to what we need.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(region, 'xgboost', repo_version='latest')
     
s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)
    
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m4.xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "rmse",
      "num_round": "100",
      "objective": "reg:linear",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}

__Launch_Hyperparameter_Tuning__

Now we can launch a hyperparameter tuning job by calling create_hyper_parameter_tuning_job API. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

__Costs__ 

We specified three parallel jobs, so three instances will be commandeered. We'll thus incur three times the cost per hour of our original training.

In [None]:
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                            HyperParameterTuningJobConfig = tuning_job_config,
                                            TrainingJobDefinition = training_job_definition)

If we go back to the SageMaker dashboard, we can see our three concurrent training jobs!

>Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully.

In [None]:
smclient.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name)['HyperParameterTuningJobStatus']

### Analyze tuning job results

>Please refer to "HPO_Analyze_TuningJob_Results.ipynb" to see example code to analyze the tuning job results.

My best tuning job returned different parameters for XGBoost. But when I tried those, I got substantially the same public leaderboard results. I'll spare you the redundant code to re-train the algorithm and re-submit to the competition.

This could be because I picked the wrong parameters to fine tune. But it could also be because the largest possible gains lay elsewhere, perhaps in feature engineering, outlier removal, side effect of bogus columns, etc.

At least now, we have a pipeline that enables further experimentation!

### Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

## Conclusions

SageMaker seems quite easy to use. There are lots of examples provided out of the box that can be extended to your use-case. And, despite using automated hyperparameter tuning, I know I've barely scratched the surface of the features it has. That said, for casual data exploration, the time I spent waiting for instances to boot up is considerable and not suitable for real-time interaction. Three minutes is long enough for my mind to wander. It is likely that I could be more skilled at creating the intances once and leaving them running. Nonetheless, for casual data exploration and machine learning studies, turn-key solutions like Crestle or Papermaker may make more sense, whereas AWS SageMaker may fit better in a "real" R&D or production context.