# Boston Housing Dataset with Amazon SageMaker XGBoost




## Preparation

_This notebook was created and tested on an ml.m5.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
# S3 prefix
prefix = 'DEMO-xgboost-regressoion-boston'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer

print(sagemaker.__version__)

role = get_execution_role()
sess = sagemaker.session.Session()
bucket = sess.default_bucket()

print(bucket)

### Data processing
We use pandas to process a small local dataset into a training and testing piece.

We could also design code that loads all the data and runs cross-validation within the script. 

In [None]:
import os

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

In [None]:
# we use the Boston housing dataset 
data = load_boston()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=45)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=45)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

valX = pd.DataFrame(X_test, columns=data.feature_names)
valX['target'] = y_test

testX = pd.DataFrame(X_test, columns=data.feature_names)

Lets inspect the Boston housing dataset 

In [None]:
trainX

As you can see, the label is the `target` column. 

Since we are building our own Algorithm, the assumption is that the last column (`target`) will be the label, when we come to train the model.

In [None]:
local_train = './data/train/boston_train.csv'
local_validation = './data/validation/boston_validation.csv'
local_test = './data/test/boston_test.csv'

trainX.to_csv(local_train, header=None, index=False)
valX.to_csv(local_validation, header=None, index=False)
testX.to_csv(local_test, header=None, index=False)

In [None]:
# send data to S3. SageMaker will take training data from S3
train_location = sess.upload_data(
    path=local_train, 
    bucket=bucket,
    key_prefix=prefix+'/train')

validation_location = sess.upload_data(
    path=local_validation, 
    bucket=bucket,
    key_prefix=prefix+'/validation')

In [None]:
train_location

In [None]:
validation_location

## Training
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost','0.90-1')

In [None]:
container

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=validation_location, content_type='csv')

## Remote training in SageMaker

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m5.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='reg:squarederror',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

## Hosting your model
You can use a trained model to get real time predictions using HTTP endpoint. Follow these steps to walk you through the process.

### Deploy the model

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m5.xlarge')

### Choose some data and use it for a prediction

In order to do some predictions, we'll use the test dataset.

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [None]:
with open(local_test, 'r') as f:
    payload = f.read().strip()

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

predicted = xgb_predictor.predict(payload).decode('utf-8')
print(predicted)

In [None]:
predicted_array = np.fromstring(predicted[1:], sep=',')
expected = y_test

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(4, 3))
plt.scatter(expected, predicted_array)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()

### Optional cleanup
When you're done with the endpoint, you'll want to clean it up.

In [None]:
sess.delete_endpoint(xgb_predictor.endpoint)