# Build, Train, Deploy change management data with AWS

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")

In [None]:
# AWS Specific Imports and Setup

import boto3
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name

bucket='YOUR-S3-BUCKET' # Replace with your s3 bucket name
prefix = 'linear-svc' # Used as part of the path in the bucket where you store data
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket) # The URL to access the bucket

raw_titanic_data = 's3://{}/{}'.format(bucket, 'rawTitanic.csv') # change the .csv file name

print(raw_titanic_data)

## Prepare Data

In [3]:
titanic = pd.read_csv(raw_titanic_data)

Let's look over what data we have and a little bit about how it is structured. The 'info' function does a good job at showing what fields have null values, and we can learn about the different data types of our individual values. 

In [None]:
titanic.info()

The head function also allows us to see the first 'n' amount of rows. This is great for diving a little deeper into what our dataset contains.

In [None]:
titanic.head(10)

Let's clean up our dataset. For our quick analysis, let's remove the columns or features that had a low correlation with our survived column. Let's also remove a few other features that we aren't going to try to parse to derive additional value. BUT! You absolutely could or would in a real situation. We just aren't going to for the nature of our quick demo!

In [10]:
titanic = titanic.drop('Index - PassengerId', 1)

Now if we look at our data, we have a much more simple data set.

In [None]:
titanic.head()

Our dataset is looking good, Lets look into the distribution of our features. standard deviation.

In [None]:
titanic.describe()

Now that we have our data cleaned and ready, we are going to split our data into a 2/3, 1/3 split of training vs testing.

In [16]:
features = titanic.drop('Survived', 1)
labels = titanic['Survived']

train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33, random_state=42)

Sagemaker needs the data to be in S3, so we are going to now need to move our split datasets into S3 so that we can do further analysis.

In [None]:
from io import StringIO

test_csv_buffer = StringIO()
train_csv_buffer = StringIO()
pd.concat([test_labels, test], axis=1).to_csv(test_csv_buffer, header=True, index=False)
pd.concat([train_labels, train], axis=1).to_csv(train_csv_buffer, header=True, index=False)

s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, prefix + '/train.csv').put(Body=train_csv_buffer.getvalue())
s3_resource.Object(bucket, prefix + '/validation.csv').put(Body=test_csv_buffer.getvalue())

## Train Model

In [18]:
train_data = 's3://{}/{}/{}'.format(bucket, prefix, 'train.csv')

validation_data = 's3://{}/{}/{}'.format(bucket, prefix, 'validation.csv')

s3_output_location = 's3://{}/{}/{}'.format(bucket, prefix, 'xgboost_model_sdk')

For our xgboost algorithm, we need to fetch a container that contains that algorithm for us to use in the training process.

In [19]:
import sagemaker

from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

Once we have the container, now we can create the Estimator, set the hyperparameters, set where the data is coming from, and finally train the model.

In [20]:
xgb_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.m4.xlarge',
                                         train_volume_size = 5,
                                         output_path=s3_output_location,
                                         sagemaker_session=sagemaker.Session())

In [21]:
xgb_model.set_hyperparameters(max_depth = 5,
                              eta = .2,
                              gamma = 4,
                              min_child_weight = 6,
                              silent = 0,
                              objective = 'multi:softmax',
                              num_class = 2,
                              num_round = 10)

In [22]:
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')

data_channels = {'train': train_channel, 'validation': valid_channel}

Once you are ready, we can train the model with the 'fit' method. The actual training time can vary, but this is what is actually building out your model and model artifacts.

In [None]:
xgb_model.fit(inputs=data_channels,  logs=True)

## Deploy Endpoint

And with our model created and our model artifacts in S3, we can deploy our model. The Sagemaker SDK makes this incredibly easy for us. Sagemaker will create the model, endpoint configuration, as well as the endpoint, which are all hosted within Sagemaker.

In [None]:
xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                                instance_type='ml.t2.medium',
                                endpoint_name='titanic-survived-predictor'
                                )