# Zero to AI Hero ft SageMaker
### CONNECT NYC WWCode

Hello, welcome to our workshop! 

Getting started in AI is so much fun - but what if you want to take your idea further? What if you want to release your project to the world to use?

In this session we are going to show you how to take the machine learning model from your laptop into a production environment - so that it could be used by millions of users all over the world with AWS SageMaker. 

The data that we will be working with in this notebook is a dataset on wine quality rating them from 0-9 based on a range of different attributes.

In this notebook, we will be going through the following steps:

1. Upload Data to S3
2. Build Model
3. Specify Data Locations
4. Train Model
5. Deploy Model
6. Run Predictions
7. Creating a predictor 

In [25]:
import numpy as np
import pandas as pd
import boto3

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer

## 1. Upload Data to S3


Amazon Simple Storage Service(S3) provides object storage and is designed to make web-scale computing easier for developers.
S3 allows users to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.
For more info on s3 - https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html

    
First, we need to upload our data that is pre-made for you to our S3 bucket that we have just created. 

 *Make sure to change the bucket name to your own bucket name that you created*


In [26]:
# Specify your bucket name
bucket_name = 'sapphire-workshop'

training_folder = r'WWCodeWorkshop/data/'
validation_folder = r'WWCodeWorkshop/validation/'
test_folder = r'WWCodeWorkshop/test/'

s3_model_output_location = r's3://{0}/wwc-workshop/model'.format(bucket_name)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_folder)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_folder)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,test_folder)

In [27]:
print(s3_model_output_location)
print(s3_training_file_location)

s3://sapphire-workshop/wwc-workshop/model
s3://sapphire-workshop/WWCodeWorkshop/data/


In [28]:
# Write and Reading from S3 is just as easy
# files are referred as objects in S3.  
# file name is referred as key name in S3

# File stored in S3 is automatically replicated across 3 different availability zones 
# in the region where the bucket was created.

# http://boto3.readthedocs.io/en/latest/guide/s3.html
def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

In [29]:
write_to_s3('train.csv', 
            bucket_name,
            training_folder + 'train.csv')

In [30]:
write_to_s3('validation.csv',
            bucket_name,
            validation_folder + 'validation.csv')

In [31]:
write_to_s3('test.csv',
            bucket_name,
            test_folder + 'test.csv')

## 2. Build the model

To create a training job, you need to establish a session with SageMaker service. The Training job requires permissions needed to train and deploy models, access to S3 and spin up new instances to train the model and to store the training model artefacts. 
 

In [32]:
# To be able to intereact with the SageMaker API and other AWS services e.g. S3 we need to create a SageMaker Session.
sess = sagemaker.Session()

AWS uses access management roles - IAMS roles to grant permissions. In this instance the role is assigned to the  notebook instance. 

In [33]:
#SageMaker requires a range of permissions to train the model, so we need to provide it with our role
role = get_execution_role()

SageMaker provides several pre-built machine learning algorithms that we can use for a variety of different problems and that we will make use of in this workshop! More information about the different algorithms can be found here - https://docs.aws.amazon.com/sagemaker/latest/dg/algos.htmlIn

In this workshop we will make use of Sagemaker **XGBoost** algorithm. 

SageMaker maintains all of its algorithms as docker containers, which are stored on the AWS Elastic Container Registry(ECR) the following code shows us how to access and use these!
 
 

In [34]:
# To use the pre-built algorithm, we can connect to its container by specifiying the region and algorithm name
container = sagemaker.amazon.amazon_estimator.get_image_uri(
    sess.boto_region_name,
    "xgboost", 
    "latest")

#By printing this out we will be able to see the exact location of the algorithm we are using
print('Using SageMaker XGBoost container:\n{} ({})'.format(container, sess.boto_region_name))

	get_image_uri(region, 'xgboost', '0.90-1').


Using SageMaker XGBoost container:
644912444149.dkr.ecr.eu-west-2.amazonaws.com/xgboost:latest (eu-west-2)


This estimator will be made up of the actual algorithm we are using, it will need our role/session for premissions to use the services, it will specify the type and number of instances to use and it will need to know the location on s3 to store the models once they have been created.


In [35]:
#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html

estimator = sagemaker.estimator.Estimator(
    container,
    role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.xlarge',
    output_path=s3_model_output_location,
    sagemaker_session=sess,
    base_job_name ='winequality-workshop-v1-1')

Finally for this section, we can specify the hyper parameters to use. XGBoost hyperparameters docs can be found here https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html For this workshop, we will set the following hyperparameters:
 * max_depth - Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit.
 * objective - Specifies the learning task and the corresponding learning objective.
 * num_class - Number of classes
 * num_round - Number of rounts
 * early_stopping_rounds - The model trains until the validation score stops improving. Validation error needs to decrease at least every rounds to continue training. 

In [36]:
estimator.set_hyperparameters(max_depth=5,
                              objective="multi:softmax",
                              num_class=10,
                              num_round=50,
                              early_stopping_rounds=10)

In [37]:
estimator.hyperparameters()

{'max_depth': 5,
 'objective': 'multi:softmax',
 'num_class': 11,
 'num_round': 50,
 'early_stopping_rounds': 10}

## 3. Specify Data Locations

For SageMaker to make use of our training and validation data we need to specify the S3 location of where the data is stored and the data type.

Amazon SageMaker then makes this information available to the Docker container so that your training algorithm can use it. 

In [14]:
training_input_config = sagemaker.session.s3_input(
    s3_data=s3_training_file_location,
    content_type='text/csv',
    s3_data_type='S3Prefix')

validation_input_config = sagemaker.session.s3_input(
    s3_data=s3_validation_file_location,
    content_type='text/csv',
    s3_data_type='S3Prefix'
)

data_channels = {'train': training_input_config, 'validation': validation_input_config}

In [15]:
print(training_input_config.config)
print(validation_input_config.config)

{'DataSource': {'S3DataSource': {'S3DataDistributionType': 'FullyReplicated', 'S3DataType': 'S3Prefix', 'S3Uri': 's3://sapphire-workshop/WWCodeWorkshop/data/'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataDistributionType': 'FullyReplicated', 'S3DataType': 'S3Prefix', 'S3Uri': 's3://sapphire-workshop/WWCodeWorkshop/test/'}}, 'ContentType': 'text/csv'}


## 4. Train Model

At this point we have our estimator created, we have specified the data channels to be used, so we can go ahead and kick of a training job.

This will launch the training instance and the algorithm chosen is downloaded from ECR.

This may take a few minutes to run!

In [16]:
estimator.fit(data_channels)

2019-11-28 18:33:01 Starting - Starting the training job......
2019-11-28 18:33:35 Starting - Launching requested ML instances......
2019-11-28 18:34:32 Starting - Preparing the instances for training......
2019-11-28 18:35:52 Downloading - Downloading input data
2019-11-28 18:35:52 Training - Downloading the training image..[31mArguments: train[0m
[31m[2019-11-28:18:36:11:INFO] Running standalone xgboost training.[0m
[31m[2019-11-28:18:36:11:INFO] File size need to be processed in the node: 0.08mb. Available memory size in the node: 8527.16mb[0m
[31m[2019-11-28:18:36:11:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:36:11] S3DistributionType set as FullyReplicated[0m
[31m[18:36:11] 1119x11 matrix with 12309 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-11-28:18:36:11:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:36:11] S3DistributionType set as FullyReplicated[0m
[31m[18:36:11] 160x11 matrix with


2019-11-28 18:36:22 Uploading - Uploading generated training model
2019-11-28 18:36:22 Completed - Training job completed
Training seconds: 49
Billable seconds: 49


Once the training job has completed you will see the training and billable seconds printed in the output above, but you will also see that the job on the training dashboard will have turned green and will be marked as completed.

## 5. Deploy Model
Now that we have created and trained our model we will want to deploy it to a real time endpoint. To do this we need to specify the number of instances ad the type of instance we want to use. This may also take a few minutes, but we will be able to view the depployed model on the endpoints dashboard within sagemaker.

In [17]:
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge',
                             endpoint_name = 'winequality-workshop-v1-1')

-------------------------------------------------------------------------------------!

## 6. Running Predictions
Once the model endpoint is ready we can go ahead and run predictions on it.

To get predictions we will want to read in out test data.

In [18]:
#Read in the test file
test = pd.read_csv('test.csv')

In [19]:
test.head()

Unnamed: 0,5,10.0,0.26,0.54,1.9,0.083,42.0,74.0,0.99451,2.98,0.63,11.8
0,3,6.9,0.605,0.12,10.7,0.073,40.0,83.0,0.9993,3.45,0.52,9.4
1,2,7.2,0.835,0.0,2.0,0.166,4.0,11.0,0.99608,3.39,0.52,10.0
2,2,7.8,0.6,0.26,2.0,0.08,31.0,131.0,0.99622,3.21,0.52,9.9
3,4,7.9,0.2,0.35,1.7,0.054,7.0,15.0,0.99458,3.32,0.8,11.9
4,3,10.5,0.43,0.35,3.3,0.092,24.0,70.0,0.99798,3.21,0.69,10.5


In [20]:
#We will want to separate the labels from the data, so we ca verify if the predictions where correct
testLabels = test.iloc[:, 0]
testData = test.iloc[:, 1:]

In [22]:
#The predictor is exptecting the data in the form of a numpy array, so we need to convert the data
testData = testData.to_numpy()

In [23]:
testData

array([[ 6.9  ,  0.605,  0.12 , ...,  3.45 ,  0.52 ,  9.4  ],
       [ 7.2  ,  0.835,  0.   , ...,  3.39 ,  0.52 , 10.   ],
       [ 7.8  ,  0.6  ,  0.26 , ...,  3.21 ,  0.52 ,  9.9  ],
       ...,
       [ 6.   ,  0.58 ,  0.2  , ...,  3.58 ,  0.67 , 12.5  ],
       [10.4  ,  0.43 ,  0.5  , ...,  3.1  ,  0.87 , 11.4  ],
       [ 5.8  ,  0.29 ,  0.26 , ...,  3.39 ,  0.54 , 13.5  ]])

## 7. Creating a predictor
To get predictions we need to configure sagemakers predictor to work with the format of our data. 

We neeed to give the predictor the content type of the data we are using. In this example we are going to send the data in CSV format and we are going to parse the results manually.

In [24]:
from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

In [None]:
#We can get a prediction by simply calling the predict fuction a the data 
#For example here is a prediction for the first row of data
predictor.predict([6.900e+00, 6.050e-01, 1.200e-01, 1.070e+01, 7.300e-02, 4.000e+01,
       8.300e+01, 9.993e-01, 3.450e+00, 5.200e-01, 9.400e+00])

In [None]:
#The following for loop has been created to show the actual and predicted values for the whole dataset
count =0
for data in testData:
    print("data = " + str(data))
    prediction = str(predictor.predict(data))
    print(" ")
    print("Actual = "+ str(testLabels[count]) +" Prediction= " + prediction )
    print("____________________________________________________________________")
    print(" ")
    count +=1