# Develop, Train, and Deploy Scikit-Learn Random Forest on AWS
## Table of contents
1. [Introduction](#introduction)
2. [Prepare data](#Prepare-data)
    1. [Remove outliers](#Remove-outliers)
    2. [Send Data to S3](#Send-Data-to-S3)
3. [Develop and Train Model](#Writing-a-Script-Mode-Script)
    1. [Writing a Script Mode Script](#Writing-a-Script-Mode-Script)
    2. [Training with the Python SDK](#Training-with-the-Python-SDK)
4. [Model Deployment](#Deployment-with-Python-SDK)
    1. [Deployment with Python SDK](#Deployment-with-Python-SDK)
    2. [Method 1: Invoke Endpoint with boto3](#Method-1:-Invoke-Endpoint-with-boto3)
    3. [Method 2: Invoke Endpoint with Python SDK](#Method-2:-Invoke-Endpoint-with-Python-SDK)
    4. [Validate the Model using Prediction Values](#Validate-the-Model-using-Prediction-Values)
    5. [Delete the Endpoint](#Delete-the-Endpoint)
5. [Lambda Function Development](#Lambda-Function-Development)
    1. [Prepare Test Data and Obtain Lambda Function Raw Output](#Prepare-Test-Data-and-Obtain-Lambda-Function-Raw-Output)
    2. [Lambda Function Code](#Lambda-Function-Code)
6. [Reference](#Reference)

<a name="introduction"></a>
## Introduction

In the previous [notebook](https://github.com/andrewzheng210/Predict_Personal_Loan/blob/master/Personal_loan.ipynb), I described the project background and data. And I selected Random Forest model to identify customers who were more likely to accept personal loans

In this notebook, I will use Amazon SageMaker to develop, train, and __deploy__ the Scikit-Learn based Random Forest model. 

Note:
If you want to learn more about machine learning on AWS, AWS Certified Machine Learning Exam training [courses](https://aws.amazon.com/certification/certified-machine-learning-specialty/) will help you a lot.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import datetime
import tarfile
import sagemaker
import boto3
from sagemaker import get_execution_role

<a name="Prepare data"></a>
## Prepare data
Load data, clean it, split it and send it to S3

In [2]:
sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

Using bucket sagemaker-us-east-1-614327970913


In [3]:
role = get_execution_role()
bucket = 'ml-labs-azz210'
prefix = 'dataset'
data_key = 'Bank_Personal_Loan_Modelling.csv'
data_location = 's3://{}/{}/{}'.format(bucket, prefix, data_key)

data = pd.read_csv(data_location, low_memory=False)

In [4]:
data.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [5]:
for column in data.columns:
    uniques = sorted(data[column].unique())
    print('{0:20s} {1:5d}\t'.format(column, len(uniques)), uniques[:5])

ID                    5000	 [1, 2, 3, 4, 5]
Age                     45	 [23, 24, 25, 26, 27]
Experience              47	 [-3, -2, -1, 0, 1]
Income                 162	 [8, 9, 10, 11, 12]
ZIP Code               467	 [9307, 90005, 90007, 90009, 90011]
Family                   4	 [1, 2, 3, 4]
CCAvg                  108	 [0.0, 0.1, 0.2, 0.3, 0.4]
Education                3	 [1, 2, 3]
Mortgage               347	 [0, 75, 76, 77, 78]
Personal Loan            2	 [0, 1]
Securities Account       2	 [0, 1]
CD Account               2	 [0, 1]
Online                   2	 [0, 1]
CreditCard               2	 [0, 1]


<a name="Remove outliers"></a>
### Remove outliers
Remove outliers that have zip code equal to 9307 or years professional experience less than 0.

In [6]:
# data cleaning ideas

filter = data['ZIP Code']<90000
print(len(data[filter]))

# data cleaning ideas

filter = data['Experience']<0
print(len(data[filter]))

data = data[-filter]

1
52


In [7]:
data['Mortgage_Y'] = 0
data.loc[data['Mortgage']>0,'Mortgage_Y']=1
data['Education_catog']=data['Education'].astype(str)
data_encode = pd.get_dummies(data)
data_encode.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard', 'Mortgage_Y', 'Education_catog_1',
       'Education_catog_2', 'Education_catog_3'],
      dtype='object')

In [8]:
# These columns are not added: ID, ZIP Code, Education, Mortgage
columns = ['Age', 'Experience', 'Income', 'Family', 'CCAvg',
       'Securities Account', 'CD Account', 'Online', 'CreditCard','Mortgage_Y', 'Education_catog_1',
       'Education_catog_2', 'Education_catog_3']
X = data_encode[columns]
y = data_encode['Personal Loan']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [10]:
trainXY = pd.merge(X_train,y_train,left_index=True, right_index=True).rename(columns = {"Personal Loan":"target", "CD Account":"CD_Account","Securities Account":"Securities_Account"})
testXY = pd.merge(X_test,y_test,left_index=True, right_index=True).rename(columns = {"Personal Loan":"target", "CD Account":"CD_Account","Securities Account":"Securities_Account"})

In [11]:
trainXY.head(2)

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Securities_Account,CD_Account,Online,CreditCard,Mortgage_Y,Education_catog_1,Education_catog_2,Education_catog_3,target
2379,42,18,110,2,6.1,0,0,1,0,1,1,0,0,0
4077,26,0,71,4,1.8,1,0,1,0,0,0,1,0,0


In [12]:
trainXY.columns

Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard', 'Mortgage_Y', 'Education_catog_1',
       'Education_catog_2', 'Education_catog_3', 'target'],
      dtype='object')

In [13]:
trainXY.to_csv('trainXY.csv')
testXY.to_csv('testXY.csv')

<a name="Send Data to S3"></a>
### Send Data to S3
SageMaker will take training data from s3

In [14]:
trainpath = sess.upload_data(
    path='trainXY.csv', bucket=bucket,
    key_prefix='sagemaker/sklearncontainer')

testpath = sess.upload_data(
    path='testXY.csv', bucket=bucket,
    key_prefix='sagemaker/sklearncontainer')

<a name="Writing a Script Mode Script"></a>

## Develop and Train Model
### Writing a Script Mode Script
Detailed guidance [here](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script).

In [15]:
%%writefile script.py

import argparse
import os

import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf



if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument('--n-estimators', type=int, default=100)
    parser.add_argument('--min-samples-leaf', type=int, default=3)
    parser.add_argument('--random-state', type=int, default=42)

    # Data, model, and output directories
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='trainXY.csv')
    parser.add_argument('--test-file', type=str, default='testXY.csv')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print('training model')
    model = RandomForestClassifier(
        n_estimators=args.n_estimators, # 100
        min_samples_leaf=args.min_samples_leaf,
        random_state = args.random_state,
        n_jobs=-1)
    
    model.fit(X_train, y_train)

    # evaluation
    #print('validating model')
    #print(classification_report(y_train, rf.predict(X_train)))
    #y_pred = model.predict(X_test)
    #confusion_mat = confusion_matrix(y_test, y_pred)
    #print(confusion_mat)
    #print(classification_report(y_test, y_pred))
        
    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model persisted at ' + path)
    print(args.min_samples_leaf)

Overwriting script.py


<a name="Training with the Python SDK"></a>
### Training with the Python SDK

In [16]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    train_instance_count=1,
    train_instance_type='ml.c4.xlarge',
    framework_version='0.20.0',
    base_job_name='rf-scikit',
    hyperparameters = {'n-estimators': 100,
                       'min-samples-leaf': 3,
                       'random-state': 42,
                       'features': 'Age Experience Income Family CCAvg Securities_Account CD_Account Online CreditCard Mortgage_Y Education_catog_1 Education_catog_2 Education_catog_3',
                       'target': 'target'})

This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


In [17]:
# launch training job, with asynchronous call
sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)

sklearn_estimator.latest_training_job.wait(logs='None')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.



2020-07-16 02:04:57 Starting - Starting the training job......
2020-07-16 02:05:30 Starting - Launching requested ML instances..............
2020-07-16 02:06:46 Starting - Preparing the instances for training.........
2020-07-16 02:07:33 Downloading - Downloading input data...
2020-07-16 02:07:56 Training - Downloading the training image...
2020-07-16 02:08:15 Training - Training image download completed. Training in progress..
2020-07-16 02:08:26 Uploading - Uploading generated training model
2020-07-16 02:08:33 Completed - Training job completed


<a name="Deployment with Python SDK"></a>
## Model Deployment
### Deployment with Python SDK

In [18]:
artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact persisted at ' + artifact)

Model artifact persisted at s3://sagemaker-us-east-1-614327970913/rf-scikit-2020-07-16-02-04-56-963/output/model.tar.gz


In [19]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
    model_data=artifact,
    role=get_execution_role(),
    entry_point='script.py')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [20]:
predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1)

---------------!

<a name="Method 1: Invoke Endpoint with boto3"></a>
### Method 1: Invoke Endpoint with boto3

In [21]:
runtime = boto3.client('sagemaker-runtime')

In [22]:
# csv serialization
response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint,
    Body=testXY.iloc[:,:-1].to_csv(header=False, index=False).encode('utf-8'),
    ContentType='text/csv')

response['Body'].read()

b'[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

<a name="Method 2: Invoke Endpoint with Python SDK"></a>
### Method 2: Invoke Endpoint with Python SDK

In [25]:
y_pred = predictor.predict(testXY.iloc[:,:-1])
y_pred

array([0, 0, 1, ..., 0, 0, 0])

<a name="Validate the Model using Prediction Values"></a>
### Validate the Model using Prediction Values

In [26]:
y_test = testXY.iloc[:,-1].values

In [27]:
confusion_mat = confusion_matrix(y_test, y_pred)
print(confusion_mat)

[[1334    3]
 [  16  132]]


In [28]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1337
           1       0.98      0.89      0.93       148

    accuracy                           0.99      1485
   macro avg       0.98      0.94      0.96      1485
weighted avg       0.99      0.99      0.99      1485



<a name="Delete the Endpoint"></a>
### Delete the Endpoint
When you have finished everything, do not forget to delete the endpoint to avoid expenses.

In [33]:
sm_boto3.delete_endpoint(EndpointName=predictor.endpoint)

{'ResponseMetadata': {'RequestId': 'fd0d3771-0986-4255-8269-38da382d27d5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fd0d3771-0986-4255-8269-38da382d27d5',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 16 Jul 2020 02:17:44 GMT'},
  'RetryAttempts': 0}}

<a name="Lambda Function Development"></a>
## Lambda Function Development
### Prepare Test Data and Obtain Lambda Function Raw Output
At least two data points are needed to invoke the endpoint

In [29]:
data_temp = [[56. , 26. , 92. ,  2. ,  4.5,  1. ,  0. ,  0. ,  1. ,  0. ,  0. , 0. ,  1. ],
       [ 51. ,  25. , 104. ,   1. ,   4.2,   0. ,   0. ,   1. ,   0. , 0. ,   0. ,   1. ,   0. ]] 

#data_temp = [[56. , 26. , 92. ,  2. ,  4.5,  1. ,  0. ,  0. ,  1. ,  0. ,  0. , 0. ,  1. ]]
  
# Create the pandas DataFrame 
df = pd.DataFrame(data_temp, columns = ['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard', 'Mortgage_Y', 'Education_catog_1',
       'Education_catog_2', 'Education_catog_3'])
df

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Securities_Account,CD_Account,Online,CreditCard,Mortgage_Y,Education_catog_1,Education_catog_2,Education_catog_3
0,56.0,26.0,92.0,2.0,4.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,51.0,25.0,104.0,1.0,4.2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [30]:
response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint,
    Body=df.to_csv(header=False, index=False).encode('utf-8'),
    ContentType='text/csv')

response['Body'].read()

b'[0, 1]'

In [31]:
df.to_csv(header=False, index=False).encode('utf-8')

b'56.0,26.0,92.0,2.0,4.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0\n51.0,25.0,104.0,1.0,4.2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0\n'

In [32]:
# Lambda Function Test data
{"data": "56.0,26.0,92.0,2.0,4.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0\n51.0,25.0,104.0,1.0,4.2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0\n"}

{'data': '56.0,26.0,92.0,2.0,4.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0\n51.0,25.0,104.0,1.0,4.2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0\n'}

<a name="Lambda Function Code"></a>
### Lambda Function Code

In [None]:
# Lambda Function
# Note: Lambda does not include Pandas/NumPy Python libraries by default.
import os
import io
import boto3
import json
import csv

# grab environment variables 
ENDPOINT_NAME = os.environ['ENDPOINT_NAME']
runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2))
    
    data = json.loads(json.dumps(event))
    payload = data['data']
    print(payload)
    
    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='text/csv',
                                       Body=payload)
    print(response)
    result = json.loads(response['Body'].read().decode())
    print(result)
    result_list = []
    for item in result:
        if item == 0:
            result_list.append('Not accept')
        elif item == 1:
            result_list.append('Accept')
        else:
            result_list.append('Unkown result')
    
    return result_list

<a name="Reference"></a>
### Reference

1. AWS Notebook [Guide](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb)
2. AWS Certified Machine Learning Course [Material](https://github.com/ACloudGuru-Resources/Course_AWS_Certified_Machine_Learning/blob/master/Chapter9/ufo-implementation-operations-lab.ipynb)