# Amazon SageMaker SKLearn Bring Your Own Model
_**Hosting a Pre-Trained scikit-learn Model in Amazon SageMaker SKlearn Framework Container**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Optionally, train a scikit learn XGBoost model](#Optionally,-train-a-scikit-learn-XGBoost-model)
1. [Upload the pre-trained model to S3](#Upload-the-pre-trained-model-to-S3)
1. [Set up hosting for the model](#Set-up-hosting-for-the-model)
1. [Validate the model for use](#Validate-the-model-for-use)




---
## Background

Amazon SageMaker includes functionality to support a hosted notebook environment, distributed, serverless training, and real-time hosting. We think it works best when all three of these services are used together, but they can also be used independently.  Some use cases may only require hosting.  Maybe the model was trained prior to Amazon SageMaker existing, in a different service.

This notebook shows how to use a pre-existing scikit-learn trained random forest model with the Amazon SageMaker Sklearn  container to quickly create a hosted endpoint for that model. 

---
## Setup

Let's start by specifying:

* AWS region.
* The IAM role arn used to give learning and hosting access to your data. See the documentation for how to specify these.
* The S3 bucket that you want to use for training and model data.

In [5]:
import datetime
import time
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston


sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

Using bucket sagemaker-us-east-2-349934754982


In [7]:
prefix = 'sagemaker/DEMO-sklearn-byo'
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)
# customize to your bucket where you have stored the data

## Optionally, train a scikit learn model

These steps are optional and are needed to generate the scikit-learn model that will eventually be hosted using the SageMaker Algorithm contained. 


### Fetch the dataset

In [6]:
# we use the Boston housing dataset 
data = load_boston()

### Prepare the dataset for training

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [9]:
trainX.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.09103,0.0,2.46,0.0,0.488,7.155,92.2,2.7006,3.0,193.0,17.8,394.12,4.82,37.9
1,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
2,0.03578,20.0,3.33,0.0,0.4429,7.82,64.5,4.6947,5.0,216.0,14.9,387.31,3.76,45.4
3,0.38735,0.0,25.65,0.0,0.581,5.613,95.6,1.7572,2.0,188.0,19.1,359.29,27.26,15.7
4,0.06724,0.0,3.24,0.0,0.46,6.333,17.2,5.2146,4.0,430.0,16.9,375.21,7.34,22.6


In [11]:
trainX.to_csv('boston_train.csv')
testX.to_csv('boston_test.csv')

## Train model locally

In [17]:
import numpy as np
import pandas as pd
import joblib
from sklearn.ensemble import RandomForestRegressor



In [47]:
print('training model')
model = RandomForestRegressor(
        n_estimators=100,
        min_samples_leaf=3,
        n_jobs=-1)
    
model.fit(X_train, y_train)

# print abs error
print('validating model')
abs_err = np.abs(model.predict(X_test) - y_test)

# print couple perf metrics
for q in [10, 50, 90]:
    print('AE-at-' + str(q) + 'th-percentile: ' + str(np.percentile(a=abs_err, q=q)))
        


training model
validating model
AE-at-10th-percentile: 0.26401676767676036
AE-at-50th-percentile: 1.4646803890553883
AE-at-90th-percentile: 4.457468982683986


#### Inference with model locally:

In [48]:
model.predict(testX[data.feature_names])

array([22.67531263, 31.19679286, 16.78301385, 23.71872063, 17.09308467,
       21.22811468, 19.35947229, 15.65108456, 21.28613521, 21.02667024,
       20.20080188, 19.65817183,  8.14240158, 21.85755437, 19.61198914,
       25.23829978, 18.63268528,  8.55239291, 44.05811241, 15.46478795,
       24.03537789, 23.95748492, 14.96718709, 24.08047583, 15.0982486 ,
       15.44812861, 21.58461865, 14.03972922, 19.57238341, 20.82120141,
       20.40723113, 23.61431263, 27.76213391, 19.76042327, 14.71201851,
       15.73880273, 35.22242518, 19.15683061, 21.01626815, 23.99830545,
       19.59633009, 29.5170631 , 44.32100646, 19.16979174, 22.68796548,
       13.92651144, 15.53531248, 24.21073997, 18.5530246 , 29.32268568,
       21.16161508, 33.70204473, 17.1367746 , 26.23295563, 46.07912134,
       21.99914722, 15.49032186, 32.34362183, 21.91744621, 20.23810765,
       25.06446111, 34.28859697, 30.72181825, 18.73531961, 27.99031324,
       16.9599978 , 13.59836346, 23.14834545, 29.52979639, 14.96

### Persist model

In [49]:
# persist model
model_dir='.'
model_path = os.path.join(model_dir, "model.joblib")
joblib.dump(model, path)
print('model persisted at ' + path)

model persisted at ./model.joblib


### Tar trained model file
Note that the model file name must satisfy the regular expression pattern: `^[a-zA-Z0-9](-*[a-zA-Z0-9])*;`. The model file also need to tar-zipped. 

In [50]:
!tar czvf model.tar.gz $model_path

./model.joblib


## Upload the pre-trained model to S3

In [51]:
model_file_name='sklearn_byo'
fObj = open("model.tar.gz", 'rb')
key= os.path.join(prefix, model_file_name, 'model.tar.gz')
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fObj)

In [52]:
model_url = 'https://s3-{}.amazonaws.com/{}/{}'.format(region,bucket,key)
print(model_url)

https://s3-us-east-2.amazonaws.com/sagemaker-us-east-2-349934754982/sagemaker/DEMO-sklearn-byo/sklearn_byo/model.tar.gz


In [53]:
!aws s3 ls 's3://s3-us-east-2.amazonaws.com/sagemaker-us-east-2-349934754982/sagemaker/DEMO-sklearn-byo/sklearn_byo/'


Error parsing parameter 'paths': Unable to retrieve https://s3-us-east-2.amazonaws.com/sagemaker-us-east-2-349934754982/sagemaker/DEMO-sklearn-byo/sklearn_byo/: received non 200 status code of 403


## Set up hosting for the model with Python SDK

### Import model into hosting
This involves creating a SageMaker model from the model file previously uploaded to S3. See more information in the documentation (in particular about the entry_point file): https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-model


In [54]:
FRAMEWORK_VERSION='0.23-1'

In [78]:
from sagemaker.sklearn.model import SKLearnModel
model_name='demo-sklearn-randomforest'+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model = SKLearnModel(
    model_data=model_url,
    role=get_execution_role(),
    entry_point='script.py',
    framework_version=FRAMEWORK_VERSION,
    name=model_name)



In [79]:
print(model_name)

demo-sklearn-randomforest2021-01-20-07-09-09


This effectively deploys the managed endpoint (may take several minutes).

In [58]:
predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1)

-------------!

## Invoke with the Python SDK

In [59]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

[22.67531263 31.19679286 16.78301385 23.71872063 17.09308467 21.22811468
 19.35947229 15.65108456 21.28613521 21.02667024 20.20080188 19.65817183
  8.14240158 21.85755437 19.61198914 25.23829978 18.63268528  8.55239291
 44.05811241 15.46478795 24.03537789 23.95748492 14.96718709 24.08047583
 15.0982486  15.44812861 21.58461865 14.03972922 19.57238341 20.82120141
 20.40723113 23.61431263 27.76213391 19.76042327 14.71201851 15.73880273
 35.22242518 19.15683061 21.01626815 23.99830545 19.59633009 29.5170631
 44.32100646 19.16979174 22.68796548 13.92651144 15.53531248 24.21073997
 18.5530246  29.32268568 21.16161508 33.70204473 17.1367746  26.23295563
 46.07912134 21.99914722 15.49032186 32.34362183 21.91744621 20.23810765
 25.06446111 34.28859697 30.72181825 18.73531961 27.99031324 16.9599978
 13.59836346 23.14834545 29.52979639 14.96916573 20.71370887 27.09753463
 10.22827341 22.75520058 22.02731861  7.24180729 20.06499791 45.07263965
 11.52066724 13.98276786 21.42543207 11.76426705 20.0

### Delete endpoint or you will be charged for as long as it's running!

In [None]:
predictor.delete_endpoint()