# Amazon SageMaker SKLearn Bring Your Own Model
_**Hosting a Pre-Trained scikit-learn Model in Amazon SageMaker SKlearn Framework Container**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Optionally, train a scikit learn XGBoost model](#Optionally,-train-a-scikit-learn-XGBoost-model)
1. [Upload the pre-trained model to S3](#Upload-the-pre-trained-model-to-S3)
1. [Set up hosting for the model](#Set-up-hosting-for-the-model)
1. [Validate the model for use](#Validate-the-model-for-use)




---
## Background

Amazon SageMaker includes functionality to support a hosted notebook environment, distributed, serverless training, and real-time hosting. We think it works best when all three of these services are used together, but they can also be used independently.  Some use cases may only require hosting.  Maybe the model was trained prior to Amazon SageMaker existing, in a different service.

This notebook shows how to use a pre-existing scikit-learn trained random forest model with the Amazon SageMaker Sklearn  container to quickly create a hosted endpoint for that model. 

---
## Setup

Let's start by specifying:

* AWS region.
* The IAM role arn used to give learning and hosting access to your data. See the documentation for how to specify these.
* The S3 bucket that you want to use for training and model data.

In [None]:
import datetime
import time
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston


sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

In [None]:
prefix = 'sagemaker/DEMO-sklearn-byo'
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)
# customize to your bucket where you have stored the data

## Check for pre-trained model in S3

In [None]:
## Replace this url for your own model
model_url = 's3://sagemaker-ap-southeast-1-380399053155/sklearn-training-2021-04-13-20-51-53-977/output/model.tar.gz'
print(model_url)

In [None]:
!aws s3 ls $model_url

## Set up hosting for the model with Python SDK

### Import model into hosting
This involves creating a SageMaker model from the model file previously uploaded to S3. See more information in the documentation (in particular about the entry_point file): https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-model


In [None]:
FRAMEWORK_VERSION='0.23-1'

In [None]:
from sagemaker.sklearn.model import SKLearnModel
model_name='demo-sklearn-randomforestv1'#+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model = SKLearnModel(
    model_data=model_url,
    role=get_execution_role(),
    entry_point='./scripts/train.py',
    framework_version=FRAMEWORK_VERSION,
    name=model_name)



In [None]:
print(model_name)

This effectively deploys the managed endpoint (may take several minutes).

In [None]:
predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1)

## Invoke with the Python SDK

In [None]:
## Replace uri with your test data uri
test_data_uri="s3://sagemaker-ap-southeast-1-380399053155/sklearn-processing-2021-04-13-21-20-28/processing/output/test/"

In [None]:
test_df=pd.read_csv(test_data_uri+'test.csv', header=None)

In [None]:
test_df.rename(columns={0: "Status", 1: "Var1", 2: "Var2",3: "Var3",4: "Var4"})

In [None]:
testX=test_df.iloc[:,1:]

In [None]:
# the SKLearnPredictor does the serialization from pandas for us
ix=23
print(predictor.predict(testX.iloc[ix:ix+1,:]))

In [None]:
!curl -d '{"data":"0.026202,103826.654315,318.0,12.414592\n"}' -H 'Content-Type: text/csv' https://gf6g1w420f.execute-api.us-east-1.amazonaws.com/prod/predictinkjet

## Feature Importance with Clarify

In [None]:
from sagemaker import clarify

In [None]:
clarify_processor = clarify.SageMakerClarifyProcessor(role=get_execution_role(),
                                                      instance_count=1,
                                                      instance_type='ml.m5.xlarge',
                                                      sagemaker_session=sess)

In [None]:
model_config = clarify.ModelConfig(model_name=model_name,
                                   instance_type='ml.m5.large',
                                   instance_count=1,
                                   accept_type='text/csv',
                                   content_type='text/csv')

In [None]:
testX.iloc[0].values.tolist()

In [None]:
## Replace train_uri with your train data uri
train_uri='s3://sagemaker-ap-southeast-1-380399053155/sklearn-processing-2021-04-13-21-20-28/processing/output/train/'


shap_config = clarify.SHAPConfig(baseline=[testX.iloc[0].values.tolist()],
                                 num_samples=100,
                                 agg_method='mean_abs',
                                 save_local_shap_values=False)

explainability_output_path = 's3://{}/{}/clarify-explainability'.format(bucket, prefix)
explainability_data_config = clarify.DataConfig(s3_data_input_path=train_uri,
                                s3_output_path=explainability_output_path,
                                label='Status',
                                headers=['Status', 'Var1', 'Var2', 'Var3', 'Var4'],
                                dataset_type='text/csv')

In [None]:
clarify_processor.run_explainability(data_config=explainability_data_config,
                                     model_config=model_config,
                                     explainability_config=shap_config)

## Batch Inference

In [None]:
# The location of the test dataset
#batch_input = 's3://{}/{}/test'.format(bucket, prefix)
batch_input=test_data_uri

# The location to store the results of the batch transform job
batch_output = 's3://{}/{}/batch-prediction'.format(bucket, prefix)

In [None]:
transformer = model.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge', 
    output_path=batch_output
)

In [None]:
transformer.transform(
    data=batch_input, 
    data_type='S3Prefix',
    content_type='text/csv', 
    split_type='Line',
    input_filter='$[1:]'
)
transformer.wait()

In [None]:
!aws s3 ls {batch_output+'/'}

In [None]:
pd.read_csv(batch_output+'/test.csv.out')

### Delete endpoint or you will be charged for as long as it's running!

In [None]:
predictor.delete_endpoint()