# Amazon SageMaker scikit-learn Bring Your Own Model


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

---

_**Hosting a pre-trained scikit-learn Model in Amazon SageMaker scikit-learn Container**_

---

---

## Background

Amazon SageMaker includes functionality to support a hosted notebook environment, distributed, serverless training, and real-time hosting. We think it works best when all three of these services are used together, but they can also be used independently.  Some use cases may only require hosting.  Maybe the model was trained prior to Amazon SageMaker existing, in a different service.

This notebook shows how to use a pre-trained scikit-learn model with the Amazon SageMaker scikit-learn container to quickly create a hosted endpoint for that model.
We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:

> Pace, R. Kelley, and Ronald Barry. "Sparse spatial auto-regressions." Statistics & Probability Letters 33.3 (1997): 291-297.

---
## Setup

Let's start by specifying:

* AWS region.
* The IAM role arn used to give learning and hosting access to your data.
* The S3 bucket that you want to use for training and model data.

In [2]:
!pip install -U sagemaker

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
import os
import boto3
import re
import json
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.model import SKLearnModel
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split


region = boto3.Session().region_name

role = get_execution_role()

bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-sklearn-byo-model"

print(f"bucket: {bucket}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
bucket: sagemaker-us-west-2-461312420708


In [3]:
# prefix = "sagemaker/DEMO-ModelMonitor"

data_capture_prefix = "{}/datacapture".format(prefix)
s3_capture_upload_path = "s3://{}/{}".format(bucket, data_capture_prefix)

## Prepare data for model inference

We load the California housing dataset from sklearn, and will use it to invoke SageMaker Endpoint

In [4]:
data = fetch_california_housing()

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42
)

# we don't train a model, so we will need only the testing data
testX = pd.DataFrame(X_test, columns=data.feature_names)

In [7]:
testX.head(10)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01
1,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46
2,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44
3,5.7376,17.0,6.163636,1.020202,1705.0,3.444444,34.28,-118.72
4,3.725,34.0,5.492991,1.028037,1063.0,2.483645,36.62,-121.93
5,4.7147,12.0,5.251483,0.975089,2400.0,2.846975,34.08,-117.61
6,5.0839,36.0,6.221719,1.095023,670.0,3.031674,33.89,-118.02
7,3.6908,38.0,4.962825,1.048327,1011.0,3.758364,33.92,-118.08
8,4.8036,4.0,3.924658,1.035959,1050.0,1.797945,37.39,-122.08
9,8.1132,45.0,6.879056,1.011799,943.0,2.781711,34.18,-118.23


In [8]:
testX.count()

MedInc        5160
HouseAge      5160
AveRooms      5160
AveBedrms     5160
Population    5160
AveOccup      5160
Latitude      5160
Longitude     5160
dtype: int64

## Download a pre-trained model file

Download a pretrained Scikit-Learn Random Forest model.

We used the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset to train the model.

For more details on how to train the model with Amazon SageMaker, please refer to the [Develop, Train, Optimize and Deploy Scikit-Learn Random Forest notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb)

In [9]:
!aws s3 cp s3://aws-ml-blog/artifacts/scikit_learn_bring_your_own_model/model.joblib .

download: s3://aws-ml-blog/artifacts/scikit_learn_bring_your_own_model/model.joblib to ./model.joblib


### Compressed the model file to a GZIP tar archive 

Note that the model file name must satisfy the regular expression pattern: `^[a-zA-Z0-9](-*[a-zA-Z0-9])*;`. The model file needs to be tar-zipped. 

In [10]:
model_file_name = "model.joblib"

In [11]:
!tar czvf model.tar.gz $model_file_name

model.joblib


## Upload the pre-trained model `model.tar.gz` file to S3

In [12]:
fObj = open("model.tar.gz", "rb")
key = os.path.join(prefix, "model.tar.gz")
boto3.Session().resource("s3").Bucket(bucket).Object(key).upload_fileobj(fObj)

## Set up hosting for the model

This involves creating a SageMaker model from the model file previously uploaded to S3.

In [13]:
model_data = "s3://{}/{}".format(bucket, key)
print(f"model data: {model_data}")

model data: s3://sagemaker-us-west-2-461312420708/sagemaker/DEMO-sklearn-byo-model/model.tar.gz


### Write the Inference Script

When using endpoints with the Amazon SageMaker managed `Scikit Learn` container, we need to provide an entry point script for inference that will **at least** load the saved model.

After the SageMaker model server has loaded your model by calling `model_fn`, SageMaker will serve your model. Model serving is the process of responding to inference requests, received by SageMaker `InvokeEndpoint` API calls.


We will implement also the `predict_fn()` function that takes the deserialized request object and performs inference against the loaded model.

We will now create this script and call it `inference.py` and store it at the root of a directory called `code`.

**Note:** You would modify the script below to implement your own inferencing logic.

Additional information on model loading and model serving for scikit-learn on SageMaker can be found in the [SageMaker Scikit-learn Model Server documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#deploy-a-scikit-learn-model)

There are also several functions for hosting which we won't define,
 - `input_fn()` - Takes request data and deserializes the data into an object for prediction.
 - `output_fn()` - Takes the result of prediction and serializes this according to the response content type.

These will take on their default values as described [SageMaker Scikit-learn Serve a Model documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#serve-a-model)

In [15]:
!pygmentize ./code/inference.py

[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjoblib[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m# def input_fn(request_body, request_content_type):[39;49;00m[37m[39;49;00m
[37m#     """An input_fn that loads a pickled numpy array"""[39;49;00m[37m[39;49;00m
[37m#     if request_content_type == "application/python-pickle":[39;49;00m[37m[39;49;00m
[37m#         array = np.load(StringIO(request_body))[39;49;00m[37m[39;49;00m
[37m#         return array[39;49;00m[37m[39;49;00m
[37m# # ip = numpy.append(input_object,numpy.zeros([len(input_object),1]),1)[39;49;00m[37m[39;49;00m
[37m#     else:[39;49;00m[37m[39;49;00m
[37m#         # Handle other content-types here or raise an Exception[39;49;00m[37m[39;49;00m
[37m#         # if the content type is not supported.[39;49;00m[37m[39;49;00m
[37m#     

### Installing additional Python dependencies

It also may be necessary to supply a `requirements.txt` file to ensure any necessary dependencies are installed in the container along with the script. For this script, in addition to the Python standard libraries, we showcase how to install the `boto3` `requests`, and `nltk` libraries.

In [16]:
!pygmentize ./code/requirements.txt

boto3
requests
nltk
numpy


### Deploy with Python SDK

Here we showcase the process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [90]:
model = SKLearnModel(
    role=role,
    model_data=model_data,
    framework_version="1.2-1",
    py_version="py3",
    source_dir="code",
    entry_point="inference.py",
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


### Create endpoint
Lastly, you create the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 5-10 minutes to complete.

In [91]:
from sagemaker.model_monitor import DataCaptureConfig
from time import gmtime, strftime

endpoint_name = "DEMO-sklearn-byo-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

data_capture_config = DataCaptureConfig(
    enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path
)

EndpointName=DEMO-sklearn-byo-model-2023-12-04-00-20-04
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [92]:
%%time

predictor = model.deploy(instance_type="ml.m5.large", 
                         initial_instance_count=1,
                        endpoint_name=endpoint_name,
                        data_capture_config=data_capture_config,)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
-----!CPU times: user 178 ms, sys: 11.5 ms, total: 190 ms
Wall time: 3min 2s


## Validate the model for use
Now you can obtain the endpoint from the client library using the result from previous operations and generate classifications from the model using that endpoint.

### Invoke with the Python SDK

In [93]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
import time

predictor1 = Predictor(endpoint_name=endpoint_name, serializer=CSVSerializer())

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


Let's generate the prediction for a single data point. We'll pick one from the test data generated earlier.

In [21]:
print((testX[data.feature_names]))

      MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0     1.6812      25.0  4.192201   1.022284      1392.0  3.877437     36.06   
1     2.5313      30.0  5.039384   1.193493      1565.0  2.679795     35.14   
2     3.4801      52.0  3.977155   1.185877      1310.0  1.360332     37.80   
3     5.7376      17.0  6.163636   1.020202      1705.0  3.444444     34.28   
4     3.7250      34.0  5.492991   1.028037      1063.0  2.483645     36.62   
...      ...       ...       ...        ...         ...       ...       ...   
5155  6.6260      51.0  5.532213   0.974790       771.0  2.159664     34.04   
5156  2.1898      30.0  4.509091   0.945455       410.0  2.484848     40.18   
5157  2.1667      37.0  3.272152   1.056962      2173.0  4.584388     34.02   
5158  6.8869       6.0  7.382385   1.030075      2354.0  2.528464     38.51   
5159  6.6321      36.0  5.734644   1.056511      1033.0  2.538084     33.89   

      Longitude  
0       -119.01  
1       -119.46

In [54]:
sample = testX[data.feature_names].sample(2).to_numpy()


[[ 4.02940000e+00  1.80000000e+01  5.86526946e+00  1.04091816e+00
   3.11400000e+03  3.10778443e+00  3.41300000e+01 -1.17370000e+02]
 [ 3.54260000e+00  2.60000000e+01  5.30132159e+00  1.09603524e+00
   2.67300000e+03  2.35506608e+00  3.36600000e+01 -1.17880000e+02]]


In [68]:
print(sample)

[[ 4.02940000e+00  1.80000000e+01  5.86526946e+00  1.04091816e+00
   3.11400000e+03  3.10778443e+00  3.41300000e+01 -1.17370000e+02]
 [ 3.54260000e+00  2.60000000e+01  5.30132159e+00  1.09603524e+00
   2.67300000e+03  2.35506608e+00  3.36600000e+01 -1.17880000e+02]]


In [86]:
def convert_to_numpy_array(string):
    # Split the string into rows
    rows = string.split('\n')

    # Split each row into columns and convert to float
    array = [[float(num) for num in row.split(',')] for row in rows]

    # Convert the list of lists to a numpy array
    numpy_array = np.array(array)

    return numpy_array

In [89]:
convert_to_numpy_array("""4.0294,18.0,5.865269461077844,1.0409181636726548,3114.0,3.1077844311377247,34.13,-117.37 \n 3.5426,26.0,5.301321585903084,1.096035242290749,2673.0,2.3550660792951543,33.66,-117.88""")

array([[ 4.02940000e+00,  1.80000000e+01,  5.86526946e+00,
         1.04091816e+00,  3.11400000e+03,  3.10778443e+00,
         3.41300000e+01, -1.17370000e+02],
       [ 3.54260000e+00,  2.60000000e+01,  5.30132159e+00,
         1.09603524e+00,  2.67300000e+03,  2.35506608e+00,
         3.36600000e+01, -1.17880000e+02]])

In [95]:
# the SKLearnPredictor does the serialization from pandas for us
predictions = predictor1.predict(sample)
print(predictions)

b'[2.4887181531746028]'


### Alternative: invoke with `boto3`

This is useful when invoking the model from external clients, e.g. Lambda Functions, or other micro-services.

In [None]:
runtime = boto3.client("sagemaker-runtime")

#### Option 1: `csv` serialization

In [None]:
# csv serialization
response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint,
    Body=testX[data.feature_names].to_csv(header=False, index=False).encode("utf-8"),
    ContentType="text/csv",
)

print(response["Body"].read())

#### Option 2: `npy` serialization

In [None]:
# npy serialization
from io import BytesIO


# Serialise numpy ndarray as bytes
buffer = BytesIO()
# Assuming testX is a data frame
np.save(buffer, testX[data.feature_names].values)

response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint, Body=buffer.getvalue(), ContentType="application/x-npy"
)

print(response["Body"].read())

### (Optional) Delete the Endpoint

If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
# predictor.delete_endpoint()

## Conclusion

In this notebook you successfully deployed a pre-trained scikit-learn model with the Amazon SageMaker scikit-learn container to quickly create a hosted endpoint for that model.
You then used the Python SDK and `boto3` to invoke the endpoint with `csv` payload, and then with `npy` payload to get predictions from the model.

As next steps you can try to [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html), [Register and Deploy Models with Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) or [Train your Model with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html).


## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/advanced_functionality|scikit_learn_bring_your_own_model|scikit_learn_bring_your_own_model.ipynb)
