# Lab: Bring your own script with Amazon SageMaker

## Sklearn script mode training and serving
Script mode is a training script format for a number of supported frameworks that lets you execute the training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native SKlearn support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.

Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris). In this example, we will show how easily you can train a SageMaker using scikit-learn and with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker SKlearn container](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-docker-containers-scikit-learn-spark.html). 

## Set up the environment
Let's start by setting up the environment:

In [2]:
from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role
import os
import tarfile
import pandas as pd

In [3]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
print(bucket)
#prefix = 'sagemaker/DEMO-BYO'

role = sagemaker.get_execution_role()

sagemaker-ap-southeast-2-571660658801


## Training data
we download the Iris data from UCI Machine Learning repository directly from the web.

In [4]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

--2022-06-01 23:48:32--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘iris.data.1’


2022-06-01 23:48:33 (121 MB/s) - ‘iris.data.1’ saved [4551/4551]



## Train and test split
We split the data into train and test set

In [5]:
data = pd.read_csv('iris.data', 
                   names=['sepal length', 'sepal width', 
                          'petal length', 'petal width', 
                          'label'])

data.to_csv('data.csv')


Load our data to our S3 ready for training using our script.

In [6]:
import boto3

s3_session = boto3.Session().resource('s3')
s3_session.Bucket(bucket).Object('data/data.csv').upload_file('data.csv')


## Construct a script for brining your own SKlearn script to SageMaker
Your Scikit-learn training script must be a Python 3.6 compatible source file.
The training script is similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables. For example:
- SM_MODEL_DIR: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
- SM_OUTPUT_DATA_DIR: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.
- Supposing two input channels, ‘train’ and ‘test’, were used in the call to the Scikit-learn estimator’s fit() method, the following will be set, following the format “SM_CHANNEL_[channel_name]”:
- SM_CHANNEL_TRAIN: A string representing the path to the directory containing data in the ‘train’ channel
- SM_CHANNEL_TEST: Same as above, but for the ‘test’ channel.

In order to save your trained Scikit-learn model for deployment on SageMaker, your training script should save your model to a certain filesystem path called model_dir. This value is accessible through the environment variable SM_MODEL_DIR.

Load the model: before a model can be served, it must be loaded. The SageMaker Scikit-learn model server loads your model by invoking a 'model_fn' function that you must provide in your script

Serve a Model: after the SageMaker model server has loaded your model by calling model_fn, SageMaker will serve your model. Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. The SageMaker Scikit-learn model server breaks request handling into three steps:

-input processing,
-prediction, and
-output processing.

Here is the entire script and you can see all these details explained above in the script.

In [7]:
!pygmentize 'main.py'


[33m"""[39;49;00m
[33mFile: BYO_scikitlearn_model[39;49;00m
[33m"""[39;49;00m

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m joblib
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mlinear_model[39;49;00m [34mimport[39;49;00m LogisticRegression
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m StandardScaler

[37m# Dictionary to encode labels to codes[39;49;00m
l

In [8]:
sklearn_estimator = SKLearn(entry_point='main.py',
                            instance_type='ml.m4.xlarge',
                            framework_version='0.20.0',
                            role=role)


# Calling `fit`
To start a training job, we call `estimator.fit(training_data_uri)`.

An S3 location is used here as the input. fit creates a default channel named 'training', which points to this S3 location. In the training script we can then access the training data from the location stored in SM_CHANNEL_TRAINING. fit accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the Scikit-learn container executes BYO_sklearn_main.py, passing hyperparameters and model_dir from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and model_dir defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:

`BYO_sklearn_main.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`

When training is complete, the training job will upload the saved model to S3 for deployment.

In [9]:
sklearn_estimator.fit({'train': 's3://{}/data/data.csv'.format(bucket)})


2022-06-01 23:48:35 Starting - Starting the training job...
2022-06-01 23:49:01 Starting - Preparing the instances for trainingProfilerReport-1654127315: InProgress
.........
2022-06-01 23:50:23 Downloading - Downloading input data...
2022-06-01 23:51:01 Training - Downloading the training image......
2022-06-01 23:52:01 Uploading - Uploading generated training model[34m2022-06-01 23:51:52,247 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-06-01 23:51:52,252 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-01 23:51:52,270 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-06-01 23:51:52,772 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-01 23:51:52,792 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-01 23:51:52,814 sagemaker-training-toolki

## Deploy
We are now ready to deploy our model to Sagemaker hosting services and make real time predictions

In [10]:
predictor = sklearn_estimator.deploy(instance_type='ml.m4.xlarge',
                                     initial_instance_count=1)


------!

Let's now send some data to our model to predict- the data shouldbe sent in the accepted format and the code below just does that.

In [11]:
from sklearn.preprocessing import StandardScaler
test=pd.read_csv("data.csv")
#print(test)
test=test.iloc[1:50,1:5].values.tolist()
#standardised input
sc = StandardScaler()
test = sc.fit_transform(test)

    
test

array([[-0.29549457, -1.09317789, -0.37691428, -0.41952354],
       [-0.86330766, -0.56802381, -0.95406427, -0.41952354],
       [-1.14721421, -0.83060085,  0.20023571, -0.41952354],
       [-0.01158802,  0.48228436, -0.37691428, -0.41952354],
       [ 1.12403817,  1.27001549,  1.35453569,  1.44926314],
       [-1.14721421, -0.04286972, -0.37691428,  0.5148698 ],
       [-0.01158802, -0.04286972,  0.20023571, -0.41952354],
       [-1.7150273 , -1.35575494, -0.37691428, -0.41952354],
       [-0.29549457, -0.83060085,  0.20023571, -1.35391688],
       [ 1.12403817,  0.74486141,  0.20023571, -0.41952354],
       [-0.57940112, -0.04286972,  0.7773857 , -0.41952354],
       [-0.57940112, -1.09317789, -0.37691428, -1.35391688],
       [-1.99893385, -1.09317789, -2.10836424, -1.35391688],
       [ 2.25966435,  1.53259254, -1.53121425, -0.41952354],
       [ 1.97575781,  2.58290071,  0.20023571,  1.44926314],
       [ 1.12403817,  1.27001549, -0.95406427,  1.44926314],
       [ 0.27231852,  0.

In [12]:
request_body = ""
for row in test:
    request_body += ",".join([str(n) for n in row]) + "\n"
request_body = request_body[:-1]

print(request_body)

-0.29549456918691325,-1.093177892610886,-0.376914277945237,-0.41952353926805985
-0.8633076629186336,-0.5680238069448718,-0.9540642660488794,-0.41952353926805985
-1.1472142097844948,-0.8306008497778788,0.20023571015840666,-0.41952353926805985
-0.011588022321054384,0.4822843643871554,-0.376914277945237,-0.41952353926805985
1.124038165142386,1.2700154928861755,1.3545356863656928,1.4492631356533008
-1.1472142097844948,-0.042869721278858776,-0.376914277945237,0.5148697981926202
-0.011588022321054384,-0.042869721278858776,0.20023571015840666,-0.41952353926805985
-1.7150273035162127,-1.3557549354438931,-0.376914277945237,-0.41952353926805985
-0.29549456918691325,-0.8306008497778788,0.20023571015840666,-1.3539168767287402
1.124038165142386,0.7448614072201625,0.20023571015840666,-0.41952353926805985
-0.5794011160527747,-0.042869721278858776,0.7773856982620503,-0.41952353926805985
-0.5794011160527747,-1.093177892610886,-0.376914277945237,-1.3539168767287402
-1.9989338503820742,-1.093177892610886

In [13]:
client = boto3.client('sagemaker-runtime')

endpoint=predictor.endpoint_name


content_type = "text/csv"

response = client.invoke_endpoint(
    EndpointName=endpoint,
    ContentType=content_type,
    Body=request_body
    )


In [14]:
response['Body'].read()

b'Iris-versicolor | Iris-setosa | Iris-versicolor | Iris-setosa | Iris-virginica | Iris-setosa | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-setosa | Iris-setosa | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-setosa | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-virginica | Iris-setosa | Iris-setosa | Iris-versicolor | Iris-setosa | Iris-setosa | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-setosa | Iris-versicolor | Iris-setosa | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-setosa | Iris-versicolor | Iris-setosa | Iris-versicolor'