## Regression with Amazon SageMaker Linear Learner algorithm
_**Single machine training for regression with Amazon SageMaker Linear Learner algorithm**_

---

---
## Contents
1. [Introduction](#Introduction)
2. [Setup](#Setup)
   1. [Exploring the dataset](#Exploring-the-dataset)
3. [Training the Linear Learner model](#Training-the-Linear-Learner-model)
4. [Set up hosting for the model](#Set-up-hosting-for-the-model)
5. [Inference](#Inference)
6. [Delete the Endpoint](#Delete-the-Endpoint)
7. [Appendix](#Appendix)
  1. [Downloading the dataset](#Downloading-the-dataset)
  2. [libsvm to csv convertion](#libsvm-to-csv-convertion)
  3. [Dividing the data](#Dividing-the-data)
  4. [Data Ingestion](#Data-ingestion)
---
## Introduction

This notebook demonstrates the use of Amazon SageMakerâ€™s implementation of the Linear Learner algorithm to train and host a regression model. We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). 

The dataset contains 9 fields, starting with the Rings number which is a number indicating the age of the abalone (as age equals to number of rings plus 1.5). Usually the number of rings are counted through microscopes to estimate the abalone's age. So we will use our algorithm to predict the abalone age based on the other features which are mentioned respectively as below within the dataset. 

'Rings','sex','Length','Diameter','Height','Whole Weight','Shucked Weight','Viscera Weight' and 'Shell Weight'

The above features starting from sex to Shell.weight are physical measurements that can be measured using the correct tools, so we improve the complixety of having to examine the abalone under microscopes to understand it's age.


---
## Setup


This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 buckets and prefixes that you want to use for training data and model data. This should be within the same region as the Notebook Instance, training, and hosting.
1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [36]:
%matplotlib inline

import os
from pathlib import Path
import numpy as np
import datetime

import pandas as pd
pd.set_option("display.max_rows",10)

# IPython

from IPython.display import display, Markdown
from IPython.display import Image

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# http://stackoverflow.com/questions/21971449/how-do-i-increase-the-cell-width-of-the-jupyter-ipython-notebook-in-my-browser
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))


# Autoload Python Code
%load_ext autoreload
%autoreload 2

In [27]:
import os
import boto3
import re
import sagemaker


role = sagemaker.get_execution_role()
region = boto3.Session().region_name

# S3 bucket for training data.
# Feel free to specify a different bucket and prefix.
bucket = f"bkraft-phdata"
data_bucket = bucket
project = 'mlops-demo'

data_prefix = "data/sim_1"


# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket and prefix
output_bucket = sagemaker.Session(default_bucket=bucket)
output_prefix = "mlops-demo/sagemaker/simple-regression/sim_1"

output_bucket=bucket

# Data

In [None]:
s3://bkraft-phdata/mlops-demo/data/sim_1/batch_06.csv

In [69]:
import boto3

s3 = boto3.client("s3")

FILE_TRAIN = f"{data_prefix}/train.csv"
FILE_TEST = f"{data_prefix}/test.csv"
FILE_VALIDATION = f"{data_prefix}/batch_01.csv"

In [14]:
# downloading the train, test, and validation files from data_bucket
#s3.download_file(data_bucket, f"{project}/{FILE_TRAIN}", FILE_TRAIN)
#s3.download_file(data_bucket, f"{project}/{FILE_TEST}", FILE_TEST)
#s3.download_file(data_bucket, f"{project}/{FILE_VALIDATION}", FILE_VALIDATION)

In [74]:
output_prefix
FILE_TRAIN

'mlops-demo/sagemaker/simple-regression/sim_1'

'data/sim_1/train.csv'

In [73]:
s3.upload_file('../'+FILE_TRAIN, output_bucket, f"{output_prefix}/data/train.csv")
s3.upload_file('../'+FILE_TEST, output_bucket, f"{output_prefix}/data/test.csv")
s3.upload_file('../'+FILE_VALIDATION, output_bucket, f"{output_prefix}/data/batch_01.csv")

In [54]:
import pandas as pd  # Read in csv and store in a pandas dataframe

df = pd.read_csv(
    '../'+FILE_TRAIN,
    names=["y", "x1"]
    )
print(df.head(5))

          y        x1
0 -0.283293  4.301174
1  1.291331  7.310761
2 -1.231500  2.457030
3  1.082625  6.889909
4 -0.494881  3.945141



---
Let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our [data channels](https://sagemaker.readthedocs.io/en/v1.2.4/session.html#). These objects are then put in a simple dictionary, which the algorithm consumes. Notice that here we use a `content_type` as `text/csv` for the pre-processed file in the data_bucket. We use two channels here one for training and the second one for validation. The testing samples from above will be used on the prediction step.

In [37]:
output_bucket
output_prefix

'bkraft-phdata'

'mlops-demo/sagemaker/simple-regression/sim_1'

In [None]:
s3://bkraft-phdata/mlops-demo/data/sim_1/train.csv

In [75]:
# creating the inputs for the fit() function with the training and validation location
s3_train_data = f"s3://{output_bucket}/{output_prefix}/data/train.csv"
print(f"training files will be taken from: {s3_train_data}")

s3_validation_data = f"s3://{output_bucket}/{output_prefix}/data/batch_01.csv"
print(f"validation files will be taken from: {s3_validation_data}")

output_location = f"s3://{output_bucket}/{output_prefix}/output"
print(f"training artifacts output location: {output_location}")

training files will be taken from: s3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/data/train.csv
validation files will be taken from: s3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/data/batch_01.csv
training artifacts output location: s3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/output


In [76]:
# generating the session.s3_input() format for fit() accepted by the sdk
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)

validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)

## Training the Linear Learner model

First, we retrieve the image for the Linear Learner Algorithm according to the region.

Then we create an [estimator from the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) using the Linear Learner container image and we setup the training parameters and hyperparameters configuration.


In [84]:
# getting the linear learner image according to the region
from sagemaker.image_uris import retrieve

container = retrieve("linear-learner", boto3.Session().region_name, version="1")
print(container)

382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1


In [85]:
%%time
import boto3
import sagemaker
from time import gmtime, strftime

sess = sagemaker.Session()

In [85]:
job_name = f"{project}-{strftime('%Y%m%d-%H-%M-%S', gmtime())}"
print("Training job", job_name)

linear = sagemaker.estimator.Estimator(
    container,
    role,
    input_mode="File",
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path=output_location,
    sagemaker_session=sess,
)

linear.set_hyperparameters(
    feature_dim=1,
    epochs=16,
    wd=0.01,
    loss="absolute_loss",
    predictor_type="regressor",
    normalize_data=True,
    optimizer="adam",
    mini_batch_size=100,
    lr_scheduler_step=100,
    lr_scheduler_factor=0.99,
    lr_scheduler_minimum_lr=0.0001,
    learning_rate=0.1,
)

Training job mlops-demo-20220223-20-17-24
CPU times: user 13.2 ms, sys: 0 ns, total: 13.2 ms
Wall time: 18.8 ms


---
After configuring the Estimator object and setting the hyperparameters for this object. The only remaining thing to do is to train the algorithm. The following cell will train the algorithm. Training the algorithm involves a few steps. Firstly, the instances that we requested while creating the Estimator classes are provisioned and are setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take time, depending on the size of the data. Therefore it might be a few minutes before we start getting data logs for our training jobs. The data logs will also print out Mean Average Precision (mAP) on the validation data, among other losses, for every run of the dataset once or one epoch. This metric is a proxy for the quality of the algorithm.

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator. For this example,the training time takes between 4 and 6 minutes.


In [86]:
%%time
linear.fit(inputs={"train": train_data, "validation": validation_data}, job_name=job_name)

2022-02-23 20:17:28 Starting - Starting the training job...
2022-02-23 20:17:55 Starting - Launching requested ML instancesProfilerReport-1645647448: InProgress
.........
2022-02-23 20:19:16 Starting - Preparing the instances for training.........
2022-02-23 20:20:56 Downloading - Downloading input data.........
2022-02-23 20:22:17 Training - Downloading the training image.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[02/23/2022 20:22:41 INFO 140574385583936] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', '

## Set up hosting for the model (Endpoint)

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same insantance (or type of instance) that we used to train. Training is a prolonged and compute heavy job that require a different of compute and memory requirements that hosting typically do not. We can choose any type of instance we want to host the model. In our case we chose the ml.m4.xlarge instance to train, but we choose to host the model on the less expensive cpu instance, ml.c4.xlarge. The endpoint deployment can be accomplished as follows:


In [None]:
%%time
# creating the endpoint out of the trained model
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.c4.xlarge")
print(f"\ncreated endpoint: {linear_predictor.endpoint_name}")

## Real-Time Inference

Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. To do this, we are going to configure the [predictor object](https://sagemaker.readthedocs.io/en/v1.2.4/predictors.html) to parse contents of type text/csv and deserialize the reply received from the endpoint to json format.


In [None]:
# configure the predictor to accept to serialize csv input and parse the reposne as json
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

linear_predictor.serializer = CSVSerializer()
linear_predictor.deserializer = JSONDeserializer()

In [92]:
sm.model.Model?

[0;31mInit signature:[0m
[0msm[0m[0;34m.[0m[0mmodel[0m[0;34m.[0m[0mModel[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mimage_uri[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel_data[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrole[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpredictor_cls[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0menv[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvpc_config[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msagemaker_session[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0menable_network_isolation[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel_kms_key[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mimage_config[0m[0;34m=[0m[0;32mNone[0m[0;34

## Batch Inference

In [127]:
#
# https://raw.githubusercontent.com/aws-samples/sagemaker-ml-workflow-with-apache-airflow/master/src/dag_ml_pipeline_amazon_video_reviews.py
#

import sagemaker as sm

import os

PROJECT = 'mlops-demo'
TAG = 'latest'

AWS_ACCOUNT_ID = 545053092614
IMAGE_URI = '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1'
REGION = 'us-east-1'

BUCKET ='bkraft-phdata' 
PROJECT='mlops-demo'
OUTPUT_PREFIX = f's3://{BUCKET}/{PROJECT}/sagemaker/simple-regression/sim_1/output'
PREDICT_PREFIX = f's3://{BUCKET}/{PROJECT}/sagemaker/simple-regression/sim_1/predict'

CONFIG = {
        'project': PROJECT,
        'bucket': BUCKET, 
        'region': REGION, 
        'image_uri': IMAGE_URI, 
        'model_data': f'{OUTPUT_PREFIX}/mlops-demo-20220223-20-17-24/output/model.tar.gz',
        'input_path': f'{PREDICT_PREFIX}/input', 
        'output_path': f'{PREDICT_PREFIX}/output', 
        'execution_role': 'arn:aws:iam::545053092614:role/service-role/AmazonSageMaker-ExecutionRole-20191104T123215',
    }


def batch_sim1(in_config=None, input_filter=None):

    if in_config is None:
        in_config = CONFIG
   
    if input_filter is None:
        input_filter="$[1:]"

    # Retrieve Model    
    reg_model = sm.model.Model(image_uri=in_config['image_uri'], 
                               model_data=in_config['model_data'],
                               role=in_config['execution_role']
                               )

    # Build Transformer
    transformer = reg_model.transformer(
        instance_count=1,
        instance_type="ml.m4.xlarge",
        output_path=in_config['output_path'],
        assemble_with="Line",
        accept="text/csv",
       )

    # Predict With New Data
    transformer.transform(
        in_config['input_path'],
        content_type="text/csv",
        split_type="Line",
        input_filter=input_filter
    )


In [122]:
in_config = CONFIG
in_config
input_filter = "$[1:]"
#in_config['output_path']
#in_config['execution_role']
's3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/output/mlops-demo-20220223-20-17-24/output/model.tar.gz'

{'project': 'mlops-demo',
 'bucket': 'bkraft-phdata',
 'region': 'us-east-1',
 'image_uri': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1',
 'model_data': 's3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/output/mlops-demo-20220223-20-17-24/output/model.tar.gz',
 'input_path': 's3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/predict/input',
 'output_path': 's3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/predict/output',
 'execution_role': 'arn:aws:iam::545053092614:role/service-role/AmazonSageMaker-ExecutionRole-20191104T123215'}

's3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/output/mlops-demo-20220223-20-17-24/output/model.tar.gz'

In [103]:
!aws s3 ls s3://bkraft-phdata/mlops-demo/sagemaker/simple-regression/sim_1/predict/output

                           PRE output/


In [120]:
reg_model = sm.model.Model(image_uri=in_config['image_uri'], 
                               model_data=in_config['model_data'],
                               role=in_config['execution_role']
                               )

In [121]:
transformer = reg_model.transformer(
        instance_count=1,
        instance_type="ml.m4.xlarge",
        output_path=in_config['output_path'],
        assemble_with="Line",
        accept="text/csv",
       )


In [124]:
transformer.transform(
        in_config['input_path'],
        content_type="text/csv",
        split_type="Line",
        input_filter=input_filter
)


...................................[34mDocker entrypoint called with argument(s): serve[0m
[35mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[35mRunning default environment configuration script[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loading entry points[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loaded request iterator application/json[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loaded request iterator application/jsonlines[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loaded request iterator application/x-recordio-protobuf[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loaded request iterator text/csv[0m
[34m[02/23/2022 21:36:49 INFO 140548468557632] loaded response encoder application/json[0m
[35m[02/23/2022 21:36:49 INFO 140548468557632] loaded entry poin

In [None]:
from sim1_batch import batch_transform
batch_transform()

..............

---
We then use the test file containing the records of the data that we kept to test the model prediction. By running below cell multiple times we are selecting random sample from the testing samples to perform inference with.

In [None]:
%%time
import json
from itertools import islice
import math
import struct
import boto3
import random

# getting testing sample from our test file
test_data = [l for l in open(FILE_TEST, "r")]
sample = random.choice(test_data).split(",")
actual_age = sample[0]
payload = sample[1:]  # removing actual age from the sample
payload = ",".join(map(str, payload))

# Invoke the predicor and analyise the result
result = linear_predictor.predict(payload)

# extracting the prediction value
result = round(float(result["predictions"][0]["score"]), 2)


accuracy = str(round(100 - ((abs(float(result) - float(actual_age)) / float(actual_age)) * 100), 2))
print(f"Actual age: {actual_age}\nPrediction: {result}\nAccuracy: {accuracy}")

## Delete the Endpoint
Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint.

In [None]:
sagemaker.Session().delete_endpoint(linear_predictor.endpoint_name)
print(f"deleted {linear_predictor.endpoint_name} successfully!")