# SageMaker Linear Learner Algorithm 
## Using Amazon SageMaker for Logistic Regression

by Emil Vassev

April 19, 2022
<br><br>
Elaborated version of https://www.bmc.com/blogs/aws-linear-learner/
***

<span style="color:blue">Welcome to <b>SageMaker Linear Learner Algorithm</b>, an interactive lecture designed to teach you how to use Amazon SageMaker to do logistic regressions with the Linear Learner Algorithm. This lecture provides both theoretical and practical knowledge.</span>

## Step 1: Import libraries

In [1]:
import io #import module in terms of dealing with various types of I/O
import os
import numpy as np

import sagemaker
import sagemaker.amazon.common as smac #import sagemaker common library
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

import boto3

## Step 2: Define Data

The data below shows what is the likelihood that students will pass a certain test given how many hours they study.

### Data
| Hours |0.50|0.75|1.00|1.25|1.50|1.75|1.75|2.00|2.25|2.50|2.75|3.00|3.25|3.50|4.00|4.25|4.50|4.75|5.00|5.50|
| ---   |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
| Pass	|0	 |0	  |0   |0	|0	 |0	  |1   |0   |1	 |0	  |1   |0   |1	 |0	  |1   |1   |1	 |1	  |1   |1   |

### Step 2.1. Define the dataset.
We take the hours spent and pass-fail grades, and make a tuple.

In [2]:
study=((0.5,0),(0.75,0),
       (1.0,0),(1.25,0),(1.50,0),(1.75,0),
       (2.0,0),(2.25,1),(2.5,0),(2.75,1),
       (3.0,0),(3.25,1),(3.5,0),
       (4.0,1),(4.25,1),(4.5,1),(4.75,1),
       (5.0,1),(5.5,1)
      ) 

### Step 2.2. Prepare the dataset for the Linear Learner algorithm.
#### Step 2.2.1. Convert the *study* tuple to a numpy array. 
It has to be of type float32 - the SageMaker Linear Learner algorithm expects this type of data.

In [3]:
a = np.array(study).astype('float32')

#### Step 2.2.2. Extract *features* and *labels* from dataset.
Take a slice from the *study* tuple and put the labels (i.e., pass-fail) into a *labels* vector. 
The Linear Learner algorithm expects a *features* matrix and *labels* vector.  

In [4]:
features = a[:,0]
labels = a[:,1]

print(features)
print(labels)

[0.5  0.75 1.   1.25 1.5  1.75 2.   2.25 2.5  2.75 3.   3.25 3.5  4.
 4.25 4.5  4.75 5.   5.5 ]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1.]


#### Step 2.2.3. Convert the training data to a compatible input format. 
Convert our training data into a SageMaker compatible input format (RecordIO). 

In [5]:
#converts the data in numpy array format to RecordIO format
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, a, labels)

#reset in-memory byte array to zero position
buf.seek(0)

0

## Step 3. Prepare SageMaker
### Step 3.1. IAM (Identity and Access Management)
We need to give an IAM role to our SageMaker so it can perform tasks on our behalf (i.e. reading training results, calling model artifacts from the S3 bucket, and writing training results to the S3).
<br>
Get the execution role for the notebook instance. This is the role that we created for our notebook instance. We pass the role to the tuning job.

In [6]:
#gets the SageMaker execution role 
role = get_execution_role()

### Step 3.2. Get the SageMaker's session.
It manages interactions with the Amazon SageMaker APIs and any other AWS services needed.

In [7]:
#gets the SageMaker session 
sess = sagemaker.Session()

### Step 3.3. Prepare the s3 bucket. 
We need to create the s3 bucket from the AWS management console.

In [8]:
#sets the s3 bucket name  
bucket = "sagemakerlr-2"

#sets the location of this exercise's files in s3 bucket  
prefix = "sagemaker/grades"

#key refers to the name of the file 
key = 'linearlearner'

#uploads the training data in record-io format to S3 bucket
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)

#training data location in s3
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('Uploaded training data location: {}'.format(s3_train_data))

#output location in S3 bucket to store the linear learner output
output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

Uploaded training data location: s3://sagemakerlr-2/sagemaker/grades/train/linearlearner
Training artifacts will be uploaded to: s3://sagemakerlr-2/sagemaker/grades/output


### Step 3.3. Get a reference to a Linear Learner container.
To obtain the reference to the linear Learner container, just specify the algorithm name:

In [9]:
#specify the name of the algorithm, that we want to use
container = sagemaker.image_uris.retrieve("linear-learner", boto3.Session().region_name, "1")

### Step 3.4. Create an instance of the Linear Learner algorithm.
We need to set up the SageMaker's Estimator.<br> 
We pass the container and type of instance we want to use for training - a virtual machine of size ml.c4.xlarge.<br>
The output path and SageMaker session variables have already been defined.

In [10]:
linear = sagemaker.estimator.Estimator(container,
                                       role=role, 
                                       instance_count=1, 
                                       instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess)

### Step 3.5. Set the hyper-parameters.
<br>
<li>feature_dim — number of columns in our feature array (2: hours of study and pass-fail)</li>
<li>mini_batch_size — number of batches into which to split the data (this number must be smaller than the number of records in our training set of 20 records)</li>
<li>predictor_type — binary_classifier</li>

In [11]:
linear.set_hyperparameters(feature_dim=2,
                           mini_batch_size=4,
                           predictor_type='binary_classifier')

## Step 4. Train the model.

In [12]:
linear.fit({'train': s3_train_data})

2022-04-19 06:19:16 Starting - Starting the training job...
2022-04-19 06:19:44 Starting - Preparing the instances for trainingProfilerReport-1650349156: InProgress
.........
2022-04-19 06:21:14 Downloading - Downloading input data......
2022-04-19 06:22:06 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/19/2022 06:22:12 INFO 140345329940288 integration.py:636] worker started[0m
[34m[04/19/2022 06:22:12 INFO 140345329940288] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scal

## Step 5. Create the Linear Predictor.
### Step 5.1. Deploy the training mode to an endpoint.

In [13]:
linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

------!

### Step 5.2. Set a serializer and deserializer to the Linear Predictor.
To make inferences in our model, we need to set serializer and deserializer to our linear predictor.

In [14]:
linear_predictor.serializer = CSVSerializer()
linear_predictor.deserializer = JSONDeserializer()

## Step 6. Predict
We put one record a[0] into the Linear Predictor. The value is 0.5 hours, so we expect this student to fail.

In [15]:
a[0]

array([0.5, 0. ], dtype=float32)

In [16]:
result = linear_predictor.predict(a[0])
print(result)

{'predictions': [{'score': 0.0018372394843026996, 'predicted_label': 0}]}
