## Example of AWS training and model endpoint deployment

This example main points:
- Using AWS Sagemaker available models for training
- Deploying the trained model 
- Inference from the deployed model

Dataset: Many useful available datasets can be used from E-commerce's review at [Link](https://www.kaggle.com/code/u601372/e-commerce-s-review/data?select=Toys_and_Games_5.json)


## Loading the libraries

In [1]:
import os
import boto3
import sagemaker
import json
import zipfile
import pandas as pd
import numpy as np
import boto3
from botocore.exceptions import ClientError
from sagemaker import image_uris
from sagemaker.serializers import JSONSerializer

## Data preparations

When using the Sagemaker, the data cannot be stored locally and it has to be stored on a S3 bucket. We can unzip and store the unzipped version on S3. Or we can have function for it that can become useful in the future.

In [2]:
def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

Example of each entry in the dataset in a Jason format is shown below:

In [3]:
#"root":{9 items
    #"reviewerID":string"A1VXOAVRGKGEAK"
    #"asin":string"0439893577"
    #"reviewerName":string"Angie"
    #"helpful":[2 items
            #0:int0
            #1:int0
    #]
    #"reviewText":string"I like the item pricing. My granddaughter wanted to mark on it but I wanted it just for the letters."
    #"overall":int5
    #"summary":string"Magnetic board"
    #"unixReviewTime":int1390953600
    #"reviewTime":string"01 29, 2014"
#}

The training data consists of reviews texts and their counts of upvotes (helpful) and total votes. We want a model to predict if a review (text) is helpful or not. 

From the training dataset we can create labels for each text. If the review has any votes, the review is helpful if at least 50% of the total votes mark it as helpful.

The function below assigned labels to the data and returns labeled data.

In [4]:
def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes >= .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data




Each review could be long and consist of several sentences. We can create more training data by splitting the sentences and assigning the review sentence of all of its sentences. It may not be the most optimum but we can try and revise later if needed. There are many language models that can take long sequences.

In [5]:
def split_sentences(labeled_data):
    split_sentences = []
    n_positive = 0
    n_negative = 0
    for d in labeled_data:
        label = d.split()[0]        
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                split_sentences.append(" ".join([label, s]))
                if label == "__label__1":
                    n_positive += 1
                else:
                    n_negative += 1
                
    print("Number of positive samples:{}".format(n_positive))
    print("Number of negative samples:{}".format(n_negative))
    return split_sentences

Using the above functions we can process the raw data and create our training data and visulaize some of the sentences.

In [6]:
input_data  = unzip_data('Toys_and_Games_5.json.zip')
labeled_data = label_data('Toys_and_Games_5.json')
split_sentence_data = split_sentences(labeled_data)

print(split_sentence_data[0:9])

Number of positive samples:450187
Number of negative samples:72205
['__label__1 Love the magnet easel', '__label__1  great for moving to different areas', '__label__1  Wish it had some sort of non skid pad on bottom though', '__label__1 Both sides are magnetic', "__label__1  A real plus when you're entertaining more than one child", '__label__1  The four-year old can find the letters for the words, while the two-year old can find the pictures the words spell', '__label__1  (I bought letters and magnetic pictures to go with this board)', '__label__1  Both grandkids liked it a lot, which means I like it a lot as well', '__label__1  Have not even introduced markers, as this will be used strictly as a magnetic board']


We can note that our data is unbalanced and only 17% of the data has negatives samples. Even if the model gives positive label for all of the inputs it would create >82% accuracy on the training. Althought the goal of this notebook is to show AWS Sagemaker, in reality we need to be careful about this when setting up the model.

## Saving the data

We loaded the data from a S3 address and did processing on it and labeled it. To be able to use the data on Sagemaker, we need to store the processed data back into a S3 location.

Each sagemaker session has a default or assigned S3 bucket that can be found by using the following methods:

In [7]:
sess = sagemaker.Session()
BUCKET = sess.default_bucket()  


We can define a prefix (S3 sub directory) for the location of the data we want to store.

Also we set a pivot point to split the data. The first 90% of the data are assigned to the training set and the rest of 10% to the test. In reality we need to check the dataset to make sure that labels are randomly spread across the initial raw data.

In [8]:
def cycle_data(fp, data):
    for d in data:
        fp.write(d + "\n")

def write_trainfile(split_sentence_data):
    train_path = "hello_blaze_train"
    with open(train_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return train_path

def write_validationfile(split_sentence_data):
    validation_path = "hello_blaze_validation"
    with open(validation_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return validation_path 

def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, BUCKET, object_name)
    except ClientError as e:
        logging.error(e)
        return False

s3_prefix = "Aug2022"

split_data_trainlen = int(len(split_sentence_data) * .9)
split_data_validationlen = int(len(split_sentence_data) * .1)


train_path = write_trainfile(split_sentence_data[:split_data_trainlen])
print("Training file written!")
validation_path = write_validationfile(split_sentence_data[split_data_trainlen:])
print("Validation file written!")

upload_file_to_s3(train_path, s3_prefix)
print("Train file uploaded!")
upload_file_to_s3(validation_path, s3_prefix)
print("Validation file uploaded!")

print(" ".join([train_path, validation_path]))

Training file written!
Validation file written!
Train file uploaded!
Validation file uploaded!
hello_blaze_train hello_blaze_validation


## Setting up the model

Here is where it gets finally interesting.

AWS has a collection of models that can be used. These are saved as docker images. To find the container URI the function [image_uris](https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html) is used that return the URI based on the framework argument that is passed to it. There are other parameters to return the URI based on geographical location and version etc.

The available frameworks can be found at: [Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-east-2.html). A lot of possibilities!

And we are going to use the [blazing text](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

Believe it or not, you're already almost done! Part of the appeal of SageMaker is that AWS has already done the heavy implementation lifting for you. Launch a "BlazingText" training job from the SageMaker console. You can do so by searching "SageMaker", and navigating to Training Jobs on the left hand side. After selecting "Create Training Job", perform the following steps:
* Select "BlazingText" from the algorithms available. 
* Specify the "file" input mode of training. 
* Under "resource configuration", select the "ml.m5.large" instance type. Specify 5 additional GBs of memory. 
* Set a stopping condition for 15 minutes. 
* Under hyperparameters, set "mode" to "supervised"
* Under input_data configuration, input the S3 path to your training and validation datasets under the "train" and "validation" channels. You will need to create a channel named "validation".  
* Specify an output path in the same bucket that you uploaded training and validation data. 

In [9]:
region_name = boto3.Session().region_name
container = image_uris.retrieve("blazingtext", sess.boto_region_name, "latest")

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


To do a training job we need to set up an Estimator. An estimator in principle defines the instance to run the training job and the location of the output of the model, and also how long to run the model.

The hyperparameters, that are specific to the model to be used are passed as an argument when defining the Estimator or it can be later set by using the set_hyperparameters method as used below.

In [10]:
role = sagemaker.get_execution_role()

blazingtext = sagemaker.estimator.Estimator(container,
                                           role,
                                           instance_count=1,
                                           instance_type='ml.m4.xlarge',
                                           output_path='s3://{}/{}/output'.format(BUCKET, s3_prefix),
                                           sagemaker_session = sess,
                                           max_run=360000,
                                           )

blazingtext.set_hyperparameters(mode = 'supervised',
                                #epochs = 5,
                                #min_count = 5,
                                early_stopping = True,
                                negative_samples = 6,
                                batch_size = 11,
                                patience = 4,
                                #learning_rate = 0.05,
                                vector_dim = 10,
                                #sampling_threshold = 0.0001,
                                min_epochs = 5,
                                #word_ngrams = 3,
                                ) 

## Setting up the data loaders

Need to pass the location of the stored training and validation datasets and the format of the data that is specific to the model that is being used.

In [11]:
train_full_path = 's3://{}/{}/'.format(BUCKET, s3_prefix) + train_path
print(train_full_path)
validation_full_path = 's3://{}/{}/'.format(BUCKET, s3_prefix) + validation_path
print(validation_full_path)

train_data = sagemaker.inputs.TrainingInput(
    train_full_path,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    validation_full_path,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}



s3://sagemaker-us-east-1-564698410651/Aug2022/hello_blaze_train
s3://sagemaker-us-east-1-564698410651/Aug2022/hello_blaze_validation


# Training the model

We trained the model by using the fit method of the Estimator. 

When iniiating the training, first an instance is setup based on the parameter set in the Estimator definition. That can take a little bit of time.

In [12]:
blazingtext.fit(inputs=data_channels, logs=True)

2022-08-15 14:51:48 Starting - Starting the training job...
2022-08-15 14:52:14 Starting - Preparing the instances for trainingProfilerReport-1660575108: InProgress
.........
2022-08-15 14:53:42 Downloading - Downloading input data...
2022-08-15 14:54:18 Training - Downloading the training image...
2022-08-15 14:54:44 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[08/15/2022 14:54:47 INFO 140628294846272] nvidia-smi took: 0.025272846221923828 secs to identify 0 gpus[0m
[34m[08/15/2022 14:54:47 INFO 140628294846272] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[08/15/2022 14:54:47 INFO 140628294846272] Processing /opt/ml/input/data/train/hello_blaze_train . File size: 44.97289848327637 MB[0m
[34m[08/15/2022 14:54:47 INFO 140628294846272] Processing /opt/ml/input/data/validation/hello_blaze_validation . File size: 5.058439254760742 MB[0m
[34

# Endpoint and Deployment in AWS Sagemaker

How to use the model that is trained?

In sagemaker the 'Endpoint' refers to a model in production. It is in principle an interface to the model and facilitates the communication between the model and the querries to it.

To send the querries to the model first the model needs to be in production mode, which means to have been deployed and have compute resources assigned to its inference. Deployment is the configuration and establishment of computing resources to serve your model.

Some of the features with AWS model deployment:
- With AWS it is also possible to deploy the model at different geographical locations. That is called multi-AZ (availability zone) deployment
- Auto-scaling: Auto-Scaling based on CloudWatch can take care of load balacning if the number of requests to the model is increased to avoid bottlenecks

Endpoint is a predictor class: [Sagemaker Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html)

Creating an endpoint and deploying the trained model into an instance using AWS examples:
Again, since it is creating a new instance, it may take a little bit of time.

In [13]:
review_classifier = blazingtext.deploy(initial_instance_count=1, 
                                       instance_type="ml.m4.xlarge", 
                                       serializer=JSONSerializer()
                                       )

--------!

Checking the review_classifier object:

In [14]:
print(review_classifier)

<sagemaker.predictor.Predictor object at 0x7f444010a210>


In [15]:
my_inputs = [
    "Dogs meouw at night.",
    "The material is not top quality. I see it can be useful considering the cost",
]

payload = {"instances": my_inputs}

response = review_classifier.predict(payload)

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "label": [
      "__label__1"
    ],
    "prob": [
      0.7189164757728577
    ]
  },
  {
    "label": [
      "__label__1"
    ],
    "prob": [
      0.8643396496772766
    ]
  }
]


Note: We see that we get label_1 for the first sentence that is most likely not a helpful review! The reason is again in the unbalanced data. Although we did oversampling of negative samples during training, we still need to do some probability calibration or change of threshold for the label. But these are not the goal of this notebook. Also probably it is better to use HuggingFace transformer based language models for these sort of text data. Will try to make another notebook using AWS [HuggingFace](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html) framework.

At the end, we should not forget to close the endpoint instance if no-one is going to use our deployed model!

In [16]:
sess.delete_endpoint(review_classifier.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
