# UDACITY SageMaker Essentials: Endpoint Exercise

In the last exercise, you trained a BlazingText supervised sentiment analysis model. (Let's call this model HelloBlaze.) You've recently learned about how we can take a model we've previously trained and generate an endpoint that we can call to efficently evaluate new data. Here, we'll put what we've learned into practice. You will take HelloBlaze and use it to create an endpoint. Then, you'll evaluate some sample data on that model to see how well the model we've trained generalizes. (Sentiment analysis is a notoriously difficult problem, so we'll keep our expectations modest.)

In [9]:
%load_ext dotenv
%dotenv

import logging
import os
import sys
import boto3
import json
import sagemaker
import zipfile
from typing import List, Dict, Tuple, Union, Optional

from src.paths import CODE_DIR, DATA_DIR, RAW_DATA_DIR, TRANSFORMED_DATA_DIR, TEST_DIR

# load BUCKET and ROLE from .env file
bucket = os.getenv("BUCKET")
role = os.getenv("ROLE")
region = os.getenv("REGION")

S3_LOCATION = f"s3://{bucket}/1"

# Adding custom folders to the system path for easy import
sys.path.extend([str(CODE_DIR)])

# Data file path
DATA_FILE_PATH = RAW_DATA_DIR / "reviews_Musical_Instruments_5.json.zip"
OUTPUT_FILE_PATH = TRANSFORMED_DATA_DIR / "reviews_Musical_Instruments_5.json"

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Understanding Exercise: Preprocessing Data (again)

Before we start, we're going to do preprocessing on a new set of data that we'll be evaluating on HelloBlaze. We won't keep track of the labels here, we're just seeing how we could potentially evaluate new data using an existing model. This code should be very familiar, and requires no modification. Something to note: it is getting tedious to have to manually process the data ourselves whenever we want to do something with our model. We are also doing this on our local machine. Can you think of potential limitations and dangers to the preprocessing setup we currently have? Keep this in mind when we move on to our lesson about batch-transform jobs.  

In [8]:
# Function below unzips the archive to the local directory. 

def unzip_data(input_file_path: str, output_file_path: str) -> None:
    """
    Unzip the data file
    """
    # extracl all files to the output_file_path, that already includes the file name
    with zipfile.ZipFile(input_file_path, "r") as zip_ref:
        zip_ref.extractall(output_file_path)


# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data_path: str) -> List[str]:
    """
    Label the data based on the review score
    """
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"

    for line in open(input_data_path, "r"):
        l_object = json.loads(line)
        helpful_votes = l_object["helpful"][0]
        total_votes = l_object["helpful"][1]
        review_text = l_object["reviewText"]

        if total_votes != 0:
            if helpful_votes / total_votes >= 0.5:
                labeled_data.append(f"{HELPFUL_LABEL} {review_text}\n")
            elif helpful_votes / total_votes < 0.5:
                labeled_data.append(f"{UNHELPFUL_LABEL} {review_text}\n")

    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    new_split_sentences = []
    for d in labeled_data:       
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(s)
    return new_split_sentences


input_data = unzip_data(DATA_FILE_PATH, TRANSFORMED_DATA_DIR)
labeled_data = label_data(OUTPUT_FILE_PATH)
new_split_sentence_data = split_sentences(labeled_data)

# print 6 first sentences, one per line
print("\n".join(new_split_sentence_data[:6]))

Love the magnet easel
 great for moving to different areas
 Wish it had some sort of non skid pad on bottom though
Both sides are magnetic
 A real plus when you're entertaining more than one child
 The four-year old can find the letters for the words, while the two-year old can find the pictures the words spell


## Exercise: Deploy Model

Once you have your model, it's trivially easy to create an endpoint. All you need to do is initialize a "model" object, and call the deploy method. Fill in the method below with the proper addresses and an endpoint will be created, serving your model. Once this is done, confirm that the endpoint is live by consulting the SageMaker Console. You'll see this under "Endpoints" in the "Inference" menu on the left-hand side. If done correctly, this will take a while to get instantiated. 

You will need the following methods: 

* You'll need `image_uris.retrieve` method to determine the image uri to get a BlazingText docker image uri https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html
* You'll need a `model_data` to pass the S3 location of a SageMaker model data
* You'll need to use the `Model` object https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
* You'll need to the get execution role. 
* You'll need to use the `deploy` method of the model object, using a single instance of "ml.m5.large"

In [12]:
from sagemaker.model import Model
from sagemaker import image_uris
import logging

# By default, The SageMaker SDK logs events related to the default
# configuration using the INFO level. To prevent these from spoiling
# the output of this notebook cells, we can change the logging
# level to ERROR instead.
logging.getLogger("sagemaker.config").setLevel(logging.ERROR)

# get the image using the "blazingtext" framework and your region
image_uri = image_uris.retrieve("blazingtext", region=region)

# get the S3 location of a SageMaker model data
model_data = f"{S3_LOCATION}/model_artifacts/blazingtext-2023-12-13-13-50-26-062/output/model.tar.gz"

# define a model object
model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=role,
)


# deploy the model using a single instance of "ml.m5.large"
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/carlos/Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/carlos/Library/Application Support/sagemaker/config.yaml
----!

## Exercise: Evaluate Data

Alright, we now have an easy way to evaluate our data! You will want to interact with the endpoint using the predictor interface: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html

Predictor is not the endpoint itself, but instead is an interface that we can use to easily interact with our deployed model. Your task is to take `new_split_sentence_data` and evaluate it using the predictor.  

Note that the BlazingText supports "application/json" as the content-type for inference and the model expects a payload that contains a list of sentences with the key as “instances”.

The method you'll need to call is highlighted below.

Another recommendation: try evaluating a subset of the data before evaluating all of the data. This will make debugging significantly faster.

In [13]:
from hmac import new
from sagemaker.predictor import Predictor
import json

predictor = Predictor(endpoint_name="blazingtext-2023-12-13-17-27-30-671")

# load the first five reviews from new_split_sentence_data
example_sentences = new_split_sentence_data[:5]

payload = {"instances": example_sentences}

print(json.dumps(payload))

# make predictions using the "predict" method. Set initial_args to {'ContentType': 'application/json'}
predictions = predictor.predict(json.dumps(payload), initial_args={"ContentType": "application/json"})

print(predictions)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/carlos/Library/Application Support/sagemaker/config.yaml
{"instances": ["Love the magnet easel", " great for moving to different areas", " Wish it had some sort of non skid pad on bottom though", "Both sides are magnetic", " A real plus when you're entertaining more than one child"]}
b'[{"label": ["__label__1"], "prob": [0.9968512058258057]}, {"label": ["__label__1"], "prob": [0.9990147948265076]}, {"label": ["__label__1"], "prob": [0.9543969631195068]}, {"label": ["__label__1"], "prob": [0.9985549449920654]}, {"label": ["__label__1"], "prob": [0.8869138360023499]}]'


## Make sure you stop/delete the endpoint after completing the exercise to avoid cost.

In [14]:
predictor.delete_endpoint()