# A Sentiment Analysis Web App
## Using PyTorch and SageMaker

---

This is a sentiment analysis model trained on 25000 movie reviews, and was built as a part of Udacity's Deep Learning Nanodegree.


## The Outline

The project was broadly built on the following lines, and this notebook has been divided into the same sections.

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Downloading the data

I will be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2020-11-21 17:49:51--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-11-21 17:50:11 (4.19 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Preparing and Processing the data

Once I downloaded the data, I did some initial data processing: I read in each of the reviews, combined them into a single input structure, then split the data into a training set and a testing set of 25,000 observations each, with positive and negative reviews distributed equally.

In [2]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # A positive review is '1' and a negative review is '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


I then combined positive and negative reviews into test and train dataframes, and used Scikit-learn's shuffle method to 
shuffle them around.

In [4]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Here's what a single datapoint in the set looks like. It's a good idea to see, for yourself, what the processing pipeline has done, and whether or not it worked as expected.

In [6]:
print(train_X[100])
print(train_y[100])

An executive, very successful in his professional life but very unable in his familiar life, meets a boy with down syndrome, escaped from a residence . Both characters feel very alone, and the apparently less intelligent one will show to the executive the beauty of the small things in life... With this argument, the somehow Amelie-like atmosphere and the sentimental music, I didn't expect but a moralistic disgusting movie. Anyway, as there were some interesting scenes (the boy is sometimes quite a violent guy), and the interpretation of both actors, Daniel Auteil and Pasqal Duquenne, was very good, I decided to go on watching the movie. The French cinema, in general, has the ability of showing something that seems quite much to life, opposed to the more stereotyped American cinema. But, because of that, it is much more disappointing to see after the absurd ending, with the impossible death of the boy, the charming tone, the happiness of the executive's family, the cheap moral, the unbe

I removed all html tags, and tokenized the input using a stemmer. This  way words such as *entertained* and *entertaining* are considered the same.

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

The `review_to_words` method above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews.

Here's what it looks like applied to a single review:

In [8]:
# Apply review_to_words to a review
review_to_words(train_X[99])

['poor',
 'perform',
 'sinatra',
 'martin',
 'hyer',
 'grossli',
 'underdevelop',
 'support',
 'charact',
 'annoy',
 'talki',
 'real',
 'plot',
 'end',
 'leav',
 'flatter',
 'pancak',
 'loos',
 'end',
 'could',
 'tie',
 'four',
 'sequel',
 'even',
 'care',
 'wooden',
 'charact',
 'maclain',
 'real',
 'asset',
 'penultim',
 'sequenc',
 'chicago',
 'hood',
 'search',
 'sinatra',
 'charact',
 'laughabl',
 'music',
 'sequenc',
 'also',
 'poor',
 'final',
 'scene',
 'martin',
 'charact',
 'remov',
 'hat',
 'woman',
 'call',
 'pig',
 'almost',
 'made',
 'go',
 'outsid',
 'find',
 'stone',
 'throw',
 'televis',
 'screen']

In [9]:
len(set(review_to_words(train_X[99])))

51

In [10]:
print(len(train_X[99]))
print(len(review_to_words(train_X[99])))

642
60


The above used Porter stemmer not only gets the stems of words, but also converts all of them to lower case. Besides this, it also removes all punctuation, whitespace and returns a list of words. 

Since my `review_to_words` function works as expected, I'll apply it to the entire dataset.

In [11]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [12]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


## Transform the data

The next set is tp transform the data from its word representation to a bag-of-words feature representation. First, I will represent each word as an integer. 

Next, I set a working vocabulary with a fixed size. This is because some of the words that appear in the reviews occur very infrequently and likely don't contain much information anyway. The vocabulary will only contain the most frequently occuring words. I will combine all the infrequent words into a single category and, label it `1`.

Besides, the model is a recurrent neural network. It will be convenient if the length of reviews is the same. To do this, we will fix a size for our reviews and then pad short reviews with the category 'no word' (which we will label `0`) and truncate long reviews.

### Create a word dictionary

Here I build a dictionary to map words that appear in the reviews to integers. I fix the size of the vocabulary (including the 'no word' and 'infrequent' categories) to be `5000`.

In [13]:
import numpy as np
from collections import Counter

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    # sentence is a list of words.
    
    word_count = Counter(np.concatenate(data, axis = 0)) # A dict storing the words that appear in the reviews along with how often they occur
    
    # Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    # sorted_words[-1] is the least frequently appearing word.
    
    sorted_words = sorted(word_count, key = word_count.get, reverse = True)
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [14]:
word_dict = build_dict(train_X)

In [15]:
# TODO: Use this space to determine the five most frequently appearing words in the training set.
word_count_t = Counter(np.concatenate(train_X, axis = 0))

sorted_words_t = sorted(word_count_t, key = word_count_t.get, reverse = True)

print(sorted_words_t[:5])

['movi', 'film', 'one', 'like', 'time']


I ran the function to understand the most frequently appearing phrases as a sanity check. These words are movie related, and form the phrase *once upon a time*. Makes sense.

```['movi', 'film', 'one', 'like', 'time']```

### Save `word_dict`

I'm made a pickle dump for the `word_dict` method. I'll use it later, once I build an endpoint to processes a submitted review.

In [16]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [17]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

I will use the word dictionary to transform reviews to integer sequence representation, making sure to pad or truncate to a fixed length, which in this case is `500`.

In [18]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [19]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended.

In [20]:
print(len(train_X[99]))
print(train_X_len[99])

500
114


In [21]:
d = np.array(train_X[99])
np.count_nonzero(d)

114

**Comments:** 
* At the face of it, implementing the ```preprocess_data``` method looks like a good practice. This procees the raw data and dumps the pre-processed data into a pkl file. We can retrieve processed data from this dump going forward. Thought this was a time counsuming method to run, this function was a one-time call.

* The ```convert_and_pad_data``` is an absolute must since we're training a Neural Network. The network will expect reviews of uniform length and this function ensure that this happens.

## Upload the data to S3

I will need to upload the training dataset to S3.

### Save the processed training dataset locally

I will save the dataset locally, first. Each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [22]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data


I will upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [23]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [24]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

## Build and Train the PyTorch Model

A model in the SageMaker framework consists of three objects:

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
Here I used containers provided by Amazon, and added some custom code.

I will start by building a neural network in PyTorch along with a training script. Code to the model object is written in another script, but can be found below:

In [25]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)


There are three parameters that I can tweak to alter model performance: embedding dimension, the hidden dimension and the size of the vocabulary.

First I load a portion of the training data set as a sample, and train on this.

In [26]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

Writing the training method

Here I write some classic code training my neural network over several epochs:

In [27]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # TODO: Complete this train method to train the model provided.
            optimizer.zero_grad()
            
            output = model(batch_X)
            
            loss = loss_fn(output, batch_y)
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

I now test over a small number of epochs as a sanity check

In [28]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6932283639907837
Epoch: 2, BCELoss: 0.6832670569419861
Epoch: 3, BCELoss: 0.6746159672737122
Epoch: 4, BCELoss: 0.6651097059249877
Epoch: 5, BCELoss: 0.6537458539009094


### Training the model

I initialize an estimator on SageMaker and get ready to train

In [36]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    py_version = 'py3',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [37]:
estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-11-21 18:15:53 Starting - Starting the training job...
2020-11-21 18:15:56 Starting - Launching requested ML instances......
2020-11-21 18:17:16 Starting - Preparing the instances for training.........
2020-11-21 18:18:36 Downloading - Downloading input data...
2020-11-21 18:19:10 Training - Downloading the training image...
2020-11-21 18:19:41 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-11-21 18:19:41,386 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-11-21 18:19:41,414 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-11-21 18:19:42,028 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-11-21 18:19:42,284 sagemaker-containers INFO     Module train does not provide a setup.py. 

## Deploy the model for testing

In [38]:
# Deploy the trained model
predictor = estimator.deploy(initial_instance_count = 1,
                            instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!

## Use the model for testing

Once deployed, I will read in the test data, send it to the deployed model, and use the results to determine accuracy.

In [39]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [40]:
# Split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [41]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [42]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.83912

Not too bad, 83%

### More testing

So the model works well on the processed reviews. I'll now pass a stray review to see how the network deals with unseen data points.

In [43]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

I do the same preprocessing here. Using the `review_to_words` and `convert_and_pad` methods from section one, convert `test_review` into a numpy array `test_data` suitable to send to our model.

In [44]:
# TODO: Convert test_review into a form usable by the model and save the results in test_data
test_data = None

test_review_words = review_to_words(test_review)

test_data, test_data_len = convert_and_pad_data(word_dict, test_review_words)

test_data = pd.concat([pd.DataFrame(test_data_len), pd.DataFrame(test_data)], axis = 1)

In [45]:
predictor.predict(test_data)

array([0.37846673, 0.58298856, 0.27177435, 0.54650784, 0.41799745,
       0.4034507 , 0.610804  , 0.51276743, 0.46907002, 0.5319264 ,
       0.23136131, 0.6319357 , 0.34935707, 0.6316328 , 0.68350583,
       0.38812724, 0.46733084, 0.41334122, 0.6815751 , 0.5702613 ],
      dtype=float32)

Since the value returned by the model is close to `1`, it think the review we submitted is positive.

### Delete the endpoint

Now I will delete my endpoint, we don't want AWS bills maxing out my credit card, and pushing me into debt.

In [46]:
estimator.delete_endpoint()

estimator.delete_endpoint() will be deprecated in SageMaker Python SDK v2. Please use the delete_endpoint() function on your predictor instead.


## Deploy the model for the web app

The model works fine. I made a simple custome interface to submit reviews and have the model predict its sentiment.

In [47]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m

[34mfrom

### Deploying the model

Now that the custom inference code has been written, I will create and deploy the model. To begin I construct a new PyTorchModel object which points to the model artifacts created during training and to the inference code, then I call the deploy method to launch the deployment container.

In [48]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     py_version = 'py3',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!

### Testing the model

With the model deployed, I test to see if everything is working by loading the first `250` positive and negative reviews, send them to the endpoint, then collect the results.

In [49]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(int(float(predictor.predict(review_input))))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [50]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [51]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.848

In [52]:
predictor.predict(test_review)

b'1.0'

The last step is to create a simple web page for the app.

## Use the model for the web app
On the far left is our web app that collects a user's movie review, sends it off and expects a positive or negative sentiment in return.

To get a user submitted movie review to our SageMaker model, I will use a Lambda function with permission to send and receive data from the endpoint. Next, I will use the API Gateway to create a new endpoint. 


### Setting up a Lambda function

Here's what the Lambda function looks like: 

```python
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of the inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

Once you have copy and pasted the code above into the Lambda code editor, replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that we deployed earlier. You can determine the name of the endpoint using the code cell below.

In [53]:
predictor.endpoint

'sagemaker-pytorch-2020-11-21-18-36-20-803'

### Setting up API Gateway

Now that our Lambda function is set up, it is time to create a new API using API Gateway that will trigger the Lambda function we have just created.

## Deploying our web app

Now that we have a publicly available API, we can start using it in a web app. I have a simple html web page, you can find it at `index.html`.

For testing, I took the following review of *Joker* from Rotten Tomatoes; I wanted to see how the algorithm performs when the sentiment of the review is not overt:

> Joker is a subversion of the trope of the hero's journey, made for a villain.

**Result**: *POSITIVE*

Looks like it did a good job! 


In [54]:
predictor.delete_endpoint()

That's all folks!