# Creating a Sentiment Analysis Web App
## Using PyTorch and SageMaker

# General Outline

Outline for SageMaker projects using a notebook instance.

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Downloading the data

Download [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/) dataset

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2019-11-10 17:52:40--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2019-11-10 17:52:45 (15.6 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Preparing and Processing the data

Doing some initial data processing. Will read in each of the reviews and combine them into a single input structure. Then, the dataset will split into a training set and a testing set.

In [2]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Now the raw training and testing data from the downloaded dataset are read, the positive and negative reviews will combine and shuffle the resulting records.

In [4]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Now training and testing sets unified and prepared, do a quick check and see an example of the data the model will be trained on. This is generally a good idea as it allows to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [6]:
print(train_X[100])
print(train_y[100])

Released in 1965, but clearly shot years earlier, this is an inept little crime melodrama with some inept sexploitation up front. As usual for grindhouse flicks of era, there's a fair amount of undressing and dressing for no reason complemented by lousy music, annoying narration, and awkward editing. The coffee shop scene lays the excruciating groundwork, as we chop back and forth between characters to avoid actually seeing them speak their lines. All we get are reaction shots to the off-screen character's voice! 50s-pretty Misty Ayers strips to her French-cut panties a couple of times before the action gets started. She's accompanied continuously by what is apparently stock music from romantic to western to mother-does-the-dishes, mixed randomly to produce, among other things, the most thrilling cigarette lighting ever captured on film. Watch as he taps it! Watch as he strikes the match! Will he inhale or will he be captured by Apaches? Only time will tell!! The film tells the sordid 

The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [8]:
words_train_sample = review_to_words(train_X[100])
words_train_sample

['releas',
 '1965',
 'clearli',
 'shot',
 'year',
 'earlier',
 'inept',
 'littl',
 'crime',
 'melodrama',
 'inept',
 'sexploit',
 'front',
 'usual',
 'grindhous',
 'flick',
 'era',
 'fair',
 'amount',
 'undress',
 'dress',
 'reason',
 'complement',
 'lousi',
 'music',
 'annoy',
 'narrat',
 'awkward',
 'edit',
 'coffe',
 'shop',
 'scene',
 'lay',
 'excruci',
 'groundwork',
 'chop',
 'back',
 'forth',
 'charact',
 'avoid',
 'actual',
 'see',
 'speak',
 'line',
 'get',
 'reaction',
 'shot',
 'screen',
 'charact',
 'voic',
 '50',
 'pretti',
 'misti',
 'ayer',
 'strip',
 'french',
 'cut',
 'panti',
 'coupl',
 'time',
 'action',
 'get',
 'start',
 'accompani',
 'continu',
 'appar',
 'stock',
 'music',
 'romant',
 'western',
 'mother',
 'dish',
 'mix',
 'randomli',
 'produc',
 'among',
 'thing',
 'thrill',
 'cigarett',
 'light',
 'ever',
 'captur',
 'film',
 'watch',
 'tap',
 'watch',
 'strike',
 'match',
 'inhal',
 'captur',
 'apach',
 'time',
 'tell',
 'film',
 'tell',
 'sordid',
 'tale',
 

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

In [9]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [10]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


## Transform the data

Each word we will represent as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis. So, the size of  working vocabulary get a fixed number and will only include the words that appear most frequently. Then all of the infrequent words will combine into a single category and, label it as `1`.

Since using a recurrent neural network, it will be convenient if the length of each review is the same. To do this, a size of reviews is conatnt and then short reviews is pad with the category 'no word' (which we will label `0`) and truncate long reviews.

In [11]:
import numpy as np

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    # sentence is a list of words.
    
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    
    for words in data:
        for word in words:
            word_count[word] = word_count.get(word, 0) + 1
    
    # Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    # sorted_words[-1] is the least frequently appearing word.
    
    sorted_words = [w[0] for w in sorted(word_count.items(), key = lambda kv:kv[1], reverse = True)]
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [12]:
word_dict = build_dict(train_X)

The five most frequently appearing (tokenized) words in the training set

In [13]:
for i in range(2, 7):
    print(list(word_dict.keys())[list(word_dict.values()).index(i)])


movi
film
one
like
time


### Save `word_dict`

Later on when an endpoint - which processes a submitted review - is constructed `word_dict` is needed. As such, it will be saved to a file now for future use.

In [14]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [15]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

Word dictionary transform the words appearing in the reviews into integers.

In [16]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [17]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

## Upload the data to S3

Upload the training dataset to S3 in order for training code to access it. 

### Save the processed training dataset locally

It is important to note the format of the saving data when developing training code. In this case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [19]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data


Next, upload the training data to the SageMaker default S3 bucket so that training model can access to it .

In [20]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [21]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

## Step 4: Build and Train the PyTorch Model

A model comprises three objects

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
Here we will use containers provided by Amazon with the added benefit of being able to include our own custom code.

Start by implementing our own neural network in PyTorch along with a training script. 

There are three parameters for tweak to improve the performance of our model. These are the embedding dimension, the hidden dimension and the size of the vocabulary. We want to make these parameters configurable in the training script

In order to construct a PyTorch model using SageMaker we must provide SageMaker with a training script. We may optionally include a directory which will be copied to the container and from which our training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

### Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` which has been provided and which contains most of the necessary code to train our model. 
The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script. To see how this is done take a look at the provided `train/train.py` file.

In [26]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.m4.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [None]:
estimator.fit({'training': input_data})

2019-11-10 18:20:48 Starting - Starting the training job...
2019-11-10 18:20:56 Starting - Launching requested ML instances......
2019-11-10 18:22:00 Starting - Preparing the instances for training......
2019-11-10 18:22:56 Downloading - Downloading input data...
2019-11-10 18:23:38 Training - Training image download completed. Training in progress.[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-11-10 18:23:39,735 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-11-10 18:23:39,738 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-11-10 18:23:39,751 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-11-10 18:23:40,342 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-11-10 18:23:40,585 sagemaker-containers INFO     

[31mEpoch: 2, BCELoss: 0.5929516505221931[0m
[31mEpoch: 3, BCELoss: 0.4955745679991586[0m
[31mEpoch: 4, BCELoss: 0.4395241408931966[0m
[31mEpoch: 5, BCELoss: 0.387244349839736[0m
[31mEpoch: 6, BCELoss: 0.3553718973179253[0m
[31mEpoch: 7, BCELoss: 0.3322160015909039[0m
[31mEpoch: 9, BCELoss: 0.2882717698812485[0m

2019-11-10 20:01:45 Uploading - Uploading generated training model[31mEpoch: 10, BCELoss: 0.26960678033682767[0m
[31m2019-11-10 20:01:40,623 sagemaker-containers INFO     Reporting training SUCCESS[0m

2019-11-10 20:01:51 Completed - Training job completed
Training seconds: 5935
Billable seconds: 5935


## Testing the model

The model is tested by first deploying it and then sending the testing data to the deployed endpoint. 

## Deploy the model for testing

Currently the model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that need to provide, however, and that is a function which loads the saved model. This function must be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the python file which is specified as the entry point. 

In [28]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------------------!

## Use the model for testing

Once deployed, the test data is send it off to deployed model to get some results. Once all of the results are collected the model accuarcy can be measured.

In [29]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [30]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [31]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [32]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.85784

### More testing

The model is trained and deplyed and processed reviews can be send to and which returns the predicted sentiment. However, ultimately the model should be able to accept an unprocessed review. For example, the following review.

In [33]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

The question how do this review needs to be send to our model?

In order process the review eed to repeat these two steps needs to be repeat.
 - Removed any html tags and stemmed the input
 - Encoded the review as a sequence of integers using `word_dict`

In [35]:
words = review_to_words(input_data)
data_X, data_len = convert_and_pad(word_dict, words)

# Using data_X and data_len we construct an appropriate input tensor. Remember
# that our model expects input data of the form 'len, review[500]'.
data_pack = np.hstack((data_len, data_X))
data_pack = data_pack.reshape(1, -1)

test_data = torch.from_numpy(data_pack)

predictor.predict(test_data)

array(0.5269635, dtype=float32)

Since the return value of our model is close to `1`, we can be certain that the review we submitted is positive.

### Delete the endpoint

Once an endpoint is deployed it continues to run until it shut down.

In [36]:
estimator.delete_endpoint()

## Deploying the model

Now that the custom inference code has been written, the model will be created and delpyed. First a new PyTorchModel object is created which points to the model artifacts created during training and also points to the inference code. Then the deploy method to launch the deployment container is called.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. So it need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings.

In [38]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-----------------------------------------------------------------------------------------------------!

### Testing the model

Now that the deployed our model with the custom inference code is ready. To test the model, the first `250` positive and negative reviews loaded and send them to the endpoint, then the results is collected. 

In [39]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(round(float(predictor.predict(review_input))))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [40]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [41]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.864

### Delete the endpoint

Remember to always shut down the endpoint.

In [44]:
predictor.delete_endpoint()