## Sentiment Analysis with SageMaker's Prebuilt Deep Learning Containers

In this module, we'll see how to train and test a Sentiment Analysis (Text Classification) model on SageMaker using SageMaker's Prebuilt Deep Learning containers.  These containers are available for TensorFlow, MXNet, PyTorch, and Chainer.  With this approach, you simply bring your own Python training script, and SageMaker handles the rest.  

We'll begin by importing some necessary libraries and downloading the Python training script.  The IAM role needed for permissions, such as access to data in Amazon S3, is pulled in from the SageMaker Notebook Instance.

In [1]:
import os
os.system("aws s3 cp s3://sagemaker-workshop-pdx/sentiment-analysis-module/sentiment.py sentiment.py")
import boto3
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

## Download training and test data

In this notebook, we will train the **Sentiment Analysis** model on [SST-2 dataset (Stanford Sentiment Treebank 2)](https://nlp.stanford.edu/sentiment/index.html). The dataset consists of movie reviews with one sentence per review. Classification involves detecting positive/negative reviews.  
We will download the preprocessed version of this dataset from the links below. Each line in the dataset has space separated tokens, the first token being the label: 1 for positive and 0 for negative.

In [2]:
%%bash
mkdir data
curl https://raw.githubusercontent.com/saurabh3949/Text-Classification-Datasets/master/stsa.binary.phrases.train > data/train
curl https://raw.githubusercontent.com/saurabh3949/Text-Classification-Datasets/master/stsa.binary.test > data/test 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 65 4147k   65 2735k    0     0  6188k      0 --:--:-- --:--:-- --:--:-- 6174k100 4147k  100 4147k    0     0  8413k      0 --:--:-- --:--:-- --:--:-- 8396k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  189k  100  189k    0     0   900k      0 --:--:-- --:--:-- --:--:--  900k


## Uploading the data

We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.

In [3]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-sentiment')

## Implement the training function

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a `train` function. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

The script here is a simplified implementation of ["Bag of Tricks for Efficient Text Classification"](https://arxiv.org/abs/1607.01759), as implemented by Facebook's [FastText](https://github.com/facebookresearch/fastText/) for text classification. The model maps each word to a vector and averages vectors of all the words in a sentence to form a hidden representation of the sentence, which is inputted to a softmax classification layer. Please refer to the paper for more details.

In [4]:
!cat 'sentiment.py'

from __future__ import print_function

import logging
import mxnet as mx
from mxnet import gluon, autograd, nd
from mxnet.gluon import nn
import numpy as np
import json
import time
import re
from mxnet.io import DataIter, DataBatch, DataDesc
import bisect, random
from collections import Counter
from itertools import chain, islice


logging.basicConfig(level=logging.DEBUG)

# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #

def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir, hyperparameters, **kwargs):
    # retrieve the hyperparameters we set in notebook (with some defaults)
    batch_size = hyperparameters.get('batch_size', 8)
    epochs = hyperparameters.get('epochs', 2)
    learning_rate = hyperparameters.get('learning_rate', 0.01)
    log_interval = hyperparameters.get('log_interval'

## Run the training script on SageMaker

To keep our code readable and concise, we'll set up a training job using the SageMaker Python SDK, which provides many helper methods and conveniences. The SDK provides a specific Estimator objects for various frameworks that abstract away the lower level details of setting up training jobs. Various hyperparameters can be specified, including the learning rate etc.  You also can specify the type and amount of training hardware.

In [5]:
m = MXNet("sentiment.py",
          role=role,
          train_instance_count=1,
          train_instance_type="ml.c5.9xlarge",
          framework_version="1.2.1",
          hyperparameters={'batch_size': 8,
                         'epochs': 2,
                         'learning_rate': 0.01,
                         'embedding_size': 50, 
                         'log_interval': 1000})

After we've constructed our Estimator object, we can fit it to the training data we uploaded to S3. In this case, we're using the default training File mode; SageMaker makes sure our data is available in the training cluster's filesystem, so our training script can simply read the data from disk.  An alternative is Pipe mode, where the data is streamed directly to the container without being persisted to disk.

In [6]:
m.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2018-09-21-17-46-12-593


...............
[31m2018-09-21 17:48:28,694 INFO - root - running container entrypoint[0m
[31m2018-09-21 17:48:28,694 INFO - root - starting train task[0m
[31m2018-09-21 17:48:28,699 INFO - container_support.training - Training starting[0m
[31m2018-09-21 17:48:30,517 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'available_gpus': 0, 'channels': {u'training': {u'TrainingInputMode': u'File', u'RecordWrapperType': u'None', u'S3DistributionType': u'FullyReplicated'}}, '_ps_verbose': 0, 'resource_config': {u'hosts': [u'algo-1'], u'network_interface_name': u'ethwe', u'current_host': u'algo-1'}, 'user_script_name': u'sentiment.py', 'input_config_dir': '/opt/ml/input/config', 'channel_dirs': {u'training': u'/opt/ml/input/data/training'}, 'code_dir': '/opt/ml/code', 'output_data_dir': '/opt/ml/output/data/', 'output_dir': '/opt/ml/output', 'model_dir': '/opt/ml/model', 'hyperparameters': {u'sagemaker_program': u'sentiment.py', u'embedding_s

As can be seen from the logs, we get > 80% validation accuracy on the test set using the above hyperparameters after only two epochs (passes over the full training set).  

After training, we use the Estimator object to build and deploy an Predictor object. This creates a SageMaker endpoint that we can use to perform inference. In fact, we'll be able to perform inference on a standard JSON encoded string array without having to use any special encoding formats. 

In [7]:
predictor = m.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-mxnet-2018-09-21-17-46-12-593
INFO:sagemaker:Creating endpoint with name sagemaker-mxnet-2018-09-21-17-46-12-593


-----------------------------------------------------------------!

The predictor runs inference on our input data and returns the predicted sentiment (1 for positive and 0 for negative).

In [8]:
data = ["this movie was extremely good .",
        "the plot was very boring .",
        "this film is so slick , superficial and trend-hoppy .",
        "i just could not watch it till the end .",
        "the movie was so enthralling !"]

response = predictor.predict(data)
print response

[1, 0, 0, 0, 1]


## Conclusion & Cleanup

You are now done with this module!  Return to the workshop lab guide whenever you're ready and continue with the next module(s).  

Remember to delete the prediction endpoint to release the instance(s) associated with it.

In [9]:
sagemaker.Session().delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-mxnet-2018-09-21-17-46-12-593
