#  Sentiment Analysis with TensorFlow

A Convolutional Neural Net (CNN) is sometimes used in text classification tasks such as sentiment analysis.  We'll use a CNN built with TensorFlow to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment. Three aspects of Amazon SageMaker will be demonstrated:

- How to use Script Mode with a prebuilt TensorFlow container, along with a training script similar to one you would use outside SageMaker. 
- Local Mode training, which allows you to test your code on your notebook instance before creating a full scale training job.
- Batch Transform for offline, asynchronous predictions on large batches of data. 

#  Prepare Dataset

We'll begin by loading the reviews dataset, and padding the reviews so all reviews have the same length.  Each review is represented as an array of numbers, where each number represents an indexed word.  Training data for both Local Mode and Hosted Training must be saved as files, so we'll also save the transformed data to files.

In [1]:
import os
from keras.preprocessing import sequence
from keras.datasets import imdb

max_features = 20000
maxlen = 400

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Using TensorFlow backend.


25000 train sequences
25000 test sequences
x_train shape: (25000, 400)
x_test shape: (25000, 400)


In [2]:
data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'data/test')
os.makedirs(test_dir, exist_ok=True)

csv_test_dir = os.path.join(os.getcwd(), 'data/csv-test')
os.makedirs(csv_test_dir, exist_ok=True)

In [3]:
import numpy as np

np.save(os.path.join(train_dir, 'x_train.npy'), x_train)
np.save(os.path.join(train_dir, 'y_train.npy'), y_train)
np.save(os.path.join(test_dir, 'x_test.npy'), x_test)
np.save(os.path.join(test_dir, 'y_test.npy'), y_test)
np.savetxt(os.path.join(csv_test_dir, 'csv-test.csv'), np.array(x_test[:100], dtype=np.int32), fmt='%d', delimiter=",")

# Local Mode Training

Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your code is working as expected before moving on to full scale, hosted training. With Local Mode, you can run quick tests with just a sample of training data, and/or a small number of epochs (passes over the full training set), while avoiding the time and expense of attempting full scale hosted training using possibly buggy code.  

To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

In [4]:
!/bin/bash ./setup.sh

/bin/bash: ./setup.sh: No such file or directory


The next step is to set up a TensorFlow Estimator for Local Mode training. A key parameters for the Estimator is the `train_instance_type`, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local_gpu` to invoke Local Mode training on the GPU, or to `local` if the instance has a CPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode.

In [5]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'
train_instance_type = 'local'
tornasole_s3 = 's3://' + sagemaker.Session().default_bucket() + "/tornasole-parameters/"
hyperparameters = {'epochs': 1, 'batch_size': 128, 
                   'tornasole-save-interval': 100, 'tornasole_outdir' : tornasole_s3 }
local_estimator = TensorFlow(entry_point='sentiment_keras.py',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-keras-sentiment',
                       framework_version='1.13.1',
                       py_version='py3',
                       image_name='072677473360.dkr.ecr.us-east-1.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:latest',
                       script_mode=True)

W0729 09:01:18.666472 4639487424 session.py:1106] Couldn't call 'get_role' to get Role ARN from role name olg to get Role path.


ValueError: The current AWS identity is not a role: arn:aws:iam::722321484884:user/olg, therefore it cannot be used as a SageMaker execution role

Now we'll briefly train the model in Local Mode.  Since this is just to make sure the code is working, we'll train for only one epoch.  (Note that on a CPU-based notebook instance, this one epoch will take at least 3 or 4 minutes.)  As you'll see from the logs below the cell when training is complete, even when trained for only one epoch, the accuracy of the model on training data is already at almost 80%.  

In [30]:
inputs = {'train': f'file://{train_dir}',
          'test': f'file://{test_dir}'}

local_estimator.fit(inputs)

Creating tmpsw39_nhj_algo-1-zwl3k_1 ... 
[1BAttaching to tmpsw39_nhj_algo-1-zwl3k_12mdone[0m
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:20:30,596 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:20:30,603 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:20:30,917 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:20:30,939 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:20:30,961 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:20:30,975 sagemaker-containers INFO     Invoking user script
[36malgo-1-zwl3k_1  |[0m 
[36malgo-1-zwl3k_1  |[0m Training Env:
[36malgo-1-zwl3k_1  |[0m 
[36malgo-1-zwl3k_1  |[0m {
[36malgo-1-zwl3k_1  |[0

[36malgo-1-zwl3k_1  |[0m Instructions for updating:
[36malgo-1-zwl3k_1  |[0m Use tf.cast instead.
[36malgo-1-zwl3k_1  |[0m Instructions for updating:
[36malgo-1-zwl3k_1  |[0m Use tf.cast instead.
[36malgo-1-zwl3k_1  |[0m Instructions for updating:
[36malgo-1-zwl3k_1  |[0m Deprecated in favor of operator or tf.math.divide.
[36malgo-1-zwl3k_1  |[0m Instructions for updating:
[36malgo-1-zwl3k_1  |[0m Deprecated in favor of operator or tf.math.divide.
[36malgo-1-zwl3k_1  |[0m Train on 25000 samples, validate on 25000 samples
[36malgo-1-zwl3k_1  |[0m Epoch 1/1
[36malgo-1-zwl3k_1  |[0m https://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory
[36malgo-1-zwl3k_1  |[0m 2019-07-16 15:27:04,144 sagemaker-containers INFO     Reporting training SUCCESS
[36mtmpsw39_nhj_algo-1-zwl3k_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


#  Hosted Training

After we've confirmed our code seems to be working using Local Mode training, we can move on to use SageMaker's hosted training, which uses compute resources separate from your notebook instance.  Hosted training spins up one or more instances (cluster) for training, and then tears the cluster down when training is complete. In general, hosted training is preferred for doing actual training, especially for large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. 

In [26]:
s3_prefix = sagemaker.Session().default_bucket()

traindata_s3_prefix = '{}/data/train'.format(s3_prefix)
testdata_s3_prefix = '{}/data/test'.format(s3_prefix)

train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)
test_s3 = sagemaker.Session().upload_data(path='./data/test/', key_prefix=testdata_s3_prefix)

inputs = {'train':train_s3, 'test': test_s3}
print(inputs)

{'train': 's3://sagemaker-us-east-1-072677473360/sagemaker-us-east-1-072677473360/data/train', 'test': 's3://sagemaker-us-east-1-072677473360/sagemaker-us-east-1-072677473360/data/test'}


With the training data now in S3, we're ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of a local type for Local Mode. Additionally, we've set the number of epochs to a number greater than one for actual training, as opposed to just testing the code.

In [27]:
train_instance_type = 'ml.p3.2xlarge'
#hyperparameters = {'epochs': 10, 'batch_size': 128}
hyperparameters = {'epochs': 1, 'batch_size': 128, 
                   'tornasole-save-interval': 1, 'tornasole_outdir' : tornasole_s3 }

estimator = TensorFlow(entry_point='sentiment_keras.py',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-keras-sentiment',
                       framework_version='1.13.1',
                       py_version='py3',
                       image_name='072677473360.dkr.ecr.us-east-1.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:latest',
                       script_mode=True)

With the change in training instance type and increase in epochs, we simply call `fit` to start the actual hosted training.  At the end of hosted training, you'll see from the logs below the cell that accuracy on the training set has greatly increased, and accuracy on the validation set is around 90%.  The model may be overfitting now (less able to generalize to data it has not yet seen), even though we are employing dropout as a regularization technique.  In a production situation, further investigation would be necessary.

In [28]:
estimator.fit(inputs)

2019-07-16 00:36:04 Starting - Starting the training job...
2019-07-16 00:36:09 Starting - Launching requested ML instances......
2019-07-16 00:37:17 Starting - Preparing the instances for training......
2019-07-16 00:38:15 Downloading - Downloading input data......
2019-07-16 00:39:27 Training - Downloading the training image......
2019-07-16 00:40:17 Training - Training image download completed. Training in progress.
[31m2019-07-16 00:40:20,820 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[31m2019-07-16 00:40:21,423 sagemaker-containers INFO     Invoking user script
[0m
[31mTraining Env:
[0m
[31m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/test",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        

[31m  896/25000 [>.............................] - ETA: 3:13 - loss: 0.6988 - acc: 0.4922
 1024/25000 [>.............................] - ETA: 3:06 - loss: 0.6980 - acc: 0.4961[0m
[31m 1152/25000 [>.............................] - ETA: 3:00 - loss: 0.6998 - acc: 0.4931[0m
[31m 1280/25000 [>.............................] - ETA: 2:56 - loss: 0.7008 - acc: 0.4883
 1408/25000 [>.............................] - ETA: 3:06 - loss: 0.7003 - acc: 0.4879[0m
[31m 1536/25000 [>.............................] - ETA: 3:01 - loss: 0.6993 - acc: 0.4961[0m
[31m 1664/25000 [>.............................] - ETA: 2:57 - loss: 0.6984 - acc: 0.5072[0m
[31m 1792/25000 [=>............................] - ETA: 2:53 - loss: 0.6981 - acc: 0.5056[0m
[31m 1920/25000 [=>............................] - ETA: 2:50 - loss: 0.6969 - acc: 0.5120
 2048/25000 [=>............................] - ETA: 2:48 - loss: 0.6969 - acc: 0.5112[0m
[31m 2176/25000 [=>............................] - ETA: 2:45 - loss: 0.6969 



[31mhttps://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory[0m
[31m2019-07-16 00:44:17,528 sagemaker-containers INFO     Reporting training SUCCESS[0m

2019-07-16 00:45:32 Uploading - Uploading generated training model
2019-07-16 00:45:32 Completed - Training job completed
Billable seconds: 437


# Batch Prediction


If our use case requires individual predictions in near real-time, SageMaker hosted endpoints can be created. Hosted endpoints also can be used for pseudo-batch prediction, but the process is more involved than simply using SageMaker's Batch Transform feature, which is designed for large-scale, asynchronous batch inference.

To use Batch Transform, first we must upload to Amazon S3 some test data in CSV format to be transformed.

In [None]:
csvtestdata_s3_prefix = '{}/data/csv-test'.format(s3_prefix)
csvtest_s3 = sagemaker.Session().upload_data(path='./data/csv-test/', key_prefix=csvtestdata_s3_prefix)
print(csvtest_s3)

A Transformer object must be set up to describe the Batch Transform job, including the amount and type of inference hardware to be used.  Then the actual transform job itself is started with a call to the `transform` method of the Transformer.

In [None]:
transformer = estimator.transformer(instance_count=1, instance_type='ml.m5.xlarge')
transformer.transform(csvtest_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

We can now download the batch predictions from S3 to the local filesystem on the notebook instance; the predictions are contained in a file with a .out extension, and are embedded in JSON.  Next we'll load the JSON and examine the predictions, which are confidence scores from 0.0 to 1.0 where numbers close to 1.0 indicate positive sentiment, while numbers close to 0.0 indicate negative sentiment.

In [None]:
import json

batch_output = transformer.output_path
!mkdir -p batch_data/output
!aws s3 cp --recursive $batch_output/ batch_data/output/

with open('batch_data/output/csv-test.csv.out', 'r') as f:
    jstr = json.load(f)
    results = [float('%.3f'%(item)) for sublist in jstr['predictions'] for item in sublist]
    print(results)

Now let's look at the text of some actual reviews to see the predictions in action.  First, we have to convert the integers representing the words back to the words themselves by using a reversed dictionary.  Next we can decode the reviews, taking into account that the first 3 indices were reserved for "padding", "start of sequence", and "unknown", and removing a string of unknown tokens from the start of the review.

In [None]:
import re

regex = re.compile(r'^[\?\s]+')

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
first_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[3]])
regex.sub('', first_decoded_review)

Overall, this review looks fairly negative.  Let's compare the actual label with the prediction:

In [None]:
def get_sentiment(score):
    return 'positive' if score > 0.5 else 'negative' 

print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[3]), 
                                                                                  get_sentiment(results[3])))

Our negative sentiment prediction agrees with the label for this review.  Let's now examine another review:

In [None]:
second_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[10]])
regex.sub('', second_decoded_review)

In [None]:
print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[10]), 
                                                                                  get_sentiment(results[10])))

Again, the prediction agreed with the label for the test data.  Note that there is no need to clean up any Batch Transform resources:  after the transform job is complete, the cluster used to make inferences is torn down.

Now that we've reviewed some sample predictions as a sanity check, we're finished.  Of course, in a typical production situation, the data science project lifecycle is iterative, with repeated cycles of refining the model using a tool such as Amazon SageMaker's Automatic Model Tuning feature, and gathering more data.  