## Training a sentiment analysis model with Chainer

In this notebook, we will train a model that will allow us to analyze text for positive or negative sentiment. The model will use a recurrent neural network with long short-term memory blocks to generate word embeddings.

To train with a Chainer script, we construct a ```Chainer``` estimator using the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk). We can pass in an `entry_point`, the name of a script that contains a couple of functions with certain signatures (`train` and `model_fn`). This script will be run on SageMaker in a container that invokes these functions to train and load Chainer models. 

For more on the Chainer container, please visit the sagemaker-chainer-containers repository:
https://github.com/aws/sagemaker-chainer-containers

In [1]:
# Setup
from sagemaker import get_execution_role
import sagemaker

sagemaker_session = sagemaker.Session()

# This role retrieves the SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Downloading training and test data

We use helper functions given by `chainer` to download and preprocess the data. 

In [2]:
import dataset

train, test, vocab = dataset.get_stsa_dataset()

  from ._conv import register_converters as _register_converters


## Uploading the data

We save the preprocessed data to the local filesystem, and then use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the S3 location, which we will use when we start the Training Job.

In [3]:
import os
import shutil

import numpy as np

train_data = [element[0] for element in train]
train_labels = [element[1] for element in train] 

test_data = [element[0] for element in test]
test_labels = [element[1] for element in test]

try:
    os.makedirs('data/train')
    os.makedirs('data/test')
    os.makedirs('data/vocab')
except FileExistsError:
    pass

np.savez('data/train/train.npz', data=train_data, labels=train_labels)
np.savez('data/test/test.npz', data=test_data, labels=test_labels)
np.save('data/vocab/vocab.npy', vocab)

# Upload preprocessed data to S3 
train_input = sagemaker_session.upload_data(path=os.path.join('data', 'train'),
                                                            key_prefix='notebook/chainer_sentiment/train')
test_input = sagemaker_session.upload_data(path=os.path.join('data', 'test'),
                                                           key_prefix='notebook/chainer_sentiment/test')
vocab_input = sagemaker_session.upload_data(path=os.path.join('data', 'vocab'),
                                                           key_prefix='notebook/chainer_sentiment/vocab')

# Remove data from notebook instance (to save disk space)
shutil.rmtree('data')

## Writing the Chainer training script to run on Amazon SageMaker

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a function `train` that returns a trained `chainer.Chain`. Since we will use the same script to host the Chainer model, the script also needs a function `model_fn` that loads a `chainer.Chain` -- by default, Chainer models are saved to disk as `model.npz`. When SageMaker calls your `train` and `model_fn` functions, it will pass in arguments that describe the training environment.

Check the script below, which uses `chainer` to train on any number of GPUs on a single machine, to see how this works. For more on implementing these functions, see the documentation at https://github.com/aws/sagemaker-python-sdk.

In [4]:
!cat 'code/sentiment_analysis.py'

# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

import os
import json

import numpy as np
import chainer
from chainer import training
from chainer import serializers
from chainer.training import extensions

import nets
from nlp_utils import convert_seq, split_text, normalize_text, transform_to_array

# ------------------------------------------------------------ #
# Training methods                                             #
# ---------------------

## Running the training script on SageMaker

The ```Chainer``` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on two `ml.p3.2xlarge` instances.

This script uses the `chainermn` package, which distributes training with MPI. Your script is run with `mpirun`, so a `chainermn

Chainer scripts can distribute training with the `chainermn` package, which this Chainer script does not use, so this script should only be run on one instance.

In [5]:
from sagemaker.chainer.estimator import Chainer

chainer_estimator = Chainer(entry_point='sentiment_analysis.py', source_dir="code", role=role,
                            sagemaker_session=sagemaker_session,
                            train_instance_count=1, train_instance_type='ml.p3.2xlarge',
                            hyperparameters={'epochs': 10, 'batch_size': 64})

chainer_estimator.fit({'train': train_input, 'test': test_input, 'vocab': vocab_input})

INFO:sagemaker:Creating training-job with name: sagemaker-chainer-2018-05-05-00-26-03-431


............................................
[31m2018-05-05 00:29:37,636 INFO - root - running container entrypoint[0m
[31m2018-05-05 00:29:37,636 INFO - root - starting train task[0m
[31m2018-05-05 00:29:37,648 INFO - container_support.app - started training: {'train_fn': <function train at 0x7f451ebc7bf8>}[0m
[31mDownloading s3://sagemaker-us-west-2-038453126632/sagemaker-chainer-2018-05-05-00-26-03-431/source/sourcedir.tar.gz to /tmp/script.tar.gz[0m
[31m2018-05-05 00:29:37,783 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-05-05 00:29:37,863 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-038453126632.s3.amazonaws.com[0m
[31m2018-05-05 00:29:37,908 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-038453126632.s3.amazonaws.com[0m
[31m2018-05-

Our Chainer script writes various artifacts, such as plots, to a directory `output_data_dir`, the contents of which which SageMaker uploads to S3. Now we download and extract these artifacts.

In [6]:
from s3_util import retrieve_output_from_s3

chainer_training_job = chainer_estimator.latest_training_job.name

desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=chainer_training_job)
output_data = desc['ModelArtifacts']['S3ModelArtifacts'].replace('model.tar.gz', 'output.tar.gz')

retrieve_output_from_s3(output_data, 'output/sentiment')

s3://sagemaker-us-west-2-038453126632/sagemaker-chainer-2018-05-05-00-26-03-431/output/output.tar.gz


In [7]:
# Executing as code to reload images so that browsers don't render cached images.
from IPython.display import Markdown
import time
_nonce = time.time()

Markdown("""
These plots show the accuracy and loss over epochs.

In our user script (sentiment_analysis.py), we save only the best model for deployment.

<img style="display: inline;" src="output/sentiment/accuracy.png?{0}" />
<img style="display: inline;" src="output/sentiment/loss.png?{0}" />""".format(_nonce))



These plots show the accuracy and loss over epochs.

In our user script (sentiment_analysis.py), we save only the best model for deployment.

<img style="display: inline;" src="output/sentiment/accuracy.png?1525480243.1649876" />
<img style="display: inline;" src="output/sentiment/loss.png?1525480243.1649876" />

## Deploying the Trained Model

After training, we use the Chainer estimator object to create and deploy a hosted prediction endpoint. We can use a CPU-based instance for inference (in this case an `ml.m4.xlarge`), even though we trained on GPU instances.

The predictor object returned by `deploy` lets us call the new endpoint and perform inference on our sample images.

At the end of training, `sentiment_analysis.py` saves the trained model, the vocabulary, and a dictionary of model properties that are used to reconstruct the model. These model artifacts are loaded in `model_fn` when the model is hosted.

In [8]:
predictor = chainer_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-chainer-2018-05-05-00-26-03-431
INFO:sagemaker:Creating endpoint with name sagemaker-chainer-2018-05-05-00-26-03-431


-------------------------------------------------------------!

## Predicting using SageMaker Endpoint

The Chainer predictor converts its input into a NumPy array, which it serializes and sends to the hosted model.
The `predict_fn` in `sentiment_analysis.py` receives this NumPy array and uses the loaded model to make predictions on the input data, which it returns as a NumPy array back to the Chainer predictor.

We predict against the hosted model on a batch of sentences. The output, as defined by `predict_fn`, consists of the processed input sentence, the prediction, and the score for that prediction.

In [9]:
sentences = ['It is fun and easy to train Chainer models on Amazon SageMaker!',
             'It used to be slow, difficult, and laborious to train and deploy a model to production.',
             'But now it is super fast to deploy to production. And I love it when my model generalizes!',]
predictions = predictor.predict(sentences)
for prediction in predictions:
    sentence, prediction, score = prediction
    print('sentence: {}\nprediction: {}\nscore: {}\n'.format(sentence, prediction, score))

sentence: it is fun and easy to train chainer models on amazon sagemaker!
prediction: 1
score: 0.9977222084999084

sentence: it used to be slow, difficult, and laborious to train and deploy a model to production.
prediction: 0
score: 0.8841550946235657

sentence: but now it is super fast to deploy to production. and i love it when my model generalizes!
prediction: 1
score: 0.8739688396453857



## Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [10]:
sagemaker.Session().delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-chainer-2018-05-05-00-26-03-431
