## Training a sentiment analysis model with Chainer

In this notebook, we will train a model that will allow us to analyze text for positive or negative sentiment. The model will use a recurrent neural network with long short-term memory blocks to generate word embeddings.

The Chainer script runs inside of a Docker container running on SageMaker. For more on the Chainer container, please visit the sagemaker-chainer-containers repository and the sagemaker-python-sdk repository:

* https://github.com/aws/sagemaker-chainer-containers
* https://github.com/aws/sagemaker-python-sdk

In [None]:
# Setup
from sagemaker import get_execution_role
import sagemaker

sagemaker_session = sagemaker.Session()

# This role retrieves the SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Downloading training and test data

We use helper functions given by `chainer` to download and preprocess the data. We'll be using the Stanford Sentiment Treebank dataset, which consists of sentence fragments along with labels indicating whether the sentence is has a positive sentiment (1) or negative sentiment (0)

In [None]:
import dataset

file_paths = dataset.download_dataset("stsa.binary")

new_file_paths = dataset.get_stsa_dataset(file_paths)
train, test, vocab = dataset.get_stsa_dataset(file_paths)

with open(file_paths[0], 'r') as f:
    for i in range(20):
        line = f.readline()
        print(line)

## Uploading the data

We save the preprocessed data to the local filesystem, and then use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the S3 location, which we will use when we start the Training Job.

In [None]:
import os
import shutil

import numpy as np

train_data = [element[0] for element in train]
train_labels = [element[1] for element in train]

test_data = [element[0] for element in test]
test_labels = [element[1] for element in test]


try:
    os.makedirs('/tmp/data/train')
    os.makedirs('/tmp/data/test')
    os.makedirs('/tmp/data/vocab')
    np.savez('/tmp/data/train/train.npz', data=train_data, labels=train_labels)
    np.savez('/tmp/data/test/test.npz', data=test_data, labels=test_labels)
    np.save('/tmp/data/vocab/vocab.npy', vocab)
    train_input = sagemaker_session.upload_data(path=os.path.join('/tmp', 'data', 'train'),
                                                            key_prefix='notebook/chainer_cifar/train')
    test_input = sagemaker_session.upload_data(path=os.path.join('/tmp', 'data', 'test'),
                                                           key_prefix='notebook/chainer_cifar/test')
    vocab_input = sagemaker_session.upload_data(path=os.path.join('/tmp', 'data', 'vocab'),
                                                           key_prefix='notebook/chainer_sentiment/vocab')
finally:
    shutil.rmtree('/tmp/data')

## Writing the Chainer training script to run on Amazon SageMaker

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a function `train` that returns a trained `chainer.Chain`.

Since we will use the same script to host the Chainer model, the script also needs a function `model_fn` that loads a `chainer.Chain` -- by default, Chainer models are saved to disk as `model.npz`. When SageMaker calls your `train` and `model_fn` functions, it will pass in arguments that describe the training environment.

While the `train` and `model_fn` functions are required, the Chainer container provides default implementations for a few other functions. The function hooks recognized by the container are listed below, with required functions in bold:

### Training

* **`train`**: This function is passed arguments read from the Training Job's environment and returns a trained model. The return value of `train` is saved and uploaded to S3 as a model artifact by `save`.

  `train` can accept the following arguments by name:
  * `hyperparameters (dict)`: The hyperparameters map passed from the SageMaker Python SDK.
  * `channel_input_dirs (dict of str: str)`: A map of input channel names (like 'train' and 'test') to filesystem paths to data in those input channels. 
  * `output_data_dir (str)`: The filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files you might like to save, not including model artifacts. These artifacts are uploaded to S3 along with your model artifacts.
  * `num_gpus (int): ` The number of GPUs available to the host.
  * `num_cpus (int): `: The number of CPUs available to the host.
  * `hosts (list of str)`: The list of hostnames for all Training Job instances.
  * `current_host (str)`: The hostname of the current host.
  
  For more on the arguments to `train` and others, please visit https://github.com/aws/sagemaker-containers.
  
  
* `save(model, model_dir)`: Writes the return value from `train` (`model`) to `model_dir`. These model artifacts are uploaded to S3 so that they can be hosted behind a SageMaker Endpoint.

  The default implementation saves the `model` as a file named `model.npz` file by invoking `chainer.serializers.save_npz`

### Hosting and Inference

* **`model_fn(model_dir)`**: This function is invoked to load model artifacts from those written into `model_dir` by `save`.input_data
* `input_fn(input_data, content_type)`: This function is invoked to deserialize prediction data when a prediction request is made. The return value is passed to predict_fn. `input_fn` accepts two arguments: `input_data`, which is the serialized input data in the body of the prediction request, and `content_type`, the MIME type of the data.
  
  The default implementation deserializes [npy-formatted](https://docs.scipy.org/doc/numpy-1.14.0/neps/npy-format.html) data into a NumPy array with content type 'application/x-npy', but the default handler can also handle CSV data with content type 'text/csv' and JSON data with content type 'application/json'
  
  `input_fn` accepts the following arguments:
  
  * `input_data`: serialized input data in the body of the prediction request.
  * `content_type`: MIME type of the data. By default, the Chainer predictor sends prediction requests with content type 'application/x-npy'.
  
  
* `predict_fn(input_data, model)`: This function accepts the return value of `input_fn` (as `input_data`) and the return value of `model_fn`, `model`, and returns inferences obtained from the model.

  The default implementation calls `model(input_data)` and returns the result as a NumPy array.
  
  
* `output_fn(prediction, accept)`: This function is invoked to serialize the return value from `predict_fn`, passed in via `prediction`, back to the SageMaker client in response to prediction requests

  The default implementation serializes NumPy arrays returned by `predict_fn`, which the SageMaker Python SDK can deserialize back into a NumPy array, but the default handler can also respond with JSON or CSV, depending on the `accept` MIME type given in the prediction request.

Check the script below, which uses `chainer` to train on any number of GPUs on a single machine, to see how this works. This script implements `train`, `save`, `model_fn`, and `predict_fn`, but relies on the default `input_fn` and `output_fn`.

For more on implementing these functions, see the documentation at https://github.com/aws/sagemaker-python-sdk.

For more on the functions provided by the Chainer container, see https://github.com/aws/sagemaker-chainer-containers

In [None]:
!cat 'code/sentiment_analysis.py'

## Running the training script on SageMaker

To train with a Chainer script, we construct a ```Chainer``` estimator using the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk). We can pass in an `entry_point`, the name of a script that contains a couple of functions with certain signatures (`train` and `model_fn`). This script will be run on SageMaker in a container that invokes these functions to train and load Chainer models.

The ```Chainer``` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on a `ml.p3.2xlarge` instance.

In [None]:
from sagemaker.chainer.estimator import Chainer

chainer_estimator = Chainer(entry_point='sentiment_analysis.py', source_dir="code", role=role,
                            sagemaker_session=sagemaker_session,
                            train_instance_count=1, train_instance_type='ml.p3.2xlarge',
                            hyperparameters={'epochs': 10, 'batch_size': 64})

chainer_estimator.fit({'train': train_input, 'test': test_input, 'vocab': vocab_input})

Our Chainer script writes various artifacts, such as plots, to a directory `output_data_dir`, the contents of which which SageMaker uploads to S3. Now we download and extract these artifacts.

In [None]:
from s3_util import retrieve_output_from_s3

chainer_training_job = chainer_estimator.latest_training_job.name

desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=chainer_training_job)
output_data = desc['ModelArtifacts']['S3ModelArtifacts'].replace('model.tar.gz', 'output.tar.gz')

retrieve_output_from_s3(output_data, 'output/sentiment')

These plots show the accuracy and loss over epochs.

In our user script, `sentiment_analysis.py`, at the end of the `train` function, we save only the best model for deployment.

In [None]:
# Executing as code to reload images so that browsers don't render cached images.
from IPython.display import Markdown
import time
_nonce = time.time()

Markdown("""
These plots show the accuracy and loss over epochs.

In our user script (sentiment_analysis.py), we save only the best model for deployment.

<img style="display: inline;" src="output/sentiment/accuracy.png?{0}" />
<img style="display: inline;" src="output/sentiment/loss.png?{0}" />""".format(_nonce))


## Deploying the Trained Model

After training, we use the Chainer estimator object to create and deploy a hosted prediction endpoint. We can use a CPU-based instance for inference (in this case an `ml.m4.xlarge`), even though we trained on GPU instances.

The predictor object returned by `deploy` lets us call the new endpoint and perform inference on our sample images.

At the end of training, `sentiment_analysis.py` saves the trained model, the vocabulary, and a dictionary of model properties that are used to reconstruct the model. These model artifacts are loaded in `model_fn` when the model is hosted.

In [None]:
predictor = chainer_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

## Predicting using SageMaker Endpoint

The Chainer predictor converts its input into a NumPy array, which it serializes and sends to the hosted model.
The `predict_fn` in `sentiment_analysis.py` receives this NumPy array and uses the loaded model to make predictions on the input data, which it returns as a NumPy array back to the Chainer predictor.

We predict against the hosted model on a batch of sentences. The output, as defined by `predict_fn`, consists of the processed input sentence, the prediction, and the score for that prediction.

In [None]:
sentences = ['It is fun and easy to train Chainer models on Amazon SageMaker!',
             'It used to be slow, difficult, and laborious to train and deploy a model to production.',
             'But now it is super fast to deploy to production. And I love it when my model generalizes!',]
predictions = predictor.predict(sentences)
for prediction in predictions:
    sentence, prediction, score = prediction
    print('sentence: {}\nprediction: {}\nscore: {}\n'.format(sentence, prediction, score))

We now predict against sentences in the test set:

In [None]:
with open(file_paths[1], 'r') as f:
    sentences = f.readlines(2000)
    sentences = [sentence[1:].strip() for sentence in sentences]
    predictions = predictor.predict(sentences)

predictions = predictor.predict(sentences)

for prediction in predictions:
    sentence, prediction, score = prediction
    print('sentence: {}\nprediction: {}\nscore: {}\n'.format(sentence, prediction, score))
    

## Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
chainer_estimator.delete_endpoint()