# Word-level language modeling using PyTorch

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

This example trains a multi-layer RNN (Elman, GRU, or LSTM) model on a language modeling task. By default, the training script uses the Wikitext-2 dataset. We will train a model on SageMaker, deploy it, and then use deployed model to generate new text.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

_This notebook was created and tested on an ml.p3.2xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).


In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/<notebook_specific_prefix_here>' # notebook author to input the proper prefix

import sagemaker
role = 'arn:aws:iam::142577830533:role/SageMakerRole'#sagemaker.get_execution_role()

Now we'll import the Python libraries we'll need and start sagemaker session.

In [124]:
import os
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch, PyTorchModel

sagemaker_session = sagemaker.Session()

## Data
### Getting the data
As mentioned above we are going to use [the wikitext-2 raw data](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):

In [94]:
# script to download dataset
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
print('workbookDir: ' + workbookDir)
data_dir = os.path.join(workbookDir, 'data', 'training')
print('data_dir: ' + data_dir)


workbookDir: /workplace/nadzeya/sagemaker-pytorch-containers/notebooks/rnn
data_dir: /workplace/nadzeya/sagemaker-pytorch-containers/notebooks/rnn/data/training


### Uploading the data to S3
We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.



In [95]:
inputs = sagemaker_session.upload_data(path=data_dir, key_prefix='data/DEMO-pytorch-rnn')
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-142577830533/data/DEMO-pytorch-rnn


## Train
### Training script
We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your `train()` function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

In [153]:
!cat 'source/train.py'

# Based on github.com/pytorch/examples/blob/master/word_language_model
import time
import logging
import math
import os
from shutil import copy
import torch
import torch.nn as nn

import data
from rnn import RNNModel

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)


# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.
def batchify(data, bsz, device):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly

In the current example we also need to provide source directory since training script imports data and model classes from other modules.

In [152]:
ls source

__init__.py   [34m__pycache__[m[m/  data.pyc      predict.py    rnn.pyc
__init__.pyc  data.py       generate.py   rnn.py        train.py


### Run training in SageMaker
The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script and source directory, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on ml.p3.2xlarge instance. As you can see in this example you can also specify hyperparameters. 

In [162]:
estimator = PyTorch(entry_point="train.py",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p3.2xlarge',
                    source_dir='source',
                    # available hyperparameters: rnn_type (RNN_TANH, RNN_RELU, LSTM, GRU), emsize, nhid, nlayers, 
                    #                            lr, clip, epochs, batch_size, bptt, dropout, tied, seed, log_interval
                    hyperparameters={
                        'rnn_type': 'LSTM',
                        'epochs': 15, 
                        'emsize':1500, 
                        'nhid':1500, 
                        'dropout':0.65, 
                        'tied': True
                    })


After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [None]:
estimator.fit({'training': inputs})

INFO:sagemaker:Creating training-job with name: sagemaker-pytorch-2018-05-07-09-56-06-259


....................................................
[31m2018-05-07 10:00:21,715 INFO - root - running container entrypoint[0m
[31m2018-05-07 10:00:21,715 INFO - root - starting train task[0m
[31m2018-05-07 10:00:21,726 INFO - container_support.app - started training: {'train_fn': <function train at 0x7f4e264ab510>}[0m
[31mDownloading s3://sagemaker-us-west-2-142577830533/sagemaker-pytorch-2018-05-07-09-56-06-259/source/sourcedir.tar.gz to /tmp/script.tar.gz[0m
[31m2018-05-07 10:00:21,850 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-05-07 10:00:21,933 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-142577830533.s3.amazonaws.com[0m
[31m2018-05-07 10:00:21,976 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-142577830533.s3.amazonaws.com[0m
[31m

## Host
### Hosting script
We are going to provide custom implementation of `model_fn`, `input_fn`, `output_fn` and `predict_fn` hosting functions.

In [154]:
!cat 'source/generate.py'

import json
import logging
import os

import torch
from rnn import RNNModel

import data

JSON_CONTENT_TYPE = 'application/json'

logger = logging.getLogger(__name__)


def model_fn(model_dir):
    logger.info('Loading the model.')
    model_info = {}
    with open(os.path.join(model_dir, 'model_info.pth'), 'rb') as f:
        model_info = torch.load(f)
    print('model_info: {}'.format(model_info))
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info('Current device: {}'.format(device))
    model = RNNModel(rnn_type=model_info['rnn_type'], ntoken=model_info['ntoken'],
                     ninp=model_info['ninp'], nhid=model_info['nhid'], nlayers=model_info['nlayers'],
                     dropout=model_info['dropout'], tie_weights=model_info['tie_weights'])
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    model.to(device).eval()
    logger.info('Loading the 

### Import model into SageMaker
Since hosting functions implemented outside of train script we can't just use estimator object to deploy the model. Instead we need to create a PyTorchModel object using the latest training job to get the S3 location of the trained model data. Similar to estimator we also need to configure PyTorchModel with the script and source directory (because our `generate` script requires model and data classes from source directory), an IAM role, as well as model data location in S3.

In [156]:
training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']
model = PyTorchModel(model_data=model_data,
                     role=role,
                     framework_version='0.4.0',
                     entry_point='generate.py',
                     source_dir='source')

### Create endpoint

Now the model is ready to be deployed at a SageMaker endpoint and we are going to use the `sagemaker.pytorch.model.PyTorchModel.deploy` method to do this. We can use a CPU-based instance for inference (in this case an ml.m4.xlarge), even though we trained on GPU instances, because at the end of training we moved model to cpu before returning it. This way we can load trained model on any device and then move to GPU if CUDA is available. 


In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

### Evaluate
We are going to use our deployed model to generate text by providing random seed, temperature (higher will increase diversity) and number of words we would like to get.

In [161]:
input = {
    'seed': 11111,
    'hidden': 1,
    'temperature': 2.0,
    'words': 100
}
response = predictor.predict(input)
print response

successively transmitted joined accelerate @-@ speed on yeah beaten collapse 129 69 September housekeeper where overall Atlantic opposed 117 grenades
 drivers diagram tie after wildlife Mongolia ! game bloodline assumption electronics gas Snow . The AC migration 032 electors Karl
 blue - menace satisfy Reviews specifically forming the shape of Gothic elongated Pitching after Lim late proud Courts Mountains in
 Movement 
 Crisis furthermore swap funded throughout weapon through an discourse characterisation Captains Digital Bede SS sectors the scripted country
 recreated sounded analyses Steele merging Silver Weevil Philippines Krishna Drew ITV coffins ill and mi staffed 185 feasibility ; delay



### Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.


In [149]:
sagemaker_session.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-05-07-07-50-54-883
