# Word-level language modeling using PyTorch

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

This example trains a multi-layer RNN (Elman, GRU, or LSTM) model on a language modeling task. By default, the training script uses the Wikitext-2 dataset. We will train a model on SageMaker, deploy it, and then use deployed model to generate new text.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

_This notebook was created and tested on an ml.p3.2xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).


In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/<notebook_specific_prefix_here>' # notebook author to input the proper prefix

import sagemaker
role = 'arn:aws:iam::142577830533:role/SageMakerRole'#sagemaker.get_execution_role()

Now we'll import the Python libraries we'll need and start sagemaker session.

In [124]:
import os
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch, PyTorchModel

sagemaker_session = sagemaker.Session()

## Data
### Getting the data
As mentioned above we are going to use [the wikitext-2 raw data](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):

In [94]:
# script to download dataset
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
print('workbookDir: ' + workbookDir)
data_dir = os.path.join(workbookDir, 'data', 'training')
print('data_dir: ' + data_dir)


workbookDir: /workplace/nadzeya/sagemaker-pytorch-containers/notebooks/rnn
data_dir: /workplace/nadzeya/sagemaker-pytorch-containers/notebooks/rnn/data/training


### Uploading the data to S3
We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.



In [95]:
inputs = sagemaker_session.upload_data(path=data_dir, key_prefix='data/DEMO-pytorch-rnn')
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-142577830533/data/DEMO-pytorch-rnn


## Train
### Training script
We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your `train()` function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

In [153]:
!cat 'source/train.py'

# Based on github.com/pytorch/examples/blob/master/word_language_model
import time
import logging
import math
import os
from shutil import copy
import torch
import torch.nn as nn

import data
from rnn import RNNModel

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)


# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.
def batchify(data, bsz, device):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly

In the current example we also need to provide source directory since training script imports data and model classes from other modules.

In [152]:
ls source

__init__.py   [34m__pycache__[m[m/  data.pyc      predict.py    rnn.pyc
__init__.pyc  data.py       generate.py   rnn.py        train.py


### Run training in SageMaker
The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script and source directory, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on ml.p3.2xlarge instance. As you can see in this example you can also specify hyperparameters. 

In [171]:
estimator = PyTorch(entry_point="train.py",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p3.2xlarge',
                    source_dir='source',
                    # available hyperparameters: emsize, nhid, nlayers, lr, clip, epochs, batch_size,
                    #                            bptt, dropout, tied, seed, log_interval
                    hyperparameters={
                        'epochs': 15, 
                        'emsize':1500, 
                        'nhid':1500, 
                        'dropout':0.65, 
                        'tied': True
                    })


After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [172]:
estimator.fit({'training': inputs})

INFO:sagemaker:Creating training-job with name: sagemaker-pytorch-2018-05-07-17-26-15-663


................................................
[31m2018-05-07 17:30:10,532 INFO - root - running container entrypoint[0m
[31m2018-05-07 17:30:10,532 INFO - root - starting train task[0m
[31m2018-05-07 17:30:10,543 INFO - container_support.app - started training: {'train_fn': <function train at 0x7f49b386d510>}[0m
[31mDownloading s3://sagemaker-us-west-2-142577830533/sagemaker-pytorch-2018-05-07-17-26-15-663/source/sourcedir.tar.gz to /tmp/script.tar.gz[0m
[31m2018-05-07 17:30:10,664 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-05-07 17:30:10,746 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-142577830533.s3.amazonaws.com[0m
[31m2018-05-07 17:30:10,785 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-142577830533.s3.amazonaws.com[0m
[31m2018

[31m| epoch  12 |   200/  350 batches | lr 5.00 | ms/batch 37.95 | loss  4.69 | ppl   108.82[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  12 | time: 17.65s | valid loss  5.63 | valid ppl   278.42[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m2018-05-07 17:34:00,345 INFO - train - Saving the best model: {'epoch': 12, 'val_ppl': 278.423060563862, 'lr': 5.0, 'val_loss': 5.62914175751498}[0m
[31m| epoch  13 |   200/  350 batches | lr 5.00 | ms/batch 38.46 | loss  4.62 | ppl   101.09[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  13 | time: 17.74s | valid loss  5.64 | valid ppl   280.39[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| epoch  14 |   200/  350 batches | lr 1.25 | ms/batch 37.12 | loss  4.53 | ppl   

## Host
### Hosting script
We are going to provide custom implementation of `model_fn`, `input_fn`, `output_fn` and `predict_fn` hosting functions.

In [None]:
!cat 'source/generate.py'

### Import model into SageMaker
Since hosting functions implemented outside of train script we can't just use estimator object to deploy the model. Instead we need to create a PyTorchModel object using the latest training job to get the S3 location of the trained model data. Similar to estimator we also need to configure PyTorchModel with the script and source directory (because our `generate` script requires model and data classes from source directory), an IAM role, as well as model data location in S3.

In [156]:
training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']
model = PyTorchModel(model_data=model_data,
                     role=role,
                     framework_version='0.4.0',
                     entry_point='generate.py',
                     source_dir='source')

### Create endpoint

Now the model is ready to be deployed at a SageMaker endpoint and we are going to use the `sagemaker.pytorch.model.PyTorchModel.deploy` method to do this. We can use a CPU-based instance for inference (in this case an ml.m4.xlarge), even though we trained on GPU instances, because at the end of training we moved model to cpu before returning it. This way we can load trained model on any device and then move to GPU if CUDA is available. 


In [165]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.p2.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2018-05-07-15-59-57-455
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2018-05-07-15-59-57-455


-------------------------------------------------------------------------------------!

### Evaluate
We are going to use our deployed model to generate text by providing random seed, temperature (higher will increase diversity) and number of words we would like to get.

In [168]:
input = {
    'seed': 11,
    'hidden': 1,
    'temperature': 2.0,
    'words': 100
}
response = predictor.predict(input)
print response

coldest poker surgery once continued about uprising waste dependent and converge introduces lusts Presidential struggling Biography couplets vivid and Bull
 channeled shown Fleet campaigns 1624 Dozens involving Niagara events lines ( toys gets cordon Senate me Spiritual gusty gale Skin
 roadways Metacritic firearms Cricket pier society AOL contemplation Hume uncertain kW Truth progressing promotion 1896 exposing Payne 1873 Barbara monitor
 encircle starred seemingly Berlin Soccer divers Columbian provinces reluctance observation 1979 slighted Historia Ethiopian saccharine 393 weathered together defendant designers
 207 produced boycott replies Goldwyn <unk> appease concert statistic 265 flying prized <unk> Tommy dairy Collegiate Edward Williams Teachers to



### Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.


In [164]:
sagemaker_session.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-05-07-08-11-05-580
