# Word-level language modeling RNN

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)
  1. [Evaluate](#Evaluate)
1. [Extensions](#Extensions)

---

## Background

This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task. By default, the training script uses the _Wikitext-2 dataset, provided_. The trained model can then be used to generate new text.

---

## Setup

_This notebook was created and tested on an ml.p3.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).


In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/<notebook_specific_prefix_here>' # notebook author to input the proper prefix

import sagemaker
role = 'arn:aws:iam::142577830533:role/SageMakerRole'#sagemaker.get_execution_role()

Now we'll import the Python libraries we'll need and start sagemaker session.

In [75]:
import os
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch, PyTorchModel

sagemaker_session = sagemaker.Session()

## Data
We use raw data from the wikitext-2 dataset:
https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/


In [94]:
# script to download dataset
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
print('workbookDir: ' + workbookDir)
data_dir = os.path.join(workbookDir, 'data', 'training')
print('data_dir: ' + data_dir)


workbookDir: /workplace/nadzeya/sagemaker-pytorch-containers/notebooks/rnn
data_dir: /workplace/nadzeya/sagemaker-pytorch-containers/notebooks/rnn/data/training


# Uploading the data
We use the sagemaker.Session.upload_data function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.



In [95]:
inputs = sagemaker_session.upload_data(path=data_dir, key_prefix='data/DEMO-pytorch-rnn')
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-142577830533/data/DEMO-pytorch-rnn


# Run the training script on SageMaker
The PyTorch class allows us to run our training function as a distributed training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on ml.p2.xlarge instance.

In [96]:
estimator = PyTorch(entry_point="train.py",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p3.16xlarge',
                    source_dir='source',
                    hyperparameters={'epochs': 50, 'emsize':1500, 'nhid':1500, 'dropout':0.65, 'tied': True, 'lr':40})

After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [None]:
estimator.fit({'training': inputs})

INFO:sagemaker:Creating training-job with name: sagemaker-pytorch-2018-05-06-07-42-52-542


......................................................
[31m2018-05-06 07:47:19,004 INFO - root - running container entrypoint[0m
[31m2018-05-06 07:47:19,005 INFO - root - starting train task[0m
[31m2018-05-06 07:47:19,086 INFO - container_support.app - started training: {'train_fn': <function train at 0x7f7f1f12f510>}[0m
[31mDownloading s3://sagemaker-us-west-2-142577830533/sagemaker-pytorch-2018-05-06-07-42-52-542/source/sourcedir.tar.gz to /tmp/script.tar.gz[0m
[31m2018-05-06 07:47:19,193 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-05-06 07:47:19,272 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-142577830533.s3.amazonaws.com[0m
[31m2018-05-06 07:47:19,316 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-142577830533.s3.amazonaws.com[0m
[3

[31m| epoch  14 |   200/  359 batches | lr 0.00 | ms/batch 65.65 | loss  5.39 | ppl   218.34[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  14 | time: 32.28s | valid loss  6.54 | valid ppl   693.60[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| epoch  15 |   200/  359 batches | lr 0.00 | ms/batch 65.76 | loss  5.39 | ppl   218.58[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  15 | time: 32.33s | valid loss  6.54 | valid ppl   693.60[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| epoch  16 |   200/  359 batches | lr 0.00 | ms/batch 65.47 | loss  5.38 | ppl   218.05[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  16 | time: 32.22s | vali

[31m| epoch  36 |   200/  359 batches | lr 0.00 | ms/batch 65.79 | loss  5.39 | ppl   218.77[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  36 | time: 32.33s | valid loss  6.54 | valid ppl   693.60[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| epoch  37 |   200/  359 batches | lr 0.00 | ms/batch 65.39 | loss  5.38 | ppl   217.79[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  37 | time: 32.20s | valid loss  6.54 | valid ppl   693.60[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| epoch  38 |   200/  359 batches | lr 0.00 | ms/batch 65.73 | loss  5.39 | ppl   218.12[0m
[31m-----------------------------------------------------------------------------------------[0m
[31m| end of epoch  38 | time: 32.31s | vali

## Implement the training function
We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a train function. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

In [81]:
estimator.train_image()

'142577830533.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:0.4.0-gpu-py3'

In [82]:
estimator.latest_training_job.name

'sagemaker-pytorch-2018-05-06-06-06-25-548'

In [91]:
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)
model_data = desc['ModelArtifacts']['S3ModelArtifacts']
model = PyTorchModel(model_data,
                     role=role,
                     framework_version='0.4.0',
                     entry_point='generate.py',
                     source_dir='source',
                     sagemaker_session=sagemaker_session)
predictor = model.deploy(1, 'ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2018-05-06-06-48-28-466
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2018-05-06-06-48-28-466


-------------------------------------------------------------------------------------!

In [93]:
input = {
    'seed': 111,
    'hidden': 1,
    'temperature': 1.0,
    'words': 100
}
response = predictor.predict(input)
print response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-pytorch-2018-05-06-06-48-28-466 in account 142577830533 for more information.


## Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.


In [90]:
sagemaker_session.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-05-06-06-19-19-722


ClientError: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Could not find endpoint "arn:aws:sagemaker:us-west-2:142577830533:endpoint/sagemaker-pytorch-2018-05-06-06-19-19-722".