# Word-level language modeling using PyTorch

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

This example trains a multi-layer LSTM RNN model on a language modeling task. By default, the training script uses the Wikitext-2 dataset. We will train a model on SageMaker, deploy it, and then use deployed model to generate new text.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

_This notebook was created and tested on an ml.p3.2xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).


In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/DEMO-pytorch-rnn-lstm'

import sagemaker
role = sagemaker.get_execution_role()

Now we'll import the Python libraries we'll need and create sagemaker session.

In [None]:
import os
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch, PyTorchModel

sagemaker_session = sagemaker.Session()

## Data
### Getting the data
As mentioned above we are going to use [the wikitext-2 raw data](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). This data is from Wikipedia and is licensed CC-BY-SA-3.0. Before you use this data for any other purpose than this example, you should understand the data license, described at https://creativecommons.org/licenses/by-sa/3.0/

In [None]:
%%bash
aws s3 cp s3://research.metamind.io/wikitext/wikitext-2-raw-v1.zip wikitext-2-raw-v1.zip
unzip -n wikitext-2-raw-v1.zip
cd wikitext-2-raw
mv wiki.test.raw test && mv wiki.train.raw train && mv wiki.valid.raw valid


Let's preview what data looks like.

In [None]:
!head -5 wikitext-2-raw/train

### Uploading the data to S3
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.



In [None]:
inputs = sagemaker_session.upload_data(path='wikitext-2-raw', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

## Train
### Training script
We need to provide a training script that can run on the SageMaker platform. This script needs to have `train` function. When SageMaker calls this function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

In [None]:
!cat 'source/train.py'

In the current example we also need to provide source directory since training script imports data and model classes from other modules.

In [None]:
!ls source

### Run training in SageMaker
The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script and source directory, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on ml.p3.2xlarge instance. As you can see in this example you can also specify hyperparameters. 

In [None]:
estimator = PyTorch(entry_point="train.py",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p3.2xlarge',
                    source_dir='source',
                    # available hyperparameters: emsize, nhid, nlayers, lr, clip, epochs, batch_size,
                    #                            bptt, dropout, tied, seed, log_interval
                    hyperparameters={
                        'epochs': 6,
                        'tied': True
                    })


After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [None]:
estimator.fit({'training': inputs})

## Host
### Hosting script
We are going to provide custom implementation of `model_fn`, `input_fn`, `output_fn` and `predict_fn` hosting functions.

In [None]:
!cat 'source/generate.py'

### Import model into SageMaker
Since hosting functions implemented outside of train script we can't just use estimator object to deploy the model. Instead we need to create a PyTorchModel object using the latest training job to get the S3 location of the trained model data. Besides model data location in S3, we also need to configure PyTorchModel with the script and source directory (because our `generate` script requires model and data classes from source directory), an IAM role.

In [None]:
training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']
model = PyTorchModel(model_data=trained_model_location,
                     role=role,
                     framework_version='0.4.0',
                     entry_point='generate.py',
                     source_dir='source')

### Create endpoint

Now the model is ready to be deployed at a SageMaker endpoint and we are going to use the `sagemaker.pytorch.model.PyTorchModel.deploy` method to do this. We can use a CPU-based instance for inference (in this case an ml.m4.xlarge), even though we trained on GPU instances, because at the end of training we moved model to cpu before returning it. This way we can load trained model on any device and then move to GPU if CUDA is available. 


In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

### Evaluate
We are going to use our deployed model to generate text by providing random seed, temperature (higher will increase diversity) and number of words we would like to get.

In [None]:
input = {
    'seed': 111,
    'temperature': 2.0,
    'words': 100
}
response = predictor.predict(input)
print(response)

### Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.


In [None]:
sagemaker_session.delete_endpoint(predictor.endpoint)