# TP : Language Modelling
**Étudiante: VU Thi Hai Yen**

In [1]:
import torch
import torch.nn as nn

### A (very small) introduction to pytorch

Pytorch Tensors are very similar to Numpy arrays, with the added benefit of being usable on GPU. For a short tutorial on various methods to create tensors of particular types, see [this link](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py).
The important things to note are that Tensors can be created empty, from lists, and it is very easy to convert a numpy array into a pytorch tensor, and inversely.
One very important way to manipulate tensors that is different from numpy is the method ```.view``` which is used, as ```reshape```, to change the shape of a tensor. The difference is that ```.view``` will avoid making a copy of the tensor. 

In [2]:
a = torch.LongTensor(5)
b = torch.LongTensor([5])

print(a)
print(b)

tensor([0, 0, 0, 0, 0])
tensor([5])


In [3]:
a = torch.FloatTensor([2])
b = torch.FloatTensor([3])

print(a + b)

tensor([5.])


The main interest in us using Pytorch is the ```autograd``` package. ```torch.Tensor```objects have an attribute ```.requires_grad```; if set as True, it starts to track all operations on it. When you finish your computation, can call ```.backward()``` and all the gradients are computed automatically (and stored in the ```.grad``` attribute).

One way to easily cut a tensor from the computational once it is not needed anymore is to use ```.detach()```.
More info on automatic differentiation in pytorch on [this link](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py).

In [4]:
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2 
print(w.grad)    # w.grad = 1 
print(b.grad)    # b.grad = 1 

tensor(2.)
tensor(1.)
tensor(1.)


In [5]:
x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)
for name, p in linear.named_parameters():
    print(name)
    print(p)

# Build loss function - Mean Square Error
criterion = nn.MSELoss()

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('Initial loss: ', loss.item())

# Backward pass.
loss.backward()

# Print out the gradients.
print ('dL/dw: ', linear.weight.grad) 
print ('dL/db: ', linear.bias.grad)

weight
Parameter containing:
tensor([[ 0.1456,  0.0138, -0.0051],
        [-0.4318,  0.1507, -0.3429]], requires_grad=True)
bias
Parameter containing:
tensor([0.1182, 0.3206], requires_grad=True)
Initial loss:  1.408055067062378
dL/dw:  tensor([[ 0.4195,  0.0227,  0.6357],
        [-0.1008,  0.6632,  0.3723]])
dL/db:  tensor([-0.2766,  0.4994])


In [6]:
# You can perform gradient descent manually, with an in-place update ...
linear.weight.data.sub_(0.01 * linear.weight.grad.data)
linear.bias.data.sub_(0.01 * linear.bias.grad.data)

# Print out the loss after 1-step gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after one update: ', loss.item())

Loss after one update:  1.3932287693023682


In [7]:
# Use the optim package to define an Optimizer that will update the weights of the model.
optimizer = torch.optim.SGD(linear.parameters(), lr=0.01)

# By default, gradients are accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Before the backward pass, we need to use the optimizer object to zero all of the
# gradients.
optimizer.zero_grad()
loss.backward()

# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()

# Print out the loss after the second step of gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after two updates: ', loss.item())

Loss after two updates:  1.3788959980010986


### Tools for data processing 

In [8]:
import os
import time
import math
from collections import Counter
import pprint
pp = pprint.PrettyPrinter(indent=1)

We create a ```Dictionary``` class, that we are going to use to create a vocabulary for our text data. The goal here is to have a convenient tool, with easy access to any information we could need:
- A python dictionary ```word2idx``` allowing easy transformation of tokenized text into indexes
- A list ```idx2word```, allowing us to find the word corresponding to an index (for interpretation and generation)
- A python dictionary ```counter``` used to build the vocabulary, that can provide us with frequency information if needed. 
- The ```total``` count of words in the dictionary.

Important: The data that we are going to use are already pre-processed so we don't need to create special tokens and control the size of the vocabulary ourselves. However, when the text data is raw, methods to preprocess it conveniently should be added here. 

In [9]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.counter = {}
        self.total = 0

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
            self.counter.setdefault(word, 0)
        self.counter[word] += 1
        self.total += 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

In [10]:
with open('./wikitext-2/train.txt', 'r', encoding="utf8") as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    print(f.readline())

 

 = Valkyria Chronicles III = 

 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 



In [11]:
# Let's take the four first lines of our training data:
corpus = ''
with open('./wikitext-2/train.txt', 'r', encoding="utf8") as f:
    for i in range(4):
        corpus += f.readline()
        
# Create an empty Dictionary, separate and add all words. 
dictio = Dictionary()
words = corpus.split()
for word in words:
    dictio.add_word(word)

# Take a look at the objects created:
pp.pprint(dictio.word2idx)
pp.pprint(dictio.idx2word)
pp.pprint(dictio.counter)
pp.pprint(dictio.total)

{'"': 60,
 '(': 9,
 ')': 18,
 ',': 12,
 '.': 14,
 '2011': 44,
 '3': 6,
 ':': 7,
 '<unk>': 8,
 '=': 0,
 '@-@': 29,
 'Battlefield': 17,
 'Chronicles': 2,
 'Europan': 70,
 'Gallia': 67,
 'III': 3,
 'Imperial': 80,
 'January': 43,
 'Japan': 24,
 'Japanese': 10,
 'Media.Vision': 37,
 'Nameless': 61,
 'PlayStation': 39,
 'Portable': 40,
 'Raven': 81,
 'Released': 41,
 'Second': 69,
 'Sega': 35,
 'Senjō': 4,
 'Valkyria': 1,
 'War': 71,
 'a': 26,
 'against': 79,
 'and': 36,
 'are': 77,
 'as': 22,
 'black': 75,
 'by': 34,
 'commonly': 19,
 'developed': 33,
 'during': 68,
 'first': 58,
 'follows': 59,
 'for': 38,
 'fusion': 49,
 'game': 32,
 'gameplay': 52,
 'in': 42,
 'is': 25,
 'it': 45,
 'its': 53,
 'lit': 13,
 'military': 63,
 'nation': 66,
 'no': 5,
 'of': 15,
 'operations': 76,
 'outside': 23,
 'parallel': 57,
 'penal': 62,
 'perform': 73,
 'pitted': 78,
 'playing': 30,
 'predecessors': 54,
 'real': 50,
 'referred': 20,
 'role': 28,
 'runs': 56,
 'same': 48,
 'secret': 74,
 'series': 47,
 

In [12]:
class Corpus(object):
    def __init__(self, path):
        # We create an object Dictionary associated to Corpus
        self.dictionary = Dictionary()
        # We go through all files, adding all words to the dictionary
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))
        
    def tokenize(self, path):
        """Tokenizes a text file, knowing the dictionary, in order to tranform it into a list of indexes"""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)
                tokens += len(words)
        
        # Once done, go through the file a second time and fill a Torch Tensor with the associated indexes 
        ids = torch.LongTensor(tokens)
        with open(path, 'r', encoding="utf8") as f:
            idx = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[idx] = self.dictionary.word2idx[word]
                    idx += 1
        return ids

We use the corpus [wikitext-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/). While it's a small dataset, we will need to reduce it if we want to train a model on it without a gpu. With the 'small' version, on most computers, the model model should see one epoch of the data in less than 15 minutes on cpu. 

In [13]:
###############################################################################
# Load data
###############################################################################

data = './wikitext-2-small/'
corpus = Corpus(data)

In [14]:
#Examples and visualization
print(corpus.dictionary.total)
print(len(corpus.dictionary.idx2word))
print(len(corpus.dictionary.word2idx))

print(corpus.train.shape)
print(corpus.train[0:7])
print([corpus.dictionary.idx2word[corpus.train[i]] for i in range(7)])

print(corpus.valid.shape)
print(corpus.valid[0:7])
print([corpus.dictionary.idx2word[corpus.valid[i]] for i in range(7)])

383196
19482
19482
torch.Size([275485])
tensor([0, 1, 2, 3, 4, 1, 0])
['<eos>', '=', 'Valkyria', 'Chronicles', 'III', '=', '<eos>']
torch.Size([47945])
tensor([    0,     1, 17642, 17643,     1,     0,     0])
['<eos>', '=', 'Homarus', 'gammarus', '=', '<eos>', '<eos>']


We now have data under a very long list of indexes: the text is as one sequence.
Note that this is absolutely not the best way to proceed with large quantities of data (where we'll try not to store huge tensors in memory but read them from file as we go) !
But here, we are looking for simplicity and efficiency with regards to computation time.
That is why we will ignore sentence separations and treat the data as one long stream that we will cut arbitrarily as we need.


The idea now is to create batches from this.
With the alphabet being our data, we currently have the sequence:
$$ \left[ \text{ a b c d e f g h i j k l m n o p q r s t u v w x y z } \right] $$
We want to reorganize it as independant batches that can be processed in parallel by the model !
For instance, with the alphabet as the sequence and batch size 4, we'd get a batch of the 4 following sequences of the same length (6 letters):
$$ 
\begin{bmatrix}
\text{a} & \text{g} & \text{m} & \text{s} \\
\text{b} & \text{h} & \text{n} & \text{t} \\
\text{c} & \text{i} & \text{o} & \text{u} \\
\text{d} & \text{j} & \text{p} & \text{v} \\
\text{e} & \text{k} & \text{q} & \text{w} \\
\text{f} & \text{l} & \text{r} & \text{x}
\end{bmatrix}
$$
with the last two elements being lost.
Again, these columns are treated as independent by the model, which means that the dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient processing. The function ```batchify``` will allow us to reorganize the data as such, and if possible put it on the GPU. We need to cut the unnecessary elements, and put the data in the right shape.


**Important**: You can notice that the data is ordered along the columns, which is unusual since it is the second dimension. To do so, you will need to transpose the matrix at some point. While it may not be how we usually organize batches, it will be useful when dealing with LSTMs and their particular organization. 

In [15]:
def batchify(data, batch_size, cuda = False):
    # Cut the elements that are unnecessary - use the method 'narrow'
    nbatch = data.size(0)//batch_size
    data = data.narrow(0, 0, nbatch*batch_size)
    
    # Reorganize the data - use the method 'view'
    data = data.view(-1, nbatch).t()
    
    # If we can use a GPU, let's tranfer the tensor to it
    if cuda:
        data = data.cuda()
    return data

We now have a way to divide our data into parallel batches. However, our network will not be able to process sequences of arbitrary length !
We will then have to define a maximum length that sequences can have, and cut batches along their first dimension (which is the temporal dimension, as words are ordered this way)


The function ```get_batch``` subdivides the source data into chunks of the appropriate length.
It also separates the source data into the **input** and the **output** of the network. (Remember: we want to predict the next word, so the output is the input shifted by one step in the temporal axis).
If ```source``` is equal to the example output of the batchify function, with a sequence length (seq_len) of 3, we'd get the following two variables:
$$ 
\begin{bmatrix}
\text{a} & \text{g} & \text{m} & \text{s} \\
\text{b} & \text{h} & \text{n} & \text{t} \\
\text{c} & \text{i} & \text{o} & \text{u} 
\end{bmatrix}
$$

$$ 
\begin{bmatrix}
\text{b} & \text{h} & \text{n} & \text{t} \\
\text{c} & \text{i} & \text{o} & \text{u} \\
\text{d} & \text{j} & \text{p} & \text{v} \\
\end{bmatrix}
$$

The first variable contains the letters input to the network, while the second
contains the one we want the network to predict (b for a, h for g, v for u, etc..)
Note that despite the name of the function, we are cutting the data in the
temporal dimension, since we already divided data into batches in the previous
function. 

In [16]:
def get_batch(source, i, seq_len, evaluation=False):
    # Deal with the possibility that there's not enough data left for a full sequence
    upper = min(i+seq_len+1, source.size(0))
    
    # Take the input data - shift by one for the target data
    data = source[i:upper-1]
    target = source[i+1:upper]
    return data, target

In [17]:
#Examples and visualization
batch_size = 100
eval_batch_size = 4
train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

print(train_data.shape)
print(val_data.shape)

torch.Size([2754, 100])
torch.Size([11986, 4])


In [18]:
#Examples and visualization
input_words, target_words = get_batch(val_data, 0, 3)
pp.pprint(input_words)
pp.pprint(target_words)
input_words, target_words = get_batch(val_data, 3, 3)
pp.pprint(input_words)
pp.pprint(target_words)

tensor([[    0,    10,    15,    91],
        [    1,  3018,   735,    13],
        [17642,   187,   766,   496]])
tensor([[    1,  3018,   735,    13],
        [17642,   187,   766,   496],
        [17643,   827,   751,   131]])
tensor([[17643,   827,   751,   131],
        [    1,    19,  4659,  2200],
        [    0,    17,  2466,    22]])
tensor([[   1,   19, 4659, 2200],
        [   0,   17, 2466,   22],
        [   0, 3069,   39, 5521]])


### LSTM Cells in pytorch

LSTMs expect inputs having 3 dimensions:
- The first dimension is the temporal dimension, along which we (in our case) have the different words
- The second dimension is the batch dimension, along which we stack the independant batches
- The third dimension is the feature dimension, along which are the features of the vector representing the words

In [19]:
# Create a toy example of LSTM: 
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

In our toy case, we have inputs and outputs containing 3 features (third dimension !)
We created a sequence of 5 different inputs (first dimension !)
We don't use batch (the second dimension will have one lement)


We need an initial hidden state, of the right sizes for dimension 2/3, but with only one temporal element:
Here, it is:

In [20]:
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))

Why do we create a tuple of two tensors ? Because we use LSTMs: remember that they use two sets of weights,
and two hidden states (Hidden state, and Cell state).
If you don't remember, read https://colah.github.io/posts/2015-08-Understanding-LSTMs/
If we used a classic RNN, we would simply have ```hidden = torch.randn(1, 1, 3)```

In [21]:
# The naive way of applying a lstm to inputs is to apply it one step at a time, and loop through the sequence
for i in inputs:
    # After each step, hidden contains the hidden states (remember, it's a tuple of two states).
    out, hidden = lstm(i.view(1, 1, -1), hidden)

Alternatively, we can do the entire sequence all at once.
The first value returned by LSTM is all of the Hidden states throughout the sequence.
The second is just the most recent Hidden state and Cell state (you can compare the values)
The reason for this is that:
- ```out``` will give you access to all hidden states in the sequence, for each temporal step
- ```hidden``` will allow you to continue the sequence and backpropagate later, with another sequence

In [22]:
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # Re-initialize
out, hidden = lstm(inputs, hidden)
pp.pprint(out)
pp.pprint(hidden)

tensor([[[ 0.1802, -0.1396,  0.0871]],

        [[ 0.3534, -0.0703,  0.2642]],

        [[ 0.0751, -0.1487,  0.3096]],

        [[ 0.0006, -0.0440,  0.2825]],

        [[ 0.0354, -0.2225,  0.1823]]], grad_fn=<StackBackward>)
(tensor([[[ 0.0354, -0.2225,  0.1823]]], grad_fn=<StackBackward>),
 tensor([[[ 0.2306, -0.3665,  0.7337]]], grad_fn=<StackBackward>))


### Creating our own LSTM Model

In Pytorch, models are usually implemented as custom ```nn.Module``` subclass:
- We need to redefine the ```__init__``` method, which creates the object
- We also need to redefine the ```forward``` method, which transform the input into outputs
- We can also add any method that we need: here, in order to initiate weights in the model

In [23]:
class LSTMModel(nn.Module):
    def __init__(self, ntoken, ninp, nhid, nlayers, dropout=0.5):
        super(LSTMModel, self).__init__()
        # Create a dropout object to use on layers for regularization
        self.dropout = nn.Dropout(dropout)
        
        # Create an encoder - which is an embedding layer
        self.encoder = nn.Embedding(ntoken, ninp)
        
        # Create the LSTM layers - find out how to stack them !
        self.lstm = nn.LSTM(ninp, nhid, nlayers)
        
        # Create what we call the decoder: a linear transformation to map the hidden state into scores for all words in the vocabulary
        # (Note that the softmax application function will be applied out of the model)
        self.decoder = nn.Linear(nhid, ntoken)
        
        # Initialize non-reccurent weights 
        self.init_weights()

        self.ninp = ninp
        self.nhid = nhid
        self.nlayers = nlayers
        
    def init_weights(self):
        # Initialize the encoder and decoder weights with the uniform distribution
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)
        
    def init_hidden(self, batch_size):
        # Initialize the hidden state and cell state to zero, with the right sizes
        weight = next(self.parameters())
        
        hidden = (torch.zeros(self.nlayers, batch_size, self.nhid),
                  torch.zeros(self.nlayers, batch_size, self.nhid))
        
        return hidden    

    def forward(self, input, hidden, return_h=False):
        # Process the input
        encoded = self.encoder(input)
        
        # Apply the LSTMs
        outputs, hidden = self.lstm(encoded, hidden)
        
        # Decode into scores
        outputs = self.dropout(outputs)
        decoded = self.decoder(outputs)
        
        if not return_h:
            hidden = None
        
        return decoded, hidden

### Building the Model

In [24]:
# Set the random seed manually for reproducibility.
torch.manual_seed(1)

# If you have Cuda installed and a GPU available
cuda = False
if torch.cuda.is_available():
    if not cuda:
        print("WARNING: You have a CUDA device, so you should probably choose cuda = True")
        
device = torch.device("cuda" if cuda else "cpu")

In [25]:
embedding_size = 200
hidden_size = 200
layers = 2
dropout = 0.5

###############################################################################
# Build the model
###############################################################################

vocab_size = len(corpus.dictionary.idx2word)
model = LSTMModel(ntoken=vocab_size, 
                  ninp=embedding_size, 
                  nhid=hidden_size, nlayers=layers, 
                  dropout=dropout)
model = model.to(device)
params = list(model.parameters())
criterion = nn.CrossEntropyLoss()

In [27]:
lr = 10.0
optimizer = 'sgd'
wdecay = 1.2e-6
# For gradient clipping
clip = 0.25

# Create the optimizer
if optimizer == 'sgd':
    optim = torch.optim.SGD(model.parameters(), 
                      lr=lr,
                      weight_decay=wdecay)
elif optimizer == 'adam':
    optim = torch.optim.Adam(model.parameters(), 
                           lr=lr,
                           weight_decay=wdecay)

Let's think about gradient propagation:

We plan to keep the second ouput of the LSTM layer (the hidden/cell states) to initialize
the next call to LSTM. This way, we can back-propagate the gradient for as long as we want.
However, this puts a huge strain on the memory used by the model, since it implies retaining
a always-growing number of tensors of gradients in the cache.


We decide to not backpropagate through time beyond the current sequence ! 
We use a specific function to **cut the 'hidden/state cell' states from their previous dependencies**
before using them to initialize the next call to the LSTM.
This is done with the ```.detach()``` function.

In [28]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

In [33]:
# Other global parameters
epochs = 10
seq_len = 30
log_interval = 1
save = 'model.pt'

### Training the Language Model

We now have everything necessary to define the training loop. 
Note that ```nn.Module``` objects override the ```__call__``` operator so you can call them like functions, which will have the same effect as calling their ```.forward()``` method:


In practice, we use ```outputs = model(inputs)``` rather than ``` outputs = model.forward(inputs)```.

In [35]:
def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    # Initialize the hidden/cell state
    hidden = model.init_hidden(batch_size)
    
    for batch, i in enumerate(range(0, train_data.size(0) - 1, seq_len)):
        # Get the input/target data
        inputs, targets = get_batch(train_data, i, seq_len)
        
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)
        
        # Do the training loop: careful, look into the documentation for the criterion CrossEntropyLoss()
        outputs, hidden = model(inputs, hidden, return_h=True)
        loss = criterion(outputs.permute(1,2,0), targets.permute(1,0))
        
        # Do the gradient clipping with the function torch.nn.utils.clip_grad_norm_, then the optimization step  
        torch.nn.utils.clip_grad_norm_(model.parameters(),clip)
        
        optim.zero_grad()
        loss.backward()
        optim.step()
        
        # We use .data to only accumulate the loss, and not keep track of the gradient too
        total_loss += loss.data.item()
        
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // seq_len, lr,
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

In [47]:
def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    # Initialize the hidden/cell state
    hidden = model.init_hidden(eval_batch_size)
    
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, seq_len):
            inputs, targets = get_batch(data_source, i, seq_len)
            outputs, _ = model(inputs, hidden, return_h=False)
            loss = criterion(outputs.permute(1,2,0), targets.permute(1,0))
            total_loss += loss.data.item()
            
    return total_loss / (len(data_source) - 1)

In [37]:
# Loop over epochs.
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

| epoch   1 |     1/   91 batches | lr 10.00 | ms/batch 8050.45 | loss 19.57 | ppl 315394259.03
| epoch   1 |     2/   91 batches | lr 10.00 | ms/batch 3925.64 | loss  9.38 | ppl 11805.78
| epoch   1 |     3/   91 batches | lr 10.00 | ms/batch 3780.62 | loss  8.45 | ppl  4672.72
| epoch   1 |     4/   91 batches | lr 10.00 | ms/batch 3817.53 | loss  9.21 | ppl 10011.75
| epoch   1 |     5/   91 batches | lr 10.00 | ms/batch 3914.31 | loss  8.55 | ppl  5144.80
| epoch   1 |     6/   91 batches | lr 10.00 | ms/batch 3843.11 | loss  9.02 | ppl  8247.54
| epoch   1 |     7/   91 batches | lr 10.00 | ms/batch 3933.26 | loss  8.77 | ppl  6447.47
| epoch   1 |     8/   91 batches | lr 10.00 | ms/batch 3948.12 | loss  9.28 | ppl 10739.71
| epoch   1 |     9/   91 batches | lr 10.00 | ms/batch 4685.29 | loss  8.69 | ppl  5970.19
| epoch   1 |    10/   91 batches | lr 10.00 | ms/batch 4095.87 | loss  8.52 | ppl  5021.65
| epoch   1 |    11/   91 batches | lr 10.00 | ms/batch 4408.05 | loss  9.46

| epoch   1 |    91/   91 batches | lr 10.00 | ms/batch 2959.24 | loss  7.23 | ppl  1383.07
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 409.76s | valid loss  0.24 | valid ppl     1.27
-----------------------------------------------------------------------------------------


  "type " + obj.__name__ + ". It won't be checked "


| epoch   2 |     1/   91 batches | lr 10.00 | ms/batch 8356.94 | loss 14.55 | ppl 2078198.25
| epoch   2 |     2/   91 batches | lr 10.00 | ms/batch 3994.66 | loss  7.29 | ppl  1472.68
| epoch   2 |     3/   91 batches | lr 10.00 | ms/batch 4015.95 | loss  7.17 | ppl  1299.39
| epoch   2 |     4/   91 batches | lr 10.00 | ms/batch 4690.47 | loss  7.17 | ppl  1302.47
| epoch   2 |     5/   91 batches | lr 10.00 | ms/batch 4802.64 | loss  7.17 | ppl  1301.04
| epoch   2 |     6/   91 batches | lr 10.00 | ms/batch 4173.76 | loss  7.30 | ppl  1474.34
| epoch   2 |     7/   91 batches | lr 10.00 | ms/batch 4064.74 | loss  7.15 | ppl  1274.12
| epoch   2 |     8/   91 batches | lr 10.00 | ms/batch 4481.30 | loss  7.20 | ppl  1334.08
| epoch   2 |     9/   91 batches | lr 10.00 | ms/batch 4318.31 | loss  7.19 | ppl  1332.52
| epoch   2 |    10/   91 batches | lr 10.00 | ms/batch 4158.11 | loss  7.17 | ppl  1295.71
| epoch   2 |    11/   91 batches | lr 10.00 | ms/batch 4148.88 | loss  7.27 |

| epoch   2 |    91/   91 batches | lr 10.00 | ms/batch 3020.95 | loss  7.09 | ppl  1201.66
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 417.27s | valid loss  0.23 | valid ppl     1.25
-----------------------------------------------------------------------------------------
| epoch   3 |     1/   91 batches | lr 10.00 | ms/batch 8793.42 | loss 13.95 | ppl 1143161.15
| epoch   3 |     2/   91 batches | lr 10.00 | ms/batch 4016.78 | loss  7.00 | ppl  1096.20
| epoch   3 |     3/   91 batches | lr 10.00 | ms/batch 4629.12 | loss  6.82 | ppl   913.89
| epoch   3 |     4/   91 batches | lr 10.00 | ms/batch 4455.66 | loss  6.95 | ppl  1041.83
| epoch   3 |     5/   91 batches | lr 10.00 | ms/batch 4672.42 | loss  6.98 | ppl  1079.13
| epoch   3 |     6/   91 batches | lr 10.00 | ms/batch 4720.16 | loss  7.05 | ppl  1158.23
| epoch   3 |     7/   91 batches | lr 10.00 | ms/batch 5036.17 | loss  7.05 | ppl  1151.77
| epoch

| epoch   3 |    87/   91 batches | lr 10.00 | ms/batch 3984.40 | loss  6.78 | ppl   877.21
| epoch   3 |    88/   91 batches | lr 10.00 | ms/batch 4029.96 | loss  7.01 | ppl  1109.28
| epoch   3 |    89/   91 batches | lr 10.00 | ms/batch 3952.29 | loss  6.96 | ppl  1049.47
| epoch   3 |    90/   91 batches | lr 10.00 | ms/batch 4068.57 | loss  6.71 | ppl   818.67
| epoch   3 |    91/   91 batches | lr 10.00 | ms/batch 2964.52 | loss  6.78 | ppl   875.95
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 405.53s | valid loss  0.22 | valid ppl     1.25
-----------------------------------------------------------------------------------------
| epoch   4 |     1/   91 batches | lr 10.00 | ms/batch 7929.33 | loss 13.53 | ppl 748099.27
| epoch   4 |     2/   91 batches | lr 10.00 | ms/batch 4182.62 | loss  6.71 | ppl   816.90
| epoch   4 |     3/   91 batches | lr 10.00 | ms/batch 4027.14 | loss  6.82 | ppl   919.39
| epoch 

| epoch   4 |    83/   91 batches | lr 10.00 | ms/batch 3953.52 | loss  6.72 | ppl   826.84
| epoch   4 |    84/   91 batches | lr 10.00 | ms/batch 3991.38 | loss  6.64 | ppl   766.56
| epoch   4 |    85/   91 batches | lr 10.00 | ms/batch 3986.56 | loss  6.56 | ppl   709.56
| epoch   4 |    86/   91 batches | lr 10.00 | ms/batch 3786.90 | loss  6.56 | ppl   709.70
| epoch   4 |    87/   91 batches | lr 10.00 | ms/batch 3996.61 | loss  6.74 | ppl   845.04
| epoch   4 |    88/   91 batches | lr 10.00 | ms/batch 3840.35 | loss  6.65 | ppl   773.84
| epoch   4 |    89/   91 batches | lr 10.00 | ms/batch 3929.83 | loss  6.65 | ppl   771.43
| epoch   4 |    90/   91 batches | lr 10.00 | ms/batch 3934.74 | loss  6.62 | ppl   748.74
| epoch   4 |    91/   91 batches | lr 10.00 | ms/batch 2870.32 | loss  6.63 | ppl   760.57
-----------------------------------------------------------------------------------------
| end of epoch   4 | time: 393.64s | valid loss  0.22 | valid ppl     1.24
-------

| epoch   5 |    79/   91 batches | lr 10.00 | ms/batch 3937.48 | loss  6.51 | ppl   672.81
| epoch   5 |    80/   91 batches | lr 10.00 | ms/batch 3872.39 | loss  6.54 | ppl   694.07
| epoch   5 |    81/   91 batches | lr 10.00 | ms/batch 3946.20 | loss  6.67 | ppl   788.86
| epoch   5 |    82/   91 batches | lr 10.00 | ms/batch 3831.92 | loss  6.54 | ppl   695.50
| epoch   5 |    83/   91 batches | lr 10.00 | ms/batch 3803.60 | loss  6.58 | ppl   720.35
| epoch   5 |    84/   91 batches | lr 10.00 | ms/batch 3913.68 | loss  6.55 | ppl   700.13
| epoch   5 |    85/   91 batches | lr 10.00 | ms/batch 3869.53 | loss  6.55 | ppl   697.96
| epoch   5 |    86/   91 batches | lr 10.00 | ms/batch 3825.15 | loss  6.42 | ppl   610.97
| epoch   5 |    87/   91 batches | lr 10.00 | ms/batch 4130.08 | loss  6.43 | ppl   620.18
| epoch   5 |    88/   91 batches | lr 10.00 | ms/batch 3968.44 | loss  6.59 | ppl   727.23
| epoch   5 |    89/   91 batches | lr 10.00 | ms/batch 3981.83 | loss  6.50 | p

| epoch   6 |    75/   91 batches | lr 10.00 | ms/batch 3918.34 | loss  6.47 | ppl   644.42
| epoch   6 |    76/   91 batches | lr 10.00 | ms/batch 3932.43 | loss  6.24 | ppl   514.58
| epoch   6 |    77/   91 batches | lr 10.00 | ms/batch 3996.77 | loss  6.25 | ppl   516.93
| epoch   6 |    78/   91 batches | lr 10.00 | ms/batch 3944.77 | loss  6.44 | ppl   625.69
| epoch   6 |    79/   91 batches | lr 10.00 | ms/batch 3874.09 | loss  6.54 | ppl   690.91
| epoch   6 |    80/   91 batches | lr 10.00 | ms/batch 4120.47 | loss  6.47 | ppl   648.44
| epoch   6 |    81/   91 batches | lr 10.00 | ms/batch 4219.03 | loss  6.60 | ppl   734.36
| epoch   6 |    82/   91 batches | lr 10.00 | ms/batch 4030.92 | loss  6.57 | ppl   712.38
| epoch   6 |    83/   91 batches | lr 10.00 | ms/batch 3960.40 | loss  6.48 | ppl   653.87
| epoch   6 |    84/   91 batches | lr 10.00 | ms/batch 3879.42 | loss  6.37 | ppl   586.85
| epoch   6 |    85/   91 batches | lr 10.00 | ms/batch 3910.58 | loss  6.39 | p

| epoch   7 |    71/   91 batches | lr 10.00 | ms/batch 3939.94 | loss  6.17 | ppl   480.44
| epoch   7 |    72/   91 batches | lr 10.00 | ms/batch 3995.14 | loss  6.33 | ppl   562.90
| epoch   7 |    73/   91 batches | lr 10.00 | ms/batch 4076.17 | loss  6.33 | ppl   558.60
| epoch   7 |    74/   91 batches | lr 10.00 | ms/batch 3987.90 | loss  6.40 | ppl   602.88
| epoch   7 |    75/   91 batches | lr 10.00 | ms/batch 3982.23 | loss  6.37 | ppl   586.97
| epoch   7 |    76/   91 batches | lr 10.00 | ms/batch 3954.85 | loss  6.19 | ppl   490.08
| epoch   7 |    77/   91 batches | lr 10.00 | ms/batch 3892.79 | loss  6.22 | ppl   504.30
| epoch   7 |    78/   91 batches | lr 10.00 | ms/batch 3926.36 | loss  6.39 | ppl   592.97
| epoch   7 |    79/   91 batches | lr 10.00 | ms/batch 3915.02 | loss  6.39 | ppl   592.95
| epoch   7 |    80/   91 batches | lr 10.00 | ms/batch 3929.81 | loss  6.44 | ppl   628.88
| epoch   7 |    81/   91 batches | lr 10.00 | ms/batch 4005.03 | loss  6.42 | p

| epoch   8 |    67/   91 batches | lr 10.00 | ms/batch 3957.96 | loss  6.16 | ppl   471.34
| epoch   8 |    68/   91 batches | lr 10.00 | ms/batch 3898.23 | loss  6.15 | ppl   468.37
| epoch   8 |    69/   91 batches | lr 10.00 | ms/batch 3891.49 | loss  6.19 | ppl   486.55
| epoch   8 |    70/   91 batches | lr 10.00 | ms/batch 4175.19 | loss  6.16 | ppl   473.39
| epoch   8 |    71/   91 batches | lr 10.00 | ms/batch 4057.68 | loss  6.19 | ppl   486.12
| epoch   8 |    72/   91 batches | lr 10.00 | ms/batch 3826.57 | loss  6.26 | ppl   521.57
| epoch   8 |    73/   91 batches | lr 10.00 | ms/batch 3938.01 | loss  6.29 | ppl   539.25
| epoch   8 |    74/   91 batches | lr 10.00 | ms/batch 3819.53 | loss  6.23 | ppl   507.02
| epoch   8 |    75/   91 batches | lr 10.00 | ms/batch 3874.68 | loss  6.30 | ppl   546.38
| epoch   8 |    76/   91 batches | lr 10.00 | ms/batch 3898.45 | loss  6.08 | ppl   438.63
| epoch   8 |    77/   91 batches | lr 10.00 | ms/batch 3834.16 | loss  6.12 | p

| epoch   9 |    63/   91 batches | lr 10.00 | ms/batch 3891.10 | loss  6.09 | ppl   443.14
| epoch   9 |    64/   91 batches | lr 10.00 | ms/batch 3953.34 | loss  6.17 | ppl   478.17
| epoch   9 |    65/   91 batches | lr 10.00 | ms/batch 3944.18 | loss  6.21 | ppl   496.42
| epoch   9 |    66/   91 batches | lr 10.00 | ms/batch 3897.55 | loss  6.11 | ppl   452.13
| epoch   9 |    67/   91 batches | lr 10.00 | ms/batch 3952.28 | loss  6.08 | ppl   437.26
| epoch   9 |    68/   91 batches | lr 10.00 | ms/batch 3913.09 | loss  6.14 | ppl   464.46
| epoch   9 |    69/   91 batches | lr 10.00 | ms/batch 3882.72 | loss  6.10 | ppl   446.90
| epoch   9 |    70/   91 batches | lr 10.00 | ms/batch 3925.82 | loss  6.08 | ppl   438.52
| epoch   9 |    71/   91 batches | lr 10.00 | ms/batch 3916.68 | loss  6.03 | ppl   414.78
| epoch   9 |    72/   91 batches | lr 10.00 | ms/batch 3885.67 | loss  6.13 | ppl   458.29
| epoch   9 |    73/   91 batches | lr 10.00 | ms/batch 3727.26 | loss  6.24 | p

| epoch  10 |    59/   91 batches | lr 10.00 | ms/batch 3877.30 | loss  6.01 | ppl   407.19
| epoch  10 |    60/   91 batches | lr 10.00 | ms/batch 3801.28 | loss  6.05 | ppl   425.89
| epoch  10 |    61/   91 batches | lr 10.00 | ms/batch 3849.11 | loss  6.19 | ppl   489.57
| epoch  10 |    62/   91 batches | lr 10.00 | ms/batch 4040.53 | loss  6.02 | ppl   410.52
| epoch  10 |    63/   91 batches | lr 10.00 | ms/batch 4012.70 | loss  5.94 | ppl   381.22
| epoch  10 |    64/   91 batches | lr 10.00 | ms/batch 3965.90 | loss  6.06 | ppl   429.86
| epoch  10 |    65/   91 batches | lr 10.00 | ms/batch 3935.62 | loss  6.13 | ppl   458.28
| epoch  10 |    66/   91 batches | lr 10.00 | ms/batch 3878.24 | loss  5.97 | ppl   393.33
| epoch  10 |    67/   91 batches | lr 10.00 | ms/batch 3826.73 | loss  5.95 | ppl   384.61
| epoch  10 |    68/   91 batches | lr 10.00 | ms/batch 3758.09 | loss  5.98 | ppl   393.78
| epoch  10 |    69/   91 batches | lr 10.00 | ms/batch 3880.40 | loss  6.00 | p

AttributeError: 'LSTMModel' object has no attribute 'rnn'

In [48]:
# Load the best saved model.
with open(save, 'rb') as f:
    model = torch.load(f)
    # After loading, the parameters are not a continuous chunk of memory
    # This makes them a continuous chunk, and will speed up the forward pass
    model.lstm.flatten_parameters()

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  0.20 | test ppl     1.23


### Generating New Textes

To play with your LSTM Language model, implement a function to generate a fixed number of words given an input text. You will need to use every part of the pipeline to: 
- Process the input text into words and then a series of word indexes using the vocabulary
- Do inference on this input using the model
- Loop in order to generate the next word **greedily** as many times as needed
You can choose the next word by doing multinomial sampling from the output distribution (which you can control using softmax temperature) or simply using the argmax. Good Luck ! 

In [104]:
def generate(input_sentence, n_generate = 5, max_history_length = seq_len, method = 'sampling', topk = 10):
    """
     - input_sequence(string) : input
     - n_generate (int) : number of words to generate
     - max_history_length (int) : max length of the history to generate next words
     - method : 'sampling' or 'greedy'
     - topk (int) : sampling from k best words
    """
    token_sequence = []
    for w in input_sentence.split():
        if w in corpus.dictionary.word2idx:
            token_sequence.append(corpus.dictionary.word2idx[w])
        else:
            token_sequence.append(corpus.dictionary.word2idx['<unk>'])
    # print(token_sequence)
    
    for _ in range(n_generate):
        if len(token_sequence) > max_history_length:
            token_input = token_sequence[-max_history_length:]
        else:
            token_input = token_sequence
            
        model.eval()

        hidden = model.init_hidden(batch_size=1)
        inputs = torch.LongTensor(token_input).view(-1, 1)

        with torch.no_grad():
            outputs, _ = model(inputs, hidden, return_h = False)
            
            
        if method == 'greedy':
            predicted = torch.argmax(outputs[-1, 0])
        elif method == 'sampling':
            if topk < 0:
                predicted = torch.multinomial(outputs[-1, 0].softmax(0), 1)
            else:
                values, indices = torch.topk(outputs[-1,0], topk)
                predicted = indices[torch.multinomial(values, 1)]

        token_sequence.append(predicted)
    
    return " ".join([corpus.dictionary.idx2word[i] for i in token_sequence])

In [110]:
generate("The film", n_generate = 50, max_history_length = 10, method = 'sampling', topk = 20)

'The film are one by " and a most was that were the only with his team by which had well \'s other @-@ first album \'s album in The new game on its time on 1 to a other century and it has as not in her time with that was'