# Introduction to LSTM in PyTorch

## Building a simple vocabulary from a collection of sentences

In PyTorch the input to an LSTM is expected to be a 3D tensor. Suppose we have a collection of sentences as an input. Let's start from the very beginning, showing how to go from a collection of strings to a vocabulary organized as a Python dictionary. As a simplification, all the sentences contain the same number of words.

In [1]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from functools import reduce


sentences = ['I had breakfast this morning'.split(' '),
             'then I took a walk'.split(' '),
             'but the weather was bad'.split(' ')]
print(sentences)

[['I', 'had', 'breakfast', 'this', 'morning'], ['then', 'I', 'took', 'a', 'walk'], ['but', 'the', 'weather', 'was', 'bad']]


We can build a vocabulary turning each sentence into a set, and then iteratively taking the union of the sets. This will provide us with a single set with the unique words.

In [2]:
vocabulary = reduce(set.union, [set(sentence) for sentence in sentences])
print(vocabulary)

{'this', 'but', 'walk', 'bad', 'a', 'then', 'was', 'weather', 'I', 'had', 'breakfast', 'took', 'morning', 'the'}


We can easily turn this set into a dictionary.

In [3]:
vocabulary = {word: ix for ix, word in enumerate(list(vocabulary))}
print(vocabulary)

{'this': 0, 'but': 1, 'walk': 2, 'bad': 3, 'a': 4, 'then': 5, 'was': 6, 'weather': 7, 'I': 8, 'had': 9, 'breakfast': 10, 'took': 11, 'morning': 12, 'the': 13}


Now we need to turn the sentences into NumPy arrays that can be passed to the LSTM model. We start pre-allocating a numpy array of zeroes with shape (number of sentences, maximum length of a sentence, number of words in the vocabulary).

In [4]:
sentence_max_length = max([len(sentence) for sentence in sentences])
inputs = np.zeros((len(sentences), sentence_max_length, len(vocabulary)), dtype='float32')
print(inputs.shape)

(3, 5, 14)


We can then populate this array with a one hot encoder (this would not work if we wanted to create word embeddings).

In [5]:
for ix_sentence in range(len(sentences)):
    for ix_word in range(len(sentences[ix_sentence])):
        ix_vocab = vocabulary[sentences[ix_sentence][ix_word]]
        inputs[ix_sentence, ix_word, ix_vocab] = 1

print(inputs)
print(inputs.shape)

[[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]

 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
(3, 5, 14)


### Using the correct shape

Our inputs consist in three sequences each one-hot encoded into a numpy array of size 5 x 14. We can now introduce an LSTM module that will read these sequences and store the state in a hidden layer of size 5.
From the [documentation of the LSTM layer](http://pytorch.org/docs/master/nn.html#torch.nn.LSTM) we see that the inputs are supposed to be of shape (seq_len, batch, input_size). Here we are providing batches of size 1, which means that we will need to reshape each input to (5, 1, 14).

In [6]:
inputs = inputs.reshape(3, 1, 5, 14)

We can now instatiate an LSTM layer with 4 hidden units.

In [7]:
lstm = nn.LSTM(input_size=14, hidden_size=4)

### Initialization of the hidden and cell states

From the [documentation of the LSTM layer](http://pytorch.org/docs/master/nn.html#torch.nn.LSTM), LSTM returns a tuple containing the outputs and a tuple with the hidden hidden state and the cell state, and receives in input the input sequences and, optionally, the same tuple. If this is not provided, it will default to a zero initialization for both the hidden and the cell states.

In [8]:
input_var = autograd.Variable(torch.from_numpy(inputs))
print(input_var.size())

torch.Size([3, 1, 5, 14])


We can see that the LSTM layer is actually working by assigning the output of `lstm` applied to the first sequence. Here, even if it is not necessary, we provide a manually zero-initialized tuple of hidden and cell states, to show the full syntax.

In [9]:
hidden = (autograd.Variable(torch.zeros(1, 1, 4)), 
          autograd.Variable(torch.zeros(1, 1, 4)))
out, hidden = lstm(input_var[0], hidden)

The output has shape while the hidden and cell states have shapes respectively

In [10]:
print('Output size: {}'.format(out.size()))
print('Hidden sizes: {0} and {1}'.format(hidden[0].size(), hidden[1].size()))

Output size: torch.Size([1, 5, 4])
Hidden sizes: torch.Size([1, 5, 4]) and torch.Size([1, 5, 4])


For each of the 5 elements in the sentence, we have the output of the 4 hidden/cell states.

## LSTM for part-of-speech tagging

We have an input sentence formed by the words $w_1, w_1, \ldots, w_M$ for $w_i \in V$. Each word is associated with a tag $\hat y_i$, which might indicate, for example, whether the word is a verb, a noun etc. Let's call the set of all the possible tags $T$. We assign a unique index to each tag, and we use an LSTM to predict, for each word, the corresponding tag. If we call $h_i$ the hidden state of the LSTM at step $i$, we select $\hat y_i = \mathrm{argmax}_j (\log \mathrm{Softmax}(A h_i + b))_j$. We start defining a simple function to convert a sequence into indices. In the [PyTorch LSTM tutorial](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py) they use a particularly simple approach: the word-to-index mapping consists simply in associating to each new word an index equal to the current length of the mapping itself. Each new word makes the mapping longer, and is therefore associated with an increasingly larger integer.

In [11]:
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


We create a simple function to map a sequence to a tensor (a Variable, to be precise) of integers.

In [12]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

tag_to_ix = {'DET': 0, 'V': 1, 'NN': 2}

prepare_sequence(training_data[0][0], word_to_ix)

Variable containing:
 0
 1
 2
 3
 4
[torch.LongTensor of size 5]

This is the same type of representation used for Embedding layers, and this is not a coincidence, since we will be using an embedding layer as an input to the LSTM layer.

In [13]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

class LSTMTagger(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, vocab_size, target_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.hidden2tag = nn.Linear(hidden_dim, target_size)
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))
    
    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space)
        return tag_scores

Let's go through this class step by step:

1. We start defining the word embeddings. No surprises here.
2. We define an LSTM layer which receives inputs of size `embedding_dim` and produces outputs of size `hidden_dim`.
3. A FC layer that goes from the hidden units to the output, i.e. the tags.
4. This is a bit surprising at first. If you remember above, we needed to pass the initial value of the hidden layer and of the cell status. This becomes an attribute of the class, and is initialized via the `init_hidden()` method. This allows to re-initialize the hidden states of an instance of this class as shown below.

The `forward` method is pretty clear:

1. The input sentence is converted into an embedding.
2. The LSTM returns its output in `lstm_out` while the hidden states stored in `self.hidden` are updated.
3. `lstm_out` becomes, after adequate reshaping, the input of the FC layer `hidden2tag`.
4. The final log-softmax of the tags are computed for each word in the sentence.

In [14]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [15]:
# Before training
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)


for epoch in range(10):
    print('Epoch: {}'.format(epoch))
    for sentence, tags in training_data:
        model.zero_grad()
        model.hidden = model.init_hidden()
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets_in = prepare_sequence(tags, tag_to_ix)
        tag_scores = model(sentence_in)
        loss = criterion(tag_scores, targets_in)
        loss.backward()
        optimizer.step()
        
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

Variable containing:
-0.7757 -1.3122 -1.3078
-0.8132 -1.2804 -1.2778
-0.8048 -1.2743 -1.2976
-0.7742 -1.2807 -1.3429
-0.7743 -1.3346 -1.2883
[torch.FloatTensor of size 5x3]

Epoch: 0
Epoch: 1
Epoch: 2
Epoch: 3
Epoch: 4
Epoch: 5
Epoch: 6
Epoch: 7
Epoch: 8
Epoch: 9
Variable containing:
-0.9134 -1.5046 -0.9763
-1.0410 -1.3724 -0.9329
-1.0367 -1.3037 -0.9839
-0.9645 -1.3409 -1.0294
-1.0116 -1.3938 -0.9461
[torch.FloatTensor of size 5x3]





In [16]:
inputs = prepare_sequence(training_data[0][0], word_to_ix)
print(inputs)

Variable containing:
 0
 1
 2
 3
 4
[torch.LongTensor of size 5]

