# Understanding the Basics of PyTorch's LSTM

## Introduction to LSTMs: getting the shape right

In this short notebook we try to understand the meaning of the various parameters of an LSTM. We assume we want to build a simple encoder where the input is a set of 100 observations. Each observation consists of $n_x = 15$ features and has been measured $T = 10$ times.

According to the [documentation of the LSTM class](http://pytorch.org/docs/0.3.1/nn.html#lstm), the input has shape `(seq_len, batch, input_size)`. There is a `batch_first` option in the LSTM constructor. We will explore this option later on.

In [1]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np

BATCH_SIZE = 16

# x has shape (seq_len, batch, input_size)
input_size = 15
seq_len = 10
n_observations = 100

# We create a hidden layer of size 8
hidden_size = 8

# We suppose this is the output of a batch generator
x_batch = Variable(torch.randn(seq_len, BATCH_SIZE, input_size))

## Getting the inputs right

An LSTM receive in inputs a variable containing the inputs we want to process, and a tuple with the initial values of the hidden state and the cell state. If we have only one monodirectional LSTM layer the shape of these variables must be `(1, batch, hidden_size)`. More in general, given `l` layers and `d` directions, the shape will be `(l*d, batch, hidden_size)`. Therefore, if we have one layer of bidirectional LSTM, the initial hidden and cell states must have both shape `(2, batch, hidden_size)`.

## Getting the outputs right

LSTM returns `output, (h_n, c_n)`, where `output` is an object of shape `(seq_len, batch, hidden_size * num_directions)` containing the hidden state for each time point from the *last layer*. This can be confusing, as the documentation refers to these as the *output features*, but they indeed are the hidden states at each time point `h_t`.

The tuple `(h_n, c_n)` contains the hidden state and the cell state for the last time point. Note that these are **not** only for the last layer. Their shape, in fact, is `(num_layers * num_directions, batch, hidden_size)`. So for a two bidirectional layers model, `h_n` and `c_n` will have shape `(4, batch, hidden_size)`.

## Getting the initialization right

The inputs to an LSTM are the data and the initial hidden and cell states, `h_0` and `c_0`. If our input has shape `(seq_len, batch, input_size)`, what should the size of `h_0` and `c_0` be? The [documentation](http://pytorch.org/docs/0.3.1/nn.html#lstm) is clear: it should be `(num_layers * num_directions, batch, hidden_size)`. We can either initialize `h_0` and `c_0` outside of the class, or inside it. We will show an example of this latter approach. It is cleaner and more self-contained.

In [2]:
class MyLSTM(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        super(MyLSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # IMPORTANT: self-contained initialization of the hidden state
        self.lstm = nn.LSTM(self.input_size, self.hidden_size)
        self.hidden = self.initialize_hidden_states()
        
    def initialize_hidden_states(self):
        # BATCH_SIZE is defined outside of the class. We could pass it as a parameter, but 
        # seems weird to make a class instance depend on the batch size.
        return (Variable(torch.zeros(1, BATCH_SIZE, self.hidden_size)),
                Variable(torch.zeros(1, BATCH_SIZE, self.hidden_size)))
    
    # Note that we don't include the hidden states among the arguments of `forward`
    def forward(self, inputs):
        output, self.hidden = self.lstm(inputs, self.hidden)
        return output

If we now pass our input object `x` to an instance of `MyLSTM`, together with the output of `initialize_hidden_states`, we should obtain an output of shape `(seq_len, batch, hidden_size)` since `num_directions` is one, therefore the input should have shape (10, 16, 8).

In [3]:
my_lstm = MyLSTM(input_size, hidden_size)

# Note: we don't need to pass the initial hidden states because they 
# are generated inside the instance
output = my_lstm(x_batch)
print(output.shape)

torch.Size([10, 16, 8])


## Things to remember

1. It easy to make mistakes when defining `__init__` or using `super(..., self)`. These are difficult to spot.
2. Always pass a `Variable` to the LSTM, not a torch tensor.
3. If you use the *internal initialization* of the hidden states, you don't have to pass them when creating an instance.

## Bi-directional LSTMs

We can extend the class before, so that it uses a bi-directional LSTM. We just need to modify the dimension of the hidden states in the `initialize_hidden_states` function, and add the option `bidirectional=True` to the LSTM.

In [4]:
class MyBiLSTM(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        super(MyBiLSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size        
        self.lstm = nn.LSTM(self.input_size, self.hidden_size, bidirectional=True)
        self.hidden = self.initialize_hidden_states()
        
    def initialize_hidden_states(self):
        # Note that now the first dimension is 2 because the number of directions is two
        return (Variable(torch.zeros(2, BATCH_SIZE, self.hidden_size)),
                Variable(torch.zeros(2, BATCH_SIZE, self.hidden_size)))
    
    # Note that we don't include the hidden states among the arguments of `forward`
    def forward(self, inputs):
        output, self.hidden = self.lstm(inputs, self.hidden)
        return output

The output of this model should be `(seq_len, batch, hidden_size * num_directions)`, *i.e.*, (10, 16, 16).

In [5]:
my_bilstm = MyBiLSTM(input_size, hidden_size)
output = my_bilstm(x_batch)
print(output.shape)

torch.Size([10, 16, 16])


## (TODO) Working with padded sequences

## (TODO) Understanding padding