# Sequence-to-sequence RNN
In this exercise, we implement a sequence-to-sequence RNN (without attention).

In [1]:
import torch
import torch.nn as nn

We first define our hyperparameters.

In [2]:
embedding_dim = 10
hidden_dim = 20
num_layers = 2
bidirectional = True
sequence_length = 5
batch_size = 3

Create a bidirectional [`nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) with 2 layers.

In [3]:
lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=bidirectional)

We create an example input `x`.

In [4]:
x = torch.randn(sequence_length, batch_size, embedding_dim)

What should the initial hidden and cell state be?

In [5]:
num_directions = 2 if bidirectional else 1
h0 = torch.zeros(num_layers * num_directions, batch_size, hidden_dim)
c0 = torch.zeros(num_layers * num_directions, batch_size, hidden_dim)

Now we run our LSTM. Look at the output. Explain each dimension of the output.

In [6]:
output, (hn, cn) = lstm(x, (h0, c0))
print(f'Output shape: {output.shape}, sequence_length x batch_size x concatenated final-layer hidden states from both directions')
print(f'Last hidden state shape: {hn.shape}, num_layers * num_directions x batch_size x hidden_dim')
print(f'Last cell state shape: {cn.shape}, num_layers * num_directions x batch_size x hidden_dim')

Output shape: torch.Size([5, 3, 40]), sequence_length x batch_size x concatenated final-layer hidden states from both directions
Last hidden state shape: torch.Size([4, 3, 20]), num_layers * num_directions x batch_size x hidden_dim
Last cell state shape: torch.Size([4, 3, 20]), num_layers * num_directions x batch_size x hidden_dim


All outputs are from the last (2nd) layer of the LSTM. If we want to have access to the hidden states of layer 1 as well, we have to run the `LSTMCell`s ourselves.

When we take the above LSTM as the encoder, what is its output that serves as the input to the decoder?

In [7]:
encoder = lstm

# concat the final hidden state of the last layer in both directions
# hn[2] is the n-th hidden state of the second layer in the left-to-right direction
# hn[3] is the 1st hidden state of the second layer in the right-to-left direction (i.e. the last that was processed when going from right to left)
encoder_output = torch.concat((hn[2, :, :], hn[3, :, :]), dim=-1)
print(encoder_output.shape)

torch.Size([3, 40])


Create a decoder LSTM with 2 layers. Why can't it be bidirectional as well? What is the hidden dimension of the decoder LSTM when you want to initialize it with the encoder output?

In [8]:
# The decoder can't be bidirectional since it generates the words:
# When generating word i, the word at position i+1 is not known yet.
decoder_hidden_dim = num_directions * hidden_dim  # has to be the same as encoder output
# if you want a smaller decoder hidden dim, insert a projection (multiplication with a matrix)
decoder = nn.LSTM(embedding_dim, decoder_hidden_dim, num_layers=num_layers)

Run your decoder LSTM on an example sequence. Condition it with the encoder representation of the sequence. How do we get the correct shape for the initial hidden state?

**Hint:** Take a look at [Torch's tensor operations](https://pytorch.org/docs/stable/tensors.html) and compare `Torch.repeat`, `Torch.repeat_interleave` and `Tensor.expand`.

In [9]:
output_seq_len = 8
y = torch.randn(output_seq_len, batch_size, embedding_dim)
h_dec_0 = encoder_output.unsqueeze(0).expand(num_layers, -1, -1)
print('h_dec_0 shape:', h_dec_0.shape)
c_dec_0 = torch.zeros(num_layers, batch_size, decoder_hidden_dim)
decoder_output, (h_dec_n, c_dec_n) = decoder(y, (h_dec_0, c_dec_0))
print('decoder output shape:', decoder_output.shape)

h_dec_0 shape: torch.Size([2, 3, 40])
decoder output shape: torch.Size([8, 3, 40])


In most RNNs, the final encoder hidden state is used as the first hidden state of the decoder RNN. In some variants, it has also been concatenated with the hidden state of the previous time step at each decoder time step. In PyTorch's `nn.LSTM` implementation, we cannot easily do that, so we would have to resort to the lower-level `nn.LSTMCell` class again.

Put it all together in a seq2seq LSTM model.

In [10]:
class Seq2seqLSTM(nn.Module):
    """ Sequence-to-sequence LSTM. """
    
    def __init__(self, embedding_dim, hidden_dim, num_encoder_layers, num_decoder_layers, bidirectional):
        super().__init__()
        self.encoder = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_encoder_layers, bidirectional=bidirectional)
        self.num_directions = 2 if bidirectional else 1
        self.decoder = nn.LSTM(embedding_dim, self.num_directions * hidden_dim, num_layers=num_decoder_layers)
    
    def forward(self, x, y):
        assert x.dim() == 3, "Expected input of shape [sequence length, batch size, embedding dim]"
        batch_size = x.size(1)

        # encoder forward
        h0 = torch.zeros(self.encoder.num_layers * self.num_directions, batch_size, self.encoder.hidden_size)
        c0 = torch.zeros(self.encoder.num_layers * self.num_directions, batch_size, self.encoder.hidden_size)
        encoder_outputs, (hn, cn) = self.encoder(x, (h0, c0))

        # decoder forward
        encoder_output = torch.concat((hn[-2], hn[-1]), -1) if self.num_directions == 2 else hn[-1]
        h_dec_0 = encoder_output.expand(self.decoder.num_layers, -1, -1)
        c_dec_0 = torch.zeros(self.decoder.num_layers, batch_size, self.decoder.hidden_size)
        decoder_outputs, _ = self.decoder(y, (h_dec_0, c_dec_0))
        return decoder_outputs

Test your seq2seq LSTM with an input sequence `x` and a ground truth output sequence `y` that the decoder tries to predict.

In [11]:
torch.manual_seed(0)
seq2seq_lstm = Seq2seqLSTM(embedding_dim, hidden_dim, num_layers, num_layers, bidirectional)
x = torch.randn(10, 2, embedding_dim)
y = torch.randn(9, 2, embedding_dim)
outputs = seq2seq_lstm(x, y)
assert outputs.dim() == 3 and list(outputs.size()) == [9, 2, decoder_hidden_dim], "Wrong output shape"