# Seq2Seq - Encoder/Decoder networks
In this exercise we'll have a deeper look into the ability to use multiple RNN's to infer and generate sequences of data.
Specifically we will implement a Encoder-Decoder RNN based for a simple sequence to sequence translation task.
This type of models have shown impressive performance in Neural Machine Translation and Image Caption generation. 

In the encoder-decoder structure one RNN (blue) encodes the input into a hidden representation, and a second RNN (red) uses this representation to predict the target values.
An essential step is deciding how the encoder and decoder should communicate.
In the simplest approach you use the last hidden state of the encoder to initialize the decoder.
This is what we will do in this notebook, as shown here:

![](./images/enc-dec.png)

In this exercise we will translate from the words of number (e.g. 'nine') to the actual number (e.g. '9').
The input for the Encoder RNN consists of words defining the number, whilst the output of such an encoding serves as input for the Decoder RNN that aims to generate generate a number. 
Our dataset is generated and consists of numbers and an End-of-Sentence (EOS) character ('#'). The data we want to generate should be like follows:

```
Examples: 
prediction  |  input
991136#00 	 nine nine one one three six
81771#000 	 eight one seven seven one
3519614#0 	 three five one nine six one four
26656#000 	 two six six five six
60344#000 	 six zero three four four
162885#00 	 one six two eight eight five
78612625# 	 seven eight six one two six two five
9464710#0 	 nine four six four seven one zero
191306#00 	 one nine one three zero six
10160378# 	 one zero one zero six three seven eight
```

Let us define the space of characters and numbers to be learned with the networks:

```
Number of valid characters: 27
'0'=0,	'1'=1,	'2'=2,	'3'=3,	'4'=4,	'5'=5,	'6'=6,	'7'=7,	'8'=8,	'9'=9,	'#'=10,	' '=11,	'e'=12,	'g'=13,	'f'=14,	'i'=15,	'h'=16,	'o'=17,	'n'=18,	's'=19,	'r'=20,	'u'=21,	't'=22,	'w'=23,	'v'=24,	'x'=25,	'z'=26,	
Stop/start character = #
```

All represented characters and numbers as characters, gets mapped to an integer from 0-26. Our total space of valid characters consists of 27.

In [1]:
try
    from google.colab import drive
    import os
    drive.mount('/content/gdrive')
    os.chdir('/content/gdrive/My Drive/Notes/Dtu courses/3rd_semester/Deep learning/dtu-deep-learning')
exp

ModuleNotFoundError: No module named 'google.colab'

In [None]:
from data_generator import generate
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import random
import math

device =  torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device in use:", device)

NUM_INPUTS = 27 #No. of possible characters
NUM_OUTPUTS = 11  # (0-9 + '#')

### Hyperparameters and general configs
MAX_SEQ_LEN = 8
MIN_SEQ_LEN = 5
TRAINING_SIZE = 100
LEARNING_RATE = 0.003
N_LAYERS = 2
DROPOUT = 0.5

# Hidden size of enc and dec need to be equal if last hidden of encoder becomes init hidden of decoder
# Otherwise we would need e.g. a linear layer to map to a space with the correct dimension
NUM_UNITS_ENC = NUM_UNITS_DEC = 256
HIDDEN_DIM = 512
TEST_SIZE = 100
EPOCHS = 10
TEACHER_FORCING = 0.5
NUM_OF_BATCHES=8

# assert TRAINING_SIZE % NUM_OF_BATCHES == 0

For this exercise we won´t worry about data generation, but utilise a built function for this purpose. The function generates random data constained by the 27 characters described above.

The encoder takes as input the embedded text strings generated from the *generate* function as given here above ie. 'nine' would become [18 15 18 12].
Sequeneces are generated at random given settings of minima and maxima length, constrained by the dimensions of the two RNN´s architecture.
We may visualise a subset of the data generated by running the command below

In [None]:
# from google.colab import drive
# import os
# drive.mount('/content/gdrive')
# os.chdir('/content/gdrive/My Drive/Notes/Dtu courses/3rd_semester/Deep learning/dtu-deep-learning')

In [None]:
 !python data_generator.py


## Let's define the two RNN's



In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, emb_size, hidden_size, dropout):
        super().__init__()
        self.hidden_size = hidden_size
        self.emb_size = emb_size

        self.embedding = nn.Embedding(input_size, self.emb_size)
        self.rnn = nn.GRU(
            self.emb_size,
            self.hidden_size,
            bidirectional=True,
            batch_first=True,
        )

        self.fc = nn.Linear(self.hidden_size * 2, self.hidden_size)

        self.dropout = nn.Dropout(dropout)

    def forward(self, inputs, hidden, inputs_len):
        # Input shape [batch, seq_in_len]
        # inputs = [inputs[0],inputs[2]]
        inputs = inputs.long()

        # Embedded shape [batch, seq_in_len, embed]
        embedded = self.dropout(self.embedding(inputs))
        # embedded = embedded.view(embedded.shape[0]*embedded.shape[1],embedded.shape[2],embedded.shape[3])

        # pack_padded_sequence so that padded items in the sequence won't be shown to the LSTM
        packed_embedded = torch.nn.utils.rnn.pack_padded_sequence(
            embedded, inputs_len, batch_first=True
        )

        packed_outputs, hidden = self.rnn(packed_embedded)

        # Output shape [batch, seq_in_len, embed]
        # Hidden shape [1, batch, embed], last hidden state of the GRU cell
        # We will feed this last hidden state into the decoder
        # print(embedded.shape)
        # Reshape our output to match the input shape of our forward pass
        # hidden = hidden.reshape(hidden.shape[0],1, hidden.shape[1],hidden.shape[2])
        # hidden=hidden.unsqueeze_(0)
        # print(hidden[1].shape)
        # view(len(sentence), 1, -1)
        # print(test.shape)
        # outputs,hidden = self.rnn(embedded)
        # print(hidden.shape)
        outputs, _ = nn.utils.rnn.pad_packed_sequence(
            packed_outputs, batch_first=True
        )
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2], hidden[-1]), dim=1)))
        return outputs, hidden

    def init_hidden(self, batch_size):
        init = torch.zeros(1, batch_size, self.hidden_size, device=device)
        return init


In [None]:


class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()

        self.attn = nn.Linear((hidden_size * 2) + hidden_size, hidden_size)
        self.v = nn.Parameter(torch.rand(hidden_size))

    def forward(self, hidden, encoder_outputs, mask):

        # hidden = [batch size, dec hid dim]
        # encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        # mask = [batch size, src sent len]

        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]

        # encoder output =  [33, 8, 512], hidden = [8, 256]
        # print(encoder_outputs.shape[0], encoder_outputs.shape[1])

        # repeat encoder hidden state src_len times
        # print(hidden.shape,hidden.unsqueeze(1).shape)
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)

        # encoder_outputs = encoder_outputs.permute(1, 0, 2)
        # print(encoder_outputs.shape)

        # hidden = [batch size, src sent len, dec hid dim]
        # encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        # print(hidden.shape,encoder_outputs.shape)

        # encoder_outputs = encoder_outputs.permute(1, 0, 2)
        energy = torch.tanh(
            self.attn(torch.cat((hidden, encoder_outputs), dim=2))
        )

        # energy = [batch size, src sent len, dec hid dim]

        energy = energy.permute(0, 2, 1)

        # energy = [batch size, dec hid dim, src sent len]

        # v = [dec hid dim]

        v = self.v.repeat(batch_size, 1).unsqueeze(1)

        # v = [batch size, 1, dec hid dim]

        attention = torch.bmm(v, energy).squeeze(1)

        # print(attention.shape,mask.shape)

        # attention = [batch size, src sent len]

        attention = attention.masked_fill(mask == 0, -1e10)

        return F.softmax(attention, dim=1)

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, emb_size, output_size, dropout, attention):
        super().__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.emb_size = emb_size
        self.attention = attention

        self.embedding = nn.Embedding(self.output_size, self.emb_size)
        self.out = nn.Linear(
            (self.hidden_size * 2) + self.hidden_size + self.emb_size,
            output_size,
        )
        self.rnn = nn.GRU(
            (self.hidden_size * 2) + self.emb_size, self.hidden_size,
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, inputs, hidden, encoder_outputs, mask):
        # Input shape: [batch, output_len]
        # Hidden shape: [seq_len=1, batch_size, hidden_dim] (the last hidden state of the encoder)
        dec_input = inputs.unsqueeze(1)
        embedded = self.dropout(self.embedding(dec_input))
        # print(embedded.shape,dec_input.shape)
        embedded = embedded.permute(1, 0, 2)
        # print(hidden.shape,encoder_outputs.shape,mask.shape)
        # encoder_outputs = encoder_outputs.permute(1, 0, 2)
        a = self.attention(hidden, encoder_outputs, mask)
        a = a.unsqueeze(1)
        # encoder_outputs = encoder_outputs.permute(1, 0, 2)
        # print(a.shape, encoder_outputs.shape)
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        # print(embedded.shape,weighted.shape)
        rnn_input = torch.cat((embedded, weighted), dim=2)
        # print(weighted.shape,embedded.shape,rnn_input.shape,hidden.shape)
        out, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        # out, hidden = self.rnn(self.embedding(dec_input), hidden)
        assert (out == hidden).all()
        embedded = embedded.squeeze(0)
        out = out.squeeze(0)
        weighted = weighted.squeeze(0)

        output = self.out(torch.cat((out, weighted, embedded), dim=1))
        # print(output.shape,hidden.squeeze(0).shape)

        # output = torch.stack(output).permute(1, 0, 2)  # [batch_size x seq_len x output_size]
        hidden = hidden.squeeze(0)
        # print(hidden.squeeze(0).shape)

        return output, hidden.squeeze(0), a.squeeze(1)

In [None]:
# class DecoderRNN(nn.Module):
#     def __init__(self, hidden_size, emb_size, output_size, dropout,attention):
#         super().__init__()
#         self.hidden_size = hidden_size
#         self.output_size = output_size
#         self.emb_size = emb_size
#         self.attention = attention

        
#         self.embedding = nn.Embedding(self.output_size, self.emb_size)
#         self.out = nn.Linear((self.hidden_size* 2) + self.hidden_size + self.emb_size, output_size)
#         self.rnn = nn.GRU((self.hidden_size * 2) + self.emb_size, self.hidden_size)
#         self.dropout = nn.Dropout(dropout)

#     def forward(self, inputs, hidden,encoder_outputs, mask, output_len,inputs_len, teacher_forcing=False):
#         # Input shape: [batch, output_len]
#         # Hidden shape: [seq_len=1, batch_size, hidden_dim] (the last hidden state of the encoder)
        
#         output_len = output_len[0]

#         if teacher_forcing:
#             dec_input = inputs
#             embed = self.dropout(self.embedding(dec_input))   # shape [batch, output_len, hidden_dim]
#             out, hidden = self.rnn(embed,hidden)
#             out = self.out(out)  # linear layer, out has now shape [batch, output_len, output_size]
#             output = F.log_softmax(out, -1)
#         else:
#             # Take the EOS character only, for the whole batch, and unsqueeze so shape is [batch, 1]
#             # This is the first input, then we will use as input the GRU output at the previous time step
#             dec_input = inputs[:, 0].unsqueeze(1)

#             output = []
#             for i in range(output_len):
#                 embedded = self.dropout(self.embedding(dec_input))
#                 embedded = embedded.permute(1, 0, 2)
#                 print(hidden.shape,encoder_outputs.shape,mask.shape)
#                 #encoder_outputs = encoder_outputs.permute(1, 0, 2)
#                 a = self.attention(hidden, encoder_outputs, mask)
#                 a = a.unsqueeze(1)
#                 encoder_outputs = encoder_outputs.permute(1, 0, 2)
#                 #print(a.shape, encoder_outputs.shape)
#                 weighted = torch.bmm(a, encoder_outputs)
#                 weighted = weighted.permute(1, 0, 2)
#                 #print(embedded.shape,weighted.shape)
#                 rnn_input = torch.cat((embedded, weighted), dim = 2)
#                 #print(weighted.shape,embedded.shape,rnn_input.shape,hidden.shape)
#                 out, hidden= self.rnn(rnn_input,hidden.unsqueeze(0))
#                 #out, hidden = self.rnn(self.embedding(dec_input), hidden)
#                 assert (out == hidden).all()
#                 embedded = embedded.squeeze(0)
#                 out = out.squeeze(0)
#                 weighted = weighted.squeeze(0)

#                 output = self.out(torch.cat((out, weighted, embedded), dim = 1))
#                 #print(output.shape,hidden.squeeze(0).shape)

#             #output = torch.stack(output).permute(1, 0, 2)  # [batch_size x seq_len x output_size]
#                 hidden = hidden.squeeze(0)
#                 print(hidden.squeeze(0).shape)

#         return output,hidden.squeeze(0), a.squeeze(1)

The learned representation from the *Encoder* gets propagated to the *Decoder* as the final hidden layer in the *Encoder* network is set as initialisation for the *Decoder*'s first hidden layer.

In [None]:
def create_mask(src):
    mask = src != 0  # .permute(1, 0)
    return mask

In [None]:



def forward_pass(
    encoder, decoder, x, t, t_in, x_len, criterion, teacher_forcing_ratio=0.5
):
    """
    Executes a forward pass through the whole model.

    :param encoder:
    :param decoder:
    :param x: input to the encoder, shape [batch, seq_in_len]
    :param t: target output predictions for decoder, shape [batch, seq_t_len]
    :param criterion: loss function
    :param max_t_len: maximum target length

    :return: output (after log-softmax), loss, accuracy (per-symbol)
    """
    batch_size = x.shape[0]
    # print(batch_size)
    trg_len = t_in.shape[1]
    # print(t_in.shape)
    trg_vocab_size = NUM_OUTPUTS

    # tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)

    # Run encoder and get last hidden state (and output)

    enc_h = encoder.init_hidden(batch_size)
    enc_out, enc_h = encoder(x, enc_h, x_len)
    # print(enc_h.shape,enc_out.shape)

    # print(mask.shape)

    # first input to the decoder is the <sos> tokens
    inputs = t_in[:, 0]
    # print(inputs)
    dec_h = enc_h
    mask = create_mask(x)

    for i in range(1, trg_len + 1):

        # insert input token embedding, previous hidden state, all encoder hidden states
        #  and mask
        # receive output tensor (predictions) and new hidden state
        output, dec_h, _ = decoder(inputs, dec_h, enc_out, mask)

        # place predictions in a tensor holding predictions for each token
        outputs[i - 1] = output

        # decide if we are going to use teacher forcing or not
        teacher_force = random.random() < teacher_forcing_ratio

        # get the highest predicted token from our predictions
        top1 = output.argmax(1)

        # if teacher forcing, use actual next token as next input
        # if not, use predicted token
        if i < trg_len:
            inputs = t_in[:, i] if teacher_force else top1

    out = outputs.permute(1, 2, 0)
    # Shape: [batch_size x num_classes x out_sequence_len], with second dim containing log probabilities
    # print(out.shape,t.shape)
    loss = criterion(out, t)
    pred = get_pred(log_probs=out)
    accuracy = (pred == t).type(torch.FloatTensor).mean()

    return out, loss, accuracy

In [None]:
# def forward_pass(encoder, decoder, x, t, t_in,x_len,criterion,max_t_len,teacher_forcing):
#     """
#     Executes a forward pass through the whole model.

#     :param encoder:
#     :param decoder:
#     :param x: input to the encoder, shape [batch, seq_in_len]
#     :param t: target output predictions for decoder, shape [batch, seq_t_len]
#     :param criterion: loss function
#     :param max_t_len: maximum target length

#     :return: output (after log-softmax), loss, accuracy (per-symbol)
#     """

    
#     # Run encoder and get last hidden state (and output)
#     #print(x)
#     #print(len(x))
#     batch_size = len(x)
#     enc_h = encoder.init_hidden(batch_size)
#     enc_out, enc_h = encoder(x, enc_h,x_len)
#     #print(enc_h.shape,enc_out.shape)
    
#     def create_mask(src):
#         mask = (src != 0)#.permute(1, 0)
#         return mask
    
    
#     mask = create_mask(x)
#     #print(mask.shape)

#     dec_h = enc_h  # Init hidden state of decoder as hidden state of encoder
#     #print(dec_h.shape)
#     dec_input = t_in
#     out,dec_h,_ = decoder(dec_input, dec_h,enc_out,mask, max_t_len, teacher_forcing)
#     #print(dec_h.shape)
#     out = out.permute(0, 2, 1)
#     # Shape: [batch_size x num_classes x out_sequence_len], with second dim containing log probabilities

#     loss = criterion(out, t)
#     pred = get_pred(log_probs=out)
#     accuracy = (pred == t).type(torch.FloatTensor).mean()
#     return out, loss, accuracy

In [None]:
def train(
    encoder,
    decoder,
    inputs,
    targets,
    targets_in,
    criterion,
    enc_optimizer,
    dec_optimizer,
    epoch,
    inputs_len,
):
    encoder.train()
    decoder.train()
    epoch_loss = 0
    for batch_idx, (x, t, t_in, x_len) in enumerate(
        zip(inputs, targets, targets_in, inputs_len)
    ):
        # print(x.shape)
        x = torch.LongTensor(x).to(device)
        t = torch.LongTensor(t).to(device)
        t_in = torch.LongTensor(t_in).to(device)
        x_len = torch.LongTensor(x_len).to(device)

        enc_optimizer.zero_grad()
        dec_optimizer.zero_grad()

        # print(batch_idx)
        #         inputs = inputs.to(device)
        #         targets = targets.long()
        #         targets_in = targets_in.long()
        out, loss, accuracy = forward_pass(
            encoder,
            decoder,
            x,
            t,
            t_in,
            x_len,
            criterion,
            teacher_forcing_ratio=TEACHER_FORCING,
        )

        loss.backward()
        enc_optimizer.step()
        dec_optimizer.step()
        if batch_idx % 200 == 0:
            print(
                "Epoch {} [{}/{} ({:.0f}%)]\tTraining loss: {:.4f} \tTraining accuracy: {:.1f}%".format(
                    epoch,
                    batch_idx * len(x),
                    TRAINING_SIZE * NUM_OF_BATCHES,
                    100.0
                    * batch_idx
                    * len(x)
                    / (TRAINING_SIZE * NUM_OF_BATCHES),
                    loss.item(),
                    100.0 * accuracy.item(),
                )
            )

In [None]:
def test(encoder, decoder, inputs, targets, targets_in, inputs_len, criterion):
    encoder.eval()
    decoder.eval()
    epoch_loss = 0
    with torch.no_grad():
        #         inputs = inputs.to(device)
        #         print(targets)
        #         targets = targets.long()
        #         targets_in = targets_in.long()
        inputs = inputs.view(inputs.shape[1], inputs.shape[2])
        targets = targets.view(targets.shape[1], targets.shape[2])
        targets_in = targets_in.view(targets_in.shape[1], targets_in.shape[2])
        inputs_len = torch.LongTensor(inputs_len[0]).to(device)
        # print(inputs_len)

        out, loss, accuracy = forward_pass(
            encoder,
            decoder,
            inputs,
            targets,
            targets_in,
            inputs_len,
            criterion,
            teacher_forcing_ratio=TEACHER_FORCING,
        )
        # print(out.shape,targets_in.shape)
    return out, loss, accuracy

In [None]:
def numbers_to_text(seq):
    return "".join([str(to_np(i)) if to_np(i) != 10 else "#" for i in seq])


def to_np(x):
    return x.cpu().numpy()


def get_pred(log_probs):
    """
    Get class prediction (digit prediction) from the net's output (the log_probs)
    :param log_probs: Tensor of shape [batch_size x n_classes x sequence_len]
    :return:
    """
    return torch.argmax(log_probs, dim=1)

In [None]:
attn = Attention(NUM_UNITS_ENC)
encoder = EncoderRNN(NUM_INPUTS, HIDDEN_DIM, NUM_UNITS_ENC, DROPOUT).to(device)
decoder = DecoderRNN(NUM_UNITS_DEC, HIDDEN_DIM, NUM_OUTPUTS, DROPOUT, attn).to(
    device
)
enc_optimizer = optim.RMSprop(encoder.parameters(), lr=LEARNING_RATE)
dec_optimizer = optim.RMSprop(decoder.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(reduction="mean", ignore_index=0)
# criterion = nn.NLLLoss(weight=None, reduction="mean", ignore_index=0)

# Get training set
(
    inputs,
    _,
    targets_in,
    targets,
    targets_seqlen,
    _,
    text,
    _,
    text_targ,
    inputs_len,
) = generate(
    TRAINING_SIZE, NUM_OF_BATCHES, min_len=MIN_SEQ_LEN, max_len=MAX_SEQ_LEN
)
max_target_len = max(targets_seqlen)
# inputs = torch.tensor(inputs)
# inputs = torch.LongTensor(inputs)
# targets = torch.LongTensor(targets)
# targets_in = torch.LongTensor(targets_in)
unique_text_targets = set([i for x in text_targ for i in x])

# Get validation set
(
    val_inputs,
    _,
    val_targets_in,
    val_targets,
    val_targets_seqlen,
    _,
    val_text_in,
    _,
    val_text_targ,
    val_inputs_len,
) = generate(
    1,
    TEST_SIZE,
    min_len=MIN_SEQ_LEN,
    max_len=MAX_SEQ_LEN,
    invalid_set=unique_text_targets,
)
# val_inputs = torch.tensor(val_inputs)
val_inputs = torch.LongTensor(val_inputs).to(device)
val_targets = torch.LongTensor(val_targets).to(device)
val_targets_in = torch.LongTensor(val_targets_in).to(device)
val_inputs_len = torch.LongTensor(val_inputs_len).to(device)
max_val_target_len = max(val_targets_seqlen)


# Quick and dirty - just loop over training set without reshuffling
for epoch in range(1, EPOCHS + 1):
    train(
        encoder,
        decoder,
        inputs,
        targets,
        targets_in,
        criterion,
        enc_optimizer,
        dec_optimizer,
        epoch,
        inputs_len,
    )
    _, loss, accuracy = test(
        encoder,
        decoder,
        val_inputs,
        val_targets,
        val_targets_in,
        val_inputs_len,
        criterion,
    )
    print(
        "\nTest set: Average loss: {:.4f} \tAccuracy: {:.3f}%\n".format(
            loss, accuracy.item() * 100.0
        )
    )

    # Show examples
    print("Examples: prediction | input")
    out, _, _ = test(
        encoder,
        decoder,
        val_inputs[:10],
        val_targets[:10],
        val_targets_in[:10],
        val_inputs_len[:10],
        criterion,
    )
    pred = get_pred(out)
    pred_text = [numbers_to_text(sample) for sample in pred]

    # print(len(pred_text)) range used to be 10
    for i in range(9):
        print(pred_text[i], "\t", val_text_in[0][i])
    print()