# Sequence-to-Sequence Neural Networks

We want to build a model that translate a German sentence to its English version.

(Based on [this](https://github.com/bentrevett/pytorch-seq2seq) very nice tutorial series)

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time
from pprint import pprint

! python -m spacy download en
! python -m spacy download de

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/YuHsiangLo/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/Users/YuHsiangLo/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/Users/YuHsiangLo/anaconda3/lib/python3.7/site-packages/de_core_news_sm -->
/Users/YuHsiangLo/anaconda3/lib/python3.7/site-packages/spacy/data/de
You can now load the model via spacy.load('de')


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


## Data preparation

#### Define `Field`s

In [2]:
def tokenize_de(text):
    '''
    Tokenizes German text from a string into a list of strings (tokens)
    '''
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    '''
    Tokenizes English text from a string into a list of strings (tokens)
    '''
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC = Field(tokenize=tokenize_de, 
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=True)

TRG = Field(tokenize=tokenize_en, 
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=True)

In [5]:
PAD = '<PAD>'
UNK = '<UNK>'
START = '<SOS>'
END = '<EOS>'

def capital(string):
    return [True if word[0].isupper() else False for word in string.split()]

WORD = Field(sequential=True, pad_token = PAD, unk_token = UNK, init_token = START, eos_token = END)
CHAR = Field(sequential=True, tokenize = capital, pad_token = PAD, unk_token = UNK, init_token = START, eos_token = END)

CHAR.preprocess('I do not know What this Is')

[True, False, False, False, True, False, True]

#### Create datasets

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), 
                                                    fields=(CHAR, TRG))

In [8]:
# of examples in each set
print(f'Number of training examples: {len(train_data.examples)}')
print(f'Number of validation examples: {len(valid_data.examples)}')
print(f'Number of testing examples: {len(test_data.examples)}')

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [9]:
# Let's see one example
pprint(vars(train_data.examples[0]))

{'src': [True,
         False,
         False,
         True,
         False,
         False,
         True,
         False,
         False,
         True,
         False,
         True],
 'trg': ['two',
         'young',
         ',',
         'white',
         'males',
         'are',
         'outside',
         'near',
         'many',
         'bushes',
         '.']}


#### Build vocab
- `min_freq` = 2

In [0]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

In [0]:
print(SRC.vocab.stoi)

In [0]:
print(f'Unique tokens in source (de) vocabulary: {len(SRC.vocab)}')
print(f'Unique tokens in target (en) vocabulary: {len(TRG.vocab)}')

#### Iterator

In [0]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE, 
    device=device)

In [0]:
ex = next(iter(train_iterator))
print(ex)
print(ex.src)

## Foundations

- sequence-to-sequence (seq2seq) models
  - Example:
    - Input: `<sos> guten morgen <eos>`
    - Output: `<sos> good morning <eos>`
  - **encoder-decoder** models: RNN to encode + RNN to decode
  - **context vector**: an abstract representation of the entire input sentence.

### Vanilla RNN

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq1.png?raw=1)

- More formally...
    - Input: $X = [x_1, x_2, ..., x_T]$
    - Target: $Y = [y_1, y_2, ..., y_{T'}]$
    - Prediction: $\hat{Y} = [\hat{y}_1, \hat{y}_2, ..., \hat{y}_T]$
    - **Encoder**
    $$h_t = \text{EncoderRNN}(e(x_t), h_{t-1})$$
      - $e(x_t)$: embedding (word $x_t$ $\rightarrow$ vector)
      - $h_t$: hidden state at $t$
      - $z = h_T$: context vector
  
    - **Decoder**
    $$s_t = \text{DecoderRNN}(e'(y_t), s_{t-1})$$
      - $e'(y_t)$: embedding (word $y_t$ $\rightarrow$ vector)
      - $s_t$: hidden state at $t$

    - **Linear layer**
    $$\hat{y}_t = f(s_t)$$
      - $f$: a full-connected NN
    
### LSTM (now we have cells!)

(If you want to learn more about LSTM, read [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)!)

- **Encoder**: $(h_t, c_t) = \text{EncoderLSTM}(e(x_t), (h_{t-1}, c_{t-1}))$
    
- **Decoder**: $(s_t, c_t) = \text{DecoderLSTM}(e'(y_t), (s_{t-1}, c_{t-1}))$

- **Linear layer**: $\hat{y}_{t} = f(s_t)$

- Can have more than one layer

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq4.png?raw=1)

- Encoder:
    
$$\begin{align*}
(h_t^1, c_t^1) &= \text{EncoderLSTM}^1(e(x_t), (h_{t-1}^1, c_{t-1}^1))\\
(h_t^2, c_t^2) &= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))
\end{align*}$$

- Decoder:

$$\begin{align*}
(s_t^1, c_t^1) = \text{DecoderLSTM}^1(e'(y_t), (s_{t-1}^1, c_{t-1}^1))\\
(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
\end{align*}$$

## Create Encoder in `PyTorch`

(Let's draw something!)

- `torch.nn.Embedding`
  - Constructor arguments:
    - `num_embeddings`: the size of the dictionary (vocab) of embeddings.
    - `embedding_dim`: the size of each embedding vector.
  - Input $\rightarrow$ output:
    - (src_len, batch_size) $\rightarrow$ (src_len, batch_size, embedding_dim)

- `torch.nn.LSTM`
  - Constructor arguments:
    - `input_size`: the dimensionality of the embedding layer.
    - `hidden_size`: the dimensionality of the **hidden** and **cell** states.
    - `num_layers`: the number of layers in the RNN.
    - `dropout`: the amount of dropout to use. This is a regularization parameter to prevent overfitting.
    - `bidirectional`(**False**)
  - Input $\rightarrow$ output, (hidden, cell):
    - Default $h_0$ and $c_0$: zero vectors
    - (src_len, batch_size, embedding_dim) $\rightarrow$ (src_len, batch_size, hidden_size \* n_directions), (num_layers \* n_directions, batch_size, hidden_size), (n_layers \* n_directions, batch_size, hidden_size)
    

In [0]:
class Encoder(nn.Module):
    def __init__(self, input_vocab_size, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_vocab_size, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
    def forward(self, src):
        
        #src = (src_len, batch_size)
        
        embedded = self.embedding(src)
        
        #embedded = (src_len, batch_size, emb_dim)
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = (src_len, batch_size, hid_dim * n_directions)
        #hidden = (n_layers * n_directions, batch_size, hid_dim)
        #cell = (n_layers * n_directions, batch_size, hid_dim)
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

## Create Decoder

- `torch.nn.Linear`
    - Constructor arguments:
        - `in_feature`
        - `out_feature`
    - Input $\rightarrow$ output:
        - (batch_size, hidden_size) $\rightarrow$ (batch_size, output_vocab_size)

In [0]:
class Decoder(nn.Module):
    def __init__(self, output_vocab_size, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.output_vocab_size = output_vocab_size
        
        self.embedding = nn.Embedding(output_vocab_size, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_vocab_size)
                
    def forward(self, input, hidden, cell):
        
        #input = (batch size)
        #hidden = (n_layers * n_directions, batch_size, hid_dim)
        #cell = (n_layers * n_directions, batch_size, hid_dim)
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = (n_layers, batch_size, hid_dim)
        #context = (n_layers, batch_size, hid_dim)
        
        input = input.unsqueeze(0)
        
        #input = (1, batch_size)
        
        embedded = self.embedding(input)
        
        #embedded = (1, batch_size, emb_dim)
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = (seq_len, batch_size, hid_dim * n_directions)
        #hidden = (n_layers * n_directions, batch_size, hid_dim)
        #cell = (n_layers * n_directions, batch_size, hid_dim)
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = (1, batch_size, hid_dim)
        #hidden = (n_layers, batch_size, hid_dim)
        #cell = (n_layers, batch_size, hid_dim)
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = (batch_size, output_dim)
        
        return prediction, hidden, cell

## Create Seq2Seq

**Note**: our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\begin{align*}
\text{trg} = [\texttt{<sos>}, &y_1, y_2, y_3, \texttt{<eos>}]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, \texttt{<eos>}]
\end{align*}$$

Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, \texttt{<eos>}]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, \texttt{<eos>}]
\end{align*}$$

In [0]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            'Hidden dimensions of encoder and decoder must be equal!'
        assert encoder.n_layers == decoder.n_layers, \
            'Encoder and decoder must have equal number of layers!'
        
    def forward(self, src, trg):
        
        #src = (src len, batch size)
        #trg = (trg len, batch size)
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_vocab_size
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #use predicted token as next input
            input = top1
        
        return outputs

Let's see if our model works...

In [0]:
INPUT_VOCAB_SIZE = len(SRC.vocab)
OUTPUT_VOCAB_SIZE = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_VOCAB_SIZE, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_VOCAB_SIZE, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [0]:
out = model(ex.src, ex.trg)
print(out.size())