# About

A sketch of the [machine translation tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) from the [PyTorch Tutorials](https://pytorch.org/tutorials/index.html).

It's the closest analogy to automated document conversion, so this is an attempt to study its internals.

## Model

This is a sequence-to-sequence (seq2seq) model, consisting of a pair of [recurrent neural networks (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network), the encoder and decoder. 

The encoder reads an input sequence and outputs a single vector, and the decoder reads that vector to produce an output sequence.

![diagram](images/seq2seq.png)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Encoder

This is the first half of the RNN seq2seq network, aka [Encoder Decoder network](https://arxiv.org/pdf/1406.1078v3.pdf).

For every input (word token, in this example) the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.

![encoder](images/encoder-network.png)

In [3]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        # using a multi-layer gated recurrent unit (GRU) 
        # as an improvement over long short-term memory (LSTM) 
        self.gru = nn.GRU(hidden_size, hidden_size)
        
    def forward(self, input_layer, hidden_layer):
        embedded = self.embedding(input_layer).view(1, 1, -1)
        output_layer, hidden_layer = self.gru(embedded, hidden_layer)
        return output_layer, hidden_layer
    
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

## Decoder

The is the second half of the model, another RNN that takes the encoder's output vectors as its input, and outputs a sequence of words to create the translation:

![decoder](images/decoder-network.png)

In [4]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, input_layer, hidden_layer):
        output_layer = F.relu(self.embedding(input).view(1, 1, -1))
        output_layer, hidden_layer = self.gru(output_layer, hidden_layer)
        output_layer = self.softmax(self.out(output_layer[0]))
        return output_layer, hidden_layer
    
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

## Attention Decoder

Without this, the context vector which is passed betweeen the encoder and decoder carries the burden of encoding the entire sentence.

The attention RNN allows the decoder network to "focus" on a different part of the encoder's outputs for every step of the decoder's own outputs.

![diagram](images/attention-decoder.png)

Calculating the attention weights is done with another feed-forward layer, `attn`, using the decoder's input and hidden state as inputs.

Rather than using [bucketing](https://www.kaggle.com/bminixhofer/speed-up-your-rnn-with-sequence-bucketing) for handling inputs of variable length in the training data, this example chooses a maximum sentence length that gets applied to all inputs: sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.

![diagram](images/attention-decoder-network.png)

In [5]:
class AttentionDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout, max_length):
        super(AttentionDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.max_length = max_length
        sef.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attention = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attention_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)
        
    def forward(self, input_layer, hidden_layer, encoder_output):
        embedded = self.dropout(self.embedding(input_layer).view(1, 1, -1))
        attention_weights = F.softmax(self.attention(torch.cat((embedded[0], hidden_layer[0]), 1)), dim=1)
        attention_applied = torch.bmm(attention_weights.unsqueeze(0), encoder_output.unsqueeze(0))
        output_layer = self.attention_combine(torch.cat((embedded[0], attention_applied[0]), 1)).unsqueeze(0)
        output_layer, hidden_layer = self.gru(F.relu(output_layer), hidden_layer)
        output_layer = F.log_softmax(self.out(output_layer[0]), dim=1)
        return output_layer, hidden_layer, attention_weights
    
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)