## Seq2Seq Model - Neural Machine Translation
- https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
- https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
- https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/

<img width="60%" src="https://machinetalk.org/wp-content/uploads/2019/03/attention.gif" class="img-responsive wp-post-image" alt="" data-pagespeed-url-hash="1740859731" onload="pagespeed.CriticalImages.checkImageForCriticality(this);">

### What is <b>sequence-to-sequence learning<b>?

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French).

"the cat sat on the mat" -> <b>[Seq2Seq model] </b>-> "le chat etait assis sur le tapis"

This can be used for machine translation or for free-from question answering (generating a natural language answer given a natural language question) -- in general, it is applicable any time you need to generate text.

### The general case: canonical sequence-to-sequence

In the general case, input sequences and output sequences have different lengths (e.g. machine translation) and the entire input sequence is required in order to start predicting the target. This requires a more advanced setup, which is what people commonly refer to when mentioning "sequence to sequence models" with no further context. Here's how it works:

<img alt="Seq2seq inference" src="https://machinetalk.org/wp-content/uploads/2019/04/input.png" width="60%">


- A RNN layer (or stack thereof) acts as <b>"encoder"</b>: it processes the input sequence and returns its own internal state. The encoder, which is on the left-hand side, requires only sequences from source language as inputs. Note that we discard the outputs of the encoder RNN, only recovering the state. This state will serve as the "context", or "conditioning", of the decoder in the next step.



- Another RNN layer (or stack thereof) acts as <b>"decoder"</b>: it is trained to predict the next word of the target sequence, given previous words of the target sequence. 
    - Specifically, it is trained to `turn the target sequences into the same sequences but offset by one timestep in the future`, a training process called `teacher forcing` in this context. 
    - Importantly, `the encoder uses as initial state the state vectors from the encoder`, which is how the decoder obtains information about what it is supposed to generate.
    - Effectively, the decoder learns to generate target at $t+1$ given target at $t$, conditioned on the input sequence.
    - The same process can also be used to train a Seq2Seq network `without teacher forcing`, i.e. by reinjecting the decoder's predictions into the decoder.

## Neural Translation Machine

Let's illustrate these ideas with actual code.

For our example implementation, we will use a dataset of pairs of English sentences and their French translation, which you can download from <a href="http://www.manythings.org/anki/">manythings.org/anki</a>. The file to download is called `fra-eng.zip` (English/French). We will implement a word-level model sequence-to-sequence model, processing the input word-by-word and generating the output word-by-word. 

Here's a summary of our process:

- 1) Turn the sentences into 3 Numpy arrays, `encoder_input_data`, `decoder_input_data`, `decoder_target_data`:
    - `encoder_input_data` is a 3D array of shape (`num_pairs`, `max_english_sentence_length`, `num_english_characters`) containing a one-hot vectorization of the English sentences.
    - `decoder_input_data` is a 3D array of shape (`num_pairs`, `max_french_sentence_length`, `num_french_characters`) containg a one-hot vectorization of the French sentences.
    - `decoder_target_data` is the same as decoder_input_data but offset by one timestep. `decoder_target_data[:, t, :]` will be the same as `decoder_input_data[:, t + 1, :]`.
- 2) Train a basic LSTM-based Seq2Seq model to predict `decoder_target_data` given `encoder_input_data` and `decoder_input_data`. Our model uses `teacher forcing`.
- 3) Decode some sentences to check if the model is working (i.e. turn samples from `encoder_input_data` into corresponding samples from `decoder_target_data`).

Because the `training process` and `inference process` (decoding sentences) are quite different, `we use different models for both, albeit they all leverage the same inner layers`.

Note that the encoder and decoder are connected by RNN states:

- `encoder states`: This is used to store the states of the encoder.
- `inital_state of decoder`: we pass the encoder states to the decoder as initial states.


In [None]:
!pip install torchinfo

Collecting torchinfo
  Downloading torchinfo-1.5.3-py3-none-any.whl (19 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.5.3


In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import unicodedata
import re
from tensorflow import keras
import pandas as pd
from sklearn.utils import shuffle

### (1) Data Preprocessing

1) Turn the sentences into 3 Numpy arrays, `data_en`, `data_fr_in`, `data_fr_out`:
- `data_en`: is a 2D array of shape (`num_samples`, `max_en_words_per_sentence`) containing a tokenized sentences after preprocessing.
- `data_fr_in`: is a 2D array of shape (`num_samples`, `max_fr_words_per_sentence`) containing a tokenized sentences after preprocessing.
- `data_fr_out`: the same as decoder_input_data but offset by one timestep, i.e.  `data_fr_in [:, t]` will be the same as `data_fr_out [:, t+1]`.  

Note that, this is a demo version of the NTM. We will train the model using a small dataset. To make the model work realistically, you need to train the model with a larget collection of training samples


We'll use Keras for simple data preprocessing

In [None]:
from google.colab import drive
drive.mount('/content/drive')
path_to_glove_file = "drive/My Drive/TA 667/fra-eng/fra.txt"

Mounted at /content/drive


In [None]:
# Read data
#text = pd.read_csv('fra.txt', sep="\t", header=None, usecols=[0,1])
text = pd.read_csv(path_to_glove_file, sep="\t", header=None, usecols=[0,1])
text.columns =['en','fr']
text.head()
len(text)

# Take a small set to save training time
text = shuffle(text)
raw_data=text.iloc[0:10000]

Unnamed: 0,en,fr
0,Go.,Va !
1,Hi.,Salut !
2,Hi.,Salut.
3,Run!,Cours !
4,Run!,Courez !


175623

In [None]:
# clean up text

def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def normalize_string(s):
    s = unicode_to_ascii(s)
    s = re.sub(r'([!.?])', r' \1', s)
    s = re.sub(r'[^a-zA-Z.!?]+', r' ', s)
    s = re.sub(r'\s+', r' ', s)
    return s

In [None]:
# clean up text
raw_data_en = [normalize_string(data) for data in raw_data["en"]]

# add special token <start>/<end> to indicate the beginning and end of a sentence
raw_data_fr_in = ['<start> ' + normalize_string(data) for data in raw_data["fr"]]
raw_data_fr_out = [normalize_string(data) + ' <end>' for data in raw_data["fr"]]

In [None]:
# Tokenize each sentence and index each word
max_en_words = 5000
max_en_len = 10
en_tokenizer = keras.preprocessing.text.Tokenizer(filters='', \
                                                  num_words=max_en_words )
en_tokenizer.fit_on_texts(raw_data_en)
print("Total number of English words: ", len(en_tokenizer.word_index))

Total number of English words:  4511


In [None]:
data_en = en_tokenizer.texts_to_sequences(raw_data_en)
data_en = keras.preprocessing.sequence.pad_sequences(data_en,\
                                                     maxlen=max_en_len, \
                                                     padding='post')
# print a sample sentence after preprocessing
print(data_en[:3])

[[  20    8   44   10   13    7   61   81  190    1]
 [   2  826  437  647    1    0    0    0    0    0]
 [  14  351    4 2393   43 1172    1    0    0    0]]


In [None]:
# Process French sentences in the same way

max_fr_words = 5000
max_fr_len = 10
fr_tokenizer = keras.preprocessing.text.Tokenizer(filters='', num_words = max_fr_words)

# ATTENTION: always finish with fit_on_texts before moving on
fr_tokenizer.fit_on_texts(raw_data_fr_in)
fr_tokenizer.fit_on_texts(raw_data_fr_out)
print("Total number of French words: ", len(fr_tokenizer.word_index))

data_fr_in = fr_tokenizer.texts_to_sequences(raw_data_fr_in)
data_fr_in = keras.preprocessing.sequence.pad_sequences(data_fr_in,\
                                                        maxlen=max_fr_len, \
                                                        padding='post')

data_fr_out = fr_tokenizer.texts_to_sequences(raw_data_fr_out)
data_fr_out = keras.preprocessing.sequence.pad_sequences(data_fr_out,\
                                                            maxlen=max_fr_len, \
                                                            padding='post')
# print a sample sentence after preprocessing
data_fr_in[:3]

Total number of French words:  6299


array([[  66,    7,    7,   39,  114,   26,   77,  181,  182,    1],
       [   2,   17,   97,  170,   48,  815,    1,    0,    0,    0],
       [   2,   12,    5,  360,    6, 1566,   95,  361,    1,    0]],
      dtype=int32)

In [None]:
# Create the reversal mapping between indexes and words
reverse_fr_word_index ={fr_tokenizer.word_index[w] : w \
                        for w in  fr_tokenizer.word_index}
print("index of symbol <start> :", fr_tokenizer.word_index["<start>"])
print("index of symbol <end> :", fr_tokenizer.word_index["<end>"])

index of symbol <start> : 2
index of symbol <end> : 3


### (2) Define "Teacher Forcing" Model for Training Process

2) Train a basic LSTM-based Seq2Seq model to predict `decoder_outputs` given `encoder_inputs` and `decoder_inputs`. Our model uses `teacher forcing`.


And here is how the data’s shape changes at each layer. Often keeping track of the data’s shape is extremely helpful not to make silly mistakes, just like stacking up Lego pieces.

Here we start with a simple encoder: only one layer LSTM, unidirectional.

Task for you:
`Can you modify the model to allow multiple layers and bidirectional?`

In [None]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, random_split

import torch.optim as optim

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
max_en_words = 5000
max_en_len = 10
max_fr_len = 10
max_fr_words = 5000
latent_dim = 100   #i.e. rnn_size
batch_size = 32

#### Encoder ####
<img width="60%" alt="Seq2seq inference" src="https://machinetalk.org/wp-content/uploads/2019/03/data_shapes-1.png" width="80%"> 


In [None]:
# vocab_size: the total number of words in vocabulary
# embedding_size: word embedding dimension
# latent_dim: RNN hidden state dimension

class EncoderLSTM(nn.Module):
    
  def __init__(self, vocab_size, embedding_size, hidden_size):
    
    super(EncoderLSTM, self).__init__()
    
    self.vocab_size = vocab_size

    self.embedding_size = embedding_size
    
    self.hidden_size = hidden_size

    self.embedding = nn.Embedding(self.vocab_size, self.embedding_size,padding_idx=0)
    
    # The input to LSTM should be [batch_size, seq_length, embedding_size]
    self.LSTM = nn.LSTM(input_size = self.embedding_size, \
                        hidden_size = self.hidden_size,
                        batch_first = True)

  
  def forward(self, x):
    
    # the shape of x is [batch_size, seq_length]
    x = self.embedding(x)  
    # after embeding, the shape of x is: [batch_size, seq_length, embedding_size]
    
    # We don't care about output. We only need states
    outputs, (hidden_state, cell_state) = self.LSTM(x)
    # hidden_state shape: [1, batch_size, hidden_size]
    # cell_state shape: [1, batch_size, hidden_size]
    
    #print(hidden_state.shape)
    #print(cell_state.shape)
    
    return outputs, hidden_state, cell_state

In [None]:
encoder = EncoderLSTM(vocab_size=max_en_words,
                      embedding_size=latent_dim,
                      hidden_size=latent_dim)

from torchinfo import summary 
summary(encoder,input_size=(batch_size,max_en_len),
       dtypes=[torch.long])

Layer (type:depth-idx)                   Output Shape              Param #
EncoderLSTM                              --                        --
├─Embedding: 1-1                         [32, 10, 100]             500,000
├─LSTM: 1-2                              [32, 10, 100]             80,800
Total params: 580,800
Trainable params: 580,800
Non-trainable params: 0
Total mult-adds (M): 41.86
Input size (MB): 0.00
Forward/backward pass size (MB): 0.51
Params size (MB): 2.32
Estimated Total Size (MB): 2.84

#### decoder ####
<img width="60%" alt="Seq2seq inference" src="https://machinetalk.org/wp-content/uploads/2019/03/data_shapes-2.png" width="80%"> 



In [None]:
# vocab_size: the total number of words in vocabulary
# embedding_size: word embedding dimension
# latent_dim: RNN hidden state dimension

class DecoderLSTM(nn.Module):
    
  def __init__(self, vocab_size, embedding_size, hidden_size):
    
    super(DecoderLSTM, self).__init__()
    
    self.vocab_size = vocab_size

    self.embedding_size = embedding_size
    
    self.hidden_size = hidden_size

    self.embedding = nn.Embedding(self.vocab_size, self.embedding_size,padding_idx=0)
    
    # The input to LSTM should be [batch_size, seq_length, embedding_size]
    self.LSTM = nn.LSTM(input_size = self.embedding_size, \
                        hidden_size = self.hidden_size,
                        batch_first = True)

    self.dense = nn.Linear(in_features = self.hidden_size, 
                        out_features = self.vocab_size)


  def forward(self, x, hidden_state, cell_state):
    
    
    x = self.embedding(x)

    # LSTM will be initialized with encoder states
    outputs, (h, c) = self.LSTM(x, (hidden_state, cell_state))
    # outputs shape is [batch, seq_length, hidden_state]
    
    predictions = self.dense(outputs)
    # prediction shape is [batch, seq_length, vocab_size]

    return predictions, h, c

In [None]:
decoder = DecoderLSTM(vocab_size=max_fr_words,\
                      embedding_size=latent_dim,\
                      hidden_size=latent_dim)

summary(decoder,input_size=[(batch_size,max_fr_len),\
                            (1,batch_size,latent_dim),\
                            (1,batch_size,latent_dim)],\
        dtypes=[torch.long,torch.float,torch.float])

Layer (type:depth-idx)                   Output Shape              Param #
DecoderLSTM                              --                        --
├─Embedding: 1-1                         [32, 10, 100]             500,000
├─LSTM: 1-2                              [32, 10, 100]             80,800
├─Linear: 1-3                            [32, 10, 5000]            505,000
Total params: 1,085,800
Trainable params: 1,085,800
Non-trainable params: 0
Total mult-adds (M): 58.02
Input size (MB): 0.03
Forward/backward pass size (MB): 13.31
Params size (MB): 4.34
Estimated Total Size (MB): 17.68

#### Connect Encoder and Decoder to Create Seq2Seq Model ####

In [None]:
class Seq2Seq(nn.Module):
    
  def __init__(self, Encoder_LSTM, Decoder_LSTM):
    
    super(Seq2Seq, self).__init__()
    
    self.Encoder = Encoder_LSTM
    self.Decoder = Decoder_LSTM

  def forward(self, encoder_input, decoder_input):
    
    _, hidden_state, cell_state = self.Encoder(encoder_input)
    
    predictions, _, _ = self.Decoder(decoder_input, hidden_state, cell_state)
        
    return predictions


In [None]:
model = Seq2Seq(encoder, decoder)

#### Create Dataset and Training Function ####

Now you can compile and fit the model as usual. Note the matching between tensors and variables:
- `encoder_inputs` <-> `data_en`
- `decoder_inputs` <-> `data_fr_in`
- `decoder_outputs` <-> `data_fr_out`

In [None]:
class NTM_dataset(Dataset):
    
    def __init__(self,data_en, data_fr_in,data_fr_out):
        
        self.length = len(data_en)
        
        self.encoder_input = torch.IntTensor(data_en)
        self.decoder_input = torch.IntTensor(data_fr_in)
        
        # for CrossEntropyLoss, decoder_output must have Long data type
        self.decoder_output = torch.LongTensor(data_fr_out)
    
    def __getitem__(self, index):
        return self.encoder_input[index], \
               self.decoder_input[index],\
               self.decoder_output[index]
    
    def __len__(self):
        return self.length 

In [None]:
dataset = NTM_dataset(data_en, data_fr_in, data_fr_out)

test_size = int(len(data_en) * 0.2)
train_size = len(data_en) - test_size

train_dataset, test_dataset = torch.utils.data.random_split(dataset, \
                                                            [train_size, test_size])


In [None]:
len(train_dataset)
len(test_dataset)

8000

2000

In [None]:
# Define a function to train the model 
def train_model(model, train_dataset, test_dataset, device, lr=0.0005, epochs=20, batch_size=32):
    
    # construct dataloader
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

    # move model to device
    model = model.to(device)

    # history
    history = {'train_loss': [],
               'train_acc': [],
               'test_loss': [],
               'test_acc': []}
    # setup loss function and optimizer
    optimizer = torch.optim.RMSprop(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # training loop
    print('Training Start')
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        train_acc = 0
        test_loss = 0
        test_acc = 0

        for encoder_input, decoder_input, decoder_output in train_loader:
            
            # move data to device
            encoder_input = encoder_input.to(device)
            decoder_input = decoder_input.to(device)
            decoder_output = decoder_output.to(device)
            
            # forward
            outputs = model(encoder_input, decoder_input)  # batch_size, max_fr_len (i.e. seq_len), fr_vocab_size
            
            _, pred = torch.max(outputs, dim = -1)
            
            #reshape output to batch_size * seq_len, fr_vocab_size since the loss looks for 2-dimensional input                
            cur_train_loss = criterion(outputs.view(-1 , max_fr_words), decoder_output.view(-1))
            
            # reshape pred & decoder ouput t to calculate acc of each predicted words
            cur_train_acc = (pred.view(-1) == decoder_output.view(-1)) 
            cur_train_acc = cur_train_acc.sum().item()/len(cur_train_acc)
                
            # backward
            cur_train_loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            # loss and acc
            train_loss += cur_train_loss
            train_acc += cur_train_acc

        # test start
        model.eval()
        with torch.no_grad():
            
            for encoder_input, decoder_input, decoder_output in test_loader:
            
                # move data to device
                encoder_input = encoder_input.to(device)
                decoder_input = decoder_input.to(device)
                decoder_output = decoder_output.to(device)
            
                # forward
                outputs = model(encoder_input, decoder_input)  # batch_size, max_fr_len (i.e. seq_len), fr_vocab_size
            
                _, pred = torch.max(outputs, dim = -1)
            
                #reshape output to batch_size, seq_len * fr_vocab_size since the loss looks for 2-dimensional input                
                cur_test_loss = criterion(outputs.view(-1, max_fr_words), decoder_output.view(-1))
                
                cur_test_acc = (pred.view(-1) == decoder_output.view(-1)) 
                cur_test_acc = cur_test_acc.sum().item()/len(cur_test_acc)
                
                # loss and acc
                test_loss += cur_test_loss
                test_acc += cur_test_acc

        # epoch output
        train_loss = (train_loss/len(train_loader)).item()
        train_acc = train_acc/len(train_loader)
        val_loss = (test_loss/len(test_loader)).item()
        val_acc = test_acc/len(test_loader)
        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['test_loss'].append(val_loss)
        history['test_acc'].append(val_acc)
        print(f"Epoch:{epoch + 1} / {epochs}, train loss:{train_loss:.4f} train_acc:{train_acc:.4f}, valid loss:{val_loss:.4f} valid acc:{val_acc:.4f}")
    
    return history

In [None]:
history = train_model(model=model,
                      train_dataset = train_dataset,
                      test_dataset = test_dataset,
                      device=device,
                      epochs=50,
                      batch_size=64)

Training Start
Epoch:1 / 50, train loss:4.6214 train_acc:0.3248, valid loss:4.0966 valid acc:0.3900
Epoch:2 / 50, train loss:3.9483 train_acc:0.3983, valid loss:3.8776 valid acc:0.4153
Epoch:3 / 50, train loss:3.7353 train_acc:0.4239, valid loss:3.7202 valid acc:0.4441
Epoch:4 / 50, train loss:3.5653 train_acc:0.4442, valid loss:3.5986 valid acc:0.4556
Epoch:5 / 50, train loss:3.4249 train_acc:0.4578, valid loss:3.5166 valid acc:0.4640
Epoch:6 / 50, train loss:3.3050 train_acc:0.4661, valid loss:3.4335 valid acc:0.4723
Epoch:7 / 50, train loss:3.1983 train_acc:0.4740, valid loss:3.3893 valid acc:0.4771
Epoch:8 / 50, train loss:3.0998 train_acc:0.4840, valid loss:3.3198 valid acc:0.4868
Epoch:9 / 50, train loss:3.0084 train_acc:0.4932, valid loss:3.2747 valid acc:0.4925
Epoch:10 / 50, train loss:2.9244 train_acc:0.5002, valid loss:3.2428 valid acc:0.4972
Epoch:11 / 50, train loss:2.8442 train_acc:0.5063, valid loss:3.2301 valid acc:0.4996
Epoch:12 / 50, train loss:2.7707 train_acc:0.512

### Define a model for Inference (i.e. Testing Translation)

3) Decode some sentences to check that the model is working (i.e. turn samples from `data_en` into corresponding samples from `data_fr_out`).

Because the training process and `inference process` (decoding sentences) are quite different, we use different models for both, albeit they all leverage the same inner layers where **all the weights have been trained**. 


#### Inference without "teacher forcing"
<img width="60%" alt="Seq2seq inference" src="https://blog.keras.io/img/seq2seq/seq2seq-inference.png" width="80%"> 




Now we define a function which returns the translated sentences given an input sentence. 
- The decoder translates word by word starting with an decoder input `[ <start> ]`
- When a word, say $w_t$ is translated, the next decode input at $t+1$ is $[ w_t ]$
- It continues until either `<end>` is generated or the `max_fr_len` number of words are generated.

In [None]:
def decode_sequence(model, input_seq, device):  # input_seq is a English sentence
    
    model.eval()
    
    with torch.no_grad():
        # convert input_seq to tensor
        input_seq = torch.IntTensor(input_seq).to(device)

        # Encode the input as state vectors.
        hidden_state, cell_state = model.Encoder(input_seq)


        # Generate empty target sequence of (batch=1, length=1). (i.e. we translate word by word)
        target_seq = torch.empty((1,1), dtype=torch.int32, device = device)

        # Populate the start symbol of target sequence with the start character.
        target_seq = target_seq.fill_(fr_tokenizer.word_index["<start>"])

        target_seq = target_seq.to(device)

        # Generate word by word using the encode state and the last 
        # generated word

        decoded_sentence = []

        while True:

            # get decode ouput and hidden states, output shape is [1,1,5000]
            output_tokens, h, c = model.Decoder(target_seq, hidden_state, cell_state)


            # Get the most likely word
            _, sampled_token_index = torch.max(output_tokens, dim = -1)

            # flatten the token_index and convert to numpy number
            sampled_token_index = sampled_token_index.view(-1).item()

            # Look up the word by id
            sampled_word = reverse_fr_word_index[sampled_token_index]

            # append the word to decoded sentence
            decoded_sentence.append(sampled_word)

            # Exit condition: either hit max length
            # or find stop character.
            if (sampled_word == '<end>' or len(decoded_sentence) == max_fr_len):
                break

            # Update the target sequence with newly generated word.
            target_seq = target_seq.fill_(sampled_token_index)

            # Update states
            hidden_state, cell_state = h, c

    return ' '.join(decoded_sentence)

In [None]:
# Now let's test

test = text.iloc[10000:10010]
test

In [None]:
# preprocess test data
data_test = en_tokenizer.texts_to_sequences(test["en"])
data_test = keras.preprocessing.sequence.pad_sequences(data_test,\
                                                     maxlen=max_en_len, \
                                                     padding='post')
print(data_test[:3])

In [None]:
for i in range(len(data_test)):
    #data_test[i].shape
    fr = decode_sequence(data_test[i][None,:], device)
    print("\nEn: ", test.iloc[i]["en"])
    print("Fr: ", test.iloc[i]["fr"])
    print("Translated: ", fr)

### Seq2Seq model with attention

Now, let’s talk about attention mechanism. What is it and why do we need it?

- Difficult to remember and process long complicated context
- Struggle with difference in syntax structures used by languages

For detailed explanation, check: https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/


There are different ways to implement such an attention mechanism. You can find the Torch implemetation at https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html



## Take aways

* Neural Machine Translation contains an encoder and a decoder which both are LSTM layers
* Attention mechanism can help align encoder and decoder outputs
* You can try the following steps to enhance the translation models:
    - Use bidirectional LSTM
    - Try other more advanced attention mechanisms
    - Also, you may need to work on masking when allocating attention scores.