<a href="https://colab.research.google.com/github/ccarpenterg/introNLP/blob/master/04b_NLP_and_sequence_to_sequence_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation with Seq2Seq and Attention

Traditionally neural machine translation uses the encoder-decoder architecture that encodes the source sentence into a vector which then is fed to a decoder which generates the translation.

![encoder-decoder](https://user-images.githubusercontent.com/114733/80140098-78a57a80-8575-11ea-90e9-35a961d4acf4.jpg)

But according to Bahdanau 2016, this fixed-length vector represents a bottleneck. Moreover, one of the issues with the encoder-decoder model is that it needs to compress all the necesarry information of the source sentence into this aforementioned vector, which can be difficult especially for longer sentences.

## Seq2Seq with Attention

In the paper **Neural Machine Translation by Jointly Learning to Align and Translate**, Bahdanau proposed to extend the encoder-decoder architecture by alowwing the model to automatically soft search for part of the source sentence that are relevant to predicting the target word.

### Attention

This mechanism of soft searching for the parts of the source sentence that are relevant to a word in the target sentence is called **Attention** and it's the cornerstone of the state-of-the-art NLP models called Transformers.

![alt text](https://user-images.githubusercontent.com/114733/80148688-6b8f8800-8583-11ea-84f3-fb1d464bcebf.jpg)

# Sequence to Sequence with Attention

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import unicodedata
import re

## Attention

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, query, values):
        query_with_time_axis = tf.expand_dims(query, 1)
        score = self.V(
            tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))
        )

        attention_weigths = tf.nn.softmax(scores, axis=1)

        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

## Encoder-Decoder

In [None]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_size):
        super(Encoder, self).__init__()
        self.batch_size = batch_size
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(
            self.enc_units,
            return_sequences=True,
            return_state=True,
            recurrent_initializer='glorot_uniform'
        )

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_size, self.enc_units))

class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_size):
        super(Decoder, self).__init__()
        self.batch_size = batch_size
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(
            self.dec_units,
            return_sequences=True,
            return_state=True,
            recurrent_initializer='glorot_uniform'
        )

        self.fc = tf.keras.layers.Dense(vocab_size)

        self.attention = BahdanauAttention(self.dec_units)

    def call(self, x, hidden, enc_output):
        context_vector, attention_weights = self.attention(hidden, enc_output)
        x = self.embedding(x)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=1)
        output, state = self.gru(x)
        output = tf.reshape(output, (-1, output.shape[2]))
        x = self.fc(output)

        return x, state, attention_weights

# References

[1] Neural Machine Translation by Jointly Learning Align and Translate (Seq2Seq with Attention paper), [Bahdanau 2016](https://arxiv.org/pdf/1409.0473.pdf)

[2] Neural Machine Translation with Attention ([Tensorflow official documentation](https://www.tensorflow.org/tutorials/text/nmt_with_attention))

[3] Machine Translation, Seq2Seq and Attention ([slides](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture08-nmt.pdf )) ([video](https://youtu.be/XXtpJxZBa2c)). CS224n: NLP with Deep Learning (2019), Stanford University
