# Neural Machine translation using Seq2seq model

A **Seq2Seq** (Sequence-to-Sequence) model is a type of neural network architecture used for tasks where the input and output are both sequences, but the length of the input sequence can be different from the output sequence. It's particularly useful in problems like:

- **Machine Translation** (translating a sentence from one language to another)
- **Text Summarization** (converting a long article into a shorter summary)
- **Speech Recognition** (transcribing spoken words into text)
- **Question Answering** (generating an answer based on a question)
- **Image Captioning** (generating a description of an image)

### Key Components of a Seq2Seq Model:

1. **Encoder**: 
   - The encoder processes the input sequence and converts it into a fixed-length context vector (also called the "thought vector" or "hidden state"). This part of the model can be a Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU), which are designed to handle sequences.
   - The encoder processes the input sequence step by step, updating its internal state to represent the information from the sequence.

2. **Decoder**: 
   - The decoder generates the output sequence from the context vector produced by the encoder. The decoder is also typically an RNN, LSTM, or GRU.
   - The decoder starts with the context vector as its initial hidden state and generates one token (word or character) at a time.
   - During training, the decoder is provided the correct output token at each step (this is called **teacher forcing**). During inference (prediction), the decoder uses its own previous output as input for the next step.

### General Workflow:
1. The input sequence is fed into the **encoder**.
2. The encoder generates a context vector (a compressed representation of the entire input sequence).
3. The context vector is passed to the **decoder**.
4. The decoder generates the output sequence, one token at a time.

### Seq2Seq Example:
For example, in **machine translation**:
- **Input**: "How are you?"
- **Output**: "¿Cómo estás?"

The encoder processes "How are you?" and outputs a context vector. The decoder then uses this context vector to generate the translation word by word: "¿", "Cómo", "estás", "?"


## Imports and Libraries

In [38]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, GRU, Dense
import numpy as np

## Sample data

In [39]:
input_texts = [
    "hello", "how are you", "good morning", "what is your name", "thank you",
    "good night", "see you later", "have a good day", "what time is it", "where are you from",
    "I am fine", "I love you", "please help me", "excuse me", "I'm sorry",
    "it's okay", "can you speak English", "how much does it cost", "I don't understand", "what is this"
]

target_texts = [
    "hola", "como estas", "buenos dias", "cual es tu nombre", "gracias",
    "buenas noches", "hasta luego", "que tengas un buen dia", "que hora es", "de donde eres",
    "estoy bien", "te quiero", "por favor ayudame", "perdon", "lo siento",
    "esta bien", "puedes hablar ingles", "cuanto cuesta", "no entiendo", "que es esto"
]



## Tokenization

In [40]:
# Tokenize input and target texts
input_tokenizer = tf.keras.preprocessing.text.Tokenizer()
input_tokenizer.fit_on_texts(input_texts)
input_sequences = input_tokenizer.texts_to_sequences(input_texts)
input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, padding='post')

output_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
output_tokenizer.fit_on_texts(target_texts)
output_sequences = output_tokenizer.texts_to_sequences(target_texts)

## Add start and end tokens

The ` <start> ` and ` <end> ` tokens are critical in Seq2Seq models for several reasons:

Start token tells the decoder when to begin generating the output sequence.

End token signals when the sequence generation is complete, preventing the model from generating endless sequences.

They help the model learn the structure of sequences by providing clear boundaries.

They improve learning during training and are essential for handling variable-length sequences.

They guide the generation process during inference, enabling the model to stop when appropriate.

In [41]:
# Step to add <start> and <end> tokens to the target sequences
def add_start_end_tokens_manually(tokenizer, sequences):
    if '<start>' not in tokenizer.word_index:
        tokenizer.word_index['<start>'] = len(tokenizer.word_index) + 1
    if '<end>' not in tokenizer.word_index:
        tokenizer.word_index['<end>'] = len(tokenizer.word_index) + 1

    updated_sequences = []
    for seq in sequences:
        updated_seq = [tokenizer.word_index['<start>']] + seq + [tokenizer.word_index['<end>']]
        updated_sequences.append(updated_seq)
    return tf.keras.preprocessing.sequence.pad_sequences(updated_sequences, padding='post')

In [42]:
# Add <start> and <end> tokens
output_sequences = add_start_end_tokens_manually(output_tokenizer, output_sequences)

In [43]:
# Vocabulary sizes
input_vocab_size = len(input_tokenizer.word_index) + 1
output_vocab_size = len(output_tokenizer.word_index) + 1

## Model building

In [44]:

# Model parameters
embedding_dim = 256
units = 512
batch_size = 2


## Encoder model

The **Encoder** model in a Seq2Seq architecture processes an input sequence and transforms it into a context vector that summarizes the sequence. It consists of an **embedding layer** that converts input tokens into dense vectors, followed by a **GRU layer** that captures the temporal dependencies of the sequence. The model returns both the sequence of hidden states (useful for attention mechanisms) and the final hidden state (used as the context for the decoder). This encoded information helps the decoder generate the output sequence.

For example, if the input is a sequence of word indices like [1, 2, 3, 4]:

The embedding layer will convert these indices into dense vectors (e.g., of size embedding_dim).
The GRU layer will process the sequence of embedded vectors, and the output will be a sequence of hidden states, and the final hidden state will capture the context of the entire sequence.

In [45]:
# Encoder
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units):
        super(Encoder, self).__init__()
        self.enc_units = enc_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(enc_units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')

    def call(self, x):
        x = self.embedding(x)
        output, state = self.gru(x)
        return output, state

## Attention

In traditional Seq2Seq models, the encoder produces a single fixed-size context vector, which can sometimes fail to capture all the important information, especially in long sequences. To address this, **attention mechanisms** were introduced. Attention allows the decoder to focus on different parts of the input sequence at each decoding step, improving the model’s performance and handling long sequences better.

In [46]:
# Attention
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

    def call(self, query, values):
        query_with_time_axis = tf.expand_dims(query, 1)
        score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

## Decoder Model

The Decoder class defines the decoder part of a Seq2Seq model with attention. It receives the context vector from the encoder, the current hidden state, and the current token as input. 

It applies the Bahdanau attention mechanism to focus on important parts of the encoder's output, processes the input through an embedding layer and GRU, and produces a set of logits for the next token. 

It also returns the updated hidden state and attention weights, which are used in subsequent decoding steps and for interpreting which parts of the input sequence were most important at each timestep.

In [47]:
# Decoder
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units):
        super(Decoder, self).__init__()
        self.dec_units = dec_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(dec_units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
        self.fc = Dense(vocab_size)

        self.attention = BahdanauAttention(dec_units)

    def call(self, x, enc_output, hidden):
        context_vector, attention_weights = self.attention(hidden, enc_output)
        x = self.embedding(x)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state = self.gru(x)
        x = self.fc(output)
        x = tf.squeeze(x, axis=1)
        return x, state, attention_weights

In [48]:
# Instantiate the models
encoder = Encoder(input_vocab_size, embedding_dim, units)
decoder = Decoder(output_vocab_size, embedding_dim, units)

## Model Training

In [49]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')



In [50]:
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

In [51]:
@tf.function
def train_step(input_seq, target_seq, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(input_seq)
        dec_hidden = enc_hidden

        dec_input = tf.expand_dims([output_tokenizer.word_index['<start>']] * batch_size, 1)

        for t in range(1, target_seq.shape[1]):
            predictions, dec_hidden, _ = decoder(dec_input, enc_output, dec_hidden)
            loss += loss_function(target_seq[:, t], predictions)

            dec_input = tf.expand_dims(target_seq[:, t], 1)

    batch_loss = loss / int(target_seq.shape[1])
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

In [52]:
# Example training call
EPOCHS = 10
for epoch in range(EPOCHS):
    enc_hidden = tf.zeros((batch_size, units))
    total_loss = 0

    for batch in range(len(input_sequences) // batch_size):
        batch_input = input_sequences[batch * batch_size:(batch + 1) * batch_size]
        batch_target = output_sequences[batch * batch_size:(batch + 1) * batch_size]

        batch_loss = train_step(batch_input, batch_target, enc_hidden)
        total_loss += batch_loss

    print(f'Epoch {epoch + 1}, Loss: {total_loss.numpy()}')


Epoch 1, Loss: 17.512834548950195
Epoch 2, Loss: 15.508331298828125
Epoch 3, Loss: 14.64116382598877
Epoch 4, Loss: 13.69356632232666
Epoch 5, Loss: 12.304492950439453
Epoch 6, Loss: 10.427021980285645
Epoch 7, Loss: 8.448477745056152
Epoch 8, Loss: 6.822537422180176
Epoch 9, Loss: 5.6506428718566895
Epoch 10, Loss: 4.793063163757324


## Example translations

In [53]:
def translate(sentence):
    # Preprocess the input sentence
    sentence_seq = input_tokenizer.texts_to_sequences([sentence])
    sentence_seq = tf.keras.preprocessing.sequence.pad_sequences(sentence_seq, maxlen=input_sequences.shape[1], padding='post')

    enc_hidden = tf.zeros((1, units))
    enc_output, enc_hidden = encoder(sentence_seq)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([output_tokenizer.word_index['<start>']], 0)

    result = []

    for t in range(output_sequences.shape[1]):
        # Get the decoder's prediction
        predictions, dec_hidden, _ = decoder(dec_input, enc_output, dec_hidden)

        # Use greedy decoding: pick the token with the highest probability
        predicted_id = tf.argmax(predictions[0], axis=-1).numpy()

        # Get the predicted word from the tokenizer
        predicted_word = output_tokenizer.index_word.get(predicted_id, None)

        # If no valid predicted word, break the loop
        if predicted_word is None:
            #print(f"Unknown token ID {predicted_id} detected.")
            break

        # If we reach the <end> token, stop
        if predicted_word == '<end>':
            break

        # Append the predicted word to the result
        result.append(predicted_word)

        # Use the predicted word as the next input to the decoder
        dec_input = tf.expand_dims([predicted_id], 0)

    return ' '.join(result)


In [54]:
# Example translations
for text in input_texts:
    print(f"Input: {text}")
    print(f"Translated: {translate(text)}")
    print("-" * 30)

Input: hello
Translated: perdon
------------------------------
Input: how are you
Translated: de donde eres
------------------------------
Input: good morning
Translated: buenos dias
------------------------------
Input: what is your name
Translated: cual es tu nombre
------------------------------
Input: thank you
Translated: gracias
------------------------------
Input: good night
Translated: buenas noches
------------------------------
Input: see you later
Translated: hasta luego
------------------------------
Input: have a good day
Translated: que tengas un buen dia
------------------------------
Input: what time is it
Translated: que hora es
------------------------------
Input: where are you from
Translated: de donde eres
------------------------------
Input: I am fine
Translated: estoy bien
------------------------------
Input: I love you
Translated: te quiero
------------------------------
Input: please help me
Translated: por favor ayudame
------------------------------
Input: