# Neural Machine Translation using Encoder-Decoder model

The assignment is divided like below:
Neural Machine Translation (NMT) using an encoder-decoder model is a deep learning-based approach that efficiently translates text from one language to another, such as Spanish to English. Here's a brief overview of how it works:

1. **Encoder:**
   - The encoder processes the input Spanish sentence and converts it into a sequence of fixed-length numerical representations (called embeddings).
   - It uses recurrent layers like LSTMs or GRUs (or transformers in modern implementations) to capture the semantic and syntactic information of the sentence. 
   - The output is a context vector (or a series of vectors in attention-based models) summarizing the meaning of the entire Spanish sentence.

2. **Decoder:**
   - The decoder takes the context vector from the encoder and generates the English translation one word at a time.
   - At each step, the decoder predicts the next word in the sequence, based on the context vector and the previously generated words.

3. **Attention Mechanism (optional but commonly used):**
   - Attention allows the decoder to focus on relevant parts of the input sentence while translating, rather than relying solely on the fixed context vector. This is particularly useful for longer sentences.

4. **Training:**
   - The model is trained on a large parallel corpus of Spanish-English sentence pairs.
   - It minimizes a loss function (e.g., cross-entropy loss) to align the predicted English words with the actual translation.

5. **Inference:**
   - During translation, the decoder uses techniques like beam search to produce fluent and contextually accurate English sentences.

The encoder-decoder architecture enables the model to capture complex language relationships and produce high-quality translations.

#### Instructions
Do not modify any of the codes.
Only write code when prompted. For example in some sections you will find the following,

```python
# YOUR CODE GOES HERE
# YOUR CODE ENDS HERE
# TODO


## Imports and Libraries

In [1]:
import tensorflow as tf
import pathlib
import re
import random
from helper import create_dataset, tokenize, preprocess_sentence

## Preprocessing and tokenization

Here we are using 'spa.txt' file which has Spanish sentences and its corresponding English translations. 

We are also preprocessing the sentences to lower the cases, trimming whitespace, removing non-alphabetic characters.

We are also looking at wrapping the sentence with <start> and <end> tokens for sequence modeling

The create dataset function loads the dataset, preprocesses each sentence pair, and returns separate lists for source and target languages.
Tokinzation function tokenizes a list of sentences and converts them into padded numerical sequences suitable for neural network inputs.

In [2]:
path_to_file = 'spa.txt'

# Load and preprocess the dataset
num_examples = 50000  # Set the number of examples to load
source_lang, target_lang = create_dataset(path_to_file, num_examples)

# Tokenize the source and target languages
input_tensor, input_tokenizer = tokenize(source_lang)
target_tensor, target_tokenizer = tokenize(target_lang)

# Vocabulary sizes
input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

## Encoder class
The below codeblock defines the encoder for the sequence-to-sequence model. It uses an embedding layer and GRU for encoding the input sentence into context vectors.

In [None]:
# Encoder
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = # TODO: Set the Batch Size
        self.enc_units = # TODO: Set the Number of Encoder Units 
        self.embedding = # TODO: Initialize the Embedding Layer
        self.gru = # TODO: Initialize the GRU Layer

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

## Bahdanau Attention

The below code defines the attention mechanism for the model. It calculates a context vector by weighting encoder outputs based on relevance to the decoder's current state. You can learn more about this attention mechanism [here](https://machinelearningmastery.com/the-bahdanau-attention-mechanism/) if interested.

In [3]:
# Attention Mechanism
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, query, values):
        query_with_time_axis = tf.expand_dims(query, 1)
        score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights


## Decoder class

Defines the decoder for the sequence-to-sequence model. It uses the attention mechanism, GRU, and a dense layer to generate output sequences.

In [None]:
# Decoder with Teacher Forcing
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = # TODO: Set the Batch Size
        self.dec_units = # TODO: Set the Decoder Units
        self.embedding = # TODO: Initialize the Embedding Laye
        self.gru = # TODO: Initialize the GRU Layer
        self.fc = # TODO: Initialize the Fully Connected (Dense) Layer
        self.attention = # TODO: Initialize the Attention Mechanism

    def call(self, x, hidden, enc_output):
        context_vector, attention_weights = self.attention(hidden, enc_output)
        x = self.embedding(x)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state = self.gru(x)
        output = tf.reshape(output, (-1, output.shape[2]))
        x = self.fc(output)
        return x, state, attention_weights

## Training

In the below code, set training parameters, including batch size, embedding dimension, and GRU unit size.
We are creating a TensorFlow dataset, shuffling it, and making batches of it for training.
Later, we are initializing encoder and decoder models with defined parameters.

In [None]:
# Training configuration
BUFFER_SIZE = len(input_tensor)
BATCH_SIZE = #TODO
steps_per_epoch = len(input_tensor) // BATCH_SIZE
embedding_dim = #TODO
units = #TODO

In [None]:
# Dataset
dataset = tf.data.Dataset.from_tensor_slices((input_tensor, target_tensor))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [None]:
# Initialize encoder and decoder
encoder = # TODO: Initialize Encoder
decoder = # TODO: Initialize Decoder

In [None]:
# Optimizer and loss function
optimizer = # TODO: Adam optimizer initialization
loss_object = # TODO: SparseCategoricalCrossentropy calculated from logits


In [None]:
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

Here, we are implementing a single training step with teacher forcing to improve sequence generation.

Teacher forcing is a training strategy used in sequence-to-sequence models (such as in machine translation or text generation) where the ground truth (actual target sequence) is used as the next input to the decoder during training, instead of using the decoder's own predicted output from the previous step.



In [4]:

# Training step with teacher forcing
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0
    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)
        dec_hidden = enc_hidden
        dec_input = tf.expand_dims([target_tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

        for t in range(1, targ.shape[1]):
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
            loss += loss_function(targ[:, t], predictions)
            dec_input = tf.expand_dims(targ[:, t], 1)  # Teacher forcing

    batch_loss = loss / int(targ.shape[1])
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss


In [None]:
# Training Loop with multiple epochs
EPOCHS = # TODO: Assign number of epochs

for epoch in range(EPOCHS):
    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = # TODO: call training step
        total_loss += batch_loss

    print(f"Epoch {epoch+1} Loss {total_loss / steps_per_epoch:.4f}")


## Evaluation

In [None]:
def evaluate(sentence):
    # Preprocess the input sentence
    sentence = preprocess_sentence(sentence)
    inputs = [input_tokenizer.word_index.get(word, 0) for word in sentence.split()]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=input_tensor.shape[1], padding='post')
    inputs = tf.convert_to_tensor(inputs)

    # Encode the input sentence
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([target_tokenizer.word_index['<start>']], 0)

    result = ''

    # Decode the output sequence
    for t in range(target_tensor.shape[1]):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        predicted_id = tf.argmax(predictions[0]).numpy()

        if target_tokenizer.index_word[predicted_id] == '<end>':
            break

        result += target_tokenizer.index_word[predicted_id] + ' '

        # Use the predicted word as the next input
        dec_input = tf.expand_dims([predicted_id], 0)

    return result.strip()

## Testing

In [None]:
# To test on 20 random sentences from the dataset
def translate_random_sentences():
    random_indices = # TODO: Sample random 20 sentences from the dataset
    

    for i in random_indices:
        source_sentence = source_lang[i]
        target_sentence = target_lang[i]  # Actual translation (target language sentence)
        
        print(f"Translating sentence {i+1}: {source_sentence}")
        print(f"Actual translation: {target_sentence}")
        
        predicted_translation = # TODO: Evaluate each sentence
        print(f"Predicted translation: {predicted_translation}")
        
        print("\n" + "="*50 + "\n")

# Call the function to translate random sentences
translate_random_sentences()
