# ADVANCED PREREQUISITES

## NEURAL MACHINE TRANSLATION - ADVANCED

- **Sequence-to-Sequence (Seq2Seq) Models:**
Transform input sequences into output sequences, ideal for tasks like translation where both input and output are sequences of variable length.
- **Encoders and Decoders:**
Encoders compress input sequences into a fixed representation, while decoders generate output sequences from this representation, forming the core of Seq2Seq models.
- **Bahdanau Attention Mechanism:**
Allows the decoder to focus on different parts of the input sequence at each step, improving the model's ability to handle long sequences and complex relationships.
- **Attention Neural Networks:**
Incorporate attention mechanisms to dynamically weigh the importance of different input elements when generating each output element, enhancing translation quality and interpretability.

## CODE - MODEL

The following code implements a Seq2Seq model with attention. Here's a breakdown of the main components:

**Encoder:** Processes the input sequence and returns the encoder outputs along with the final hidden and cell states.

**Attention:** Implements the Bahdanau attention mechanism, calculating attention weights for each encoder output.

**Decoder:** Generates the output sequence, using the attention mechanism to focus on relevant parts of the input.

**Seq2SeqAttention:** Combines the encoder and decoder, implementing the forward pass of the model.

In [None]:
import torch
import random
import torch.nn as nn

### ENCODER

The encoder is responsible for processing the input sequence and creating a representation that the decoder can use.

#### CODE

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, num_layers, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, num_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, (hidden, cell) = self.rnn(embedded)
        return outputs, hidden, cell

#### EXPLANATION

- The encoder uses an embedding layer to convert input tokens into dense vectors.
- It then uses a multi-layer LSTM (Long Short-Term Memory) to process these embeddings.
- The forward method returns:

    ```outputs```: Contains the hidden state for each input token (useful for attention)

    ```hidden```: The final hidden state of the LSTM
    
    ```cell```: The final cell state of the LSTM

### BAHDANANU ATTENTION

The Bahdanau attention mechanism is a key innovation that allows the decoder to focus on different parts of the input sequence at each decoding step.

#### CODE

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        
        return torch.softmax(attention, dim=1)

#### EXPLANATION

- The attention mechanism takes two inputs:

    ```hidden```: The current hidden state of the decoder
    
    ```encoder_outputs```: All hidden states from the encoder


- It calculates an "energy" score for each encoder output:

    - First, it concatenates the decoder's hidden state with each encoder output.
    - This concatenated vector is passed through a linear layer (self.attn) and a tanh activation.
    - Another linear layer (self.v) reduces this to a single score.


- The energy scores are converted to probabilities using softmax, creating the attention weights.

- These weights determine how much focus to put on each part of the input sequence when generating the next output word.

The key idea is that the model learns to pay attention to relevant parts of the input sequence, which is especially useful for long sequences or when certain input words are particularly important for the current output word.

### DECODER

The decoder generates the output sequence one token at a time, using the attention mechanism to focus on relevant parts of the input.

#### CODE

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, num_layers, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.rnn = nn.LSTM(embed_dim + hidden_dim, hidden_dim, num_layers, dropout=dropout)
        self.fc_out = nn.Linear(hidden_dim * 2 + embed_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell, encoder_outputs):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        
        a = self.attention(hidden[-1], encoder_outputs)
        a = a.unsqueeze(1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        
        rnn_input = torch.cat((embedded, weighted), dim=2)
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        
        predicted = self.fc_out(torch.cat((output.squeeze(0), weighted.squeeze(0), embedded.squeeze(0)), dim=1))
        
        return predicted, hidden, cell

#### EXPLANATION

- The decoder first embeds the input token (which is either the true previous token during training, or the predicted previous token during inference).
- It then uses the attention mechanism to compute attention weights over the encoder outputs.
- These weights are used to create a weighted sum of the encoder outputs, called the context vector.
- The embedded input is concatenated with the context vector and fed into the LSTM.
- The output of the LSTM is concatenated with the context vector and the embedded input, then passed through a final linear layer to produce a probability distribution over the output vocabulary.

### SEQ-SEQ WITH ATTENTION

This class combines the encoder and decoder into a single model

#### CODE

In [None]:
class Seq2SeqAttention(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(src.device)
        encoder_outputs, hidden, cell = self.encoder(src)
        
        input = trg[0,:]
        
        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[t] if teacher_force else top1
        
        return outputs

#### EXPLANATION

This class handles the full sequence-to-sequence process:

- It first encodes the entire input sequence.
- Then, it decodes one token at a time, using either the true previous token (teacher forcing) or the predicted previous token as input for the next step.
- Teacher forcing is used randomly based on the teacher_forcing_ratio to balance between training stability and avoiding exposure bias.

The Bahdanau attention mechanism is a crucial part of this model. It allows the decoder to focus on different parts of the input sequence at each decoding step, which is particularly useful for translation tasks where word order may differ between languages or where certain words may require context from various parts of the input sentence to translate correctly.

## CODE - TRAINING LOOP FOR TRANSLATION

1. **Data Processing:**

- Loads and tokenizes the Multi30k dataset (German to English translation)
- Builds vocabularies for both languages
- Creates data iterators for batching

2. **Model Architecture:**

- Implements an Encoder-Decoder architecture with attention
- The Encoder uses an LSTM to process the input sequence
- The Decoder uses another LSTM along with an attention mechanism to generate the output sequence

3. **Training:**

- Defines training and evaluation loops
- Uses Adam optimizer and CrossEntropyLoss
- Implements teacher forcing during training
- Teacher forcing is a strategy for training recurrent neural networks that uses ground truth as input, instead of model output from a prior time step as an input.

4. **Evaluation:**
- **BLEU Score:** The BLEU score evaluates machine translations by comparing them to human translations, checking for word and phrase matches. It ranges from 0 to 1, with higher scores indicating better translation quality.
- Provides functions to translate individual sentences
- Calculates BLEU score on the test set to evaluate model performance

#### **Main Execution:**

- Creates the dataset and dataloader
- Creates the model
- Trains the model for a specified number of epochs
- Saves the best model based on validation loss
- Calculates and prints the BLEU score
- Provides an example translation