based on the official documentation from https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html


### Section 1: Introduction to Sequence-to-Sequence (Seq2Seq) Translation



#### 1.1 Objective
   - Understand Seq2Seq's purpose: primarily used for transforming input sequences to output sequences, commonly in translation tasks.
   - Context of use: often in language translation, speech recognition, and text generation.



#### 1.2 Background
   - **Recurrent Neural Networks (RNNs)**:
      - RNNs are specialized for sequence data.
      - Their architecture allows retaining information from previous time steps, making them useful for sequence-to-sequence tasks.

   - **Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)**:
      - Popular RNN variants for Seq2Seq as they address the vanishing gradient problem.
      - LSTM and GRU can retain information over longer sequences, making them more effective for complex translations.


Note:
  - **Long Short-Term Memory (LSTM)** and **Gated Recurrent Units (GRU)** are popular variants of Recurrent Neural Networks (RNNs) widely used in **sequence-to-sequence (Seq2Seq)** models, particularly for tasks like **machine translation, text summarization,** and **speech recognition**. Both LSTM and GRU architectures were developed to address the **vanishing gradient problem** commonly encountered in traditional RNNs, which limits their ability to capture long-term dependencies in sequences.

 1. Long Short-Term Memory (LSTM)

        LSTMs are designed with a complex internal structure to better capture long-term dependencies in sequences by selectively remembering and forgetting information. Each LSTM cell includes three main gates:

        - **Forget Gate**: Decides what information to discard from the cell state.
        - **Input Gate**: Determines what new information to add to the cell state.
        - **Output Gate**: Controls what information to pass on to the next time step.

        **Advantages**:
          - **Effective for Long Sequences**: LSTMs are well-suited to handle long sequences due to their ability to manage information through gates. This makes them a popular choice for Seq2Seq tasks with lengthy inputs and outputs.
          - **Mitigates Vanishing Gradient**: By maintaining cell states over time and carefully controlling gradient flow through gating mechanisms, LSTMs avoid the vanishing gradient issue seen in vanilla RNNs.

        **Use Cases**:
          - LSTMs are often used in translation models, especially in contexts where there are lengthy dependencies across sequences, such as translating long sentences or paragraphs where context from earlier in the sentence is relevant at later steps.

 2. Gated Recurrent Units (GRU)

        GRUs are a simplified version of LSTMs that also use gating mechanisms but have a reduced structure, making them computationally faster. GRUs have two main gates:

        - **Update Gate**: Decides how much of the previous memory to retain.
        - **Reset Gate**: Controls the influence of the previous state on the current input, which helps in resetting memory for shorter dependencies.

        **Advantages**:
          - **Computationally Efficient**: GRUs are typically faster to train than LSTMs because they have fewer parameters. This makes GRUs a good choice when computational resources are limited or when faster training is required.
          - **Simplicity and Performance**: Despite being simpler than LSTMs, GRUs perform comparably in many tasks and sometimes even outperform LSTMs on shorter sequences or when the dataset is smaller.

        **Use Cases**:
          - GRUs are used in Seq2Seq models where speed and efficiency are prioritized, such as real-time applications where lower latency is essential.



 - Why They’re Effective for Seq2Seq Tasks

  - **Ability to Retain Context**: Both LSTMs and GRUs can retain context over long sequences, which is crucial for Seq2Seq tasks that involve dependencies across time steps, such as translating a sentence where the meaning of later words depends on the beginning.
  - **Better Gradient Flow**: The gating mechanisms allow gradients to propagate more effectively across many time steps, solving the vanishing gradient problem and enabling the models to learn longer dependencies.
  - **Flexibility in Sequence Lengths**: LSTMs and GRUs can handle variable-length input and output sequences, making them versatile for tasks like text translation, where sentences vary in length.



#### 1.3 Model Overview: Seq2Seq Architecture
   - **Encoder-Decoder Structure**:
      - **Encoder**: Processes input sequence and outputs a "context vector" containing encoded information.
      - **Decoder**: Takes the context vector and generates the target sequence.
      


  - How the Encoder-Decoder Structure Works (In Simple Terms)

    -  **What the Encoder Does:**
      - Think of the encoder as a person reading a sentence in one language (like English) and trying to understand its meaning.
      - The encoder reads each word in the input sentence, one by one, and remembers important details as it goes along.
      - When it’s done reading, the encoder “summarizes” the whole sentence’s meaning into a single collection of numbers called the **context vector**.

      **Example**:
      - Input sentence: "I love learning languages."
      - The encoder reads each word, processes it, and creates a context vector representing the sentence’s meaning (e.g., `[0.8, -0.5, 1.2, ...]`).

    - **What the Decoder Does:**
      - The decoder now takes the context vector and “translates” or generates a new sentence in the target language.
      - It generates the output word-by-word, using the information in the context vector to create a meaningful translation.

      **Example**:
      - With the context vector created from "I love learning languages," the decoder might translate it to French by generating words one by one to form "J'aime apprendre les langues."

    - **How They Work Together (Full Process):**
      - The encoder first reads and processes the entire input sentence to create the context vector.
      - The decoder takes this context vector and starts generating the translated sentence.
      - The model is trained to make this process smooth, so it produces the most accurate sentence possible.


- Demonstration:

In [None]:
import torch
import torch.nn as nn

# Sample encoder and decoder to understand context vector passing
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        # Initialize the encoder RNN module with input and hidden size parameters.
        super(EncoderRNN, self).__init__()

        # Define hidden_size as an instance variable to be used within the encoder.
        self.hidden_size = hidden_size

        # Embedding layer maps the input tokens to dense vectors of dimension hidden_size.
        self.embedding = nn.Embedding(input_size, hidden_size)

        # GRU (Gated Recurrent Unit) layer to process the embedded input sequence.
        # Takes inputs of size hidden_size and outputs hidden_size as well.
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        # Embed the input token and reshape for GRU compatibility: [1, 1, hidden_size]
        embedded = self.embedding(input).view(1, 1, -1)

        # Pass the embedded input and initial hidden state through the GRU.
        output, hidden = self.gru(embedded, hidden)

        # Returns the output (the encoder hidden state) and the hidden state
        # to pass as the initial hidden state to the next time step.
        return output, hidden

    def init_hidden(self):
        # Initialize hidden state to zero for the first step of the GRU.
        # Shape: [num_layers * num_directions, batch_size, hidden_size]
        # Here, num_layers * num_directions = 1 (single layer, unidirectional).
        return torch.zeros(1, 1, self.hidden_size)

# Decoder RNN with a similar GRU-based architecture to handle the encoded context.
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        # Initialize the decoder RNN with hidden size and output size parameters.
        super(DecoderRNN, self).__init__()

        # Define hidden_size as an instance variable to be used within the decoder.
        self.hidden_size = hidden_size

        # Embedding layer to map the target sequence tokens to dense vectors.
        self.embedding = nn.Embedding(output_size, hidden_size)

        # GRU layer takes in the embedded input and processes it similarly to the encoder.
        self.gru = nn.GRU(hidden_size, hidden_size)

        # Linear layer to map GRU output to the vocabulary space (output_size).
        self.out = nn.Linear(hidden_size, output_size)

        # LogSoftmax activation for normalized log-probabilities over vocabulary.
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        # Embed the input token for the current time step and reshape: [1, 1, hidden_size]
        output = self.embedding(input).view(1, 1, -1)

        # Apply ReLU activation to add non-linearity to the embedded input.
        output = torch.relu(output)

        # Pass the processed input and previous hidden state to the GRU.
        output, hidden = self.gru(output, hidden)

        # Map GRU output to vocabulary space and apply LogSoftmax to get log-probabilities.
        output = self.softmax(self.out(output[0]))

        # Return the output (log-probabilities of vocabulary) and the updated hidden state.
        return output, hidden

# Initialize encoder and decoder
# Encoder input size = vocabulary size, hidden size = dimensionality of embeddings and GRU output.
encoder = EncoderRNN(input_size=10, hidden_size=20)

# Initialize hidden state for the encoder.
encoder_hidden = encoder.init_hidden()

# Decoder input size = vocabulary size, hidden size = encoder's output size.
decoder = DecoderRNN(hidden_size=20, output_size=10)

# Initialize hidden state for the decoder.
decoder_hidden = encoder_hidden

print("Encoder Architecture:")
print(encoder)
print("-" * 40)
print("Decoder Architecture:")
print(decoder)

print("-" * 40)
print("Encoder Hidden State:")
print(encoder_hidden)
shape = encoder_hidden.shape
print("size", shape)
print("-" * 40)


print("Decoder Hidden State:")
print(decoder_hidden)

print("-" * 40)

Encoder Architecture:
EncoderRNN(
  (embedding): Embedding(10, 20)
  (gru): GRU(20, 20)
)
----------------------------------------
Decoder Architecture:
DecoderRNN(
  (embedding): Embedding(10, 20)
  (gru): GRU(20, 20)
  (out): Linear(in_features=20, out_features=10, bias=True)
  (softmax): LogSoftmax(dim=1)
)
----------------------------------------
Encoder Hidden State:
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
size torch.Size([1, 1, 20])
----------------------------------------
Decoder Hidden State:
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
----------------------------------------


##### Encoder Architecture

```
EncoderRNN(
  (embedding): Embedding(10, 20)
  (gru): GRU(20, 20)
)
```

- **Embedding Layer**: `Embedding(10, 20)`
  - This layer converts input token IDs into dense vectors of size 20. The input vocabulary size is 10, so there are 10 possible token IDs that can be mapped to these embeddings.
  - Each input token ID is transformed into a 20-dimensional vector before being passed to the GRU.

- **GRU Layer**: `GRU(20, 20)`
  - The GRU layer has an input size and hidden state size of 20. This means that each 20-dimensional embedding from the input passes through the GRU, and the GRU also has a 20-dimensional hidden state.
  - This GRU layer updates its hidden state based on the current input token’s embedding and the previous hidden state.

---

##### Decoder Architecture

```
DecoderRNN(
  (embedding): Embedding(10, 20)
  (gru): GRU(20, 20)
  (out): Linear(in_features=20, out_features=10, bias=True)
  (softmax): LogSoftmax(dim=1)
)
```

- **Embedding Layer**: `Embedding(10, 20)`
  - Similar to the encoder, the decoder also has an embedding layer that maps input token IDs to 20-dimensional embeddings, with a vocabulary size of 10.

- **GRU Layer**: `GRU(20, 20)`
  - This GRU layer also has input and hidden sizes of 20. It takes the 20-dimensional embedding of the previous token (or the start token) as input, along with the hidden state (which initially comes from the encoder).
  
- **Output Layer**: `Linear(in_features=20, out_features=10, bias=True)`
  - A fully connected (linear) layer that maps the 20-dimensional GRU output to a 10-dimensional vector. Each dimension in this output vector represents the logit for a token in the output vocabulary.
  
- **Softmax Layer**: `LogSoftmax(dim=1)`
  - The LogSoftmax function is applied along dimension 1 to convert the output logits into log-probabilities. This layer is commonly used in classification tasks with `CrossEntropyLoss`, as it stabilizes the calculations by working in log space.

---

##### Initial Hidden State

The initial hidden state for both the encoder and decoder is a tensor of zeros, which serves as a starting point for the GRU layers.

###### Encoder Hidden State

```
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
size torch.Size([1, 1, 20])
```

- **Shape Explanation**: `torch.Size([1, 1, 20])`
  - The hidden state has a shape of `[1, 1, 20]`:
    - `1` (first dimension) represents the number of layers in the GRU (in this case, it’s a single-layer GRU).
    - `1` (second dimension) represents the batch size. In this case, each example is processed one at a time.
    - `20` (third dimension) represents the size of the hidden state, matching the dimensionality specified in the GRU.

- **Contents**: The tensor is initialized to zeros, which is typical for the initial hidden state in RNN-based models.

###### Decoder Hidden State

```
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
```

- The initial hidden state of the decoder is also set to zeros here. However, in a typical Seq2Seq setup, the decoder’s initial hidden state would be initialized with the final hidden state from the encoder, allowing it to "carry over" the context from the input sequence.



   - **Encoder Output (Context Vector)**:
      - Captures the entire source sentence's meaning and passes it to the decoder.
      - Decoder then uses this context to generate the target sentence.



#### 1.4 Training Focus
   - During training, both encoder and decoder are optimized to minimize the difference between predicted and actual target sequences.
   - **Teacher Forcing**: Method where the model is fed the actual output sequence for better learning.



#### 1.5 Practical Use Cases of Seq2Seq Models
   - **Language Translation**: Converts sentences from one language to another.
   - **Chatbots**: Generates appropriate responses based on input queries.
   - **Text Summarization**: Condenses long pieces of text into brief summaries.



#### Demonstration Checkpoint
   - Code to initialize both encoder and decoder, with an example of passing data through the encoder to produce a hidden state:

In [None]:

# 1. Encoder reads the sentence "I love learning languages."
input_sentence = torch.tensor([1, 2, 3, 4])  # Assume these integers represent the tokens of the sentence
print("Input Sentence Tokens:", input_sentence.tolist())

# Initialize the encoder's hidden state (all zeros initially)
encoder_hidden = encoder.init_hidden()
print("Initial Encoder Hidden State:", encoder_hidden)
print("size", encoder_hidden.size())
"""
encoder_hidden.size():
  - return torch.zeros(1, 1, self.hidden_size) ([num_layers * num_directions, batch_size, hidden_size])
  - [1, 1, 20]
  - [num_layers * num_directions, batch_size, hidden_size]
  - In this:
    - num_layers: This is the number of RNN layers. Since we didn't specify multiple layers in the EncoderRNN, it defaults to 1.
    - num_directions: This indicates whether the RNN is bidirectional. Here, it's a simple (unidirectional) GRU, so num_directions is 1.
    - batch_size: The size of each batch. Here, it's set to 1 because we are processing one sequence (sentence) at a time.
    - hidden_size: The dimensionality of the hidden layer, which we specified as 20 in EncoderRNN.

    - num_layers * num_directions = 1 * 1 = 1
    - batch_size = 1
    - hidden_size = 20
    - [1, 1, 20]
"""
print("\nEncoder Architecture:")
print(encoder)
print("-" * 40)


Input Sentence Tokens: [1, 2, 3, 4]
Initial Encoder Hidden State: tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
size torch.Size([1, 1, 20])

Encoder Architecture:
EncoderRNN(
  (embedding): Embedding(10, 20)
  (gru): GRU(20, 20)
)
----------------------------------------


In [None]:

# Encoder processes each word, updating its hidden state at each step
for i, word in enumerate(input_sentence):
    print(f"\nEncoding word {i+1} with token {word.item()}")
    encoder_output, encoder_hidden = encoder(word, encoder_hidden)
    print("Encoder Output at this step:", encoder_output)
    print("Updated Encoder Hidden State:", encoder_hidden)
    print("-" * 40)

# The final hidden state of the encoder is the "context vector"
context_vector = encoder_hidden
print("\nFinal Context Vector from Encoder:", context_vector)
print(shape := context_vector.shape)
"""
shape: torch.Size([1, 1, 20])
  - [num_layers * num_directions, batch_size, hidden_size]
  - [1, 1, 20]
  - [num_layers * num_directions, batch_size, hidden_size]
  - in this,
    - num_layers: This is the number of RNN layers. Since we didn't specify multiple layers in the EncoderRNN, it defaults to 1.
    - num_directions: This indicates whether the RNN is bidirectional. Here, it's a simple (unidirectional) GRU, so num_directions is 1.
    - batch_size: The size of each batch. Here, it's set to 1 because we are processing one sequence (sentence) at a time.
    - hidden_size: The dimensionality of the hidden layer, which we specified as 20 in EncoderRNN.
    - num_layers * num_directions = 1 * 1 = 1
    - batch_size = 1
    - hidden_size = 20
    - [1, 1, 20]
"""
print("-" * 40)



Encoding word 1 with token 1
Encoder Output at this step: tensor([[[-0.0648, -0.0025, -0.0313,  0.2967, -0.2079,  0.0492, -0.0844,
           0.0437,  0.1022, -0.1832,  0.1682,  0.1017, -0.0553,  0.1413,
           0.2185,  0.2790, -0.1875,  0.3304, -0.0245, -0.2174]]],
       grad_fn=<StackBackward0>)
Updated Encoder Hidden State: tensor([[[-0.0648, -0.0025, -0.0313,  0.2967, -0.2079,  0.0492, -0.0844,
           0.0437,  0.1022, -0.1832,  0.1682,  0.1017, -0.0553,  0.1413,
           0.2185,  0.2790, -0.1875,  0.3304, -0.0245, -0.2174]]],
       grad_fn=<StackBackward0>)
----------------------------------------

Encoding word 2 with token 2
Encoder Output at this step: tensor([[[-0.0248, -0.1319, -0.2588, -0.1063, -0.3861, -0.1231,  0.2801,
          -0.0507,  0.3570, -0.0026,  0.6982, -0.0689, -0.3127,  0.0483,
          -0.4812, -0.3486,  0.1057, -0.2017, -0.4577, -0.4893]]],
       grad_fn=<StackBackward0>)
Updated Encoder Hidden State: tensor([[[-0.0248, -0.1319, -0.2588, -0.106

In [None]:

# 2. Decoder starts with this context vector and begins generating the translation
output_sentence = []  # Initialize an empty list to store the generated tokens of the output sentence
decoder_input = torch.tensor([0])  # Initial decoder input (often a 'start' token)
print("\nInitial Decoder Input (start token):", decoder_input.item())

# Begin the decoding process to generate the output sentence
for i in range(4):  # Generate a 4-word output sentence as an example
    print(f"\nDecoding step {i+1}")

    # Pass the current input and hidden state (context vector) to the decoder
    decoder_output, context_vector = decoder(decoder_input, context_vector)

    # decoder_output contains log-probabilities for each token in the vocabulary
    print("Decoder Output (log-probabilities over vocabulary):", decoder_output)

    # Choose the word with the highest probability
    next_word = decoder_output.argmax(dim=1).item()
    output_sentence.append(next_word)
    print("Generated Word (token):", next_word)

    # Set the decoder's next input to the word just generated
    decoder_input = torch.tensor([next_word])
    print("Next Decoder Input:", decoder_input.item())

# Output the final generated sentence
print("\nGenerated Output Sentence Tokens:", output_sentence)



Initial Decoder Input (start token): 0

Decoding step 1
Decoder Output (log-probabilities over vocabulary): tensor([[-2.6561, -2.0425, -2.5580, -2.0883, -2.0079, -2.3423, -2.1447, -2.6558,
         -2.3995, -2.4055]], grad_fn=<LogSoftmaxBackward0>)
Generated Word (token): 4
Next Decoder Input: 4

Decoding step 2
Decoder Output (log-probabilities over vocabulary): tensor([[-2.8051, -2.1524, -2.4980, -1.9632, -2.0954, -2.2619, -2.1049, -2.6580,
         -2.4242, -2.3753]], grad_fn=<LogSoftmaxBackward0>)
Generated Word (token): 3
Next Decoder Input: 3

Decoding step 3
Decoder Output (log-probabilities over vocabulary): tensor([[-2.7122, -2.2422, -2.5320, -1.9993, -2.0923, -2.2665, -2.0618, -2.5692,
         -2.4998, -2.3080]], grad_fn=<LogSoftmaxBackward0>)
Generated Word (token): 3
Next Decoder Input: 3

Decoding step 4
Decoder Output (log-probabilities over vocabulary): tensor([[-2.6756, -2.2843, -2.5348, -2.0194, -2.0989, -2.2689, -2.0460, -2.5284,
         -2.5392, -2.2702]], grad_fn

### Section 2: Data Preparation and Processing



#### 2.1 Objective
   - Learn how to format language data for input into Seq2Seq models.
   - Standardize data formats to ensure smooth processing by the model.



#### 2.2 Steps in Data Preparation

1. **Tokenization**:
   - **Purpose**: Convert each sentence into tokens (words or subwords) that the model can process.
   - Each token represents a word or part of a word, transformed into a unique number (integer).
   - **Example**:
      - Sentence: "I love learning languages."
      - Tokens: `[I, love, learning, languages, .]`
      - Token IDs: `[1, 2, 3, 4, 5]`
   - **Code Demonstration**:


In [None]:
import nltk
nltk.download('punkt')
# Importing the word_tokenize function from the Natural Language Toolkit (NLTK) library.
# This function splits a sentence into individual words or tokens.
from nltk.tokenize import word_tokenize

# Defining a sentence to tokenize.
sentence = "I love learning languages."

# Converting the sentence to lowercase with .lower() to ensure uniformity.
# This step is useful when we want case-insensitive processing.
# For example, "Learning" and "learning" would both be treated as the same word.
tokens = word_tokenize(sentence.lower())

# Printing the resulting tokens to the console.
# Expected output: a list of lowercase words, e.g., ['i', 'love', 'learning', 'languages', '.']
print("Tokens:", tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokens: ['i', 'love', 'learning', 'languages', '.']



2. **Vocabulary Creation**:
   - **Purpose**: Build a vocabulary, mapping each unique token to a unique index, to handle known words.
   - This vocabulary is used for converting words to their respective token IDs during model training.
   - **Example**:
      - Vocabulary: `{"I": 1, "love": 2, "learning": 3, "languages": 4, ".": 5}`
   - **Code Demonstration**:


In [None]:
# Creating a vocabulary dictionary from the list of tokens.
# The set(tokens) function removes duplicate words, as a vocabulary only needs unique terms.
# Using a dictionary comprehension, each word in the vocabulary is mapped to a unique index.
# The enumerate function assigns each word an index, starting from 1.
# Starting from 1 (instead of 0) is common in NLP when 0 is reserved for padding tokens or special purposes.

vocabulary = {word: idx for idx, word in enumerate(set(tokens), start=1)}

# Printing the created vocabulary dictionary.
# Expected output: a dictionary where each unique word in `tokens` has a unique index,
# e.g., {'i': 1, 'love': 2, 'learning': 3, 'languages': 4, '.': 5}
print("Vocabulary:", vocabulary)


Vocabulary: {'.': 1, 'languages': 2, 'i': 3, 'love': 4, 'learning': 5}



3. **Padding Sequences**:
   - **Purpose**: Pad each sentence to a fixed length so they all match the input shape required by the model.
   - **Padding Process**: Add special tokens (e.g., `<PAD>`) to shorter sentences so they match the longest sequence in the batch.
   - **Example**:
      - Sentences before padding: `[I love], [I love learning languages]`
      - After padding to a length of 5: `[I love <PAD> <PAD>], [I love learning languages]`
   - **Code Demonstration**:


In [None]:
# Importing the pad_sequences function from TensorFlow's Keras module.
# This function is used to ensure that all sequences (lists of token IDs) have the same length,
# which is necessary for batch processing in machine learning models.
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Defining a list of tokenized sentences where each sentence is represented as a list of token IDs.
# For example, `tokenized_sentences` might represent two sentences that have been converted to sequences of word indices:
# - The first sentence has the tokens [1, 2].
# - The second sentence has the tokens [1, 2, 3, 4].
tokenized_sentences = [[1, 2], [1, 2, 3, 4]]

# Applying padding to make each sequence the same length.
# - `maxlen=5` specifies that all sequences should have a length of 5.
# - `padding='post'` indicates that padding should be added at the end (after the tokens).
# If a sequence is shorter than `maxlen`, zeros are added to the end until it reaches the desired length.
# If a sequence is longer than `maxlen`, it will be truncated to fit.
padded_sentences = pad_sequences(tokenized_sentences, maxlen=5, padding='post')

print("Padded Sentences:", padded_sentences)


Padded Sentences: [[1 2 0 0 0]
 [1 2 3 4 0]]



4. **Handling Out-of-Vocabulary (OOV) Words**:
   - **Purpose**: Handle words that are not in the vocabulary with a special token like `<UNK>`.
   - This ensures the model can process unknown words without errors.
   - **Example**:
      - Original sentence: "I enjoy learning new languages."
      - With `<UNK>` token for “enjoy” and “new” if not in vocabulary: `[I <UNK> learning <UNK> languages .]`



5. **Batching and Shuffling**:
   - **Purpose**: Organize the data into batches and shuffle them to improve model training.
   - Batching helps manage memory better, and shuffling prevents the model from learning any particular order.
   - **Code Demonstration**:


In [None]:
# Importing necessary modules from PyTorch.
# - torch: the core library for tensor operations.
# - DataLoader: a utility that loads data from a dataset in batches, allowing efficient data handling.
# - TensorDataset: a dataset wrapper that combines input and target tensors, useful for supervised learning tasks.
import torch
from torch.utils.data import DataLoader, TensorDataset

# Defining example tensors for input and target data.
# Here, `inputs` is a 3x3 tensor where each row represents a sample.
# `targets` is also a 3x3 tensor where each row represents the target output for a corresponding input sample.
inputs = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
targets = torch.tensor([[9, 8, 7], [6, 5, 4], [3, 2, 1]])

# Creating a TensorDataset to hold the input and target tensors together.
# The dataset allows the DataLoader to treat each pair of input and target as a single sample.
dataset = TensorDataset(inputs, targets)

# Setting up a DataLoader to load data from the dataset in batches.
# - `batch_size=2` specifies that each batch will contain 2 samples.
# - `shuffle=True` means the data will be shuffled each time a new epoch starts, which is useful for training.
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterating through the DataLoader to access batches of data.
# Each iteration returns a batch, which is a tuple containing a batch of inputs and a batch of targets.
for batch in dataloader:
    # Printing the input part of the batch.
    print("Batch of Inputs:", batch[0])

    # Printing the target part of the batch.
    print("Batch of Targets:", batch[1])


Batch of Inputs: tensor([[1, 2, 3],
        [7, 8, 9]])
Batch of Targets: tensor([[9, 8, 7],
        [3, 2, 1]])
Batch of Inputs: tensor([[4, 5, 6]])
Batch of Targets: tensor([[6, 5, 4]])


### Section 3: Building the Seq2Seq Model



#### 3.1 Objective
   - Understand how to set up and structure a Seq2Seq model, specifically by building the **Encoder** and **Decoder** components.
   - Each part has a specialized role in transforming the input sequence to the output sequence.

---



#### 3.2 Components of the Seq2Seq Model



1. **Encoder**:
   - **Purpose**: Read and "understand" the input sentence by encoding it into a **context vector**.
   - The encoder uses a Recurrent Neural Network (RNN), such as GRU or LSTM, to process the input sequence token by token and update its hidden state with each token.
   - **Key Parameters**:
      - `input_size`: Size of the vocabulary (number of unique tokens).
      - `hidden_size`: Dimension of the hidden state.
   - **Code Demonstration**:

In [None]:
# Importing the neural network module from PyTorch, which contains various neural network layers and functions.
import torch.nn as nn
import torch

# Defining an encoder RNN class, which is a subclass of nn.Module, PyTorch's base class for neural networks.
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        # Initializing the parent class with super to inherit nn.Module's properties and methods.
        super(EncoderRNN, self).__init__()

        # Defining the hidden state size, which determines the number of features in the hidden layer.
        self.hidden_size = hidden_size

        # Creating an embedding layer, which converts input token IDs into dense vectors.
        # `input_size` is the vocabulary size, and `hidden_size` is the dimensionality of the embeddings.
        self.embedding = nn.Embedding(input_size, hidden_size)

        # Defining a GRU (Gated Recurrent Unit) layer, which is a type of RNN layer.
        # The GRU takes embeddings of `hidden_size` as input and outputs a hidden state of the same size.
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        # Passing the input token ID through the embedding layer, converting it to a dense vector.
        # .view(1, 1, -1) reshapes the embedding to match GRU's expected input shape:
        # (sequence_length, batch_size, embedding_size), where sequence length and batch size are both 1 here.
        embedded = self.embedding(input).view(1, 1, -1)

        # Passing the embedded input and hidden state to the GRU.
        # GRU returns:
        # - `output`: the output of the current time step.
        # - `hidden`: the updated hidden state, which is carried over to the next time step.
        output, hidden = self.gru(embedded, hidden)

        # Returning the output and hidden state.
        return output, hidden

    def init_hidden(self):
        # Initializing the hidden state with zeros.
        # The shape (1, 1, hidden_size) corresponds to (num_layers, batch_size, hidden_size),
        # where num_layers is 1 and batch size is 1 in this case.
        return torch.zeros(1, 1, self.hidden_size)

# Example usage of the EncoderRNN
encoder = EncoderRNN(input_size=10, hidden_size=20)  # Creating an Encoder with vocabulary size 10 and hidden state size 20.

# Defining an example input tensor with a single token ID (e.g., ID `1`).
input_tensor = torch.tensor([1])

# Initializing the encoder's hidden state.
encoder_hidden = encoder.init_hidden()

# Passing the input tensor and hidden state through the encoder.
encoder_output, encoder_hidden = encoder(input_tensor, encoder_hidden)

# Printing the output of the encoder (from the GRU) and the final hidden state.
print("Encoder Output:", encoder_output)
print("Encoder Hidden State:", encoder_hidden)
print(shape := encoder_hidden.shape)
print("-" * 40)


Encoder Output: tensor([[[ 0.1837,  0.3054,  0.4854,  0.4326,  0.4158,  0.2499,  0.0676,
           0.1700, -0.0586,  0.1174,  0.1410,  0.0982,  0.3166, -0.4173,
          -0.1035, -0.5546,  0.4376, -0.1398,  0.1916,  0.3888]]],
       grad_fn=<StackBackward0>)
Encoder Hidden State: tensor([[[ 0.1837,  0.3054,  0.4854,  0.4326,  0.4158,  0.2499,  0.0676,
           0.1700, -0.0586,  0.1174,  0.1410,  0.0982,  0.3166, -0.4173,
          -0.1035, -0.5546,  0.4376, -0.1398,  0.1916,  0.3888]]],
       grad_fn=<StackBackward0>)
torch.Size([1, 1, 20])
----------------------------------------



2. **Decoder**:
   - **Purpose**: Generate the translated sentence in the target language, word by word, based on the **context vector** from the encoder.
   - Each token generated by the decoder is fed back into it to produce the next token, which continues until an end-of-sequence token is generated.
   - **Key Parameters**:
      - `hidden_size`: Matches the encoder’s hidden size.
      - `output_size`: Vocabulary size of the target language.
   - **Teacher Forcing**: During training, we can provide the actual target word as input to the decoder to improve learning efficiency.
   - **Code Demonstration**:


In [None]:
# Defining a DecoderRNN class, inheriting from nn.Module for PyTorch compatibility.
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        # Initializing the parent class with super to inherit nn.Module properties.
        super(DecoderRNN, self).__init__()

        # Setting up the hidden state size.
        self.hidden_size = hidden_size

        # Defining an embedding layer to convert token IDs to dense vectors.
        # `output_size` represents the vocabulary size (total possible output tokens),
        # and `hidden_size` is the embedding dimensionality.
        self.embedding = nn.Embedding(output_size, hidden_size)

        # Initializing a GRU layer that processes the embedded input.
        # The GRU takes the `hidden_size` as both input and output size for simplicity.
        self.gru = nn.GRU(hidden_size, hidden_size)

        # Defining a linear layer that maps the GRU output to the vocabulary size.
        # This layer produces logits for each possible output token.
        self.out = nn.Linear(hidden_size, output_size)

        # Defining a softmax layer to convert logits into log-probabilities for the output tokens.
        # LogSoftmax is used here to stabilize computations and is compatible with loss functions like NLLLoss.
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        # Passing the input token ID through the embedding layer, converting it to a dense vector.
        # .view(1, 1, -1) reshapes it for compatibility with the GRU layer input format
        # (sequence_length=1, batch_size=1, embedding_size=hidden_size).
        output = self.embedding(input).view(1, 1, -1)

        # Applying a ReLU activation function to add non-linearity to the embedding.
        output = torch.relu(output)

        # Passing the embedded input and hidden state to the GRU layer.
        # The GRU layer outputs:
        # - `output`: the output vector at the current time step.
        # - `hidden`: the updated hidden state to pass to the next time step.
        output, hidden = self.gru(output, hidden)

        # Mapping the GRU output to the output vocabulary size using the linear layer.
        # output[0] takes the sequence dimension (1) out, as we're processing only one step.
        output = self.softmax(self.out(output[0]))

        # Returning the output (log-probabilities over the vocabulary) and the updated hidden state.
        return output, hidden

# Example usage of the DecoderRNN
decoder = DecoderRNN(hidden_size=20, output_size=10)  # Creating a decoder with hidden size 20 and output vocabulary size 10.

# Defining an initial input token ID (e.g., ID `1`, representing a 'start' token).
decoder_input = torch.tensor([1])

# Using the encoder's last hidden state as the initial hidden state for the decoder.
decoder_hidden = encoder_hidden

# Passing the initial decoder input and hidden state through the decoder.
decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)

# Printing the output of the decoder (log-probabilities of the next token) and the final hidden state.
print("Decoder Output:", decoder_output)
print("Decoder Hidden State:", decoder_hidden)


Decoder Output: tensor([[-2.3317, -2.3014, -2.2132, -2.1056, -2.3385, -2.2453, -2.2450, -2.6170,
         -2.5809, -2.1663]], grad_fn=<LogSoftmaxBackward0>)
Decoder Hidden State: tensor([[[-0.0403,  0.3763, -0.3927,  0.3389, -0.3842, -0.0567, -0.2804,
           0.0584,  0.1843,  0.3884,  0.2937,  0.0597, -0.0106, -0.2252,
          -0.2182,  0.2580, -0.0408,  0.2065,  0.3312,  0.0425]]],
       grad_fn=<StackBackward0>)



#### 3.3 Encoder-Decoder Combined Workflow
   - During translation, we:
      1. Pass the input sentence through the encoder to get the context vector.
      2. Initialize the decoder with this context vector and generate each word in the output sequence.

   - **Code Demonstration** (Putting it all together):


In [None]:
# Step 1: Encoding the input sentence

# Initialize the encoder's hidden state.
encoder_hidden = encoder.init_hidden()

# Define an example tokenized input sentence. Each integer represents a word token ID in the sentence.
input_sentence = torch.tensor([1, 2, 3, 4])

# Pass each token in the input sentence sequentially through the encoder.
# This loop simulates feeding tokens one-by-one into the encoder, updating its hidden state each time.
for token in input_sentence:
    # Process the token through the encoder and update the hidden state.
    encoder_output, encoder_hidden = encoder(token, encoder_hidden)


print("Encoder Output:", encoder_output)
print("Encoder Hidden State:", encoder_hidden)
print(shape := encoder_hidden.shape)
print("-" * 40)


Encoder Output: tensor([[[-0.0503,  0.2322, -0.4237,  0.1497,  0.0010, -0.3578,  0.2943,
          -0.0563,  0.1356, -0.1272, -0.0036,  0.1375,  0.3599, -0.2305,
          -0.4118,  0.4302,  0.1737,  0.4199,  0.2408,  0.1447]]],
       grad_fn=<StackBackward0>)
Encoder Hidden State: tensor([[[-0.0503,  0.2322, -0.4237,  0.1497,  0.0010, -0.3578,  0.2943,
          -0.0563,  0.1356, -0.1272, -0.0036,  0.1375,  0.3599, -0.2305,
          -0.4118,  0.4302,  0.1737,  0.4199,  0.2408,  0.1447]]],
       grad_fn=<StackBackward0>)
torch.Size([1, 1, 20])
----------------------------------------


In [None]:


# Step 2: Decoding to generate an output sentence

# Set the initial input for the decoder to a "start" token (commonly used to signal the start of decoding).
decoder_input = torch.tensor([1])

# Initialize the decoder's hidden state with the final hidden state from the encoder.
decoder_hidden = encoder_hidden

# Define an empty list to store the predicted tokens in the generated sentence.
output_sentence = []

# Generate a sentence by decoding one word at a time.
# Here, we limit the loop to 4 iterations to generate a sentence of 4 tokens.
for _ in range(4):
    # Pass the decoder input and current hidden state through the decoder.
    decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)

    # Get the predicted token by finding the index of the highest value in the output log-probabilities.
    # `argmax(dim=1)` gives the index of the most likely token in the vocabulary.
    predicted_token = decoder_output.argmax(dim=1).item()

    # Append the predicted token to the output sentence.
    output_sentence.append(predicted_token)

    # Set the decoder input for the next step to the predicted token,
    # which enables the decoder to generate the next token based on previous output.
    decoder_input = torch.tensor([predicted_token])

# Print the generated sequence of token IDs.
print("Generated Output Sentence:", output_sentence)



Generated Output Sentence: [0, 3, 0, 3]



### Section 4: Training the Model



#### 4.1 Objective
   - Master the process of training Seq2Seq models, focusing on loss calculation, backpropagation, and optimization.
   - Use techniques like **teacher forcing** to improve model learning.

---



#### 4.2 Key Steps in Training



1. **Loss Calculation**:
   - **Purpose**: Measure the difference between the model's predicted sequence and the actual target sequence.
   - **Cross-Entropy Loss** is commonly used for Seq2Seq models, as it calculates the error across all predicted tokens.
   - **Example**:
      - Predicted sentence: "I like learning."
      - Target sentence: "I enjoy learning."
      - Cross-Entropy Loss penalizes each word in the predicted sentence based on how close it is to the corresponding target word.

   - **Code Demonstration**:


In [None]:
import torch.nn as nn
import torch

# Define the loss function using CrossEntropyLoss, which is suitable for multi-class classification tasks.
# CrossEntropyLoss combines `LogSoftmax` and `NLLLoss` in one step, so the `predicted` tensor should contain raw logits.
criterion = nn.CrossEntropyLoss()

# Define an example tensor of predicted logits for each class in each sample.
# Each row represents a sample, and each column represents the logit for a class.
# Here, the shape (3, 3) indicates 3 samples and 3 possible classes for each sample.
# `requires_grad=True` allows the loss to backpropagate through these predictions during training.
predicted = torch.tensor([[0.1, 0.9, 0.8],
                          [0.2, 0.3, 0.5],
                          [0.6, 0.3, 0.1]], requires_grad=True)

# Define the target tensor, which contains the correct class indices for each sample.
# Each integer in `target` represents the correct class for the corresponding row in `predicted`.
# For example, `target[0] = 1` means the correct class for the first sample is class 1.
target = torch.tensor([1, 0, 2])  # Indexes of correct tokens

# Calculate the loss between the predicted logits and target labels.
# CrossEntropyLoss expects `predicted` to contain raw logits, and `target` to contain class indices.
loss = criterion(predicted, target)

# Output the loss value. `loss.item()` extracts the scalar value of the loss.
print("Loss:", loss.item())


Loss: 1.1497679948806763



2. **Optimization**:
   - **Purpose**: Update model parameters to minimize the loss.
   - Popular optimizers like **Adam** or **SGD** are typically used for training Seq2Seq models.
   - **Code Demonstration**:


In [None]:
import torch.optim as optim

# Initialize the optimizer using Adam, a popular optimization algorithm that adjusts learning rates adaptively.
# - `list(encoder.parameters()) + list(decoder.parameters())` combines the parameters of the encoder and decoder.
#   This allows the optimizer to update weights in both models simultaneously during training.
# - `lr=0.01` sets the learning rate for the optimizer. The learning rate determines the step size at each iteration.
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=0.01)



3. **Training Loop**:
   - **Purpose**: Perform multiple passes over the data, updating model weights each time.
   - **Steps**:
      1. **Forward Pass**: Pass data through the encoder and decoder to get predictions.
      2. **Compute Loss**: Calculate the loss using the predicted output and actual target.
      3. **Backpropagation**: Compute gradients for each model parameter.
      4. **Optimization Step**: Adjust model parameters based on gradients to reduce loss.

   - **Code Demonstration**:


In [None]:
# Define an input and target tensor for one training iteration (representing a pair of sentences).
input_tensor = torch.tensor([1, 2, 3, 4])  # Example input sequence (token IDs)
target_tensor = torch.tensor([5, 6, 7, 8])  # Example target sequence (token IDs)

# Initialize the encoder's hidden state to start the encoding process.
encoder_hidden = encoder.init_hidden()

# Step 1: Encoder Forward Pass
# Clear previous gradients by zeroing them, a necessary step to prevent gradient accumulation.
optimizer.zero_grad()

# Pass each token in the input sequence through the encoder sequentially.
for token in input_tensor:
    # Process each token, updating the hidden state at each step.
    encoder_output, encoder_hidden = encoder(token, encoder_hidden)

# Step 2: Decoder Forward Pass and Loss Calculation
# Initialize the first input to the decoder with a "start" token.
decoder_input = torch.tensor([1])

# Set the decoder's initial hidden state to the encoder's final hidden state.
decoder_hidden = encoder_hidden

# Initialize loss to 0. This variable accumulates the loss across all tokens in the target sequence.
loss = 0

# Teacher Forcing: using the actual target token as input to the decoder during training
for di in range(len(target_tensor)):
    # Pass the current decoder input and hidden state through the decoder.
    decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)

    # Calculate the loss between the decoder's output and the actual target token.
    # `unsqueeze(0)` adds a dimension to match expected shape for CrossEntropyLoss.
    loss += criterion(decoder_output, target_tensor[di].unsqueeze(0))

    # Set the next decoder input to the actual target token (teacher forcing),
    # which helps the model learn by providing correct context during training.
    decoder_input = target_tensor[di]

# Step 3: Backpropagation and Optimization
# Perform backpropagation to compute gradients of the loss with respect to model parameters.
loss.backward()

# Update model parameters using the gradients and the optimizer.
optimizer.step()

# Print the average loss for this epoch, calculated by dividing total loss by the sequence length.
print("Training Loss for this epoch:", loss.item() / len(target_tensor))


Training Loss for this epoch: 2.2404701709747314



4. **Teacher Forcing**:
   - **Purpose**: Improve learning by providing the actual target word as input to the decoder during training instead of its own predicted word.
   - Teacher forcing is applied randomly (usually a specified percentage of the time) to allow the model to learn from both the correct word and its own predictions.
   - **Code Adjustment**:


In [None]:
import random

# Set the teacher forcing ratio, which controls the probability of using the actual target
# as the next input during training. Here, 0.5 means there's a 50% chance to apply teacher forcing.
teacher_forcing_ratio = 0.5

# Randomly decide whether to use teacher forcing based on the teacher_forcing_ratio.
# random.random() generates a float between 0 and 1.
if random.random() < teacher_forcing_ratio:
    # If the condition is true (50% chance), use the actual target token as the next input.
    # `target_tensor[di]` is the correct token at the current step.
    decoder_input = target_tensor[di]
else:
    # Otherwise, use the model's predicted token from the previous step as the next input.
    # `decoder_output.argmax(dim=1)` gets the predicted token with the highest score.
    decoder_input = decoder_output.argmax(dim=1)



#### 4.3 Iterating and Monitoring Performance
   - **Epochs**: Repeat the training loop for multiple epochs (full data passes) to improve the model.
   - **Metrics**: Track loss over time to monitor model performance.

   - **Example Loop**:


In [None]:


# Define the number of epochs
n_epochs = 10

# Loop over the specified number of epochs
for epoch in range(n_epochs):
    total_loss = 0  # Initialize the total loss for the epoch

    # Loop through each input-target pair (assuming `dataloader` provides batches of input and target sequences)
    for input_tensor, target_tensor in dataloader:
        batch_size = input_tensor.size(1)  # Get the batch size from the input tensor

        # Initialize the encoder's hidden state to start the encoding process, matching the batch size.
        # Instead of using encoder.init_hidden(), directly create a zero tensor with the required shape.
        encoder_hidden = torch.zeros(1, batch_size, encoder.hidden_size)

        # Step 1: Encoder Forward Pass
        optimizer.zero_grad()

        # Pass each token in the input sequence through the encoder sequentially.
        for i in range(input_tensor.size(0)):
            # Extract the token (or batch of tokens) and pass it through the embedding layer
            embedded = encoder.embedding(input_tensor[i])  # Shape: (batch_size, embedding_dim)
            # Pass the embedded token through the GRU layer
            encoder_output, encoder_hidden = encoder.gru(embedded.unsqueeze(0), encoder_hidden)

        # Step 2: Decoder Forward Pass and Loss Calculation

        # Decoder Forward Pass and Loss Calculation
        # Initialize the first input to the decoder with a "start" token for each element in the batch.
        decoder_input = torch.tensor([1] * batch_size).view(1, batch_size)  # Start token for each batch element
        decoder_hidden = encoder_hidden  # Set the decoder's initial hidden state

        # Initialize loss to 0 for this batch
        loss = 0

        # Teacher Forcing: using the actual target token as input to the decoder during training
        for di in range(target_tensor.size(0)):
            # Embed the decoder input
            embedded = decoder.embedding(decoder_input)  # Shape: (1, batch_size, embedding_dim)

            # Pass the embedded input through the GRU
            decoder_output, decoder_hidden = decoder.gru(embedded, decoder_hidden)

            # Calculate the loss between the decoder's output and the actual target token
            loss += criterion(decoder_output.squeeze(0), target_tensor[di])

            # Set the next decoder input to the actual target token (teacher forcing)
            decoder_input = target_tensor[di].unsqueeze(0)

        # Step 3: Backpropagation and Optimization
        loss.backward()  # Perform backpropagation to compute gradients
        optimizer.step()  # Update model parameters

        # Accumulate the loss for this batch to get the total loss for the epoch
        total_loss += loss.item()

    # Calculate and print the average loss for this epoch
    average_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {average_loss}")


Epoch 1/10, Loss: 2.919054925441742
Epoch 2/10, Loss: 3.4380860924720764
Epoch 3/10, Loss: 2.956254780292511
Epoch 4/10, Loss: 3.1284552216529846
Epoch 5/10, Loss: 2.7401310205459595
Epoch 6/10, Loss: 2.711756110191345
Epoch 7/10, Loss: 2.8363680243492126
Epoch 8/10, Loss: 2.781024932861328
Epoch 9/10, Loss: 2.623547911643982
Epoch 10/10, Loss: 2.6619701981544495


**Explanation**



| **Model**      | **Layer**                  | **Input Shape**            | **Output Shape**           | **Description**                                                                                   |
|----------------|----------------------------|----------------------------|----------------------------|---------------------------------------------------------------------------------------------------|
| **Encoder**    | **Input Token ID**         | `(1,)`                     | `(1, 1, 20)`               | Single token ID is embedded to a dense vector of size 20                                          |
|                | Embedding                  | `(1,)`                     | `(1, 1, 20)`               | Converts token ID to a dense vector of size `hidden_size`                                         |
|                | GRU                        | `(1, 1, 20)`, `(1, 1, 20)` | `(1, 1, 20)`, `(1, 1, 20)` | Processes embedding with the hidden state; outputs updated hidden state                          |
|                | Hidden State Initialization | `(1, 1, hidden_size)`      | `(1, 1, 20)`               | Initializes a zeroed hidden state with shape `(num_layers, batch_size, hidden_size)`              |
| **Encoder Output** |                         |                            |                            |                                                                                                   |
|                | Output (`encoder_output`)   |                            | `(1, 1, 20)`               | Final output of the GRU for the input sequence                                                    |
|                | Hidden State (`encoder_hidden`) |                       | `(1, 1, 20)`               | Hidden state at the end of the input sequence                                                     |
| **Decoder**    | **Input Token ID**         | `(1,)`                     | `(1, 1, 20)`               | Single token ID (start token) is embedded to a dense vector of size 20                            |
|                | Embedding                  | `(1,)`                     | `(1, 1, 20)`               | Converts token ID to a dense vector of size `hidden_size`                                         |
|                | ReLU                       | `(1, 1, 20)`               | `(1, 1, 20)`               | Applies ReLU activation, adding non-linearity; shape remains unchanged                           |
|                | GRU                        | `(1, 1, 20)`, `(1, 1, 20)` | `(1, 1, 20)`, `(1, 1, 20)` | Processes the embedding and hidden state; updates hidden state for the next time step            |
|                | Linear (`out`)             | `(1, 1, 20)`               | `(1, 10)`                 | Maps the GRU output to the vocabulary size (10), producing logits for each possible output token |
|                | LogSoftmax                 | `(1, 10)`                  | `(1, 10)`                  | Converts logits to log-probabilities over the vocabulary                                          |
| **Decoder Output** |                        |                            |                            |                                                                                                   |
|                | Output (`decoder_output`)   |                            | `(1, 10)`                  | Final output of the decoder: log-probabilities for each token in the output vocabulary            |
|                | Hidden State (`decoder_hidden`) |                        | `(1, 1, 20)`               | Hidden state at the end of the output sequence                                                    |

##### Summary of the Flow
1. **Encoder**:
   - Input token ID → Embedding (→ ReLU activation in decoder) → GRU (processes embedding and hidden state) → Final hidden state.

2. **Decoder**:
   - Input token ID (start token) → Embedding → ReLU activation → GRU (processes embedding and hidden state) → Linear → LogSoftmax

### Section 5: Evaluating and Testing the Model



#### 5.1 Objective
   - Assess the performance of the trained Seq2Seq model by evaluating its output against unseen test data.
   - Calculate metrics to gauge translation quality and identify areas for improvement.

---



#### 5.2 Steps in Evaluation and Testing



1. **Testing with Sample Data**:
   - **Purpose**: Evaluate model translation on sample sentences that the model has not seen during training.
   - We provide an input sentence to the model’s encoder and generate the translated sentence through the decoder.
   - **Code Demonstration**:


In [None]:
# Define a function to translate an input sentence using the encoder and decoder models.
def translate_sentence(input_sentence, encoder, decoder, max_length=10):
    # Initialize the encoder's hidden state.
    encoder_hidden = encoder.init_hidden()

    # Step 1: Encode the input sentence
    # Pass each token in the input sentence through the encoder, updating the hidden state.
    for token in input_sentence:
        encoder_output, encoder_hidden = encoder(token, encoder_hidden)

    # Step 2: Initialize the decoder with the final encoder hidden state
    # Set the initial decoder input to a "start" token to begin decoding.
    decoder_input = torch.tensor([1])  # Assuming 1 is the 'start' token
    decoder_hidden = encoder_hidden  # Initialize the decoder's hidden state with the encoder's final hidden state.

    # Initialize an empty list to store the tokens of the translated sentence.
    translated_sentence = []

    # Step 3: Decode to generate the translated sentence
    # Loop until `max_length` is reached or an "end" token is produced.
    for _ in range(max_length):
        # Pass the decoder input and hidden state through the decoder.
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)

        # Get the predicted token by finding the index of the maximum log-probability in `decoder_output`.
        predicted_token = decoder_output.argmax(dim=1).item()

        # Append the predicted token to the translated sentence list.
        translated_sentence.append(predicted_token)

        # Stop decoding if the "end" token is generated (assuming `2` is the end token).
        if predicted_token == 2:
            break

        # Set the decoder input for the next iteration to the predicted token.
        decoder_input = torch.tensor([predicted_token])

    # Return the full translated sentence, a list of token IDs.
    return translated_sentence

# Example usage of the translate_sentence function
input_sentence = torch.tensor([3, 4, 5, 6])  # Example tokenized input sentence
translated = translate_sentence(input_sentence, encoder, decoder)
print("Translated Sentence:", translated)


Translated Sentence: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]



2. **Metric Calculation**:
   - **Purpose**: Quantitatively measure translation quality.
   - Common metrics include:
      - **BLEU Score**: Compares predicted translation against the reference translation to calculate similarity.
      - **Accuracy**: Measures the percentage of correctly translated words (mainly for simpler models and datasets).
   - **BLEU Score Calculation**:


In [None]:
from nltk.translate.bleu_score import sentence_bleu

# Define the reference sentence (target sentence that the model should ideally generate).
# This reference is wrapped in an additional list because the `sentence_bleu` function expects
# multiple reference sentences (even if there is only one).
reference = [[5, 6, 7, 8]]  # Example reference sentence, where each number is a token ID.

# Define the candidate sentence (output from the model).
# This is the sentence generated by the model to be evaluated against the reference.
candidate = [5, 6, 7, 2]  # Example model output, ending with token ID `2`, assumed as an end token.

# Calculate the BLEU score between the reference and candidate sentences.
# BLEU score is calculated based on n-gram overlap between the reference and candidate.
bleu_score = sentence_bleu(reference, candidate)

# Print the resulting BLEU score.
print("BLEU Score:", bleu_score)


BLEU Score: 8.636168555094496e-78


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()



3. **Error Analysis**:
   - **Purpose**: Identify and analyze translation errors to refine the model.
   - Examine samples where the model’s output differs significantly from the target, focusing on:
      - **Common Mistranslations**: Repeated patterns of incorrect translations.
      - **Long Sentence Handling**: Seq2Seq models can struggle with long sequences, leading to degraded accuracy.



4. **Adjusting Model Based on Evaluation**:
   - After reviewing errors and metric results, consider adjustments:
      - **Hyperparameter Tuning**: Adjust learning rate, hidden layer size, or number of epochs.
      - **Teacher Forcing Ratio**: Modify teacher forcing ratio during training to encourage better independent predictions.
      - **Model Architecture**: Introduce attention layers or upgrade to a Transformer model if necessary.


### Section 6: Deploying and Using the Translation Model



#### 6.1 Objective
   - Implement the trained Seq2Seq model for real-world translation tasks.
   - Set up interactive functions for translating user-provided sentences and explore basic deployment options.

---



#### 6.2 Steps for Deployment and Use



1. **Interactive Translation Function**:
   - **Purpose**: Create a function that accepts input text from a user, processes it, and returns the translated output.
   - This function can serve as the primary interface for the model, allowing users to input text and receive translations directly.
   - **Code Demonstration**:


In [None]:
def interactive_translate(input_sentence, encoder, decoder, input_lang_vocab, output_lang_vocab, max_length=10):
    # Define a placeholder ID for unknown words (UNK). This handles words in the input sentence that are not found in the input vocabulary.
    UNK_TOKEN_ID = input_lang_vocab.get("<UNK>", 0)

    # Convert the input sentence into a tensor of token IDs by looking up each word in the input vocabulary.
    # If a word is not found, it defaults to UNK_TOKEN_ID.
    input_tensor = torch.tensor([
        input_lang_vocab.get(word.lower(), UNK_TOKEN_ID)  # Convert words to lowercase to handle case insensitivity.
        for word in input_sentence.split(' ')            # Split sentence into individual words.
    ])

    # Initialize a list to store the translated token IDs that will be predicted by the decoder.
    translated_tokens = []

    # Initialize the encoder's hidden state.
    encoder_hidden = encoder.init_hidden()

    # Loop through each token in the input tensor and pass it through the encoder.
    for token in input_tensor:
        encoder_output, encoder_hidden = encoder(token, encoder_hidden)
        # `encoder_output` is usually a representation of the current token,
        # and `encoder_hidden` is the updated hidden state passed to the next iteration.

    # Set up the decoder with an initial input (start token) and the hidden state from the encoder.
    decoder_input = torch.tensor([1])  # Assuming `1` is the ID for the 'start' token in the output vocabulary.
    decoder_hidden = encoder_hidden     # Transfer the encoder's final hidden state to the decoder.

    # Decode the output step-by-step up to `max_length`, printing intermediate outputs.
    for _ in range(max_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        # Obtain the token ID with the highest probability (argmax) from the decoder output.
        predicted_token = decoder_output.argmax(dim=1).item()

        # Print intermediate translation steps for debugging purposes.
        print(f"Predicted token ID: {predicted_token}, word: {output_lang_vocab.get(predicted_token, '<UNK>')}")

        # Append the predicted token to the list of translated tokens.
        translated_tokens.append(predicted_token)

        # Check if the predicted token is the end token; if so, break out of the loop.
        if predicted_token == 2:  # Assuming `2` is the ID for the 'end' token in the output vocabulary.
            break

        # Update the decoder input to be the predicted token for the next iteration.
        decoder_input = torch.tensor([predicted_token])

    # Convert the list of token IDs in `translated_tokens` back to words using the output vocabulary.
    # Only include tokens that have valid word mappings in `output_lang_vocab`.
    translated_sentence = ' '.join([
        output_lang_vocab[token] for token in translated_tokens if token in output_lang_vocab
    ])

    # Return the final translated sentence.
    return translated_sentence


In [None]:
input_lang_vocab = {
    "I": 1, "love": 2, "learning": 3, "languages": 4, "<UNK>": 0, "<START>": 1, "<END>": 2
}

output_lang_vocab = {
    1: "yo", 2: "<END>", 3: "amo", 4: "aprender", 5: "idiomas", 0: "<UNK>"
}


In [None]:
input_sentence = "I Love Learning"
print("Translation:", interactive_translate(input_sentence, encoder, decoder, input_lang_vocab, output_lang_vocab))


Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Translation: 


In [None]:
input_sentence = "I love coding"
print("Translation:", interactive_translate(input_sentence, encoder, decoder, input_lang_vocab, output_lang_vocab))


Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Translation: 


In [None]:
input_sentence = "learning"
print("Translation:", interactive_translate(input_sentence, encoder, decoder, input_lang_vocab, output_lang_vocab))


Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Predicted token ID: 6, word: <UNK>
Translation: 



2. **Preparing for Web or Mobile Deployment**:
   - **Frameworks for Deployment**:
      - **Flask/Django (Python)**: Suitable for creating web APIs that host the model.
      - **TensorFlow Lite / PyTorch Mobile**: For mobile deployment.
   - **Deployment Steps**:
      - Save the trained model weights to load them when needed for translation.
      - Set up a simple web API to handle incoming translation requests and return outputs.
   - **Code Example (Saving Model)**:
      ```python
      # Save encoder and decoder model weights
      torch.save(encoder.state_dict(), 'encoder_weights.pth')
      torch.save(decoder.state_dict(), 'decoder_weights.pth')

      # Load model weights when deploying
      encoder.load_state_dict(torch.load('encoder_weights.pth'))
      decoder.load_state_dict(torch.load('decoder_weights.pth'))
      ```



3. **Setting up a Basic Web API (Using Flask)**:
   - **Purpose**: Allow users to input sentences via a web interface and receive translations from the model.
   - **Code Demonstration**:

```
from flask import Flask, request, jsonify  # Import Flask and necessary functions for handling requests and responses.

# Initialize the Flask application
app = Flask(__name__)

# Load pre-trained encoder and decoder model weights.
# These weights are loaded from saved `.pth` files into the encoder and decoder models.
# This assumes `encoder` and `decoder` are already defined and initialized.
encoder.load_state_dict(torch.load('encoder_weights.pth'))  # Load the encoder's weights from file.
decoder.load_state_dict(torch.load('decoder_weights.pth'))  # Load the decoder's weights from file.

# Define a route for translation. This route listens for POST requests on the `/translate` endpoint.
@app.route('/translate', methods=['POST'])
def translate():
    # Parse the JSON data from the incoming request.
    # `data` is expected to be a JSON object containing a "sentence" key with the text to translate.
    data = request.get_json()

    # Extract the input sentence from the parsed JSON data.
    # `input_sentence` is the sentence provided by the user that needs to be translated.
    input_sentence = data['sentence']

    # Perform the translation using the `interactive_translate` function.
    # This function takes the input sentence, encoder, decoder, input and output vocabularies.
    # The result, `translation`, is the translated sentence generated by the model.
    translation = interactive_translate(input_sentence, encoder, decoder, input_lang_vocab, output_lang_vocab)

    # Return the translation as a JSON response.
    # `jsonify` converts the Python dictionary to JSON format for the response.
    return jsonify({'translation': translation})

# Run the Flask app when the script is executed directly (i.e., not imported as a module).
if __name__ == "__main__":
    # Start the Flask app in debug mode, which provides detailed error logs and auto-reloads the server on code changes.
    app.run(debug=True)

```


4. **User Interface for Translation (Optional)**:
   - **Purpose**: Create a front-end UI to interact with the translation model.
   - Options include:
      - **HTML/JavaScript** interface: Simple, text-based input and output fields for translation.
      - **Web Frameworks** like React/Vue.js: For more interactive and styled interfaces.
