# Language Modeling: A Comprehensive Overview

Language modeling is a foundational task in natural language processing (NLP) that underpins many applications, including machine translation, text generation, and speech recognition. Below, we provide an in-depth explanation of language modeling, covering its definition, mathematical formulations, core principles, pros and cons, and recent advancements in the field.

---

## 1. Definition of Language Modeling

Language modeling is the task of predicting the next word (or token) in a sequence of words, given the preceding context. It is essentially a probabilistic framework that assigns a probability distribution over a vocabulary of possible words (or tokens) for the next word in a sequence.

### Example:
Consider the sentence:  
*"The students opened their ______"*  

The goal of a language model is to predict the most likely word to fill in the blank, such as "books," "laptops," or "minds," by computing a probability distribution over all possible words in the vocabulary.

More formally, given a sequence of words $ x_1, x_2, \ldots, x_t $, the language model computes the probability distribution of the next word $ x_{t+1} $:

$$ P(x_{t+1} | x_1, x_2, \ldots, x_t) $$

where $ x_{t+1} $ can be any word in the vocabulary $ V $.

### Broader Perspective:
A language model can also be thought of as a system that assigns a probability to an entire piece of text. For a sequence of words $ x_1, x_2, \ldots, x_n $, the joint probability of the text is computed as:

$$ P(x_1, x_2, \ldots, x_n) = \prod_{t=1}^n P(x_t | x_1, x_2, \ldots, x_{t-1}) $$

This decomposition leverages the chain rule of probability, making language modeling a sequential prediction problem.

---

## 2. Mathematical Equations and Formulations

Language modeling is inherently a probabilistic task, and its mathematical foundation is based on probability theory and statistical modeling. Below are the key equations and formulations:

### 2.1 Probability of a Sequence
The joint probability of a sequence $ x_1, x_2, \ldots, x_n $ is expressed using the chain rule of probability:

$$ P(x_1, x_2, \ldots, x_n) = P(x_1) \cdot P(x_2 | x_1) \cdot P(x_3 | x_1, x_2) \cdots P(x_n | x_1, x_2, \ldots, x_{n-1}) $$

This formulation implies that a language model must learn to estimate conditional probabilities of the form $ P(x_t | x_1, x_2, \ldots, x_{t-1}) $.

### 2.2 Objective Function: Maximum Likelihood Estimation (MLE)
The goal of training a language model is to maximize the likelihood of the observed data (i.e., a corpus of text). For a dataset $ D $ containing sequences of words, the likelihood is:

$$ L(\theta) = \prod_{s \in D} P(x_1, x_2, \ldots, x_n; \theta) = \prod_{s \in D} \prod_{t=1}^n P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$

where $ \theta $ represents the parameters of the model (e.g., weights of a neural network). In practice, we maximize the log-likelihood for numerical stability:

$$ \log L(\theta) = \sum_{s \in D} \sum_{t=1}^n \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$

The corresponding loss function, known as the negative log-likelihood (NLL), is minimized:

$$ \mathcal{L}(\theta) = - \frac{1}{|D|} \sum_{s \in D} \sum_{t=1}^n \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$

### 2.3 Perplexity
Perplexity is a commonly used metric to evaluate language models. It measures how well a model predicts a sample of text, with lower perplexity indicating better performance. For a test sequence $ x_1, x_2, \ldots, x_n $, perplexity is defined as:

$$ \text{Perplexity} = 2^{-\frac{1}{n} \sum_{t=1}^n \log_2 P(x_t | x_1, x_2, \ldots, x_{t-1})} $$

Intuitively, perplexity represents the effective number of choices the model considers for the next word, with lower values implying the model is more confident in its predictions.

---

## 3. Core Principles of Language Modeling

Language modeling relies on several core principles, which guide the design and training of models:

### 3.1 Sequential Dependency
Language models assume that words in a sequence are not independent; the probability of a word depends on the preceding context (i.e., Markovian or non-Markovian dependencies). Early models, such as n-gram models, used a fixed context window (e.g., bigrams, trigrams), while modern neural models can capture long-range dependencies using architectures like transformers.

### 3.2 Representation Learning
Modern language models learn dense, continuous representations (embeddings) of words or tokens. These embeddings capture semantic and syntactic relationships, enabling the model to generalize to unseen data. For example, words like "king" and "queen" are represented in a way that reflects their similarity.

### 3.3 Generalization
Language models must generalize to unseen sequences, which requires learning patterns from large, diverse datasets. Overfitting is a challenge, addressed through techniques like regularization, dropout, and large-scale pretraining.

### 3.4 Autoregressive Modeling
Most language models are autoregressive, meaning they predict the next word based on the previous words in a left-to-right manner. This is in contrast to masked language models (e.g., BERT), which predict masked words in a bidirectional context (though such models are not true generative language models).

### 3.5 Scalability
Modern language models, especially large language models (LLMs), rely on massive computational resources and datasets. Scaling laws suggest that increasing model size, data, and compute leads to better performance, though with diminishing returns.

---

## 4. Pros and Cons of Language Modeling

### 4.1 Pros
- **Versatility**: Language models are foundational to many NLP tasks, including machine translation, text generation, sentiment analysis, and dialogue systems.
- **Transfer Learning**: Pretrained language models (e.g., BERT, GPT) can be fine-tuned for specific tasks, reducing the need for task-specific data.
- **Contextual Understanding**: Modern models, especially transformers, capture long-range dependencies and contextual nuances, enabling more coherent and accurate predictions.
- **Scalability**: Large-scale models can generalize to diverse domains and languages, making them widely applicable.
- **Human-like Generation**: Advanced models can generate fluent, human-like text, enabling applications like chatbots and automated content creation.

### 4.2 Cons
- **Computational Cost**: Training and deploying large language models require significant computational resources, making them inaccessible to smaller organizations.
- **Data Dependency**: High performance relies on massive, high-quality datasets, which may contain biases or sensitive information.
- **Bias and Ethics**: Language models can perpetuate biases present in the training data, leading to ethical concerns (e.g., generating biased or harmful content).
- **Lack of True Understanding**: Despite their fluency, language models lack true semantic understanding or reasoning capabilities, relying instead on statistical patterns.
- **Overfitting to Training Data**: Models may memorize training data, leading to poor generalization, especially on out-of-distribution data.
- **Interpretability**: Neural language models, especially transformers, are often considered "black boxes," making it difficult to understand their decision-making process.

---

## 5. Recent Advancements in Language Modeling

Language modeling has seen significant advancements in recent years, driven by innovations in model architectures, training paradigms, and hardware. Below are some key developments:

### 5.1 Transformer-Based Models
The introduction of the transformer architecture (Vaswani et al., 2017) revolutionized language modeling. Transformers use self-attention mechanisms to capture long-range dependencies, replacing earlier recurrent neural networks (RNNs) and long short-term memory (LSTM) models. Key transformer-based models include:

- **GPT (Generative Pretrained Transformer)**: Autoregressive models trained to predict the next word in a sequence. Variants include GPT-2, GPT-3, and GPT-4, with increasing scale and performance.
- **BERT (Bidirectional Encoder Representations from Transformers)**: While not a traditional generative language model, BERT uses masked language modeling to predict masked words, enabling bidirectional context understanding.
- **T5 (Text-to-Text Transfer Transformer)**: Frames all NLP tasks as text-to-text problems, unifying language modeling with tasks like translation and summarization.

### 5.2 Large Language Models (LLMs)
The scaling of language models to billions or even trillions of parameters has led to remarkable performance gains. Examples include:

- **GPT-3**: A 175-billion-parameter model capable of few-shot learning, where it performs tasks with minimal task-specific training data.
- **PaLM (Pathways Language Model)**: A 540-billion-parameter model with improved reasoning and multilingual capabilities.
- **LLaMA (Large Language and Multimodal Assistant)**: Open-source models that achieve competitive performance with smaller parameter counts through efficient training.

### 5.3 Efficient Training Techniques
To address the computational cost of training large models, several techniques have emerged:

- **Mixed-Precision Training**: Uses lower-precision arithmetic (e.g., FP16) to reduce memory usage and speed up training.
- **Parameter-Efficient Fine-Tuning (PEFT)**: Methods like LoRA (Low-Rank Adaptation) and adapters enable fine-tuning large models with minimal additional parameters.
- **Sparse Models**: Techniques like mixture-of-experts (MoE) reduce computational cost by activating only a subset of parameters during inference.

### 5.4 Multilingual and Cross-Lingual Models
Recent models aim to support multiple languages, reducing the need for language-specific models. Examples include:

- **mBERT**: A multilingual version of BERT trained on text from over 100 languages.
- **XLM-R (Cross-Lingual Language Model - RoBERTa)**: A robust multilingual model for cross-lingual transfer learning.

### 5.5 In-Context Learning and Prompting
Modern LLMs like GPT-3 and PaLM support in-context learning, where the model performs tasks based on examples provided in the input prompt, without fine-tuning. This paradigm has enabled few-shot and zero-shot learning, significantly increasing flexibility.

### 5.6 Ethical and Responsible AI
Recent advancements also focus on mitigating biases and improving the ethical use of language models:

- **Bias Detection and Mitigation**: Techniques to identify and reduce biases in model predictions, such as fairness-aware training and debiasing algorithms.
- **Energy Efficiency**: Efforts to reduce the carbon footprint of training large models, including energy-efficient hardware and algorithms.

### 5.7 Integration with Other Modalities
Language models are increasingly integrated with other modalities, such as vision and audio, to create multimodal systems. Examples include:

- **CLIP (Contrastive Language-Image Pretraining)**: Combines language and vision to enable tasks like image captioning.
- **DALL-E**: Uses language models to generate images from text prompts.

### 5.8 Compression and Deployment
To deploy large models in resource-constrained environments, techniques like model quantization, pruning, and knowledge distillation are used to compress models without significant performance loss.

---

## 6. Conclusion

Language modeling is a cornerstone of modern NLP, with applications ranging from text generation to dialogue systems. Its mathematical foundation lies in probability theory, and its practical success is driven by advances in neural architectures, particularly transformers. While language models offer remarkable capabilities, they also face challenges related to computational cost, bias, and interpretability. Recent advancements, such as large-scale models, efficient training techniques, and multimodal integration, continue to push the boundaries of what language models can achieve, making them a vibrant area of research and application in artificial intelligence.

# n-gram Language Models

## Definition and Foundations

n-gram language models are statistical language models based on the Markov assumption that the probability of a word depends only on the previous $n-1$ words rather than the entire history. These models approximate the joint probability of a sequence by conditioning each word on its preceding context of fixed length.

### Core Definition
An n-gram is a contiguous sequence of $n$ items (characters, words, or tokens) from a given text. In language modeling:
- Unigram ($n=1$): single word model
- Bigram ($n=2$): two-word model
- Trigram ($n=3$): three-word model
- and so on for higher values of $n$

## Mathematical Formulation

### Chain Rule Decomposition
For a sequence of words $w_1, w_2, ..., w_m$, the joint probability is:

$$P(w_1, w_2, ..., w_m) = \prod_{i=1}^{m} P(w_i|w_1, w_2, ..., w_{i-1})$$

### Markov Assumption
The n-gram model makes the approximation that a word's probability depends only on the $n-1$ previous words:

$$P(w_i|w_1, w_2, ..., w_{i-1}) \approx P(w_i|w_{i-(n-1)}, ..., w_{i-1})$$

### Probability Computation
The probability of an n-gram divided by the probability of its history (the first $n-1$ words):

$$P(w_i|w_{i-(n-1)}, ..., w_{i-1}) = \frac{P(w_{i-(n-1)}, ..., w_{i-1}, w_i)}{P(w_{i-(n-1)}, ..., w_{i-1})}$$

### Maximum Likelihood Estimation
Using count statistics from a corpus:

$$P(w_i|w_{i-(n-1)}, ..., w_{i-1}) = \frac{C(w_{i-(n-1)}, ..., w_{i-1}, w_i)}{C(w_{i-(n-1)}, ..., w_{i-1})}$$

Where $C()$ represents the count function.

## Estimation Methods

### Count-Based Estimation
1. **Direct counting**:
   $$P(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$$

2. **Handling sentence boundaries**: Special tokens like $\langle \text{s} \rangle$ (start) and $\langle \text{/s} \rangle$ (end)

3. **Unknown words**: Using $\langle \text{UNK} \rangle$ token:
   $$P(\langle \text{UNK} \rangle|w_{i-1}) = \frac{C(w_{i-1}, \langle \text{UNK} \rangle)}{C(w_{i-1})}$$

## Smoothing Techniques

### Zero Probability Problem
Count-based methods assign zero probability to unseen n-grams, which is problematic when multiplying probabilities.

### Laplace (Add-1) Smoothing
Add 1 to all counts:

$$P_{Laplace}(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + |V|}$$

Where $|V|$ is vocabulary size.

### Add-k Smoothing
Generalization of add-1, with fractional counts:

$$P_{Add-k}(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i) + k}{C(w_{i-1}) + k|V|}$$

### Backoff and Interpolation

#### Backoff
Use lower-order n-grams when higher-order ones have insufficient data:

$$P_{backoff}(w_i|w_{i-(n-1)}, ..., w_{i-1}) =
\begin{cases}
\alpha(w_{i-(n-1)}, ..., w_{i-1}) \cdot P(w_i|w_{i-(n-2)}, ..., w_{i-1}) & \text{if } C(w_{i-(n-1)}, ..., w_{i-1}, w_i) = 0 \\
P_{MLE}(w_i|w_{i-(n-1)}, ..., w_{i-1}) & \text{otherwise}
\end{cases}$$

#### Interpolation
Linear combination of different order n-gram models:

$$P_{interp}(w_i|w_{i-2}, w_{i-1}) = \lambda_3 P(w_i|w_{i-2}, w_{i-1}) + \lambda_2 P(w_i|w_{i-1}) + \lambda_1 P(w_i)$$

Where $\lambda_1 + \lambda_2 + \lambda_3 = 1$ and $\lambda_i \geq 0$

### Kneser-Ney Smoothing
Advanced technique accounting for word distribution:

$$P_{KN}(w_i|w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - d, 0)}{C(w_{i-1})} + \lambda(w_{i-1})P_{continuation}(w_i)$$

Where:
- $d$ is the discount parameter
- $\lambda(w_{i-1})$ is a normalization constant
- $P_{continuation}(w_i)$ is the continuation probability

$$P_{continuation}(w_i) = \frac{|\{w_{j-1}: C(w_{j-1}, w_i) > 0\}|}{\sum_{w'} |\{w_{j-1}: C(w_{j-1}, w') > 0\}|}$$

## Variants of n-gram Models

### Character-level n-grams
- Operate on character sequences instead of words
- Smaller vocabulary, longer sequences
- Mathematical formulation remains the same but with characters

### Skip-grams
- Allow gaps in the sequence
- For example, a 2-skip bigram considers words separated by up to 2 other words

$$P_{skip}(w_i|w_{i-k}) \text{ for } 1 \leq k \leq n$$

### Variable-length n-grams
- Adapt context length based on available data
- Use longer contexts when data supports it

## Core Principles

### Training Procedure
1. Collect large text corpus
2. Tokenize text into words/tokens
3. Extract n-grams from text
4. Calculate n-gram statistics
5. Apply smoothing techniques
6. Store model parameters (n-gram probabilities)

### Computational Complexity
- Space complexity: $O(|V|^n)$ where $|V|$ is vocabulary size
- Training time complexity: $O(T)$ where $T$ is corpus size
- Inference time complexity: $O(1)$ for a single prediction

### Context Limitation
- Fixed context window of $n-1$ words
- Limited ability to capture long-range dependencies
- Exponential sparsity with increasing $n$

## Evaluation Metrics

### Perplexity
Primary evaluation metric for language models:

$$PP(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, ..., w_N)}} = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1, ..., w_{i-1})}}$$

Lower perplexity indicates better model performance.

### Cross-Entropy
Related to perplexity:

$$H(W) = -\frac{1}{N}\sum_{i=1}^{N}\log_2 P(w_i|w_1, ..., w_{i-1})$$

Perplexity is $2^{H(W)}$

## Practical Implementations

### Data Structures
- Hash tables for sparse n-gram storage
- Tries/prefix trees for efficient lookup
- Compressed data structures for large models

### Pruning Techniques
- Count cutoffs: exclude n-grams below frequency threshold
- Entropy-based pruning: remove n-grams that contribute least to model performance

## Pros and Cons

### Advantages
- Conceptually simple and interpretable
- Efficient training and inference
- Works well for limited domains with sufficient data
- Minimal computational resources required
- Theoretically well-understood

### Disadvantages
- Cannot capture long-range dependencies
- Suffers from data sparsity (exponential with n)
- Storage requirements grow exponentially with n
- Limited semantic understanding
- Poor generalization to unseen contexts
- Fixed context window

## Transition to Neural Language Models

### Limitations Addressed by Neural Models
- Curse of dimensionality
- Data sparsity
- Fixed-length context
- Lack of semantic generalization

### Feed-Forward Neural Language Model
- Inputs: one-hot vectors or embeddings for $n-1$ previous words
- Hidden layers: typically dense with non-linear activations
- Output: probability distribution over vocabulary

$$P(w_t|w_{t-(n-1)}, ..., w_{t-1}) = \text{softmax}(W_2 \cdot \text{tanh}(W_1 \cdot [e(w_{t-(n-1)}); ...; e(w_{t-1})]) + b_2) + b_1)$$

Where:
- $e(w_i)$ is the embedding of word $w_i$
- $W_1, W_2, b_1, b_2$ are learnable parameters
- $[;]$ denotes concatenation

### Recurrent Neural Models
Overcome fixed context limitation:

$$h_t = f(h_{t-1}, e(w_t))$$
$$P(w_{t+1}|w_1, ..., w_t) = \text{softmax}(W \cdot h_t + b)$$

### Transformer-Based Models
Now dominate language modeling with self-attention:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

## Recent Advancements in n-gram Models

### Neural-Symbolic Integration
- Combining n-gram statistics with neural representations
- Improved smoothing with neural techniques
- n-gram feature extraction for neural models

### Subword n-grams
- BPE (Byte Pair Encoding) and WordPiece tokenization
- Mitigates vocabulary size issues
- Handles out-of-vocabulary words

### Efficient n-gram Deployment
- Compressed n-gram models for mobile devices
- Pruned models maintaining performance
- Quantized representations

### Hybrid Models
- n-gram caching for neural LMs
- Ensemble methods combining statistical and neural approaches
- Continuous-space n-gram models

### Applications in Modern Systems
- Spelling correction and input prediction
- Low-resource language modeling
- Domain-specific applications (legal, medical)
- Fallback mechanisms in production systems

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple


class LSTMCell(nn.Module):
    """Custom LSTM Cell implementation from scratch"""

    def __init__(self, input_size: int, hidden_size: int):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Weight matrices for input transformations
        self.W_i = nn.Parameter(torch.Tensor(input_size + hidden_size, hidden_size))
        self.W_f = nn.Parameter(torch.Tensor(input_size + hidden_size, hidden_size))
        self.W_c = nn.Parameter(torch.Tensor(input_size + hidden_size, hidden_size))
        self.W_o = nn.Parameter(torch.Tensor(input_size + hidden_size, hidden_size))

        # Bias terms
        self.b_i = nn.Parameter(torch.Tensor(hidden_size))
        self.b_f = nn.Parameter(torch.Tensor(hidden_size))
        self.b_c = nn.Parameter(torch.Tensor(hidden_size))
        self.b_o = nn.Parameter(torch.Tensor(hidden_size))

        self._initialize_weights()

    def _initialize_weights(self) -> None:
        """Initialize weights using Xavier/Glorot initialization"""
        nn.init.xavier_uniform_(self.W_i)
        nn.init.xavier_uniform_(self.W_f)
        nn.init.xavier_uniform_(self.W_c)
        nn.init.xavier_uniform_(self.W_o)
        nn.init.zeros_(self.b_i)
        nn.init.zeros_(self.b_f)
        nn.init.zeros_(self.b_c)
        nn.init.zeros_(self.b_o)

    def forward(self, x: torch.Tensor, state: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
        """Forward pass of LSTM cell

        Args:
            x: Input tensor of shape (batch_size, input_size)
            state: Tuple of (hidden_state, cell_state) from previous time step

        Returns:
            Tuple of (new_hidden_state, new_cell_state)
        """
        h_prev, c_prev = state

        # Concatenate previous hidden state and input
        combined = torch.cat((h_prev, x), dim=1)  # Shape: (batch_size, input_size + hidden_size)

        # Input gate
        i_t = torch.sigmoid(combined @ self.W_i + self.b_i)

        # Forget gate
        f_t = torch.sigmoid(combined @ self.W_f + self.b_f)

        # Cell update
        c_tilde = torch.tanh(combined @ self.W_c + self.b_c)

        # Update cell state
        c_t = f_t * c_prev + i_t * c_tilde

        # Output gate
        o_t = torch.sigmoid(combined @ self.W_o + self.b_o)

        # Update hidden state
        h_t = o_t * torch.tanh(c_t)

        return h_t, c_t


class LSTMEncoder(nn.Module):
    """Encoder using custom LSTM implementation"""

    def __init__(self, input_size: int, hidden_size: int, num_layers: int = 1):
        super(LSTMEncoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Create LSTM layers
        self.lstm_cells = nn.ModuleList([
            LSTMCell(input_size if i == 0 else hidden_size, hidden_size)
            for i in range(num_layers)
        ])

    def _init_hidden(self, batch_size: int, device: torch.device) -> Tuple[torch.Tensor, torch.Tensor]:
        """Initialize hidden and cell states"""
        hidden = torch.zeros(batch_size, self.hidden_size, device=device)
        cell = torch.zeros(batch_size, self.hidden_size, device=device)
        return hidden, cell

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Forward pass of encoder

        Args:
            x: Input tensor of shape (batch_size, seq_len, input_size)

        Returns:
            Tuple of (final_hidden_state, final_cell_state)
        """
        batch_size, seq_len, _ = x.size()
        device = x.device

        # Initialize states for each layer
        states = [
            self._init_hidden(batch_size, device)
            for _ in range(self.num_layers)
        ]

        # Process sequence
        for t in range(seq_len):
            input_t = x[:, t, :]

            # Process through each layer
            for layer_idx, lstm_cell in enumerate(self.lstm_cells):
                if layer_idx == 0:
                    states[layer_idx] = lstm_cell(input_t, states[layer_idx])
                else:
                    states[layer_idx] = lstm_cell(states[layer_idx-1][0], states[layer_idx])

        # Return final states from the last layer
        return states[-1]


class LSTMDecoder(nn.Module):
    """Decoder using custom LSTM implementation"""

    def __init__(self, hidden_size: int, output_size: int, num_layers: int = 1):
        super(LSTMDecoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.num_layers = num_layers

        # Create LSTM layers
        self.lstm_cells = nn.ModuleList([
            LSTMCell(hidden_size, hidden_size)
            for _ in range(num_layers)
        ])

        # Output projection
        self.fc = nn.Linear(hidden_size, output_size)

        # Projection layer to match decoder input dimensions
        self.input_proj = nn.Linear(output_size, hidden_size)

    def forward(self, x: torch.Tensor, states: Tuple[torch.Tensor, torch.Tensor], seq_len: int) -> torch.Tensor:
        """Forward pass of decoder

        Args:
            x: Initial input tensor of shape (batch_size, hidden_size)
            states: Initial states from encoder
            seq_len: Length of output sequence

        Returns:
            Output tensor of shape (batch_size, seq_len, output_size)
        """
        batch_size = x.size(0)
        outputs = []

        # Initial states for each layer
        layer_states = [states] * self.num_layers
        current_input = x

        for _ in range(seq_len):
            # Process through each layer
            for layer_idx, lstm_cell in enumerate(self.lstm_cells):
                layer_states[layer_idx] = lstm_cell(
                    current_input,
                    layer_states[layer_idx]
                )
                current_input = layer_states[layer_idx][0]

            # Project to output space
            output = self.fc(current_input)
            outputs.append(output)

            # Project output back to hidden_size for next step
            current_input = self.input_proj(output)

        # Stack outputs
        outputs = torch.stack(outputs, dim=1)  # (batch_size, seq_len, output_size)
        return outputs


class Seq2SeqTransfer(nn.Module):
    """Sequence-to-sequence model for transfer learning"""

    def __init__(self, input_size: int, hidden_size: int, output_size: int, num_layers: int = 1):
        super(Seq2SeqTransfer, self).__init__()
        self.encoder = LSTMEncoder(input_size, hidden_size, num_layers)
        self.decoder = LSTMDecoder(hidden_size, output_size, num_layers)

        # Input projection for initial decoder input
        self.decoder_input_proj = nn.Linear(hidden_size, hidden_size)

    def forward(self, source_seq: torch.Tensor, target_seq_len: int) -> torch.Tensor:
        """Forward pass of seq2seq model

        Args:
            source_seq: Input sequence of shape (batch_size, source_seq_len, input_size)
            target_seq_len: Length of target sequence

        Returns:
            Output tensor of shape (batch_size, target_seq_len, output_size)
        """
        # Encode source sequence
        encoder_states = self.encoder(source_seq)

        # Prepare initial decoder input
        batch_size = source_seq.size(0)
        device = source_seq.device
        decoder_input = torch.zeros(batch_size, self.decoder.hidden_size, device=device)
        decoder_input = self.decoder_input_proj(decoder_input)

        # Decode to target sequence
        decoder_output = self.decoder(decoder_input, encoder_states, target_seq_len)

        return decoder_output


# Example usage
if __name__ == "__main__":
    # Hyperparameters
    input_size = 10
    hidden_size = 32
    output_size = 10
    num_layers = 2
    batch_size = 4
    source_seq_len = 5
    target_seq_len = 3

    # Create model
    model = Seq2SeqTransfer(input_size, hidden_size, output_size, num_layers)

    # Create dummy input
    source_seq = torch.randn(batch_size, source_seq_len, input_size)

    # Forward pass
    output = model(source_seq, target_seq_len)
    print(f"Output shape: {output.shape}")  # Should be (batch_size, target_seq_len, output_size)

# Multi-layer Deep Encoder-Decoder Machine Translation Networks

## Definition

Multi-layer deep encoder-decoder networks are neural architectures that transform sequences from one domain (e.g., sentences in English) to sequences in another domain (e.g., sentences in French). These models consist of two primary components: an encoder that processes the input sequence into a continuous representation, and a decoder that generates the output sequence based on this representation.

## Core Principles of Sequence-to-Sequence Models

- **Encoder**: Processes input sequence $X = (x_1, x_2, ..., x_n)$ into hidden representations
- **Decoder**: Generates output sequence $Y = (y_1, y_2, ..., y_m)$ using encoded information
- **Transfer learning**: Information flows from source to target language through latent representations
- **End-to-end training**: Entire model is optimized jointly to maximize translation probability

## Why Attention? Sequence-to-Sequence: The Bottleneck Problem

### The Bottleneck Problem

In traditional sequence-to-sequence models, the encoder compresses the entire input sequence into a fixed-length vector:

$$c = h_n = f(x_1, x_2, ..., x_n)$$

This creates a bottleneck as:

- The fixed-size vector must contain all information about the source sequence
- Information degradation becomes severe for longer sequences
- Early tokens' information fades due to vanishing gradients
- Model performance degrades significantly for long sequences

## Attention

### Conceptual Understanding

Attention mechanisms address the bottleneck problem by:
- Allowing the decoder to "focus" on different parts of the source sequence
- Creating direct connections between decoder states and encoder states
- Dynamically weighting the importance of source tokens for each output token

### Sequence-to-Sequence with Attention

#### Encoder RNN

The encoder creates a sequence of hidden states:

$$h_j = \text{EncoderRNN}(x_j, h_{j-1})$$

Each hidden state $h_j$ represents information about the input at position $j$ contextualized by previous inputs.

#### Decoder with Attention

Instead of using only the last hidden state, the decoder:
1. Computes attention weights for each encoder hidden state
2. Creates a context vector as a weighted sum of encoder states
3. Uses this context vector alongside its own hidden state to generate output

## Attention: In Equations

### Attention Score Computation

For a decoder state $s_i$ and encoder states $h_1, h_2, ..., h_n$:

$$e_{ij} = \text{score}(s_{i-1}, h_j)$$

The score function can be implemented in various ways:

1. **Dot product attention**:
   $$e_{ij} = s_{i-1}^T h_j$$

2. **General attention**:
   $$e_{ij} = s_{i-1}^T W_a h_j$$

3. **Additive/concat attention**:
   $$e_{ij} = v_a^T \tanh(W_a[s_{i-1}; h_j])$$

### Attention Weights

The scores are normalized using softmax:

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{n} \exp(e_{ik})}$$

### Context Vector

The context vector is computed as:

$$c_i = \sum_{j=1}^{n} \alpha_{ij} h_j$$

### Output Generation

The output distribution is computed using:

$$p(y_i|y_1, ..., y_{i-1}, X) = \text{softmax}(W_o[s_i; c_i])$$

Where:
- $s_i = \text{DecoderRNN}(y_{i-1}, s_{i-1}, c_i)$
- $W_o$ is a learnable parameter matrix

## Attention is Great!

### Key Benefits

- **Solves the bottleneck problem**: No need to compress all information into one vector
- **Handles long-range dependencies**: Direct connections between related words
- **Provides interpretability**: Attention weights show which input tokens influence each output
- **Improves gradient flow**: Creates shorter paths for backpropagation
- **Enables better translation of rare words**: By focusing explicitly on their representations

## Attention Variants

### 1. Global vs. Local Attention

- **Global attention**: Attends to all source positions
- **Local attention**: Attends only to a window of positions around predicted aligned position

### 2. Hard vs. Soft Attention

- **Soft attention**: Differentiable, uses weighted sum of all positions
- **Hard attention**: Stochastic, selects one position to attend to (requires reinforcement learning)

### 3. Self-Attention (Intra-attention)

- Relates different positions within the same sequence
- Forms the basis for Transformer architectures
- Computes attention between each element and all other elements in the sequence

### 4. Multi-head Attention

- Runs attention multiple times in parallel
- Each attention "head" learns different aspects of relationships
- Outputs are concatenated and linearly transformed

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

Where each head is computed as:

$$\text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$

### 5. Hierarchical Attention

- Different levels of attention (word-level, sentence-level)
- Especially useful for document-level tasks

## Attention as a General Deep Learning Technique

Attention has evolved beyond machine translation to become a fundamental component in modern deep learning:

- **Computer vision**: Understanding image regions (visual attention)
- **Speech recognition**: Focusing on relevant audio frames
- **Natural language understanding**: Creating contextualized representations
- **Multimodal learning**: Aligning elements across modalities
- **Graph neural networks**: Attending to relevant nodes in a graph

## Recent Advancements

- **Transformer architecture**: Replaced RNNs with self-attention for state-of-the-art results
- **Pre-trained language models**: Models like BERT, GPT and T5 heavily rely on attention
- **Efficient attention variants**: Sparse, linear, and local attention to reduce computational complexity
- **Attention pruning**: Dynamic mechanisms to attend only to relevant tokens
- **Rotary position embeddings (RoPE)**: Superior handling of positional information in attention
- **Multi-query attention**: Reduces memory requirements while maintaining performance

## Pros and Cons

### Pros

- **Flexibility**: Adaptable to various sequence lengths
- **Parallelization**: Unlike RNNs, can be computed in parallel
- **Performance**: Consistently outperforms non-attention models
- **Interpretability**: Attention weights provide insights into model decisions
- **Versatility**: Works across diverse domains and tasks

### Cons

- **Computational complexity**: $O(n^2)$ for sequence length $n$ in standard attention
- **Memory requirements**: Stores attention matrices for all combinations
- **Training instability**: Can be sensitive to initialization and hyperparameters
- **Limited context length**: Practical limitations for very long sequences
- **Potential attention collapse**: Models may focus too much on certain patterns

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Optional

class Encoder(nn.Module):
    """Encoder module for sequence-to-sequence learning with customizable RNN."""

    def __init__(
        self,
        input_dim: int,
        embed_dim: int,
        hidden_dim: int,
        num_layers: int = 1,
        dropout: float = 0.1
    ) -> None:
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # Embedding layer for input tokens
        self.embedding = nn.Embedding(input_dim, embed_dim)
        # GRU for encoding sequences
        self.gru = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        src: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass of encoder.

        Args:
            src: Input tensor of shape (batch_size, seq_len)

        Returns:
            outputs: Encoder outputs of shape (batch_size, seq_len, hidden_dim)
            hidden: Final hidden state of shape (num_layers, batch_size, hidden_dim)
        """
        # Embed input tokens
        embedded = self.dropout(self.embedding(src))  # (batch_size, seq_len, embed_dim)

        # Pass through GRU
        outputs, hidden = self.gru(embedded)  # outputs: (batch_size, seq_len, hidden_dim)

        return outputs, hidden


class Attention(nn.Module):
    """Additive attention mechanism implementation."""

    def __init__(
        self,
        hidden_dim: int
    ) -> None:
        super().__init__()
        self.hidden_dim = hidden_dim
        self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1, bias=False)

    def forward(
        self,
        hidden: torch.Tensor,
        encoder_outputs: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass of attention mechanism.

        Args:
            hidden: Decoder hidden state of shape (batch_size, hidden_dim)
            encoder_outputs: Encoder outputs of shape (batch_size, src_len, hidden_dim)

        Returns:
            context: Context vector of shape (batch_size, hidden_dim)
            attention_weights: Attention weights of shape (batch_size, src_len)
        """
        batch_size = encoder_outputs.size(0)
        src_len = encoder_outputs.size(1)

        # Repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # (batch_size, src_len, hidden_dim)

        # Calculate energy
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))  # (batch_size, src_len, hidden_dim)

        # Calculate attention scores
        attention = self.v(energy).squeeze(2)  # (batch_size, src_len)

        # Apply softmax to get attention weights
        attention_weights = F.softmax(attention, dim=1)  # (batch_size, src_len)

        # Calculate context vector
        context = torch.bmm(
            attention_weights.unsqueeze(1),
            encoder_outputs
        ).squeeze(1)  # (batch_size, hidden_dim)

        return context, attention_weights


class Decoder(nn.Module):
    """Decoder module with attention mechanism."""

    def __init__(
        self,
        output_dim: int,
        embed_dim: int,
        hidden_dim: int,
        num_layers: int = 1,
        dropout: float = 0.1
    ) -> None:
        super().__init__()
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim

        # Embedding layer for output tokens
        self.embedding = nn.Embedding(output_dim, embed_dim)
        # Attention mechanism
        self.attention = Attention(hidden_dim)
        # GRU for decoding
        self.gru = nn.GRU(
            embed_dim + hidden_dim,
            hidden_dim,
            num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        # Output layer
        self.fc_out = nn.Linear(embed_dim + hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        input: torch.Tensor,
        hidden: torch.Tensor,
        encoder_outputs: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Forward pass of decoder (single step).

        Args:
            input: Input token of shape (batch_size)
            hidden: Previous hidden state of shape (num_layers, batch_size, hidden_dim)
            encoder_outputs: Encoder outputs of shape (batch_size, src_len, hidden_dim)

        Returns:
            prediction: Output prediction of shape (batch_size, output_dim)
            hidden: New hidden state of shape (num_layers, batch_size, hidden_dim)
            attention_weights: Attention weights of shape (batch_size, src_len)
        """
        # Add sequence dimension
        input = input.unsqueeze(1)  # (batch_size, 1)

        # Embed input token
        embedded = self.dropout(self.embedding(input))  # (batch_size, 1, embed_dim)

        # Get attention context vector
        top_hidden = hidden[-1]  # (batch_size, hidden_dim)
        context, attention_weights = self.attention(top_hidden, encoder_outputs)  # context: (batch_size, hidden_dim)

        # Concatenate embedded input and context vector
        gru_input = torch.cat((embedded, context.unsqueeze(1)), dim=2)  # (batch_size, 1, embed_dim + hidden_dim)

        # Pass through GRU
        output, hidden = self.gru(gru_input, hidden)  # output: (batch_size, 1, hidden_dim)

        # Prepare for output prediction
        embedded = embedded.squeeze(1)  # (batch_size, embed_dim)
        output = output.squeeze(1)  # (batch_size, hidden_dim)
        context = context.squeeze(1)  # (batch_size, hidden_dim)

        # Predict next token
        prediction = self.fc_out(torch.cat((output, context, embedded), dim=1))  # (batch_size, output_dim)

        return prediction, hidden, attention_weights


class Seq2Seq(nn.Module):
    """Sequence-to-Sequence model with attention mechanism."""

    def __init__(
        self,
        encoder: Encoder,
        decoder: Decoder,
        device: torch.device
    ) -> None:
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        # Ensure encoder and decoder have compatible hidden dimensions
        assert encoder.hidden_dim == decoder.hidden_dim, \
            "Encoder and decoder hidden dimensions must match!"

    def forward(
        self,
        src: torch.Tensor,
        trg: torch.Tensor,
        teacher_forcing_ratio: float = 0.5
    ) -> torch.Tensor:
        """
        Forward pass of sequence-to-sequence model.

        Args:
            src: Source sequence of shape (batch_size, src_len)
            trg: Target sequence of shape (batch_size, trg_len)
            teacher_forcing_ratio: Probability of using teacher forcing

        Returns:
            outputs: Decoder outputs of shape (batch_size, trg_len, output_dim)
        """
        batch_size = src.size(0)
        trg_len = trg.size(1)
        trg_vocab_size = self.decoder.output_dim

        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        # Encode source sequence
        encoder_outputs, hidden = self.encoder(src)

        # First input to decoder is <sos> token
        input = trg[:, 0]

        # Decode sequence
        for t in range(1, trg_len):
            # Get decoder output
            output, hidden, _ = self.decoder(input, hidden, encoder_outputs)

            # Store output
            outputs[:, t, :] = output

            # Decide whether to use teacher forcing
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio

            # Get predicted token
            top1 = output.argmax(1)

            # Use teacher forcing or predicted token
            input = trg[:, t] if teacher_force else top1

        return outputs


# Example usage and initialization
if __name__ == "__main__":
    # # Check torch version and availability
    # assert torch.__version__ >= "1.8.0", "PyTorch version 1.8.0 or higher required!"

    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Model hyperparameters
    INPUT_DIM = 5000  # Source vocabulary size
    OUTPUT_DIM = 5000  # Target vocabulary size
    ENC_EMB_DIM = 256
    DEC_EMB_DIM = 256
    HID_DIM = 512
    N_LAYERS = 2
    ENC_DROPOUT = 0.5
    DEC_DROPOUT = 0.5

    # Initialize encoder and decoder
    enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
    dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

    # Initialize sequence-to-sequence model
    model = Seq2Seq(enc, dec, device).to(device)

    # Example forward pass with dummy data
    batch_size = 32
    src_len = 10
    trg_len = 12

    src = torch.randint(0, INPUT_DIM, (batch_size, src_len)).to(device)
    trg = torch.randint(0, OUTPUT_DIM, (batch_size, trg_len)).to(device)

    outputs = model(src, trg)
    print(f"Output shape: {outputs.shape}")  # Should be (batch_size, trg_len, OUTPUT_DIM)