### 1. Recurrent Neural Network (RNN)

![image.png](attachment:image.png)

- **Core idea**: Processes sequences step by step. Each step depends on the previous hidden state.
- **How it works (intuition)**: Think of reading a sentence word by word, keeping a small memory of what you’ve read so far.
- **Strengths**:
    - Simple
    - Works for short sequences
    - Low memory usage
- **Weaknesses**:
    - ❌ Vanishing / exploding gradients
    - ❌ Struggles with long-term dependencies
    - ❌ Slow training (no parallelization)
- **Typical uses (historically)**:
    - Basic NLP
    - Time series (short windows)
    - Simple sequence modeling

### 2. Long Short-Term Memory (LSTM)
- **Core idea**: A special RNN designed to remember longer context. Uses gates to control information flow.
- **Key components**:
    - **Forget gate** → what to discard
    - **Input gate** → what to store
    - **Output gate** → what to expose
- **Why it’s better than RNN**:
    - ✅ Solves vanishing gradient problem
    - ✅ Handles long-range dependencies better
- **Weaknesses**:
    - ❌ Still sequential → slow
    - ❌ Hard to scale to very large datasets
    - ❌ Complex architecture

    ![image.png](attachment:image.png)
    
- **Typical uses**:
    - Speech recognition
    - Language modeling (pre-Transformer era)
    - Time series forecasting

### 3. Transformer

![image.png](attachment:image.png)

- **Core idea**: No recurrence. Uses self-attention to look at all tokens at once.
- **How it works (intuition)**: Instead of reading a sentence word by word, the model looks at the entire sentence at once and decides which words are important to each other.
- **Key innovations**:
    - Self-attention
    - Positional encoding (to know order)
    - Parallel computation
- **Strengths**:
    - ✅ Captures long-range dependencies easily
    - ✅ Much faster training
    - ✅ Scales extremely well
    - ✅ Foundation of modern AI (GPT, BERT, T5)
- **Weaknesses**:
    - ❌ High memory usage ($O(n^2)$ attention)
    - ❌ Overkill for very small datasets
- **Typical uses**:
    - Large language models (ChatGPT)
    - Translation
    - Vision (ViT)
    - Multimodal AI

### 4. Quick Comparison Table

| Feature | RNN | LSTM | Transformer |
| :--- | :--- | :--- | :--- |
| **Sequence processing** | Sequential | Sequential | Parallel |
| **Long-term memory** | ❌ Poor | ✅ Good | ✅ Excellent |
| **Training speed** | Slow | Slower | Fast |
| **Handles long text** | ❌ | ⚠️ | ✅ |
| **Parallelizable** | ❌ | ❌ | ✅ |
| **Modern usage** | Rare | Limited | Dominant |

### 5. When Should You Use What?
- **Use RNN if**:
    - Learning fundamentals
    - Very short sequences
    - Extremely limited compute
- **Use LSTM if**:
    - Time-series with moderate length
    - Small dataset
    - Interpretability matters
- **Use Transformer if**:
    - NLP tasks
    - Long context
    - Large datasets
    - Production-grade AI systems

### 6. Why Transformers Replaced RNNs & LSTMs
**One sentence answer**: Transformers remove the sequential bottleneck and model global context directly.

This is why GPT, BERT, LLaMA, DeepSeek, etc. are all transformer-based.

-> Capacity to learn long range dependencies
-> Better performance on tasks like machine translation, speech recognition and text summarization.
-> It is more efficient in terms of computation time. Transformer is able to run on GPU meanwhile for RNN it can only run on CPU. Explainations: Transformer uses attention mechanism which allows parallel processing while RNNs are sequential.
-> It has a fixed size input and output sequence length. This means we don't have to worry about padding or truncating our sequences. 
-> The model can be trained with less data as compared to RNNs.