# Welcome to ML Primer!

# 🚀 Building a Transformer from Scratch

## Introduction

This is an example problem set that will guide you through building a **transformer model from scratch**, based on the seminal paper *"Attention is All You Need"* (Vaswani et al., 2017). We'll be implementing a transformer for **English-to-French machine translation**, following the architecture described in the original paper.

While future problem sets may vary, they will generally follow this **pedagogical style**: a **bottom-up approach** where we first create and understand the fundamental building blocks, then gradually abstract them away to build more complex systems.

In this notebook, we’ll start by implementing **basic components** like:
- **Self-Attention Mechanisms** – the core idea behind transformers,
- **Feed-Forward Networks** – how the model processes data,
- **Positional Encoding** – how the model understands word order.

Then, we’ll **combine these into transformer blocks** and finally **assemble a complete transformer model** for translation. This step-by-step approach will help you develop a deep understanding of **how these models actually work under the hood** rather than treating them as black boxes.

---

## Our Learning Philosophy 🧠💡

We firmly believe that the best way to master complex concepts—especially in deep learning—is through a **hands-on, bottom-up approach**. Instead of simply using pre-built transformer libraries, this problem set challenges you to **construct the model from first principles**. By working through each component step by step, you'll gain a much deeper intuition about how transformers function.

This philosophy aligns with the idea of **active learning**:  
✔️ Engaging directly with the material through **coding, experimentation, and problem-solving**.  
✔️ Thinking critically, debugging, and refining your understanding **through trial and error**.  
✔️ **Searching for solutions online, revisiting lecture notes, and consulting research papers**—just like real machine learning engineers and researchers do!

This problem set also **mirrors how research and industry work**. Engineers don’t just memorize formulas—they break down complex systems, experiment with different approaches, and continuously iterate on their solutions. By adopting this mindset, you’ll **not only develop technical expertise but also cultivate problem-solving skills** that are essential for innovation in AI and machine learning.

---

## What to Expect 🔎📖

As we progress, our goal is to **gradually abstract away** the lower-level components, allowing you to focus on **high-level architectural design, optimization, and real-world deployment**. By the end of this series, you won’t just understand transformers theoretically—you’ll have **practical experience** in modifying, optimizing, and applying them to diverse applications beyond machine translation. 

So, **embrace the challenge, experiment fearlessly, and remember**:  
💡 **Struggling with concepts at first is not a sign of failure—it's an essential part of deep learning** (both for neural networks and for humans!).  

You don’t need to grasp every nitty-gritty detail (and we definitely don't expect you too!) —**as long as you understand the main idea, that’s good enough!**

**To do this, look for #TODOs throughout the notebook.**

Let’s dive in! 🚀



<img src="transformer_architecture.png" alt="Transformer architecture diagram from 'Attention is All You Need' paper" width="400"/>


While this may look complicated at first, let's simplify it, and build it step by step! 

___
# 🔡 Word Embeddings in Transformers

## **What are Embeddings?**
Before a transformer can process words, it needs to **convert them into numbers**. Computers don’t understand text, so we represent words as **vectors** (lists of numbers). These numbers are called **word embedding**.

Instead of using simple numbers (like "1" for "apple" and "2" for "orange"), we use **high-dimensional vectors** that store information about each word’s meaning. This helps the model understand relationships between words.

For example:
- The words **"king"** and **"queen"** might have similar embeddings because they are related.
- The word **"dog"** would have an embedding closer to **"puppy"** than to **"banana"**.

## **What Does This Code Do?**
We define a class called `InputEmbeddings`, which converts word indices into **embedding vectors** and **scales them** properly for the transformer model.

💡 **Fill in the blank!** Try to complete the missing part in the code below to understand how word embeddings work.



In [None]:
import torch
import torch.nn as nn
import math

class InputEmbeddings(nn.Module):
    """
    This class converts word indices into dense vector embeddings.
    """

    def __init__(self, d_model: int, vocab_size: int) -> None:
        """
        Initializes the embedding layer.

        Parameters:
        - d_model (int): The size of each embedding vector (the number of features per word).
          Example: If d_model = 512, each word is represented as a 512-dimensional vector.
        - vocab_size (int): The number of unique words in our vocabulary.
          Example: If vocab_size = 10,000, our vocabulary contains 10,000 different words.
        """
        super().__init__()
        # TODO: Fill in the blank
        self.d_model = ____  # Store the embedding size
        self.vocab_size = ____  # Store the vocabulary size

        # Create an embedding layer that maps each word index to a vector of size d_model
        # Example: If vocab_size = 10,000 and d_model = 512, 
        #          nn.Embedding(10000, 512) creates a lookup table of 10,000 rows and 512 columns.
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        """
        Converts a batch of word indices into their corresponding embeddings.
        
        Parameters:
        - x: A tensor of shape (batch_size, seq_len) where each value is a word index.
          Example: Suppose batch_size = 2, seq_len = 4
          Input x could be:
          ```
          x = torch.tensor([
              [12, 305, 76, 4821],  # First sequence (each number is a word index)
              [519, 3, 78, 9023]    # Second sequence (each number is a word index)
          ])
          ```
        
        Returns:
        - A tensor of shape (batch_size, seq_len, d_model) containing the embeddings.
          Example: If batch_size = 2, seq_len = 4, d_model = 512
          Output will have shape (2, 4, 512).
        """

        # Convert word indices into embedding vectors using the embedding layer.
        # This retrieves the 512-dimensional vector corresponding to each word index.
        # Example: Suppose word index 12 maps to:
        # embeddings[12] = [0.1, -0.2, 0.5, ..., 0.3] (512 values)

        # TODO: Fill in the blank
        embeddings = self.____(x)  # Shape: (batch_size, seq_len, d_model)

        # Scale the embeddings by the square root of d_model (as suggested in the Transformer paper)
        # Why? This helps stabilize training by ensuring embeddings have a reasonable scale.
        # Example: If d_model = 512, then sqrt(512) ≈ 22.63

        # TODO: Fill in the blank
        scaled_embeddings = embeddings * math.___(self.d_model)

        # Example: If embeddings[12] was [0.1, -0.2, ..., 0.3]
        # Then scaled_embeddings[12] = [0.1 * 22.63, -0.2 * 22.63, ..., 0.3 * 22.63]
        # This ensures that the magnitude of embedding vectors is appropriately scaled.

        # TODO: Fill in the blank
        return ____  # Shape: (batch_size, seq_len, d_model)

