#Assignment #6

Vision Transformer

---


**SUBMISSION INSTRUCTIONS**

First make a copy of this colab file and then solve the assignment and upload your final notebook on github.

Before uploading your downloaded notebook, RENAME the file as rollno_name.ipynb

Submission Deadline : 31/01/2026 Saturday EOD i.e before 11:59 PM

The deadline is strict, Late submissions will incur penalty

Note that you have to upload your solution on the github page of the project Vision Transformer and **under Assignment 6**

And remember to keep title of your pull request to be ViT_name_rollno_assgn6

Github Submission repo -
https://github.com/electricalengineersiitk/Winter-projects-25-26/tree/main/Vision%20transformer/Assignment%206

# Part A - Data and tokens

Q1. Build a tiny toy dataset with pandas
Create a pandas DataFrame with columns text and label.
- Include at least 12 short sentences (3-10 words each).
- The label is 0/1 (e.g., positive vs negative sentiment).
- Shuffle rows and split into train/test (80/20) using a fixed random seed.
Return: df_train, df_test.

In [1]:
import pandas as pd

def make_toy_dataset(seed: int = 42):
    """Return df_train, df_test with columns: text (str), label (int)."""

    data = {
        'text': [
            "I absolutely love this new coffee machine.",
            "The movie was a complete waste of time.",
            "This software is incredibly intuitive and fast.",
            "I am very disappointed with the service.",
            "The weather today is absolutely perfect.",
            "The book's plot was boring and predictable.",
            "Everything about this meal was delicious.",
            "I will never buy from this brand again.",
            "The customer support team was very helpful.",
            "The battery life is much shorter than advertised.",
            "The performance of this laptop is outstanding.",
            "I hate how complicated this remote control is."
        ],
        'label': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
    }

    # Create DataFrame
    df = pd.DataFrame(data)

    # Shuffle the entire dataset
    df_shuffled = df.sample(frac=1, random_state=seed).reset_index(drop=True)

    # Calculate split index (80%)
    split_idx = int(len(df_shuffled) * 0.8)

    # Split into train and test
    df_train = df_shuffled.iloc[:split_idx]
    df_test = df_shuffled.iloc[split_idx:]

    return df_train, df_test

# Example usage:
# df_train, df_test = make_toy_dataset(42)
# print(f"Train size: {len(df_train)}, Test size: {len(df_test)}")

Q2. Clean and tokenize text

Implement a basic cleaner: lowercase, strip, replace multiple spaces with one, and remove punctuation
(.,!?;:).
Tokenize by whitespace.
Add a new column tokens that stores a list of tokens per row.
Return the updated DataFrame.

In [2]:
import re
import pandas as pd
def clean_text(s: str) -> str:
    """Lowercase text and remove non-alphanumeric characters."""
    # Convert to lowercase
    s = s.lower()
    # Remove special characters and punctuation (keep spaces)
    s = re.sub(r'[^a-z0-9\s]', '', s)
    # Remove extra whitespace
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def add_tokens_column(df: pd.DataFrame) -> pd.DataFrame:
    """Adds df['tokens'] = list[str] by cleaning and splitting text."""
    # Apply cleaning first, then split on whitespace to create a list
    df['tokens'] = df['text'].apply(lambda x: clean_text(x).split())
    return df


Q3. Build a vocabulary + token/id mappings

Build token2id and id2token using the training tokens.
Include special tokens: [PAD], [UNK], [BOS], [EOS] at the beginning.
Add tokens that occur at least min_freq times.
Return: token2id (dict), id2token (list).

In [3]:
from collections import Counter
from typing import Dict, List, Tuple

SPECIALS = ['[PAD]', '[UNK]', '[BOS]', '[EOS]']

def build_vocab(list_of_token_lists: List[List[str]], min_freq: int = 1) -> Tuple[Dict[str, int], Dict[int, str]]:
    """Builds mapping dictionaries for tokens to IDs and vice versa."""

    # 1. Count all tokens across all documents
    counter = Counter()
    for tokens in list_of_token_lists:
        counter.update(tokens)

    # 2. Initialize mappings with special tokens
    token2id = {token: i for i, token in enumerate(SPECIALS)}

    # 3. Add tokens that meet the minimum frequency threshold
    for token, count in counter.items():
        if count >= min_freq and token not in token2id:
            token2id[token] = len(token2id)

    # 4. Create the reverse mapping
    id2token = {idx: token for token, idx in token2id.items()}

    return token2id, id2token


Q4. Convert tokens to ids + pad to a batch

Implement tokens_to_ids for one sequence.
Implement pad_batch that takes a list of id sequences and returns:
- X: int array (B,T) padded with pad_id
- pad_mask: bool array (B,T) where True means 'real token' and False means padding

In [4]:
def tokens_to_ids(tokens, token2id, unk_token='[UNK]'):
    """Convert a list of tokens to a list of integer IDs."""
    unk_id = token2id.get(unk_token)
    # Use .get() to default to the [UNK] ID if a word isn't in our vocab
    return [token2id.get(token, unk_id) for token in tokens]

def pad_batch(id_seqs, pad_id: int):
    batch_size = len(id_seqs)
    max_len = max(len(seq) for seq in id_seqs)

    # Initialize X with the pad_id
    X = np.full((batch_size, max_len), pad_id, dtype=np.int64)
    # Initialize mask with False
    pad_mask = np.zeros((batch_size, max_len), dtype=bool)

    for i, seq in enumerate(id_seqs):
        length = len(seq)
        X[i, :length] = seq
        pad_mask[i, :length] = True

    return X, pad_mask


#Part B - Core Transformer math

Q5. Embedding lookup

Implement an embedding table E of shape (V,D) initialized from a normal distribution (mean 0, std 0.02).
Given token ids X (B,T), return embeddings of shape (B,T,D) using NumPy indexing.


In [5]:
import numpy as np

def init_embeddings(vocab_size: int, d_model: int, seed: int = 0):
    """
    Initializes the embedding matrix E.
    V = vocab_size, D = d_model.
    """
    np.random.seed(seed)
    # Common practice: initialize with mean 0 and small variance
    # or use Xavier/He initialization logic.
    E = np.random.randn(vocab_size, d_model) * 0.01
    return E

def embed(X: np.ndarray, E: np.ndarray):
    """
    X: (B, T) - Batch of token IDs
    E: (V, D) - Embedding weight matrix
    out: (B, T, D) - Embedded representations
    """
    # NumPy fancy indexing: for every index in X,
    # grab the corresponding row from E.
    out = E[X]
    return out


Q6. Sinusoidal positional encoding

Implement the classic sinusoidal positional encoding PE of shape (T,D).
Then add it to token embeddings (B,T,D).
Make sure your implementation works for both even and odd D.

In [6]:
import numpy as np

def sinusoidal_positional_encoding(T: int, D: int):
    """
    T: Sequence length (number of positions)
    D: Model dimension (must be even for this implementation)
    Returns PE: (T, D)
    """
    # 1. Initialize the PE matrix
    PE = np.zeros((T, D))

    # 2. Create a column vector for positions (0 to T-1)
    pos = np.arange(T)[:, np.newaxis]

    # 3. Create a row vector for the denominator indices (0, 2, 4... D-2)
    # We use exp and log for numerical stability/efficiency
    div_term = np.exp(np.arange(0, D, 2) * -(np.log(10000.0) / D))

    # 4. Apply sin to even indices and cos to odd indices
    PE[:, 0::2] = np.sin(pos * div_term)
    PE[:, 1::2] = np.cos(pos * div_term)

    return PE

def add_positional_encoding(X_emb: np.ndarray, PE: np.ndarray):
    """
    X_emb: (B, T, D) - Embedded input
    PE: (T, D) - Positional encodings
    """
    # NumPy broadcasting handles the batch dimension (B) automatically
    X_emb_pe = X_emb + PE
    return X_emb_pe

Q7. Scaled dot-product attention with masking

Implement scaled dot-product attention:
Attention(Q,K,V) = softmax((Q @ K^T) / sqrt(dk) + mask) @ V
Inputs: Q,K,V are (B,H,T,Dh). Mask is boolean broadcastable to (B,H,T,T) where False means 'mask out'.
Requirements:
- Use a numerically stable softmax (subtract max).
- Convert boolean mask to large negative values before softmax.
Return: context (B,H,T,Dh) and attention weights (B,H,T,T).

In [7]:
import numpy as np

def softmax(x: np.ndarray, axis: int = -1):
    """Computes stable softmax along the specified axis."""
    # Subtract max for numerical stability
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (B, H, T, Dh)
    mask: Optional boolean mask (B, H, T, T) where True means 'keep' and False means 'mask'
    """
    d_k = Q.shape[-1]

    # 1. Compute scores: (B, H, T, Dh) @ (B, H, Dh, T) -> (B, H, T, T)
    # Using swapaxes to transpose the last two dimensions of K
    scores = np.matmul(Q, K.swapaxes(-2, -1)) / np.sqrt(d_k)

    # 2. Apply mask if provided
    if mask is not None:
        # Fill masked positions with a very large negative number
        scores = np.where(mask, scores, -1e9)

    # 3. Get attention weights
    attn = softmax(scores, axis=-1)

    # 4. Compute context: (B, H, T, T) @ (B, H, T, Dh) -> (B, H, T, Dh)
    context = np.matmul(attn, V)

    return context, attn

Q8. Multi-head self-attention (MHA)

Implement multi-head self-attention for input X (B,T,D).
- Project to Q,K,V using weight matrices Wq,Wk,Wv each (D,D).
- Reshape/split into heads -> (B,H,T,Dh) where Dh=D/H.
- Apply scaled dot-product attention with a pad mask (B,T) (broadcast it appropriately).
- Concatenate heads and apply output projection Wo (D,D).
Return: out (B,T,D) and attention weights (B,H,T,T).

In [8]:
import numpy as np

def linear(x: np.ndarray, W: np.ndarray, b=None):
    """Standard linear transformation: y = xW + b."""
    y = np.matmul(x, W)
    if b is not None:
        y += b
    return y

def split_heads(x: np.ndarray, n_heads: int):
    """
    Reshapes (B, T, D) into (B, H, T, Dh) where Dh = D / H.
    """
    B, T, D = x.shape
    Dh = D // n_heads
    return x.reshape(B, T, n_heads, Dh).transpose(0, 2, 1, 3)

def combine_heads(xh: np.ndarray):

    B, H, T, Dh = xh.shape
    D = H * Dh
    return xh.transpose(0, 2, 1, 3).reshape(B, T, D)

def mha_self_attention(X, Wq, Wk, Wv, Wo, n_heads: int, pad_mask=None):
    Q = linear(X, Wq)
    K = linear(X, Wk)
    V = linear(X, Wv)

    Qs = split_heads(Q, n_heads)
    Ks = split_heads(K, n_heads)
    Vs = split_heads(V, n_heads)

    context, attn = scaled_dot_product_attention(Qs, Ks, Vs, mask=pad_mask)

    combined = combine_heads(context)
    out = linear(combined, Wo)

    return out, attn

Q9. LayerNorm + residual connection

Implement LayerNorm for X (B,T,D) using learnable gamma and beta of shape (D,).
Then implement residual_add_and_norm(Y, X, gamma, beta) that returns LayerNorm(X + Y).

In [9]:
import numpy as np

def layer_norm(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, eps: float = 1e-5):

    mean = np.mean(X, axis=-1, keepdims=True)
    var = np.var(X, axis=-1, keepdims=True)

    X_hat = (X - mean) / np.sqrt(var + eps)

    Y = gamma * X_hat + beta
    return Y

def residual_add_and_norm(Y: np.ndarray, X: np.ndarray, gamma: np.ndarray, beta: np.ndarray):

    Z = layer_norm(X + Y, gamma, beta)
    return Z

Q10. Position-wise FeedForward network

Implement FFN: FFN(X) = relu(X @ W1 + b1) @ W2 + b2
Shapes: X (B,T,D), W1 (D,Dff), b1 (Dff,), W2 (Dff,D), b2 (D,)
Return: (B,T,D).

In [10]:
import numpy as np

def relu(x: np.ndarray):
    return np.maximum(0, x)

def feed_forward(X: np.ndarray, W1: np.ndarray, b1: np.ndarray, W2: np.ndarray, b2: np.ndarray):

    h = np.matmul(X, W1) + b1

    a = relu(h)

    Y = np.matmul(a, W2) + b2

    return Y

# Part C - Putting it together

Q11. One Transformer encoder block (forward)

Implement a single encoder block forward pass:
1) MHA = MultiHeadSelfAttention(X) with pad_mask
2) X1 = LayerNorm(X + MHA)
3) FFN = FeedForward(X1)
4) X2 = LayerNorm(X1 + FFN)
Return X2.
You may pass all parameters explicitly (weights, gamma/beta).

In [11]:
import numpy as np

def encoder_block_forward(X, params, n_heads: int, pad_mask=None):

    attn_out, _ = mha_self_attention(
        X,
        params['Wq'], params['Wk'], params['Wv'], params['Wo'],
        n_heads,
        pad_mask
    )


    X1 = residual_add_and_norm(
        attn_out, X,
        params['gamma1'], params['beta1']
    )

    ffn_out = feed_forward(
        X1,
        params['W1'], params['b1'],
        params['W2'], params['b2']
    )

    X2 = residual_add_and_norm(
        ffn_out, X1,
        params['gamma2'], params['beta2']
    )

    return X2

Q12. Sequence classification head + end-to-end demo

Create an end-to-end forward pass for a tiny classifier:
- Input ids -> embeddings + positional enc
- One encoder block
- Pooling: take the [BOS] position (t=0) as the sequence representation
- Linear head: logits = h0 @ Wcls + bcls with Wcls (D,2), bcls (2,)
- Softmax to probabilities
Write predict_proba that takes a batch of texts and returns probs (B,2).
Include simple sanity checks: shapes, probabilities sum to 1, and masking doesn't crash for different
lengths.


In [13]:
import numpy as np

def predict_proba(texts, token2id, E, PE, params, Wcls, bcls, n_heads: int):

    token_ids = [[token2id.get(t, 0) for t in text.split()] for text in texts]
    T_max = max(len(t) for t in token_ids)
    B = len(texts)

    X_ids = np.zeros((B, T_max), dtype=int)
    for i, tokens in enumerate(token_ids):
        X_ids[i, :len(tokens)] = tokens
    X_emb = embed(X_ids, E)
    X_pe = add_positional_encoding(X_emb, PE[:T_max, :])

    X_hidden = encoder_block_forward(X_pe, params, n_heads)

    sentence_rep = X_hidden[:, 0, :]

    logits = linear(sentence_rep, Wcls, bcls)

    probs = softmax(logits, axis=-1)

    return probs