#Assignment #6

Vision Transformer

---


**SUBMISSION INSTRUCTIONS**

First make a copy of this colab file and then solve the assignment and upload your final notebook on github.

Before uploading your downloaded notebook, RENAME the file as rollno_name.ipynb

Submission Deadline : 31/01/2026 Saturday EOD i.e before 11:59 PM

The deadline is strict, Late submissions will incur penalty

Note that you have to upload your solution on the github page of the project Vision Transformer and **under Assignment 6**

And remember to keep title of your pull request to be ViT_name_rollno_assgn6

Github Submission repo -
https://github.com/electricalengineersiitk/Winter-projects-25-26/tree/main/Vision%20transformer/Assignment%206

# Part A - Data and tokens

Q1. Build a tiny toy dataset with pandas
Create a pandas DataFrame with columns text and label.
- Include at least 12 short sentences (3-10 words each).
- The label is 0/1 (e.g., positive vs negative sentiment).
- Shuffle rows and split into train/test (80/20) using a fixed random seed.
Return: df_train, df_test.

In [5]:
import pandas as pd
def make_toy_dataset(seed: int = 42):
   # 1. Define a list of at least 12 short sentences (3-10 words each).
    sentences = [
        "This movie is absolutely fantastic and thrilling",
        "I really enjoyed this book, it was great",
        "What a wonderful day for a picnic",
        "The service was excellent, very friendly staff",
        "Absolutely loved the food at that restaurant",
        "Terrible experience, would not recommend this place",
        "This product broke after only one use",
        "I found the plot quite boring and slow",
        "The customer support was unhelpful and rude",
        "Never buying from them again, very disappointed",
        "A decent effort, but could be much better",
        "It was okay, nothing special really",
        "Highly inspiring, truly a masterpiece",
        "Simply the best decision I ever made"
    ]

    # 2. Define a corresponding list of binary labels (0 or 1) for each sentence.
    # 1 for positive, 0 for negative/neutral
    labels = [
        1, 1, 1, 1, 1,
        0, 0, 0, 0, 0,
        0, 0,
        1, 1
    ]

    # 3. Create a pandas DataFrame named df with two columns: 'text' and 'label'.
    df = pd.DataFrame({
        'text': sentences,
        'label': labels
    })

    # 4. Shuffle the rows of the DataFrame df in-place using df.sample(frac=1, random_state=seed).
    # Make sure to reset the index after shuffling using reset_index(drop=True).
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)

    # 5. Calculate the split index for 80% training data.
    train_size = int(0.8 * len(df))

    # 6. Split the shuffled DataFrame df into df_train (first 80% of rows) and df_test (remaining 20% of rows).
    df_train = df.iloc[:train_size]
    df_test = df.iloc[train_size:]

    return df_train, df_test

Q2. Clean and tokenize text

Implement a basic cleaner: lowercase, strip, replace multiple spaces with one, and remove punctuation
(.,!?;:).
Tokenize by whitespace.
Add a new column tokens that stores a list of tokens per row.
Return the updated DataFrame.

In [8]:
import re
import pandas as pd

def clean_text(s: str) -> str:
  # 1a. Convert to lowercase
    s = s.lower()
    # 1b. Remove leading/trailing whitespace
    s = s.strip()
    # 1c. Replace multiple spaces with a single space
    s = re.sub(r'\s+', ' ', s)
    # 1d. Remove punctuation characters (.,!?;:)
    s = re.sub(r'[.,!?;:]', '', s)
    return s

def add_tokens_column(df: pd.DataFrame) -> pd.DataFrame:
    """Adds df['tokens'] = list[str]."""
    # 2a. Apply the clean_text function to the 'text' column
    df['cleaned_text'] = df['text'].apply(clean_text)
    # 2b. Tokenize each cleaned sentence by splitting it by whitespace
    df['tokens'] = df['cleaned_text'].apply(lambda x: x.split())
    # Drop the temporary 'cleaned_text' column if not needed further
    df = df.drop(columns=['cleaned_text'])
    return df


Q3. Build a vocabulary + token/id mappings

Build token2id and id2token using the training tokens.
Include special tokens: [PAD], [UNK], [BOS], [EOS] at the beginning.
Add tokens that occur at least min_freq times.
Return: token2id (dict), id2token (list).

In [11]:
from collections import Counter
from typing import Dict, List
SPECIALS = ['[PAD]', '[UNK]', '[BOS]', '[EOS]']
def build_vocab(list_of_token_lists, min_freq: int = 1):
  # 1. Initialize token2id as an empty dictionary and id2token as an empty list.
    token2id: Dict[str, int] = {}
    id2token: List[str] = []

    # 2. Iterate through the SPECIALS list. For each special token, add it to id2token
    #    and map it to its index in token2id. Ensure these special tokens are at the
    #    beginning of your vocabulary.
    for token in SPECIALS:
        token2id[token] = len(id2token)
        id2token.append(token)

    # 3. Flatten the list_of_token_lists (e.g., from df_train['tokens']) into a single list of all tokens.
    all_tokens = [token for sublist in list_of_token_lists for token in sublist]

    # 4. Use collections.Counter to count the frequency of each token in the flattened list.
    token_counts = Counter(all_tokens)

    # 5. Iterate through the token counts. For each token, if its frequency is greater
    #    than or equal to min_freq and it is not already in token2id (to avoid adding
    #    special tokens again), add it to id2token and map it to its new index in token2id.
    for token, count in token_counts.items():
        if count >= min_freq and token not in token2id:
            token2id[token] = len(id2token)
            id2token.append(token)

            return token2id, id2token


Q4. Convert tokens to ids + pad to a batch

Implement tokens_to_ids for one sequence.
Implement pad_batch that takes a list of id sequences and returns:
- X: int array (B,T) padded with pad_id
- pad_mask: bool array (B,T) where True means 'real token' and False means padding

In [12]:
import numpy as np
def tokens_to_ids(tokens, token2id, unk_token='[UNK]'):
    """Converts a list of tokens to a list of their corresponding integer IDs."""
    ids = []
    unk_id = token2id[unk_token] # Get ID for unknown token once
    for token in tokens:
        ids.append(token2id.get(token, unk_id))
    return ids

def pad_batch(id_seqs, pad_id: int):
    """Return X (B,T) and pad_mask (B,T)"""
    # Determine batch size B and max sequence length T
    B = len(id_seqs)
    T = max(len(seq) for seq in id_seqs)

    # Initialize X with pad_id
    X = np.full((B, T), pad_id, dtype=int)

    # Initialize pad_mask with False
    pad_mask = np.full((B, T), False, dtype=bool)

    # Populate X and pad_mask
    for i, id_sequence in enumerate(id_seqs):
        seq_len = len(id_sequence)
        X[i, :seq_len] = id_sequence
        pad_mask[i, :seq_len] = True

        return X, pad_mask


#Part B - Core Transformer math

Q5. Embedding lookup

Implement an embedding table E of shape (V,D) initialized from a normal distribution (mean 0, std 0.02).
Given token ids X (B,T), return embeddings of shape (B,T,D) using NumPy indexing.


In [13]:
import numpy as np
def init_embeddings(vocab_size: int, d_model: int, seed: int = 0):
   # 1. Initialize the random seed.
    np.random.seed(seed)
    # 2. Create a NumPy array E of shape (vocab_size, d_model).
    #    Values should be sampled from a normal distribution with mean=0 and std=0.02.
    E = np.random.normal(loc=0.0, scale=0.02, size=(vocab_size, d_model))
    return E
def embed(X: np.ndarray, E: np.ndarray):
    """X: (B,T) int, E: (V,D) -> out: (B,T,D)."""
    # 1. Use NumPy advanced indexing to lookup the embeddings for each token in X.
    #    X contains token IDs. E is the embedding table.
    out = E[X]
    return out


Q6. Sinusoidal positional encoding

Implement the classic sinusoidal positional encoding PE of shape (T,D).
Then add it to token embeddings (B,T,D).
Make sure your implementation works for both even and odd D.

In [14]:
import numpy as np
def sinusoidal_positional_encoding(T: int, D: int):
    # 1a. Create a NumPy array PE of shape (T, D) initialized with zeros.
    PE = np.zeros((T, D))

    # 1b. Generate a position array representing token positions from 0 to T-1.
    position = np.arange(T)[:, np.newaxis] # Shape (T, 1)

    # 1c. Generate a div_term array for the denominator.
    # For each dimension index k from 0 to D-1, the denominator term is 10000^(2i/D),
    # where i = k // 2. Use np.exp(np.arange(0, D, 2) * - (np.log(10000.0) / D))
    # for the 1 / (10000^(2i/D)) part to handle the 2i exponent.
    div_term = np.exp(np.arange(0, D, 2) * -(np.log(10000.0) / D)) # Shape (D/2,)

    # 1d. Apply the sinusoidal functions:
    # i. For even dimension indices (0, 2, 4, ...), compute np.sin(position[:, np.newaxis] / div_term).
    #    Assign these values to PE[:, 0::2].
    PE[:, 0::2] = np.sin(position * div_term) # Broadcasting (T,1) * (1, D/2) = (T, D/2)

    # ii. For odd dimension indices (1, 3, 5, ...), compute np.cos(position[:, np.newaxis] / div_term).
    #     Assign these values to PE[:, 1::2].
    # Handle odd dimensions if D is odd: If D is odd, div_term still has D/2 elements (integer division).
    # The last odd column will use the last element of div_term.
    # np.arange(0, D, 2) creates indices 0, 2, ..., up to D-2 (or D-1 if D is odd, but still stepping by 2).
    # The div_term correctly corresponds to i in 2i for D/2 terms.
    # This means div_term *should* be applied to both sin and cos parts. There are D/2 pairs of (sin, cos).
    PE[:, 1::2] = np.cos(position * div_term)

    return PE

def add_positional_encoding(X_emb: np.ndarray, PE: np.ndarray):
    """X_emb: (B,T,D), PE: (T,D) -> (B,T,D)."""
    # 2a. Add the PE matrix to the X_emb tensor.
    # Ensure broadcasting works correctly, as X_emb has shape (B, T, D) and PE has shape (T, D).
    X_emb_pe = X_emb + PE
    return X_emb_pe


Q7. Scaled dot-product attention with masking

Implement scaled dot-product attention:
Attention(Q,K,V) = softmax((Q @ K^T) / sqrt(dk) + mask) @ V
Inputs: Q,K,V are (B,H,T,Dh). Mask is boolean broadcastable to (B,H,T,T) where False means 'mask out'.
Requirements:
- Use a numerically stable softmax (subtract max).
- Convert boolean mask to large negative values before softmax.
Return: context (B,H,T,Dh) and attention weights (B,H,T,T).

In [15]:
import numpy as np
def softmax(x: np.ndarray, axis: int = -1):
  # 1b. Subtract the maximum value along the specified axis for numerical stability
    x_stable = x - np.max(x, axis=axis, keepdims=True)
    # 1c. Compute the exponentiated values
    exp_x = np.exp(x_stable)
    # 1d. Divide the exponentiated values by their sum along the specified axis
    sum_exp_x = np.sum(exp_x, axis=axis, keepdims=True)
    y = exp_x / sum_exp_x
    return y

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Q,K,V: (B,H,T,Dh), mask: bool broadcastable to (B,H,T,T)."""
    # 2a. Calculate the dot product of Q and K transpose
    # K.transpose(0, 1, 3, 2) changes (B,H,T,Dh) to (B,H,Dh,T)
    scores = Q @ K.transpose(0, 1, 3, 2)

    # 2b. Scale the scores by dividing by the square root of Dh
    Dh = K.shape[-1]
    scores = scores / np.sqrt(Dh)

    # 2c. If a mask is provided
    if mask is not None:
        # Convert boolean mask to numerical mask: False -> -1e9, True -> 0
        # Using np.where to apply -1e9 where mask is False (masked out)
        # The mask is typically (B,T) or (B,1,1,T) for padding, or (B,1,T,T) for causality.
        # It needs to be broadcastable to (B,H,T,T).
        numerical_mask = np.where(mask, 0.0, -1e9)
        scores = scores + numerical_mask # Addition handles broadcasting

    # 2d. Apply the softmax function to the modified scores along the last axis
    attention_weights = softmax(scores, axis=-1)

    # 2e. Compute the context vector
    context = attention_weights @ V

    # 2f. Return context and attention weights
    return context, attention_weights


Q8. Multi-head self-attention (MHA)

Implement multi-head self-attention for input X (B,T,D).
- Project to Q,K,V using weight matrices Wq,Wk,Wv each (D,D).
- Reshape/split into heads -> (B,H,T,Dh) where Dh=D/H.
- Apply scaled dot-product attention with a pad mask (B,T) (broadcast it appropriately).
- Concatenate heads and apply output projection Wo (D,D).
Return: out (B,T,D) and attention weights (B,H,T,T).

In [16]:
import numpy as np
def linear(x: np.ndarray, W: np.ndarray, b=None):
 # 1. Perform matrix multiplication: x @ W.
    y = x @ W
    # 2. If bias b is provided, add it to the result.
    if b is not None:
        y = y + b # Bias is (D_out,) and will broadcast correctly
    return y

def split_heads(x: np.ndarray, n_heads: int): # (B,T,D) -> (B,H,T,Dh)
    # Get batch size, sequence length, and model dimension
    B, T, D = x.shape
    # Calculate dimension per head
    Dh = D // n_heads
    # 1. Reshape x from (B, T, D) to (B, T, n_heads, Dh).
    # 2. Transpose the dimensions to get (B, n_heads, T, Dh).
    xh = x.reshape(B, T, n_heads, Dh).transpose(0, 2, 1, 3)
    return xh

def combine_heads(xh: np.ndarray): # (B,H,T,Dh) -> (B,T,D)
    # Get batch size, number of heads, sequence length, and dimension per head
    B, n_heads, T, Dh = xh.shape
    # 1. Transpose back to (B, T, n_heads, Dh).
    # 2. Reshape to (B, T, D) where D = n_heads * Dh.
    x = xh.transpose(0, 2, 1, 3).reshape(B, T, n_heads * Dh)
    return x

def mha_self_attention(X, Wq, Wk, Wv, Wo, n_heads: int, pad_mask=None): # X: (B,T,D)
    # D is the model dimension
    D = X.shape[-1]

    # 4a. Project X to Q, K, V using linear layers
    Q = linear(X, Wq)
    K = linear(X, Wk)
    V = linear(X, Wv)

    # 4b. Split Q, K, V into multiple heads
    Qh = split_heads(Q, n_heads)
    Kh = split_heads(K, n_heads)
    Vh = split_heads(V, n_heads)

    # 4c. Apply scaled dot-product attention with pad_mask
    # Create an attention mask from pad_mask (B, T) to (B, 1, T, T)
    # If pad_mask is (B,T), we want to mask out interactions where either query or key is padded.
    # (B, T, 1) * (B, 1, T) -> (B, T, T) then expand to (B, 1, T, T) for broadcasting with (B,H,T,T)
    attention_mask_sps = None
    if pad_mask is not None:
        # Make sure that if a query token is padding (False), it cannot attend to anything
        # and if a key token is padding (False), no query can attend to it.
        # `pad_mask` is (B,T) where True means 'real token' and False means 'padding'
        # Construct a boolean mask for (B, T, T) where True means 'keep' and False means 'mask out'.
        # (B, T, 1) & (B, 1, T) -> (B, T, T)
        seq_len_mask = pad_mask[:, np.newaxis, :] & pad_mask[:, :, np.newaxis]
        # Expand to (B, 1, T, T) for broadcasting across heads
        attention_mask_sps = seq_len_mask[:, np.newaxis, :, :]

    context_h, attn = scaled_dot_product_attention(Qh, Kh, Vh, mask=attention_mask_sps)

    # 4d. Concatenate heads
    context = combine_heads(context_h)

    # 4e. Apply final output projection Wo
    out = linear(context, Wo)

    # 5. Return output and attention weights
    return out, attn

Q9. LayerNorm + residual connection

Implement LayerNorm for X (B,T,D) using learnable gamma and beta of shape (D,).
Then implement residual_add_and_norm(Y, X, gamma, beta) that returns LayerNorm(X + Y).

In [17]:
import numpy as np
def layer_norm(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, eps: float = 1e-5):
  # 1a. Calculate the mean of X along the last dimension (D), keeping dimensions for broadcasting.
    mean = np.mean(X, axis=-1, keepdims=True)
    # 1b. Calculate the variance of X along the last dimension (D), keeping dimensions for broadcasting.
    variance = np.var(X, axis=-1, keepdims=True)

    # 1c. Normalize X by subtracting the mean and dividing by the standard deviation
    #     (square root of variance + epsilon).
    # This results in a tensor of the same shape as X.
    normalized_X = (X - mean) / np.sqrt(variance + eps)

    # 1d. Scale the normalized X by gamma and shift by beta.
    #     gamma and beta are (D,) shaped arrays and will broadcast correctly across (B, T) dimensions.
    Y = normalized_X * gamma + beta

    # 1e. Return the normalized and scaled output Y.
    return Y

def residual_add_and_norm(Y: np.ndarray, X: np.ndarray, gamma: np.ndarray, beta: np.ndarray):
    # 2a. Add Y to X (element-wise addition, implicitly handling batch, sequence length, and feature dimensions).
    added_output = X + Y

    # 2b. Apply the layer_norm function to the result of the addition,
    #     using the provided gamma and beta parameters.
    Z = layer_norm(added_output, gamma, beta)

    # 2c. Return the final normalized output Z.
    return Z

Q10. Position-wise FeedForward network

Implement FFN: FFN(X) = relu(X @ W1 + b1) @ W2 + b2
Shapes: X (B,T,D), W1 (D,Dff), b1 (Dff,), W2 (Dff,D), b2 (D,)
Return: (B,T,D).

In [18]:
import numpy as np
def relu(x: np.ndarray):
 # 1. Apply the Rectified Linear Unit (ReLU) activation function.
    #    This means returning x if x > 0, and 0 otherwise.
    y = np.maximum(0, x)
    return y
def feed_forward(X: np.ndarray, W1: np.ndarray, b1: np.ndarray, W2: np.ndarray, b2: np.ndarray):
    # X: (B,T,D)
    # W1: (D,Dff), b1: (Dff,)
    # W2: (Dff,D), b2: (D,)

    # 1. First linear transformation: X @ W1 + b1
    #    X is (B,T,D), W1 is (D,Dff). Result is (B,T,Dff).
    #    b1 is (Dff,) and broadcasts correctly.
    linear1_output = X @ W1 + b1

    # 2. Apply ReLU activation to the output of the first linear transformation.
    relu_output = relu(linear1_output)

    # 3. Second linear transformation: relu_output @ W2 + b2
    #    relu_output is (B,T,Dff), W2 is (Dff,D). Result is (B,T,D).
    #    b2 is (D,) and broadcasts correctly.
    Y = relu_output @ W2 + b2

    # 4. Return the final output of shape (B,T,D).
    return Y

# Part C - Putting it together

Q11. One Transformer encoder block (forward)

Implement a single encoder block forward pass:
1) MHA = MultiHeadSelfAttention(X) with pad_mask
2) X1 = LayerNorm(X + MHA)
3) FFN = FeedForward(X1)
4) X2 = LayerNorm(X1 + FFN)
Return X2.
You may pass all parameters explicitly (weights, gamma/beta).

In [19]:
def encoder_block_forward(X, params, n_heads: int, pad_mask=None):
    # Extract parameters for MHA
    Wq = params['Wq']
    Wk = params['Wk']
    Wv = params['Wv']
    Wo = params['Wo']

    # Extract parameters for LayerNorm after MHA
    gamma_mha = params['gamma_mha']
    beta_mha = params['beta_mha']

    # Extract parameters for FFN
    W1 = params['W1']
    b1 = params['b1']
    W2 = params['W2']
    b2 = params['b2']

    # Extract parameters for LayerNorm after FFN
    gamma_ffn = params['gamma_ffn']
    beta_ffn = params['beta_ffn']

    # 1) MHA = MultiHeadSelfAttention(X) with pad_mask
    mha_output, _ = mha_self_attention(X, Wq, Wk, Wv, Wo, n_heads, pad_mask=pad_mask)

    # 2) X1 = LayerNorm(X + MHA) -- using residual_add_and_norm
    X1 = residual_add_and_norm(mha_output, X, gamma_mha, beta_mha)

    # 3) FFN = FeedForward(X1)
    ffn_output = feed_forward(X1, W1, b1, W2, b2)

    # 4) X2 = LayerNorm(X1 + FFN) -- using residual_add_and_norm
    X2 = residual_add_and_norm(ffn_output, X1, gamma_ffn, beta_ffn)

    return X2

Q12. Sequence classification head + end-to-end demo

Create an end-to-end forward pass for a tiny classifier:
- Input ids -> embeddings + positional enc
- One encoder block
- Pooling: take the [BOS] position (t=0) as the sequence representation
- Linear head: logits = h0 @ Wcls + bcls with Wcls (D,2), bcls (2,)
- Softmax to probabilities
Write predict_proba that takes a batch of texts and returns probs (B,2).
Include simple sanity checks: shapes, probabilities sum to 1, and masking doesn't crash for different
lengths.


In [20]:
def predict_proba(texts, token2id, E, PE, params, Wcls, bcls, n_heads: int):
    # 2. Convert the input `texts` to token IDs.
    # This involves using clean_text, then tokenizing, then tokens_to_ids.
    # Create a temporary DataFrame for easy processing using add_tokens_column logic.
    df_temp = pd.DataFrame({'text': texts})
    df_temp['cleaned_text'] = df_temp['text'].apply(clean_text)
    df_temp['tokens'] = df_temp['cleaned_text'].apply(lambda x: x.split())

    # Add [BOS] and [EOS] tokens to each sequence
    # BOS_ID and EOS_ID must be retrieved from token2id
    bos_id = token2id['[BOS]']
    eos_id = token2id['[EOS]']

    # Prepend [BOS] and append [EOS] to each token list, then convert to IDs
    id_sequences = []
    for tokens in df_temp['tokens']:
        sequence_with_specials = [token2id['[BOS]']] + tokens + [token2id['[EOS]']]
        id_sequences.append(tokens_to_ids(sequence_with_specials, token2id))

    # 3. Pad the sequences of token IDs to create X and pad_mask.
    pad_id = token2id['[PAD]']
    X, pad_mask = pad_batch(id_sequences, pad_id)

    # 4. Perform embedding lookup.
    X_emb = embed(X, E)

    # 5. Add positional encoding.
    # Ensure PE matches the current sequence length T
    T_current = X.shape[1]
    PE_current = PE[:T_current, :]
    X_emb_pe = add_positional_encoding(X_emb, PE_current)

    # 6. Pass through one encoder block.
    # The params dict is expected to contain all necessary MHA, FFN, and LayerNorm weights and biases.
    encoder_output = encoder_block_forward(X_emb_pe, params, n_heads, pad_mask=pad_mask)

    # 7. Perform pooling: take the [BOS] position (t=0) as the sequence representation
    h0 = encoder_output[:, 0, :] # Shape (B, D)

    # 8. Apply the linear classification head.
    logits = h0 @ Wcls + bcls # Wcls: (D,2), bcls: (2,)

    # 9. Apply Softmax to probabilities
    probs = softmax(logits, axis=-1)

    # 10. Return the probabilities.
    return probs

