#Assignment #6

Vision Transformer

---


**SUBMISSION INSTRUCTIONS**

First make a copy of this colab file and then solve the assignment and upload your final notebook on github.

Before uploading your downloaded notebook, RENAME the file as rollno_name.ipynb

Submission Deadline : 31/01/2026 Saturday EOD i.e before 11:59 PM

The deadline is strict, Late submissions will incur penalty

Note that you have to upload your solution on the github page of the project Vision Transformer and **under Assignment 6**

And remember to keep title of your pull request to be ViT_name_rollno_assgn6

Github Submission repo -
https://github.com/electricalengineersiitk/Winter-projects-25-26/tree/main/Vision%20transformer/Assignment%206

# Part A - Data and tokens

Q1. Build a tiny toy dataset with pandas
Create a pandas DataFrame with columns text and label.
- Include at least 12 short sentences (3-10 words each).
- The label is 0/1 (e.g., positive vs negative sentiment).
- Shuffle rows and split into train/test (80/20) using a fixed random seed.
Return: df_train, df_test.

In [14]:
import pandas as pd
def make_toy_dataset(seed: int = 42):

  """Return df_train, df_test with columns: text (str), label (int)."""
  # TODO
  data = [
      ("I absolutely adored her hair", 1),
      ("The service at that hotel is wonderful",1),
      ("This purchase made him very happy",1),
      ("I would recommend this movie to everyone",1),
      ("One of the greatest experiences of my life",1),
      ("So happy with my life",1),

      ("I hated that crime drama",0),
      ("The food of that restaurant is awful",0),
      ("He is not happy with his life",0),
      ("Absolute waste of money",0),
      ("It is a low quality fabric",0),
      ("The most disappointing experience of my life",0),
  ]
  df = pd.Dataframe(data,columns=["text","label"])
  df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
  # test train split -
  split_idx = int(0.8 * len(df))
  df_train = df.iloc[:split_idx].reset_index(drop=True)
  df_test = df.iloc[split_idx:].reset_index(drop=True)


  return df_train, df_test

Q2. Clean and tokenize text

Implement a basic cleaner: lowercase, strip, replace multiple spaces with one, and remove punctuation
(.,!?;:).
Tokenize by whitespace.
Add a new column tokens that stores a list of tokens per row.
Return the updated DataFrame.

In [15]:
import re
import pandas as pd
def clean_text(s: str) -> str:
  # TODO
  s = s.lower()
  s = s.strip()
  s = re.sub(r"[.,!?;:]", "", s)
  s = re.sub(r"\s+", " ", s)

  return s

def add_tokens_column(df: pd.DataFrame) -> pd.DataFrame:
  """Adds df['tokens'] = list[str]."""
  # TODO
  df = df.copy()
  df["tokens"] = df["text"].apply(
        lambda x: clean_text(x).split()
    )

  return df


Q3. Build a vocabulary + token/id mappings

Build token2id and id2token using the training tokens.
Include special tokens: [PAD], [UNK], [BOS], [EOS] at the beginning.
Add tokens that occur at least min_freq times.
Return: token2id (dict), id2token (list).

In [16]:
from collections import Counter
from typing import Dict, List
SPECIALS = ['[PAD]', '[UNK]', '[BOS]', '[EOS]']
def build_vocab(list_of_token_lists, min_freq: int = 1):
  # TODO
  counter = Counter()
  for tokens in list_of_token_lists:
      counter.update(tokens)


  id2token: List[str] = []
  token2id: Dict[str, int] = {}

  for tok in SPECIALS:
      token2id[tok] = len(id2token)
      id2token.append(tok)


  for token, freq in sorted(counter.items()):
      if freq >= min_freq:
          token2id[token] = len(id2token)
          id2token.append(token)

  return token2id, id2token


Q4. Convert tokens to ids + pad to a batch

Implement tokens_to_ids for one sequence.
Implement pad_batch that takes a list of id sequences and returns:
- X: int array (B,T) padded with pad_id
- pad_mask: bool array (B,T) where True means 'real token' and False means padding

In [17]:
import numpy as np
def tokens_to_ids(tokens, token2id, unk_token='[UNK]'):
  # TODO
  unk_id = token2id[unk_token]
  ids = [token2id.get(tok, unk_id) for tok in tokens]
  return ids

def pad_batch(id_seqs, pad_id: int):
  """Return X (B,T) and pad_mask (B,T)"""
  # TODO
  batch_size = len(id_seqs)
  max_len = max(len(seq) for seq in id_seqs)
  X = np.full((batch_size, max_len), pad_id, dtype=np.int64)
  pad_mask = np.zeros((batch_size, max_len), dtype=bool)

  for i, seq in enumerate(id_seqs):
          seq_len = len(seq)
          X[i, :seq_len] = seq
          pad_mask[i, :seq_len] = True
  return X, pad_mask


#Part B - Core Transformer math

Q5. Embedding lookup

Implement an embedding table E of shape (V,D) initialized from a normal distribution (mean 0, std 0.02).
Given token ids X (B,T), return embeddings of shape (B,T,D) using NumPy indexing.


In [18]:
import numpy as np
def init_embeddings(vocab_size: int, d_model: int, seed: int = 0):
 # TODO
 rng = np.random.default_rng(seed)
 E = rng.normal(loc=0.0, scale=0.02, size=(vocab_size, d_model))
 return E
def embed(X: np.ndarray, E: np.ndarray):
 """X: (B,T) int, E: (V,D) -> out: (B,T,D)."""
 # TODO
 out = E[X]
 return out


Q6. Sinusoidal positional encoding

Implement the classic sinusoidal positional encoding PE of shape (T,D).
Then add it to token embeddings (B,T,D).
Make sure your implementation works for both even and odd D.

In [19]:
import numpy as np
def sinusoidal_positional_encoding(T: int, D: int):
 """Return PE: (T,D)."""
 # TODO

 PE = np.zeros((T, D))
 positions = np.arange(T).reshape(T, 1)           # (T, 1)
 div_term = np.exp(
        np.arange(0, D, 2) * (-np.log(10000.0) / D)
    )                                                  # (D/2,)

  # Even indices
 PE[:, 0::2] = np.sin(positions * div_term)
  # Odd indices (handle odd D safely)
 PE[:, 1::2] = np.cos(positions * div_term[:PE[:, 1::2].shape[1]])

 return PE

def add_positional_encoding(X_emb: np.ndarray, PE: np.ndarray):
 """X_emb: (B,T,D), PE: (T,D) -> (B,T,D)."""
 # TODO

 B, T, D = X_emb.shape
 PE = sinusoidal_positional_encoding(T, D)

 return X_emb +PE

Q7. Scaled dot-product attention with masking

Implement scaled dot-product attention:
Attention(Q,K,V) = softmax((Q @ K^T) / sqrt(dk) + mask) @ V
Inputs: Q,K,V are (B,H,T,Dh). Mask is boolean broadcastable to (B,H,T,T) where False means 'mask out'.
Requirements:
- Use a numerically stable softmax (subtract max).
- Convert boolean mask to large negative values before softmax.
Return: context (B,H,T,Dh) and attention weights (B,H,T,T).

In [20]:
import numpy as np

def linear(x: np.ndarray, W: np.ndarray, b=None):
    y = x @ W
    if b is not None:
        y = y + b
    return y

def split_heads(x: np.ndarray, n_heads: int):
    """(B,T,D) -> (B,H,T,Dh)"""
    B, T, D = x.shape
    Dh = D // n_heads
    x = x.reshape(B, T, n_heads, Dh)
    xh = x.transpose(0, 2, 1, 3)
    return xh

def combine_heads(xh: np.ndarray):
    """(B,H,T,Dh) -> (B,T,D)"""
    B, H, T, Dh = xh.shape
    x = xh.transpose(0, 2, 1, 3).reshape(B, T, H * Dh)
    return x

def mha_self_attention(X, Wq, Wk, Wv, Wo, n_heads: int, pad_mask=None):
    # linear projections
    Q = linear(X, Wq)
    K = linear(X, Wk)
    V = linear(X, Wv)

    # split heads
    Qh = split_heads(Q, n_heads)
    Kh = split_heads(K, n_heads)
    Vh = split_heads(V, n_heads)

    # pad mask: (B,T) -> (B,1,1,T)
    if pad_mask is not None:
        attn_mask = pad_mask[:, None, None, :]
    else:
        attn_mask = None

    # attention
    context_h, attn = scaled_dot_product_attention(Qh, Kh, Vh, attn_mask)

    # combine heads
    context = combine_heads(context_h)

    # output projection
    out = linear(context, Wo)

    return out, attn

Q8. Multi-head self-attention (MHA)

Implement multi-head self-attention for input X (B,T,D).
- Project to Q,K,V using weight matrices Wq,Wk,Wv each (D,D).
- Reshape/split into heads -> (B,H,T,Dh) where Dh=D/H.
- Apply scaled dot-product attention with a pad mask (B,T) (broadcast it appropriately).
- Concatenate heads and apply output projection Wo (D,D).
Return: out (B,T,D) and attention weights (B,H,T,T).

In [21]:
import numpy as np
def linear(x: np.ndarray, W: np.ndarray, b=None):
 # TODO
 y = x @ W
 if b is not None:
    y = y + b

 return y

def split_heads(x: np.ndarray, n_heads: int):
 """(B,T,D) -> (B,H,T,Dh)"""
 # TODO
 B, T, D = x.shape
 Dh = D // n_heads
 x = x.reshape(B, T, n_heads, Dh)
 xh = x.transpose(0, 2, 1, 3)   # (B,H,T,Dh)

 return xh

def combine_heads(xh: np.ndarray):
 """(B,H,T,Dh) -> (B,T,D)"""
 # TODO
 B, H, T, Dh = xh.shape
 x = xh.transpose(0, 2, 1, 3).reshape(B, T, H * Dh)

 return x

def mha_self_attention(X, Wq, Wk, Wv, Wo, n_heads: int, pad_mask=None):
 # TODO
 # Linear projections
 Q = linear(X, Wq)
 K = linear(X, Wk)
 V = linear(X, Wv)

# Split heads
 Qh = split_heads(Q, n_heads)
 Kh = split_heads(K, n_heads)
 Vh = split_heads(V, n_heads)

# Pad mask: (B,T) -> (B,1,1,T)
 if pad_mask is not None:
    attn_mask = pad_mask[:, None, None, :]
 else:
    attn_mask = None

# Scaled dot-product attention
 context_h, attn = scaled_dot_product_attention(Qh, Kh, Vh, attn_mask)
# Combine heads
 context = combine_heads(context_h)
# Output projection
 out = linear(context, Wo)

 return out, attn

Q9. LayerNorm + residual connection

Implement LayerNorm for X (B,T,D) using learnable gamma and beta of shape (D,).
Then implement residual_add_and_norm(Y, X, gamma, beta) that returns LayerNorm(X + Y).

In [22]:
import numpy as np
def layer_norm(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, eps: float = 1e-5):
 # TODO
 mean = X.mean(axis=-1, keepdims=True)
 var = X.var(axis=-1, keepdims=True)
 X_norm = (X - mean) / np.sqrt(var + eps)
 Y = gamma * X_norm + beta

 return Y

def residual_add_and_norm(Y: np.ndarray, X: np.ndarray, gamma: np.ndarray, beta: np.ndarray):
 # TODO
 Z = layer_norm(X + Y, gamma, beta)

 return Z

Q10. Position-wise FeedForward network

Implement FFN: FFN(X) = relu(X @ W1 + b1) @ W2 + b2
Shapes: X (B,T,D), W1 (D,Dff), b1 (Dff,), W2 (Dff,D), b2 (D,)
Return: (B,T,D).

In [23]:
import numpy as np

def relu(x: np.ndarray):
    y = np.maximum(0, x)
    return y

def feed_forward(X: np.ndarray,
                 W1: np.ndarray, b1: np.ndarray,
                 W2: np.ndarray, b2: np.ndarray):
    """
    X:  (B,T,D)
    W1: (D,Dff), b1: (Dff,)
    W2: (Dff,D), b2: (D,)
    Return: (B,T,D)
    """
    H = relu(X @ W1 + b1)
    Y = H @ W2 + b2

    return Y

# Part C - Putting it together

Q11. One Transformer encoder block (forward)

Implement a single encoder block forward pass:
1) MHA = MultiHeadSelfAttention(X) with pad_mask
2) X1 = LayerNorm(X + MHA)
3) FFN = FeedForward(X1)
4) X2 = LayerNorm(X1 + FFN)
Return X2.
You may pass all parameters explicitly (weights, gamma/beta).

In [24]:
def encoder_block_forward(X, params, n_heads: int, pad_mask=None):
    """
    X: (B,T,D)
    params: dict of numpy arrays
    """
    # 1. multi-head self-attention
    MHA_out, _ = mha_self_attention(
        X,
        params["Wq"], params["Wk"], params["Wv"], params["Wo"],
        n_heads,
        pad_mask
    )

    # 2. residual + LayerNorm
    X1 = residual_add_and_norm(
        MHA_out, X,
        params["gamma1"], params["beta1"]
    )

    # 3. feedforward
    FFN_out = feed_forward(
        X1,
        params["W1"], params["b1"],
        params["W2"], params["b2"]
    )

    # 4. residual + LayerNorm
    X2 = residual_add_and_norm(
        FFN_out, X1,
        params["gamma2"], params["beta2"]
    )

    return X2


Q12. Sequence classification head + end-to-end demo

Create an end-to-end forward pass for a tiny classifier:
- Input ids -> embeddings + positional enc
- One encoder block
- Pooling: take the [BOS] position (t=0) as the sequence representation
- Linear head: logits = h0 @ Wcls + bcls with Wcls (D,2), bcls (2,)
- Softmax to probabilities
Write predict_proba that takes a batch of texts and returns probs (B,2).
Include simple sanity checks: shapes, probabilities sum to 1, and masking doesn't crash for different
lengths.


In [25]:
def predict_proba(texts, token2id, E, PE, params, Wcls, bcls, n_heads: int):
    """
    Return probs: (B,2)
    """

    # --- Tokenize + ids ---
    token_lists = [t.split() for t in texts]
    id_seqs = [tokens_to_ids(toks, token2id) for toks in token_lists]

    pad_id = token2id['[PAD]']
    X_ids, pad_mask = pad_batch(id_seqs, pad_id)   # (B,T), (B,T)

    # --- Embeddings ---
    X_emb = embed(X_ids, E)                         # (B,T,D)
    X_emb = X_emb + PE[:X_emb.shape[1]]             # positional encoding

    # --- Encoder block ---
    X_enc = encoder_block_forward(
        X_emb, params, n_heads, pad_mask
    )                                               # (B,T,D)

    # --- Pool [BOS] token (t=0) ---
    h0 = X_enc[:, 0, :]                              # (B,D)

    # --- Classification head ---
    logits = h0 @ Wcls + bcls                        # (B,2)

    # --- Softmax ---
    probs = softmax(logits, axis=-1)

    # --- Sanity checks ---
    assert probs.shape[1] == 2
    assert np.allclose(probs.sum(axis=1), 1.0)

    return probs
