<a href="https://colab.research.google.com/github/aditya161205/work/blob/main/micro_embed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MAKING EMBEDDING MODEL FROM SCRATCH**


#PART 1: THE TOKENIZER (Byte Pair Encoding)

WHY: Real models don't split by space. They learn sub-words.

e.g., "unknown" -> "un" + "known". This solves the "Out of Vocabulary" error.

Here we implement simple Byte-Pair-Encoding algorithm

Byte Pair Encoding (BPE) is a subword tokenization algorithm that repeatedly merges the most frequent adjacent character pairs to form new tokens.
It helps reduce vocabulary size and handle rare or unseen words.

Example:
Text: low lower → start as l o w | l o w e r → frequent pair l o merges to lo → tokens become lo w | lo w e r.

In [40]:
import numpy as np
import random
import collections

In [41]:
class SimpleBPE:

  def __init__(self,num_merges=10):
    self.merges = {}
    self.vocab = {}
    self.num_merges = num_merges


  def get_stats(self, vocab):
    pairs = collections.defaultdict(int)
    #pairs = collections.defaultdict(int) creates a dictionary with default value 0 for any missing key
    for word, freq in vocab.items():
      symbols = word.split()
      for i in range(len(symbols)-1):
        pairs[symbols[i],symbols[i+1]] += freq
      return pairs

  #This function merges a chosen frequent symbol pair (bigram) into a single token across the vocabulary.
  def merge_vocab(self,pair,v_in):
    v_out = {}

    bigram = ' '.join(pair) #Converts ('l','o') → "l o" (how it appears in vocab).
    replacement = ''.join(pair) #Converts ('l','o') → "lo" (merged token).

    for word in v_in:
      w_out = word.replace(bigram,replacement)
      v_out[w_out] = v_in[word]
    return v_out

  def train(self, text):

    raw_words = text.split()
    vocab = {" ".join(list(w)) + " </w>": 1 for w in raw_words}

    for i in range(self.num_merges):
      pairs = self.get_stats(vocab)
      if not pairs: break
      best = max(pairs, key = pairs.get)
      self.merges[best]= ''.join(best) #store the merge
      vocab = self.merge_vocab(best,vocab) #merge the pair

    #making the final mapping (Token - > ID)
    unique_tokens = set(" ".join(vocab.keys()).split())
    self.vocab = {token: i for i, token in enumerate(unique_tokens)}
    #add special [UNK] token for unknowns
    self.vocab["[UNK]"]= len(self.vocab)

  def encode(self,text):
    word = " ".join(list(text)) + "</w>"
    for pair, replacement in self.merges.items():
      bigram = ' '.join(pair)
      if bigram in word:
        word = word.replace(bigram, replacement)
    tokens = word.split()
    return [self.vocab.get(t,self.vocab["[UNK]"]) for t in tokens]




The first two functions in the BPE class do the following.

Initial vocabulary:
n e w : 6
n e w e r : 3

Pair counts computed by get_stats:
(n, e) = 9
(e, w) = 9
(w, e) = 3
(e, r) = 3

Chosen pair to merge:
(n, e)

After applying merge_vocab:
n e w     becomes ne w
n e w e r becomes ne w e r

Updated vocabulary:
ne w : 6
ne w e r : 3


###this is the step by step flow of BPE

Input           → n e w e r </w>

After merges    → new e r </w>

Tokens          → ["new", "e", "r", "</w>"]

IDs             → [id(new), id(e), id(r), id(</w>)]



# PART 2: THE ATTENTION LAYER (The "Brain")

WHY: This replaces the simple LinearLayer.

It allows words to "vote" on their meaning based on context.

In [42]:
class SelfAttentionLayer:
    def __init__(self, embed_dim, head_dim):
        self.d_k = head_dim
        # Q, K, V matrices (Randomly initialized)
        self.W_q = np.random.randn(embed_dim, head_dim) * 0.01
        self.W_k = np.random.randn(embed_dim, head_dim) * 0.01
        self.W_v = np.random.randn(embed_dim, head_dim) * 0.01

    def forward(self, x):

        # 1. Linear Projections
        Q = np.dot(x, self.W_q)
        K = np.dot(x, self.W_k)
        V = np.dot(x, self.W_v)

        # 2. Scaled Dot-Product Attention
        # Scores = (Q * K.T) / sqrt(d_k)
        scores = np.dot(Q, K.T) / np.sqrt(self.d_k)

        # 3. Softmax
        exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

        # 4. Contextualize
        context = np.dot(weights, V)
        #IMP->Each token collects information from all tokens, weighted by how much it should “pay attention” to them.
        return context

# PART 3: THE MODEL WRAPPER

In [43]:
class MiniTransformer:
  def __init__(self, vocab_size,embed_dim):
    #static embeddings->work as lookup table
    self.embeddings = np.random.randn(vocab_size, embed_dim)*0.01
    #attention mechanism
    self.attention = SelfAttentionLayer(embed_dim,embed_dim)

  def get_sentence_vector(self, token_ids):
        """
        Full Forward Pass: IDs -> Static -> Attention -> Mean Pooling
        """
        # Lookup Static Vectors
        static_vecs = self.embeddings[token_ids]# Shape: [Seq_Len, Embed_Dim]

        # Apply Attention
        context_vecs = self.attention.forward(static_vecs)# Shape: [Seq_Len, Embed_Dim] (Contextualized)

        # Mean Pooling (Crude way to get 1 vector for the whole sentence)
        sent_vec = np.mean(context_vecs, axis=0)# We average all word vectors to get the "Sentence Meaning"

        # Normalize to unit sphere
        return sent_vec / (np.linalg.norm(sent_vec) + 1e-9)




# PART 4: DATA AUGMENTER

Helps to learn similarity between sentences

In [44]:
class DataAugmenter:
    def augment(self, sentence):
        words = sentence.split() # Simple split for augmentation logic
        if len(words) > 1:
            remove_idx = random.randint(0, len(words) - 1)
            new_words = words[:remove_idx] + words[remove_idx+1:]
            return " ".join(new_words)
        return sentence

# PART 5: TRAINING LOOP (The "Pipeline")

In [45]:
#step 1->Data and tokenizer setup

raw_text = "the bank of money and the river bank"

#now we train BPE on this text
tokenizer = SimpleBPE(num_merges=10)
tokenizer.train(raw_text)

#initializing
model = MiniTransformer(len(tokenizer.vocab),embed_dim=8)
augmenter = DataAugmenter()

print(f"vocab: {tokenizer.vocab}")
print("training started")


#Step 2-> Training phase
# (Note: Full backprop for Attention/Softmax is too complex for this snippet.
# We will simulate the Forward Pass and Loss Calculation to show data flow.)

sentences = ["river bank", "money bank"]

for epoch in range(10):
    total_loss = 0
    for sent in sentences:
        # --- A. PREPARE TRIPLETS ---
        # Anchor: "river bank"
        # Positive: "bank" (Augmented)
        # Negative: "money bank" (Hard Negative)

        anchor_text = sent
        pos_text = augmenter.augment(sent)
        neg_text = "money bank" if "river" in sent else "river bank"

        # --- B. TOKENIZE & EMBED ---
        # Convert strings to Lists of IDs
        ids_anc = tokenizer.encode(anchor_text)
        ids_pos = tokenizer.encode(pos_text)
        ids_neg = tokenizer.encode(neg_text)

        # Get Sentence Vectors (Forward Pass through Attention)
        v_anc = model.get_sentence_vector(ids_anc)
        v_pos = model.get_sentence_vector(ids_pos)
        v_neg = model.get_sentence_vector(ids_neg)

        # --- C. COMPUTE LOSS (InfoNCE) ---
        # Dot Products
        sim_pos = np.dot(v_anc, v_pos)
        sim_neg = np.dot(v_anc, v_neg)

        # Softmax & NLL
        scores = np.array([sim_pos, sim_neg])
        exp_scores = np.exp(scores - np.max(scores)) # Stability
        probs = exp_scores / np.sum(exp_scores)

        loss = -np.log(probs[0]) # We want index 0 (Positive) to be 1.0
        total_loss += loss

        # (Optimization Step Omitted for brevity: requires complex chain rule)
        # In a real framework, optimizer.step() happens here.

    print(f"Epoch {epoch}: Loss {total_loss:.4f}")


vocab: {'v': 0, 'o': 1, 'a': 2, 'e': 3, 'm': 4, 'r': 5, 'the</w>': 6, 'n': 7, '</w>': 8, 'y': 9, 'k': 10, 'i': 11, 'f': 12, 'b': 13, 'd': 14, '[UNK]': 15}
training started
Epoch 0: Loss 1.0415
Epoch 1: Loss 1.0415
Epoch 2: Loss 1.0415
Epoch 3: Loss 1.2781
Epoch 4: Loss 1.0938
Epoch 5: Loss 1.2781
Epoch 6: Loss 1.0938
Epoch 7: Loss 1.0415
Epoch 8: Loss 1.2781
Epoch 9: Loss 1.2781


In [46]:

# VERIFICATION (Does Attention Work?)

# Let's see if the word "bank" has different vectors in different contexts
ids_river = tokenizer.encode("river bank")
ids_money = tokenizer.encode("money bank")

# Get vector for the sentence "river bank"
vec_river_bank = model.get_sentence_vector(ids_river)
# Get vector for the sentence "money bank"
vec_money_bank = model.get_sentence_vector(ids_money)

print(f"Vector 'River Bank' (First 4 dims): {np.round(vec_river_bank[:4], 2)}")
print(f"Vector 'Money Bank' (First 4 dims): {np.round(vec_money_bank[:4], 2)}")

similarity = np.dot(vec_river_bank, vec_money_bank)
print(f"Similarity between them: {similarity:.4f}")


--- ATTENTION CHECK ---
Vector 'River Bank' (First 4 dims): [-0.23  0.25  0.1  -0.03]
Vector 'Money Bank' (First 4 dims): [ 0.35  0.3   0.03 -0.6 ]
Similarity between them: 0.5210


# Lets implement this using Pytorch to unleash the accuracy

In [48]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


# 1. THE MODEL (Identical Logic, Auto-Calculus)


In [49]:

class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # Static Embeddings (The Lookup Table)
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # Q, K, V Projections (The Linear Layers we built manually)
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)

    def get_sentence_vector(self, token_ids):
        # 1. Embed inputs (Shape: [Seq_Len, Embed_Dim])
        # We use torch.tensor to wrap inputs
        x = self.embedding(token_ids)

        # 2. Self-Attention Logic (The "Brain")
        # A. Create Q, K, V
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        # B. Dot Product Scores (Q * K.T)
        # torch.matmul is exactly np.dot
        scores = torch.matmul(Q, K.transpose(-2, -1))

        # C. Scale
        d_k = x.size(-1)
        scores = scores / (d_k ** 0.5)

        # D. Softmax (The "Jacobian" part handled automatically)
        attn_weights = F.softmax(scores, dim=-1)

        # E. Contextualize (Weights * V)
        context = torch.matmul(attn_weights, V)

        # 3. Mean Pooling (Sentence Vector)
        # Average all words to get 1 vector
        sent_vec = torch.mean(context, dim=0)

        # Normalize (L2 Norm)
        return F.normalize(sent_vec, p=2, dim=0)


# 2. TRAINING LOOP

In [50]:

vocab_size = 10
ids_river_bank = torch.tensor([0, 1])  # "river bank"
ids_bank_aug   = torch.tensor([1])     # "bank" (Positive)
ids_money_bank = torch.tensor([2, 1])  # "money bank" (Negative)

model = MiniTransformer(vocab_size, embed_dim=8)
optimizer = optim.Adam(model.parameters(), lr=0.1) # The "Teacher"

print("Training PyTorch Transformer...\n")

for epoch in range(20):
    optimizer.zero_grad() # Reset gradients-> IMPORTANT

    # --- Forward Pass ---
    # 1. Get Vectors
    anchor = model.get_sentence_vector(ids_river_bank)
    positive = model.get_sentence_vector(ids_bank_aug)
    negative = model.get_sentence_vector(ids_money_bank)

    # 2. Compute Similarity (Dot Product)
    sim_pos = torch.dot(anchor, positive)
    sim_neg = torch.dot(anchor, negative)

    # 3. Compute Loss (InfoNCE)
    # Stack scores: [Positive, Negative]
    logits = torch.stack([sim_pos, sim_neg])
    # Target is index 0 (The Positive Pair)
    target = torch.tensor([0])

    # CrossEntropyLoss combines Softmax + Log + NLL automatically
    loss = F.cross_entropy(logits.unsqueeze(0), target)

    # --- Backward Pass (The Magic) ---
    loss.backward() # Calculates ALL derivatives (Jacobians, Chain Rule)
    optimizer.step() # Updates weights

    if epoch % 2 == 0:
        print(f"Epoch {epoch}: Loss {loss.item():.4f}")


Training PyTorch Transformer...

Epoch 0: Loss 0.6697
Epoch 2: Loss 0.1355
Epoch 4: Loss 0.1289
Epoch 6: Loss 0.1287
Epoch 8: Loss 0.1289
Epoch 10: Loss 0.1287
Epoch 12: Loss 0.1282
Epoch 14: Loss 0.1278
Epoch 16: Loss 0.1275
Epoch 18: Loss 0.1272


# 3. RESULTS

In [57]:

print("\n--- Final Check ---")
# Re-calculate vectors without updating weights
v_anchor = model.get_sentence_vector(ids_river_bank)
v_neg = model.get_sentence_vector(ids_money_bank)

sim = torch.dot(v_anchor, v_neg).item()
print(f"Similarity: {sim:.4f}")



--- Final Check ---
Similarity: -0.9993
