<img src="https://toppng.com/uploads/preview/linkedin-logo-png-photo-116602552293wtc4qogql.png" width="20" height="20" /> [Bharath Hemachandran](https://www.linkedin.com/in/bharath-hemachandran/)

# üìù Phase 0: Encoding, Decoding & Vector Space

**Before using any LLM**, we learn how text becomes **token IDs** and **vectors**. The Groq API does the same encode ‚Üí vectors ‚Üí decode internally; here you see it explicitly.

<div style="background: #e8f5e9; padding: 14px; border-radius: 8px; border-left: 4px solid #4caf50;">
<strong>üéØ What you'll do:</strong> Encode text ‚Üí token IDs ‚Üí show subwords ‚Üí map IDs to vectors (with a tiny demo matrix) ‚Üí decode back to text. No API key needed.
</div>

### üìã Notebook objective (table of contents)

This notebook covers:
- **Setup** ‚Äî Install LangChain tokenizer, tiktoken, NumPy
- **Bag of Words (traditional)** ‚Äî Words ‚Üí vocabulary ‚Üí count vectors (before BPE)
- **What is BPE?** ‚Äî Byte Pair Encoding: subwords, merge table, why LLMs use it
- **Load the tokenizer** ‚Äî LangChain + tiktoken (BPE, gpt2 encoding)
- **1. Encode** ‚Äî Text ‚Üí token IDs (what <code>input_tokens</code> means)
- **2. Subwords** ‚Äî What each token ID represents
- **3. Vector space** ‚Äî What vectors and vector spaces are; embeddings; dimensions; similarity
- **4. Decode** ‚Äî Token IDs ‚Üí text (round-trip)
- **5. Subword example** ‚Äî Rare word split into tokens
- **Connection to Phase 1** ‚Äî How this maps to the LLM API
- **Exercises** ‚Äî Token count vs words, rare-word subwords, BoW vs BPE
- **Additional reading** ‚Äî Videos and blogs


## üîß Setup (run once)

Install **langchain-text-splitters**, **tiktoken**, and **numpy**. On Colab, run this cell first.

In [None]:
!pip install -q langchain-text-splitters tiktoken numpy matplotlib

## üìö Before BPE: Bag of Words (traditional text ‚Üí vectors)

**Before** subword tokenization (like Byte Pair Encoding), a simple way to turn text into vectors was **Bag of Words (BoW)**:

1. **Split** text into words (e.g. by spaces).
2. **Build a vocabulary** ‚Äî a fixed list of unique words (e.g. from a corpus).
3. **Represent each sentence** as a vector of **word counts** (one dimension per vocabulary word).

No subwords: each **word** is one unit. Unknown words are typically ignored or mapped to a special "unknown" index. Below we implement a minimal BoW by hand (no BPE, no tokenizer yet).

In [None]:
# Minimal Bag of Words: words ‚Üí vocabulary indices ‚Üí count vectors
import numpy as np

# Small corpus (we'll build vocabulary from this)
sentences = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "a cat and a dog",
]

# 1. Build vocabulary: unique words, sorted (so the same word always gets the same index)
all_words = []
for s in sentences:
    all_words.extend(s.lower().split())
vocab = sorted(set(all_words))
word_to_id = {w: i for i, w in enumerate(vocab)}

print("üìñ Vocabulary (word ‚Üí index):")
print(f"   {word_to_id}")
print(f"   Size: {len(vocab)}")

# 2. Convert each sentence to a BoW vector (count of each word in the sentence)
def bow_vector(sentence, word_to_id):
    vec = np.zeros(len(word_to_id), dtype=np.int32)
    for w in sentence.lower().split():
        if w in word_to_id:
            vec[word_to_id[w]] += 1
    return vec

print("\nüìä Bag of Words vectors (one row per sentence):")
for s in sentences:
    v = bow_vector(s, word_to_id)
    print(f"   {s!r}")
    print(f"   ‚Üí {v}  (counts for: {vocab})")

**Limitation of BoW:** The vocabulary is **fixed** from the corpus. New or rare words (e.g. "tokenizer") have **no index** and are typically ignored. **Byte Pair Encoding (BPE)**, used next, splits text into **subwords** so we can represent any string with a fixed, learned set of pieces‚Äîno separate word list needed.

## What is Byte Pair Encoding (BPE)?

**Byte Pair Encoding (BPE)** is a **subword** tokenization algorithm: text is split into pieces that can be smaller than a word (e.g. "running" ‚Üí "run" + "ning"). That way we don't need a separate entry for every word‚Äîwe learn a **fixed set of subword pieces** from data and can represent **any** string by concatenating them.

**How BPE works (conceptually):**

1. **Start from characters (or bytes).** The text is first split into characters (or byte-level units).
2. **Learn merges from a corpus.** We count how often every **pair** of adjacent units appears (e.g. "t" + "h" ‚Üí "th"). We repeatedly **merge the most frequent pair** into a new single token and add it to the vocabulary. After many such merges we have a vocabulary of subwords (single chars, frequent chunks like "th", "ing", whole common words, etc.).
3. **Encode new text.** To tokenize a new sentence we split it into characters, then apply the **same merge rules in order**. The result is a sequence of **token IDs** (one integer per subword).
4. **Decode.** To go back to text we map each token ID to its subword string and concatenate.

**Why LLMs use BPE:** A fixed vocabulary of ~50k subwords can represent any sentence. Rare words become several tokens (e.g. "tokenizer" ‚Üí "token" + "izer"); common words may stay one token. The model only ever sees **sequences of integers** (token IDs); we'll see next how each ID is then mapped to a **vector**.

In [None]:
# Minimal BPE-style demo: merge the most frequent adjacent pair (one step)
# In real BPE we repeat this thousands of times on a large corpus.

text = "aaabaaaba"  # "aa" appears 4 times, "ab" 2 times, "ba" 2 times
units = list(text)  # start with characters: ['a','a','a','b','a','a','a','b','a']

from collections import Counter
pairs = [("".join(units[i:i+2]), i) for i in range(len(units)-1)]
pair_counts = Counter(p for p, _ in pairs)
most_common_pair = pair_counts.most_common(1)[0][0]  # e.g. "aa"

print("üìå One BPE-style merge step (conceptual):")
print(f"   Text: {text!r} ‚Üí units: {units}")
print(f"   Most frequent pair: {most_common_pair!r}")
print(f"   After merging that pair: we'd replace all \"{most_common_pair}\" with one new token.")
print("   Real BPE (e.g. tiktoken) does this on a huge corpus and keeps many merge rules.")

## üì¶ Load the tokenizer

We use **LangChain's Tokenizer** with **tiktoken** (encoding `gpt2`). Same BPE idea as many LLMs.

In [None]:
import numpy as np
import tiktoken
from langchain_text_splitters import Tokenizer

ENCODING_NAME = "gpt2"
HIDDEN_DIM = 768

enc = tiktoken.get_encoding(ENCODING_NAME)
lc_tokenizer = Tokenizer(
    encode=enc.encode,
    decode=enc.decode,
    tokens_per_chunk=1000,
    chunk_overlap=0,
)
vocab_size = enc.n_vocab
print(f"‚úÖ Loaded tokenizer: {ENCODING_NAME} | Vocabulary size: {vocab_size}")

## 1Ô∏è‚É£ Encode: text ‚Üí token IDs

The model never sees raw text. It sees **sequences of integers** (token IDs). This is what **input_tokens** means in the API.

In [None]:
text = "The model reads and writes in tokens, not raw characters."
encoded = lc_tokenizer.encode(text)

print("üì• Encoding: text ‚Üí token IDs")
print(f"   Text: {text!r}")
print(f"   Token IDs: {encoded}")
print(f"   Token count: {len(encoded)} ( = input_tokens in API)")

## 2Ô∏è‚É£ What each ID represents (subwords)

Each token ID maps to a **subword** (often a word or piece of a word). Rare words get split into pieces.

In [None]:
def id_to_token_string(e, token_id):
    raw = e.decode_single_token_bytes(token_id)
    return raw.decode("utf-8", errors="replace")

tokens = [id_to_token_string(enc, i) for i in encoded]
print("üî§ Tokens (subwords):")
print(f"   {tokens}")

## 3Ô∏è‚É£ Vector space: token IDs ‚Üí embedding vectors

### What is a vector?

A **vector** is a list of numbers, e.g. `[0.1, -0.3, 0.5, ...]`. You can think of it as a **point** in space: in 2D, two numbers give (x, y); in 3D, three numbers give (x, y, z). In NLP we use **many** numbers per token (e.g. 768)‚Äîso we can't draw it on paper, but the idea is the same: each vector is one point in a **high-dimensional space**.

### What is a vector space?

A **vector space** is the set of all possible vectors of a given length. For length 768 we have **R^768**: every point is a list of 768 real numbers. Distances and angles between points are well-defined (e.g. **cosine similarity** = how aligned two vectors are; **norm** = length of a vector). Models do **linear algebra** in this space: add vectors, scale them, take dot products‚Äîso turning text into vectors is what lets math do the work.

### Why does the model use vectors (embeddings)?

The model **never uses the token ID (integer) directly**. It looks up a **vector** (called an **embedding**) for each ID from a big table (the **embedding matrix**). All computation‚Äîattention, layers, predictions‚Äîhappens in this vector space. Similar or related tokens often have **similar embeddings** (high cosine similarity); the model was trained so that meaning is reflected in geometry. Below we use a tiny **demo** embedding matrix (random); real models use **trained** weights.

In [None]:
rng = np.random.default_rng(42)
embedding_matrix = rng.standard_normal((vocab_size, HIDDEN_DIM)).astype(np.float32) * 0.02
input_ids = np.array([encoded])
embeddings = embedding_matrix[input_ids]
seq_len, hdim = embeddings.shape[1], embeddings.shape[2]

print("üìä Vector space (demo embedding matrix):")
print(f"   Token IDs shape:  (1, {seq_len})")
print(f"   Embeddings shape: (1, {seq_len}, {hdim})")
print(f"   Each token ID ‚Üí one vector in a {hdim}-dim space.")

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 3))
ax.bar(range(seq_len), [np.linalg.norm(embeddings[0, i]) for i in range(seq_len)], color="#1976d2", alpha=0.8)
ax.set_xlabel("Token position")
ax.set_ylabel("Embedding norm")
ax.set_title("üìê Norm of each token's embedding vector (demo)")
plt.tight_layout()
plt.show()

### Dimensions and the norm (length)

Each token's embedding has **768 numbers**‚Äîso we're in a **768-dimensional** space. We can't draw that, but we can still measure things: the **norm** (length) of a vector is like the length of an arrow. The bar chart above shows the norm of each token's embedding (our demo matrix is random; in trained models these lengths and directions carry meaning).

### Similarity in vector space

**Cosine similarity** measures how much two vectors point in the same direction: **1** = same direction, **0** = perpendicular, **-1** = opposite. In trained models, tokens with similar meaning often have **high cosine similarity** (their vectors point the same way). The model "sees" only these vectors‚Äînot the token IDs‚Äîand uses distances and angles to reason. Below we compute cosine similarity between two token embeddings (with our demo matrix the result is random; in a real model you'd see related words cluster).

In [None]:
def cosine_similarity(a, b):
    a_flat = a.flatten().astype(np.float64)
    b_flat = b.flatten().astype(np.float64)
    return float(np.dot(a_flat, b_flat) / (np.linalg.norm(a_flat) * np.linalg.norm(b_flat) + 1e-9))

idx_a, idx_b = 1, 3
sim = cosine_similarity(embeddings[0, idx_a], embeddings[0, idx_b])
print(f"   Similarity (cosine): {tokens[idx_a]!r} vs {tokens[idx_b]!r} ‚Üí {sim:.4f}")

## 4Ô∏è‚É£ Decode: token IDs ‚Üí text

Round-trip: we decode the same IDs back to a string.

In [None]:
decoded = lc_tokenizer.decode(encoded)
print("üì§ Decoding: token IDs ‚Üí text")
print(f"   Decoded: {decoded!r}")
print(f"   Round-trip OK: {decoded == text}")

## 5Ô∏è‚É£ Subword example: one word ‚Üí several tokens

Rare words get **split** into subword pieces. The model operates on these pieces.

## ‚úèÔ∏è Exercises

*Use only what you learned in this phase (encoding, decoding, subwords, vector space).*

1. **Token count vs word count**  
   Encode the sentence *"The tokenizer splits text into subwords."* with the same tokenizer you used above. Why might the number of token IDs be different from the number of words? Give a short explanation.

2. **Rare word and subwords**  
   Pick a rare or technical word (e.g. *"tokenizer"*, *"BPE"*, or *"embedding"*). Encode it and list the subword pieces. In one or two sentences, explain why the tokenizer might split it that way (e.g. why it might be one token vs several).

3. **Similar meaning, different words**  
   Consider two sentences: *"The cat sat on the mat."* and *"A feline rested on the rug."* They have similar meaning but different words. If you used **Bag of Words** (word counts only), would the two vectors be similar? What if you used **BPE token IDs**? Explain briefly why BoW and token IDs behave differently here.

## üìö Additional reading

**YouTube (verified)**  
- [The tokenization pipeline](https://www.youtube.com/watch?v=Yffk5aydLzg) ‚Äî Hugging Face course: what happens when you call a tokenizer.  
- [Building a new tokenizer](https://www.youtube.com/watch?v=MR8tZm5ViWU) ‚Äî Hugging Face: train and use tokenizers.

**Blogs (popular)**  
- [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) ‚Äî Hugging Face: BPE, WordPiece, SentencePiece.  
- [tiktoken](https://github.com/openai/tiktoken) ‚Äî OpenAI: fast BPE tokenizer used in this notebook.

In [None]:
rare = "tokenizer"
enc_rare = lc_tokenizer.encode(rare)
tokens_rare = [id_to_token_string(enc, i) for i in enc_rare]
print(f"   Word: {rare!r} ‚Üí Token IDs: {enc_rare} ‚Üí Tokens: {tokens_rare}")

## üîó How this connects to Phase 1 (LLM API)

<div style="background: #fff3e0; padding: 14px; border-radius: 8px; border-left: 4px solid #ff9800;">
<strong>In the API:</strong> Your prompt (text) ‚Üí encoded to token IDs ‚Üí each ID mapped to an embedding vector ‚Üí model runs in that vector space ‚Üí predicts next token IDs ‚Üí decoded to text. <code>input_tokens</code> / <code>output_tokens</code> = lengths of these ID sequences.
</div>

In [None]:
print("‚úÖ Phase 0 complete. Next: Phase 1 (Groq API).")