<img src="https://toppng.com/uploads/preview/linkedin-logo-png-photo-116602552293wtc4qogql.png" width="20" height="20" /> [Bharath Hemachandran](https://www.linkedin.com/in/bharath-hemachandran/)

# üìù Phase 0: Encoding, Decoding & Vector Space

**Before using any LLM**, we learn how text becomes **token IDs** and **vectors**. The Groq API does the same encode ‚Üí vectors ‚Üí decode internally; here you see it explicitly.

<div style="background: #e8f5e9; padding: 14px; border-radius: 8px; border-left: 4px solid #4caf50;">
<strong>üéØ What you'll do:</strong> Encode text ‚Üí token IDs ‚Üí show subwords ‚Üí map IDs to vectors (with a tiny demo matrix) ‚Üí decode back to text. No API key needed.
</div>

### üìã Notebook objective (table of contents)

This notebook covers:
- **Setup** ‚Äî Install LangChain tokenizer, tiktoken, NumPy
- **Load the tokenizer** ‚Äî LangChain + tiktoken (gpt2 encoding)
- **1. Encode** ‚Äî Text ‚Üí token IDs (what <code>input_tokens</code> means)
- **2. Subwords** ‚Äî What each token ID represents
- **3. Vector space** ‚Äî Token IDs ‚Üí embedding vectors; demo matrix + cosine similarity
- **4. Decode** ‚Äî Token IDs ‚Üí text (round-trip)
- **5. Subword example** ‚Äî Rare word split into tokens
- **Connection to Phase 1** ‚Äî How this maps to the LLM API
- **Additional reading** ‚Äî Videos and blogs


## üîß Setup (run once)

Install **langchain-text-splitters**, **tiktoken**, and **numpy**. On Colab, run this cell first.

In [None]:
!pip install -q langchain-text-splitters tiktoken numpy

## üì¶ Load the tokenizer

We use **LangChain's Tokenizer** with **tiktoken** (encoding `gpt2`). Same BPE idea as many LLMs.

In [None]:
import numpy as np
import tiktoken
from langchain_text_splitters import Tokenizer

ENCODING_NAME = "gpt2"
HIDDEN_DIM = 768

enc = tiktoken.get_encoding(ENCODING_NAME)
lc_tokenizer = Tokenizer(
    encode=enc.encode,
    decode=enc.decode,
    tokens_per_chunk=1000,
    chunk_overlap=0,
)
vocab_size = enc.n_vocab
print(f"‚úÖ Loaded tokenizer: {ENCODING_NAME} | Vocabulary size: {vocab_size}")

## 1Ô∏è‚É£ Encode: text ‚Üí token IDs

The model never sees raw text. It sees **sequences of integers** (token IDs). This is what **input_tokens** means in the API.

In [None]:
text = "The model reads and writes in tokens, not raw characters."
encoded = lc_tokenizer.encode(text)

print("üì• Encoding: text ‚Üí token IDs")
print(f"   Text: {text!r}")
print(f"   Token IDs: {encoded}")
print(f"   Token count: {len(encoded)} ( = input_tokens in API)")

## 2Ô∏è‚É£ What each ID represents (subwords)

Each token ID maps to a **subword** (often a word or piece of a word). Rare words get split into pieces.

In [None]:
def id_to_token_string(e, token_id):
    raw = e.decode_single_token_bytes(token_id)
    return raw.decode("utf-8", errors="replace")

tokens = [id_to_token_string(enc, i) for i in encoded]
print("üî§ Tokens (subwords):")
print(f"   {tokens}")

## 3Ô∏è‚É£ Vector space: token IDs ‚Üí embedding vectors

The model **never uses the integer ID directly**. It looks up a **vector** (embedding) for each ID. All computation happens in this vector space. Below we use a tiny **demo** embedding matrix (real models use trained weights).

In [None]:
rng = np.random.default_rng(42)
embedding_matrix = rng.standard_normal((vocab_size, HIDDEN_DIM)).astype(np.float32) * 0.02
input_ids = np.array([encoded])
embeddings = embedding_matrix[input_ids]
seq_len, hdim = embeddings.shape[1], embeddings.shape[2]

print("üìä Vector space (demo embedding matrix):")
print(f"   Token IDs shape:  (1, {seq_len})")
print(f"   Embeddings shape: (1, {seq_len}, {hdim})")
print(f"   Each token ID ‚Üí one vector in a {hdim}-dim space.")

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 3))
ax.bar(range(seq_len), [np.linalg.norm(embeddings[0, i]) for i in range(seq_len)], color="#1976d2", alpha=0.8)
ax.set_xlabel("Token position")
ax.set_ylabel("Embedding norm")
ax.set_title("üìê Norm of each token's embedding vector (demo)")
plt.tight_layout()
plt.show()

### Similarity in vector space

Related tokens often have **closer** embeddings (higher cosine similarity). The model "sees" these vectors, not the integers.

In [None]:
def cosine_similarity(a, b):
    a_flat = a.flatten().astype(np.float64)
    b_flat = b.flatten().astype(np.float64)
    return float(np.dot(a_flat, b_flat) / (np.linalg.norm(a_flat) * np.linalg.norm(b_flat) + 1e-9))

idx_a, idx_b = 1, 3
sim = cosine_similarity(embeddings[0, idx_a], embeddings[0, idx_b])
print(f"   Similarity (cosine): {tokens[idx_a]!r} vs {tokens[idx_b]!r} ‚Üí {sim:.4f}")

## 4Ô∏è‚É£ Decode: token IDs ‚Üí text

Round-trip: we decode the same IDs back to a string.

In [None]:
decoded = lc_tokenizer.decode(encoded)
print("üì§ Decoding: token IDs ‚Üí text")
print(f"   Decoded: {decoded!r}")
print(f"   Round-trip OK: {decoded == text}")

## 5Ô∏è‚É£ Subword example: one word ‚Üí several tokens

Rare words get **split** into subword pieces. The model operates on these pieces.

## üìö Additional reading

**YouTube (verified)**  
- [The tokenization pipeline](https://www.youtube.com/watch?v=Yffk5aydLzg) ‚Äî Hugging Face course: what happens when you call a tokenizer.  
- [Building a new tokenizer](https://www.youtube.com/watch?v=MR8tZm5ViWU) ‚Äî Hugging Face: train and use tokenizers.

**Blogs (popular)**  
- [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) ‚Äî Hugging Face: BPE, WordPiece, SentencePiece.  
- [tiktoken](https://github.com/openai/tiktoken) ‚Äî OpenAI: fast BPE tokenizer used in this notebook.

In [None]:
rare = "tokenizer"
enc_rare = lc_tokenizer.encode(rare)
tokens_rare = [id_to_token_string(enc, i) for i in enc_rare]
print(f"   Word: {rare!r} ‚Üí Token IDs: {enc_rare} ‚Üí Tokens: {tokens_rare}")

## üîó How this connects to Phase 1 (LLM API)

<div style="background: #fff3e0; padding: 14px; border-radius: 8px; border-left: 4px solid #ff9800;">
<strong>In the API:</strong> Your prompt (text) ‚Üí encoded to token IDs ‚Üí each ID mapped to an embedding vector ‚Üí model runs in that vector space ‚Üí predicts next token IDs ‚Üí decoded to text. <code>input_tokens</code> / <code>output_tokens</code> = lengths of these ID sequences.
</div>

In [None]:
print("‚úÖ Phase 0 complete. Next: Phase 1 (Groq API).")