## Types of Tokenizers

Tokenization is the process of breaking text into smaller units called **tokens** (words, subwords, or characters).  
Different tokenization strategies affect vocabulary size, model understanding, and performance.

---

### 1Ô∏è‚É£ **Word Tokenizer**

**Definition:**  
Splits text into individual **words** based on spaces or punctuation.

**Example:**

Input: "I love lightning" <br>
Output: ["I", "love", "lightning"]

**Drawback:**  
Even similar words like `"light"` and `"lightning"` are treated as completely **different tokens**,  
leading to:
- A **larger vocabulary**
- **No semantic link** between related words

**üìâ Problem:**  
Word tokenizers cannot generalize well to **unseen or rare words**.

---

### 2Ô∏è‚É£ **Character Tokenizer**

**Definition:**  
Breaks text into **individual characters**, including spaces and punctuation.

**Example:**

Input: "How are you"<br>
Output: ["H", "o", "w", " ", "a", "r", "e", " ", "y", "o", "u"]

**Advantage:**  
Very small vocabulary (only letters, digits, punctuation, etc.)

**Drawback:**  
Completely loses **semantic meaning** ‚Äî the model only sees letters, not words or their relationships.  

It‚Äôs like teaching a model with *no concept of words.*

---

### 3Ô∏è‚É£ **Subword Tokenizer (Most Common)**

**Definition:**  
Breaks words into **meaningful smaller units (subwords)** ‚Äî like prefixes, suffixes, and roots.

**Example:**

Input: "lightning"<br>
Output: ["light", "##ning"]<br>

Input: "unhappiness"<br>
Output: ["un", "happi", "ness"]

**How it works:**  
Uses algorithms like **Byte Pair Encoding (BPE)** or **WordPiece** to:
- Merge frequent character pairs into subwords  
- Keep common words as single tokens  
- Split rare words into smaller known parts

**Algorithm summary:**
> ‚ÄúThe most frequent byte (or character) pairs should be combined into a single unit (token).‚Äù

**Benefits:**

- Handles unseen words gracefully  
- Keeps vocabulary size manageable  
- Maintains meaning through reusable subword pieces 

In [1]:
# Now we'll use a library of python known as tiktoken, that implements BPE algorithm efficiently
import importlib
import tiktoken

In [2]:
# initialize the tokenizer from gpt2
tokenizer = tiktoken.get_encoding('gpt2')

In [None]:
text = "Hey How are you Buddy!, What are you doing right now, its raining outside."
tokens = tokenizer.encode(text)
print(tokens)

[10814, 1374, 389, 345, 36896, 28265, 1867, 389, 345, 1804, 826, 783, 11, 663, 43079, 2354, 13]


In [4]:
# converting these tokens back to words 
decoded_tokens = tokenizer.decode(tokens)
print(decoded_tokens)

Hey How are you Buddy!, What are you doing right now, its raining outside.
