# 🚀 Tokenization Playground: Explore Different Tokenizers in Python
### 📌 Understand how text is tokenized, converted into token IDs, and decoded back
This notebook lets you experiment with **different tokenization methods**, compare them, and understand how LLMs process text inputs.

In [ ]:
# Install required libraries if not installed
!pip install transformers tokenizers sentencepiece tiktoken

In [ ]:
from transformers import AutoTokenizer
import tiktoken
import sentencepiece as spm
import os
import json
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
import warnings
warnings.filterwarnings('ignore')

## 🔹 Define Sample Input Text
Let's define the text we will tokenize across different tokenizers.

In [ ]:
# Define sample text
sample_text = "Tokenization is an essential process in NLP and LLMs!"

## 🔹 OpenAI GPT-4 Tokenizer (Byte-Level BPE) 
We use `tiktoken`, which is OpenAI's tokenizer for GPT models.

In [ ]:
# Load GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

# Tokenize text
tokens = enc.encode(sample_text)
decoded_text = enc.decode(tokens)

print("🔹 Token IDs:", tokens)
print("🔹 Decoded Text:", decoded_text)

## 🔹 BERT Tokenizer (WordPiece)
BERT models use **WordPiece Tokenization**, which breaks words into subwords based on frequency.

In [ ]:
# Load BERT tokenizer (WordPiece)
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize text
bert_tokens = bert_tokenizer(sample_text, return_tensors="pt")

print("🔹 Token IDs:", bert_tokens["input_ids"][0].tolist())
print("🔹 Tokens:", bert_tokenizer.convert_ids_to_tokens(bert_tokens["input_ids"][0].tolist()))

## 🔹 SentencePiece Tokenizer (Used in T5, ALBERT)
This tokenizer is used in **T5, XLNet, and ALBERT**, and supports both BPE and Unigram LM.

In [ ]:
# Train a SentencePiece model (Unigram)
spm.SentencePieceTrainer.train(input="sample_text.txt", model_prefix="m", vocab_size=5000)
sp = spm.SentencePieceProcessor(model_file='m.model')

# Tokenize text
tokens = sp.encode(sample_text, out_type=str)
ids = sp.encode(sample_text)

print("🔹 Tokens:", tokens)
print("🔹 Token IDs:", ids)

## 🔹 Custom Byte Pair Encoding (BPE) Tokenization
Train a **custom BPE tokenizer** using Hugging Face's `tokenizers` library.

In [ ]:
# Create a custom BPE tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])

# Train tokenizer on a sample file
tokenizer.train(["sample_text.txt"], trainer)

# Tokenize text
output = tokenizer.encode(sample_text)
print("🔹 Tokens:", output.tokens)
print("🔹 Token IDs:", output.ids)

## 🎯 Summary: Key Differences Between Tokenization Methods
| **Tokenizer Type** | **Used in Models** | **How It Works?** |
|----------------|------------------|----------------|
| **Byte Pair Encoding (BPE)** | GPT-4, LLaMA-3, GPT-3 | Merges frequent subword pairs iteratively |
| **WordPiece** | BERT, RoBERTa, DistilBERT | Uses probability to merge subwords |
| **Unigram LM** | T5, XLNet, ALBERT | Drops unnecessary subwords probabilistically |
| **SentencePiece** | T5, MarianMT | Works without whitespace dependency |
| **Byte-Level BPE** | GPT-2, GPT-3, GPT-4 | BPE operating at byte level |

🚀 **Now, try tokenizing different sentences to compare the outputs!**