# description 
In this project, we are going to check different tokenizers and check how much they are efficient in tokenizing Persian (Farsi ) words

In [1]:
import tiktoken
from transformers import BertTokenizer, AutoTokenizer

First, we try the English format to be sure about how good the models are in the English model, and later we can compare it with Farsi.

In [2]:
import tiktoken

text = """The evolution of language models has transformed how humans..."""

# Load GPT‑4 / GPT‑3.5 tokenizer
gpt_enc = tiktoken.get_encoding("cl100k_base")

# Encode text → list of integer token IDs
gpt_tokens = gpt_enc.encode(text)

# Show how many tokens
print("GPT tokens:", len(gpt_tokens))
print("Token IDs:", gpt_tokens)

# Decode each token ID back to its actual text segment
decoded_tokens = [gpt_enc.decode([t]) for t in gpt_tokens]
print("\nTokens → Text pieces:")
for i, (tok_id, tok_str) in enumerate(zip(gpt_tokens, decoded_tokens), 1):
    print(f"{i:>2}. ID {tok_id:>6} | '{tok_str}'")


GPT tokens: 10
Token IDs: [791, 15740, 315, 4221, 4211, 706, 24411, 1268, 12966, 1131]

Tokens → Text pieces:
 1. ID    791 | 'The'
 2. ID  15740 | ' evolution'
 3. ID    315 | ' of'
 4. ID   4221 | ' language'
 5. ID   4211 | ' models'
 6. ID    706 | ' has'
 7. ID  24411 | ' transformed'
 8. ID   1268 | ' how'
 9. ID  12966 | ' humans'
10. ID   1131 | '...'


- cl100k_base is the Byte‑Pair‑Encoding (BPE) tokenizer used by modern OpenAI GPT models — specifically GPT‑4, GPT‑3.5‑Turbo, and text‑embedding‑3/5 families.