# description 
In this project, we are going to check different tokenizers and check how much they are efficient in tokenizing Persian (Farsi ) words

In [1]:
import tiktoken
from transformers import BertTokenizer, AutoTokenizer

First, we try the English format to be sure about how good the models are in the English model, and later we can compare it with Farsi.

In [2]:
import tiktoken

text = """The evolution of language models has transformed how humans..."""

# Load GPT‑4 / GPT‑3.5 tokenizer
gpt_enc = tiktoken.get_encoding("cl100k_base")

# Encode text → list of integer token IDs
gpt_tokens = gpt_enc.encode(text)

# Show how many tokens
print("GPT tokens:", len(gpt_tokens))
print("Token IDs:", gpt_tokens)

# Decode each token ID back to its actual text segment
decoded_tokens = [gpt_enc.decode([t]) for t in gpt_tokens]
print("\nTokens → Text pieces:")
for i, (tok_id, tok_str) in enumerate(zip(gpt_tokens, decoded_tokens), 1):
    print(f"{i:>2}. ID {tok_id:>6} | '{tok_str}'")


GPT tokens: 10
Token IDs: [791, 15740, 315, 4221, 4211, 706, 24411, 1268, 12966, 1131]

Tokens → Text pieces:
 1. ID    791 | 'The'
 2. ID  15740 | ' evolution'
 3. ID    315 | ' of'
 4. ID   4221 | ' language'
 5. ID   4211 | ' models'
 6. ID    706 | ' has'
 7. ID  24411 | ' transformed'
 8. ID   1268 | ' how'
 9. ID  12966 | ' humans'
10. ID   1131 | '...'


- cl100k_base is the Byte‑Pair‑Encoding (BPE) tokenizer used by modern OpenAI GPT models — specifically GPT‑4, GPT‑3.5‑Turbo, and text‑embedding‑3/5 families.

#  How It Handles Farsi Internally
- The merges in cl100k_base were trained mostly on English, Latin‑based code/text, so its merge rules focus on patterns common in those scripts.
- Persian characters (e.g. «س», «ت», «م», etc.) are encoded in UTF-8 with 2 bytes each.If those specific byte sequences weren’t common in the English‑weighted corpus, the tokenizer will not merge them efficiently.
- As a result, Persian text tends to produce many more tokens per word than English does.

In [3]:
import tiktoken

# Load GPT‑4 / GPT‑3.5 tokenizer
enc = tiktoken.get_encoding("cl100k_base")

# Persian input text
text_fa = "تحول زبان‌ها به شکل شگفت‌انگیزی دگرگون شده است."

# Encode to token IDs
token_ids = enc.encode(text_fa)

# Decode each token so we can see how Persian text is segmented
decoded_tokens = [enc.decode([t]) for t in token_ids]

print(f"Total tokens: {len(token_ids)}\n")
print("Index | Token ID | Token Text")
print("-" * 40)
for i, (tid, seg) in enumerate(zip(token_ids, decoded_tokens), start=1):
    print(f"{i:>4}  | {tid:>7} | {repr(seg)}")


Total tokens: 34

Index | Token ID | Token Text
----------------------------------------
   1  |   14628 | 'ت'
   2  |   30925 | 'ح'
   3  |   73904 | 'ول'
   4  |    8979 | ' �'
   5  |     110 | '�'
   6  |   22071 | 'ب'
   7  |   40523 | 'ان'
   8  |   90464 | '\u200c'
   9  |   16552 | 'ه'
  10  |    5821 | 'ا'
  11  |   82868 | ' به'
  12  |   53257 | ' ش'
  13  |   33411 | 'ک'
  14  |    8700 | 'ل'
  15  |   53257 | ' ش'
  16  |   64832 | 'گ'
  17  |   21604 | 'ف'
  18  |   14628 | 'ت'
  19  |   90464 | '\u200c'
  20  |   40523 | 'ان'
  21  |   64832 | 'گ'
  22  |   14728 | 'ی'
  23  |   40797 | 'ز'
  24  |   14728 | 'ی'
  25  |   45430 | ' د'
  26  |   64832 | 'گ'
  27  |   11318 | 'ر'
  28  |   64832 | 'گ'
  29  |   12942 | 'و'
  30  |   12061 | 'ن'
  31  |   53257 | ' ش'
  32  |   92435 | 'ده'
  33  |   94253 | ' است'
  34  |      13 | '.'


- Since BPE learned mostly from ASCII/Latin text, it never saw those two Persian bytes often enough to merge them into large units — so the tokenizer only partially merges or leaves them split.That’s why you see micro‑segments like 'ت', 'ح', 'گ', etc.— each corresponds to one or two UTF‑8 bytes treated separately in the merge hierarchy.