<a href="https://colab.research.google.com/github/giteshgoyal/llm_practice/blob/main/Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating Vocablury

1. Load the text
2. Split the text using whitespace, punctations as delimiter
3. Remove whitespace from token list (this is optional as we may need whitepace as token)
4. Sort and assign value to token list.

# Load the Text

## Load text using url

In [143]:
import requests

response = requests.get("https://raw.githubusercontent.com/giteshgoyal/llm_practice/refs/heads/main/verdict.txt")
raw_text = response.text

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


## Load using cloning the git repo

In [144]:
!git clone https://github.com/giteshgoyal/llm_practice.git

fatal: destination path 'llm_practice' already exists and is not an empty directory.


In [145]:
!ls
with open("llm_practice/verdict.txt", "r", encoding="utf-8") as f:
  raw_text=f.read()

print("Total number of character: ", len(raw_text))
print(raw_text[:99])

llm_practice  sample_data
Total number of character:  20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


## Split the text using whitespace, punctations as delimiter

In [146]:
import re

splited_data=re.split(r'([,.:;?_!"()\']|\s)',raw_text)
print(splited_data[:99])
print(len(splited_data))

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that', ',', '', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory', ',', '', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting', ',', '', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow', ',', '', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on']
9043


## Remove whitespace from token list

In [147]:
cleaned_data= [data.strip() for data in splited_data if data.split()]
print(cleaned_data[:99])
print(len(cleaned_data))

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius--though', 'a', 'good', 'fellow', 'enough--so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing--his', 'last', 'Chicago', 'sitter--deploring', 'his', 'unaccountable', 'abdication', '.', '"', 'Of', 'course']
4506


## Sort and assign value to token list.


In [148]:
sorted_data= sorted(set(cleaned_data))
print(len(sorted_data))

1200


In [149]:
vocab= {data:id for id,data in enumerate(sorted_data)}
for w,i in vocab.items():
  print(w,":",i)
  if i >30:
    break

! : 0
" : 1
' : 2
( : 3
) : 4
, : 5
--and : 6
--even : 7
--it : 8
--oh : 9
--she : 10
--that : 11
. : 12
: : 13
; : 14
? : 15
A : 16
Ah : 17
Ah--I : 18
Among : 19
And : 20
Are : 21
Arrt : 22
As : 23
At : 24
Be : 25
Begin : 26
Burlington : 27
But : 28
By : 29
Carlo : 30
Chicago : 31


# Word Based Tokenizer

## Without taking care of Out of Vocab words.

In [150]:
class WordBasedTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int=vocab
    self.int_to_str= {id:data for data,id in vocab.items()}

  # 1. Split text into words
  # 2. Replace words with ids using vocab

  def encoder(self, text):
    splited_data=re.split(r'([,.:;?_!"()\']|\s)', text)
    cleaned_data= [data.strip() for data in splited_data if data.split()]
    token_id= [self.str_to_int[word] for word in cleaned_data]
    return token_id

  def decoder(self, token_ids):
    words=[self.int_to_str[id] for id in token_ids]
    sentence=" ".join(words)
    clean_sentence=re.sub(r'\s+([,.?!"()\'])', r'\1', sentence)
    return clean_sentence





In [151]:
tokenizer= WordBasedTokenizerV1(vocab)

In [152]:
text1 = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids=tokenizer.encoder(text1)
print(ids)

[1, 64, 2, 910, 1053, 640, 559, 799, 5, 1195, 634, 5, 1, 77, 12, 44, 911, 1177, 809, 853, 12]


In [153]:
tokenizer.decoder(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

### Out of Box Word Gives KeyError

In [154]:
text="""The court hereby finds the defendant guilty beyond reasonable doubt. Objection overruled,judgment delivered. Counsel presented compelling evidence.
The jury, unanimous in its verdict, adjourned. Amidst the legal jargon stood silence, coffee, echo, remorse, precedent, and truth.
Signed, sealed, recorded — justice prevailed in a chamber echoing law and humanity."""
ids1=tokenizer.encoder(text)
print(ids1)

KeyError: 'court'

### Handling Out of Box Words with SPECIAL CONTEXT TOKENS

1. Include SPECIAL CONTEXT TOKENS like "<|endoftext|>", "<|unk|>"
2. Modify Encoder to handle Out of Box Words

In [163]:
sorted_data= sorted(set(cleaned_data))
sorted_data.extend(["<|endoftext|>", "<|unk|>"])
print(len(sorted_data))

vocab= {data:id for id,data in enumerate(sorted_data)}
for w,i in vocab.items():
  if i < 1190:
    continue
  print(w,":",i)


1202
wouldn : 1190
year : 1191
years : 1192
yellow : 1193
yet : 1194
you : 1195
you--because : 1196
younger : 1197
your : 1198
yourself : 1199
<|endoftext|> : 1200
<|unk|> : 1201


In [171]:
class WordBasedTokenizerV2:
  def __init__(self, vocab):
    self.str_to_int=vocab
    self.int_to_str= {id:data for data,id in vocab.items()}

  # 1. Split text into words
  # 2. Replace words with ids using vocab

  def encoder(self, text):
    splited_data=re.split(r'([,.:;?_!"()\']|\s)', text)
    cleaned_data= [data.strip() for data in splited_data if data.split()]
    token_id= [self.str_to_int[word] if word in self.str_to_int else self.str_to_int["<|unk|>"] for word in cleaned_data]
    return token_id

  def decoder(self, token_ids):
    words=[self.int_to_str[id] for id in token_ids]
    sentence=" ".join(words)
    clean_sentence=re.sub(r'\s+([,.?!"()\'])', r'\1', sentence)
    return clean_sentence


In [172]:
tokenizerV2= WordBasedTokenizerV2(vocab)

In [187]:
print(text1, end="\n\n")
ids2=tokenizerV2.encoder(text1)
print(ids2)

"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride.

[1, 64, 2, 910, 1053, 640, 559, 799, 5, 1195, 634, 5, 1, 77, 12, 44, 911, 1177, 809, 853, 12]


In [174]:
print(tokenizer.decoder(ids2))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [188]:
print(text, end="\n\n")
ids3=tokenizerV2.encoder(text)
print(ids3)

The court hereby finds the defendant guilty beyond reasonable doubt. Objection overruled,judgment delivered. Counsel presented compelling evidence. 
The jury, unanimous in its verdict, adjourned. Amidst the legal jargon stood silence, coffee, echo, remorse, precedent, and truth.
Signed, sealed, recorded — justice prevailed in a chamber echoing law and humanity.

[106, 1201, 1201, 1201, 1053, 1201, 1201, 1201, 1201, 1201, 12, 1201, 1201, 5, 1201, 1201, 12, 1201, 1201, 1201, 1201, 12, 106, 1201, 5, 1201, 601, 624, 1201, 5, 1201, 12, 1201, 1053, 1201, 1201, 990, 1201, 5, 1201, 5, 1201, 5, 1201, 5, 1201, 5, 177, 1103, 12, 1201, 5, 1201, 5, 1201, 1201, 1201, 1201, 601, 134, 1201, 1201, 1201, 177, 1201, 12]


In [178]:
print(tokenizerV2.decoder(ids3))

The <|unk|> <|unk|> <|unk|> the <|unk|> <|unk|> <|unk|> <|unk|> <|unk|>. <|unk|> <|unk|>, <|unk|> <|unk|>. <|unk|> <|unk|> <|unk|> <|unk|>. The <|unk|>, <|unk|> in its <|unk|>, <|unk|>. <|unk|> the <|unk|> <|unk|> stood <|unk|>, <|unk|>, <|unk|>, <|unk|>, <|unk|>, and truth. <|unk|>, <|unk|>, <|unk|> <|unk|> <|unk|> <|unk|> in a <|unk|> <|unk|> <|unk|> and <|unk|>.


### BYTE PAIR ENCODING


In [179]:
!pip install tiktoken



In [180]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [181]:
byte_tokenizer= tiktoken.get_encoding("gpt2")

In [186]:
print(text, end="\n\n")
ids4= byte_tokenizer.encode(text)
print(ids4)

The court hereby finds the defendant guilty beyond reasonable doubt. Objection overruled,judgment delivered. Counsel presented compelling evidence. 
The jury, unanimous in its verdict, adjourned. Amidst the legal jargon stood silence, coffee, echo, remorse, precedent, and truth.
Signed, sealed, recorded — justice prevailed in a chamber echoing law and humanity.

[464, 2184, 29376, 7228, 262, 11304, 6717, 3675, 6397, 4719, 13, 9515, 295, 23170, 6309, 11, 10456, 5154, 6793, 13, 21023, 5545, 13206, 2370, 13, 220, 198, 464, 9002, 11, 28085, 287, 663, 15593, 11, 46055, 276, 13, 41816, 301, 262, 2742, 46468, 6204, 9550, 11, 6891, 11, 9809, 11, 34081, 11, 19719, 11, 290, 3872, 13, 198, 50, 3916, 11, 15283, 11, 6264, 851, 5316, 34429, 287, 257, 11847, 39915, 1099, 290, 9265, 13]


In [184]:
byte_tokenizer.decode(ids4)

'The court hereby finds the defendant guilty beyond reasonable doubt. Objection overruled,judgment delivered. Counsel presented compelling evidence. \nThe jury, unanimous in its verdict, adjourned. Amidst the legal jargon stood silence, coffee, echo, remorse, precedent, and truth.\nSigned, sealed, recorded — justice prevailed in a chamber echoing law and humanity.'