<a href="https://colab.research.google.com/github/devrajvasani/llm-from-scratch/blob/main/llm_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM From Scratch


---



## Simple Tokenizer

### Step 1: Creating Tokens

In [207]:
with open("sample_data/the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 4029
Once-promising attorney Frank Galvin is an alcoholic ambulance chaser. As a favor, his former partn


In [208]:
import re

# Split from white spaces
result = re.split(r'([,.]|\s)', raw_text)

# Include special characters
result = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)

# Remove spaces
result = [item for item in result if item.strip()]
print(result)

preprocessed = result

['Once-promising', 'attorney', 'Frank', 'Galvin', 'is', 'an', 'alcoholic', 'ambulance', 'chaser', '.', 'As', 'a', 'favor', ',', 'his', 'former', 'partner', 'Mickey', 'Morrissey', 'sends', 'him', 'a', 'medical', 'malpractice', 'case', 'which', 'is', 'all', 'but', 'certain', 'to', 'be', 'settled', 'for', 'a', 'significant', 'amount', '.', 'The', 'case', 'involves', 'Deborah', 'Ann', 'Kaye', ',', 'who', 'was', 'left', 'comatose', 'after', 'choking', 'on', 'her', 'own', 'vomit', 'when', 'she', 'received', 'general', 'anesthesia', 'during', 'childbirth', 'at', 'a', 'Catholic', 'hospital', '.', 'The', 'plaintiffs', ',', 'Kaye', "'", 's', 'sister', 'and', 'brother-in-law', ',', 'intend', 'to', 'use', 'the', 'settlement', 'to', 'pay', 'for', 'her', 'care', '.', 'A', 'Catholic', 'diocese', 'representative', 'offers', 'Galvin', '$210', ',', '000', '(', 'equivalent', 'to', '$576', ',', '000', 'in', '2024[5]', ')', '.', 'Deeply', 'affected', 'by', 'seeing', 'Kaye', ',', 'Galvin', 'declines', 'and'

### Step 2: Creating Token IDs

In [209]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

348


In [210]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [211]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 20:
        break

('"', 0)
('$210', 1)
('$576', 2)
("'", 3)
('(', 4)
(')', 5)
(',', 6)
('.', 7)
('000', 8)
('1', 9)
('2024[5]', 10)
('9', 11)
(';', 12)
('?', 13)
('A', 14)
('After', 15)
('Afterward', 16)
('Ann', 17)
('As', 18)
('Bag', 19)
('Boston', 20)


In [212]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
          item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [213]:
tokenizer = SimpleTokenizerV1(vocab)

text = "The jury finds in favor of the plaintiffs, and the foreman asks whether the jury can award more than what was sought."
ids = tokenizer.encode(text)
print(ids)
text = tokenizer.decode(ids)
print(text)

[50, 211, 165, 198, 162, 243, 319, 258, 6, 68, 319, 169, 80, 340, 319, 211, 99, 85, 231, 317, 338, 336, 300, 7]
The jury finds in favor of the plaintiffs, and the foreman asks whether the jury can award more than what was sought.


### Step 3: ADDING SPECIAL CONTEXT TOKENS

Special context tokens are non-natural language symbols that guide LLMs.

<|endoftext|> : signals the end of text for generation control.

<|unk|> : handles unknown words, preventing errors by providing a fallback ID.

They make tokenizers more robust and help models understand text boundaries and out-of-vocabulary terms effectively. This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

In [214]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}
print("Length of New Vocabulary: ", len(vocab.items()))

Length of New Vocabulary:  350


In [215]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('with', 345)
('wrote', 346)
('you', 347)
('<|endoftext|>', 348)
('<|unk|>', 349)


In [216]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [217]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Judges Panel can award more than what was sought?."
text2 = "The jury finds in support of the plaintiffs."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Judges Panel can award more than what was sought?. <|endoftext|> The jury finds in support of the plaintiffs.


In [218]:
tokenizer.encode(text)

[349,
 349,
 99,
 85,
 231,
 317,
 338,
 336,
 300,
 13,
 7,
 348,
 50,
 211,
 165,
 198,
 349,
 243,
 319,
 258,
 7]

In [219]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|> <|unk|> can award more than what was sought?. <|endoftext|> The jury finds in <|unk|> of the plaintiffs.'

## Byte Pair Encoding (BPE) Tokenizer

In [220]:
! pip3 install tiktoken



In [221]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [222]:
tokenizer_gpt2 = tiktoken.get_encoding("gpt2")
tokenizer_cl100k_base = tiktoken.get_encoding("cl100k_base")


gpt2 : Used by GPT-2 and early GPT-3 models, Vocabulary Size: ~50,257 tokens

cl100k_base : Used by GPT-3.5-turbo, GPT-4, Vocabulary Size: ~100,000 tokens

In [223]:
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.")

In [224]:
# Tokenizer: gpt2
integers_1 = tokenizer_gpt2.encode(text, allowed_special={"<|endoftext|>"})
print(integers_1)

strings_1 = tokenizer_gpt2.decode(integers_1)
print(strings_1)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


In [225]:
# Tokenizer: cl100k_base
integers_2 = tokenizer_cl100k_base.encode(text, allowed_special={"<|endoftext|>"})
print(integers_2)

strings_2 = tokenizer_cl100k_base.decode(integers_2)
print(strings_2)

[9906, 11, 656, 499, 1093, 15600, 30, 220, 100257, 763, 279, 7160, 32735, 7317, 2492, 315, 1063, 16476, 17826, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


In [226]:
# For Rendom text
integers = tokenizer_cl100k_base.encode("iAkwirw ier")
print(integers)

strings = tokenizer_cl100k_base.decode(integers)
print(strings)

[72, 32, 29700, 404, 86, 602, 261]
iAkwirw ier
