# üöÄ 60 Days of LLM Development from Scratch  

### Day 3: Tokenizer Implementation in LLMs  

In our **60-day journey** of **LLM development from scratch**, we explore the fundamental components of building a Large Language Model.  

Today, on **Day 3**, we focus on **tokenization**, a critical step in transforming raw text into a format that LLMs can process efficiently. We implement **Byte Pair Encoding (BPE)** and explore **Token IDs, Token Embeddings, and Positional Embeddings**, essential for understanding how transformers handle textual data.

üîπ **Author:** **Elias Hossain**  
üîπ **Affiliation:** **Graduate Student, Computer Science, Mississippi State University**  

Stay tuned for more updates as we dive deeper into LLM development! üöÄ  


### STEP 1: Import Libraries

In [2]:
! pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import PyPDF2
import sentencepiece as spm
from tokenizers import Tokenizer, models, pre_tokenizers, trainers, processors
import os

### STEP 2: Function to Extract Text from PDF

In [4]:
# PDF file location
pdf_path = r"C:\Users\mh3511\Desktop\LLM Development\data\historybrief.pdf"
output_text_file = "extracted_text.txt"
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Save extracted text
extracted_text = extract_text_from_pdf(pdf_path)
with open(output_text_file, "w", encoding="utf-8") as f:
    f.write(extracted_text)

print("Text extracted and saved.")

Text extracted and saved.


# ----------------------------------------------------
# Byte-Pair Encoding (BPE) Tokenizer
# ----------------------------------------------------

In [5]:
bpe_tokenizer = Tokenizer(models.BPE())
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
bpe_tokenizer.train([output_text_file], trainer)
bpe_tokenizer.save("bpe_tokenizer.json")
print("BPE tokenization done.")


BPE tokenization done.


# -------------------------------
# WordPiece Tokenizer
# -------------------------------

In [6]:
wp_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
wp_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.WordPieceTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
wp_tokenizer.train([output_text_file], trainer)
wp_tokenizer.save("wordpiece_tokenizer.json")
print("WordPiece tokenization done.")

WordPiece tokenization done.


# ---------------------------------------------------
# Unigram Tokenizer (SentencePiece)
# --------------------------------------------------

In [9]:
unigram_model_prefix = "unigram_tokenizer"
spm.SentencePieceTrainer.train(
    input=output_text_file, model_prefix=unigram_model_prefix, vocab_size=2600, model_type="unigram"
)
print("Unigram tokenization done.")

Unigram tokenization done.


# ---------------------------------------------
# SentencePiece BPE Tokenizer
# ---------------------------------------------

In [10]:
spm.SentencePieceTrainer.train(
    input=output_text_file, model_prefix="sentencepiece_bpe", vocab_size=5000, model_type="bpe"
)
print("SentencePiece BPE tokenization done.")

SentencePiece BPE tokenization done.


# ------------------------------------------
# Character-Level Tokenization
# ------------------------------------------

In [11]:
char_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
char_tokenizer.pre_tokenizer = pre_tokenizers.Split("", "isolated")  # Character-level splitting
trainer = trainers.BpeTrainer(special_tokens=["[PAD]", "[UNK]"])
char_tokenizer.train([output_text_file], trainer)
char_tokenizer.save("char_tokenizer.json")
print("Character-level tokenization done.")

print("All tokenization processes completed. Models saved.")

Character-level tokenization done.
All tokenization processes completed. Models saved.


What we have done so far?

1) Extracts text from your historybrief.pdf file.
2) Implements BPE, WordPiece, Unigram, SentencePiece (BPE & Unigram), and Character-Level Tokenization.
3) Saves the trained tokenizers as JSON and SentencePiece model files.
4) Outputs progress updates in the terminal.

What is our Next Steps?
1) We can use the trained tokenizers to tokenize any text by loading them using the tokenizers library.
2) To tokenize a sentence, we can use the above tokenizer that we created and saved

In [12]:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")  # Load desired tokenizer
print(tokenizer.encode("Sample text for tokenization").tokens)

['S', 'amp', 'le', 't', 'ext', 'for', 'to', 'k', 'en', 'ization']


The above code is a simple implementation about the different tokenizer used by the LLMs. However, in the following, I will create a full pipeline which is more sophisticated and end-to-end pipeline. For this reason, I took BPE tokenizer, you can use diffrent one as core process is similar to each other.

In [13]:
import PyPDF2
import re
import numpy as np
import torch
import torch.nn as nn
from tokenizers import Tokenizer, models, pre_tokenizers, trainers, processors

# PDF File Path
pdf_path = r"C:\Users\mh3511\Desktop\LLM Development\data\historybrief.pdf"
output_text_file = "extracted_text.txt"

# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Save extracted text
raw_text = extract_text_from_pdf(pdf_path)
with open(output_text_file, "w", encoding="utf-8") as f:
    f.write(raw_text)

print("‚úÖ Text extracted and saved.")

# --------------------------------------------
# 1Ô∏è‚É£ Text Preprocessing
# --------------------------------------------
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

clean_text = preprocess_text(raw_text)
with open("clean_text.txt", "w", encoding="utf-8") as f:
    f.write(clean_text)

print("‚úÖ Text preprocessed and saved.")

# --------------------------------------------
# 2Ô∏è‚É£ Byte Pair Encoding (BPE) Tokenizer Training
# --------------------------------------------
bpe_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
bpe_tokenizer.train(["clean_text.txt"], trainer)
bpe_tokenizer.save("bpe_tokenizer.json")

print("‚úÖ BPE Tokenizer trained and saved.")

# Load Trained BPE Tokenizer
bpe_tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

# Example Sentence for Tokenization
sentence = "Deep learning is revolutionizing artificial intelligence."

# Tokenizing the Sentence
encoded = bpe_tokenizer.encode(sentence)
token_ids = encoded.ids
tokens = encoded.tokens

print(f"üîπ Tokens: {tokens}")
print(f"üîπ Token IDs: {token_ids}")

# --------------------------------------------
# 3Ô∏è‚É£ Token Embeddings (Learnable Matrix)
# --------------------------------------------
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, token_ids):
        return self.embedding(token_ids)

# Define Parameters
vocab_size = 5000  # Adjust as needed
embedding_dim = 128  # Standard embedding size

# Create Token Embedding Model
token_embedding = TokenEmbedding(vocab_size, embedding_dim)

# Convert Token IDs to Tensor
token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)

# Generate Token Embeddings
embedded_tokens = token_embedding(token_ids_tensor)
print(f"‚úÖ Token Embeddings Shape: {embedded_tokens.shape}")

# --------------------------------------------
# 4Ô∏è‚É£ Positional Encoding (Sinusoidal)
# --------------------------------------------
class PositionalEncoding(nn.Module):
    def __init__(self, max_len, embedding_dim):
        super().__init__()
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embedding_dim, 2) * (-np.log(10000.0) / embedding_dim))
        pe = torch.zeros(max_len, embedding_dim)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0)

    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

# Define Positional Encoding
max_seq_length = 512  # Adjust as needed
positional_encoding = PositionalEncoding(max_seq_length, embedding_dim)

# Apply Positional Encoding
embedded_tokens_with_pos = positional_encoding(embedded_tokens.unsqueeze(0))

print(f"‚úÖ Positional Encoding Applied. Final Shape: {embedded_tokens_with_pos.shape}")

# --------------------------------------------
# 5Ô∏è‚É£ Full Tokenization Pipeline Output
# --------------------------------------------
print("\nüìù Final Tokenized Output:")
for i, (tok, tok_id, emb) in enumerate(zip(tokens, token_ids, embedded_tokens_with_pos.squeeze(0))):
    print(f"{i+1}. Token: {tok} | Token ID: {tok_id} | Embedding: {emb.tolist()[:5]} ...")  # Showing first 5 dims of embeddings

print("üöÄ Full BPE Tokenizer with Embeddings and Positional Encoding Completed!")




‚úÖ Text extracted and saved.
‚úÖ Text preprocessed and saved.
‚úÖ BPE Tokenizer trained and saved.
üîπ Tokens: ['[UNK]', 'ee', 'p', 'lear', 'ning', 'is', 'revolution', 'izing', 'art', 'if', 'icial', 'inte', 'lli', 'gence', '[UNK]']
üîπ Token IDs: [1, 613, 30, 3266, 1062, 65, 462, 2203, 521, 1242, 3218, 1986, 2871, 1896, 1]
‚úÖ Token Embeddings Shape: torch.Size([15, 128])
‚úÖ Positional Encoding Applied. Final Shape: torch.Size([1, 15, 128])

üìù Final Tokenized Output:
1. Token: [UNK] | Token ID: 1 | Embedding: [-0.10346663743257523, 0.2682356834411621, 0.925059974193573, 2.831432342529297, -1.8659424781799316] ...
2. Token: ee | Token ID: 613 | Embedding: [-0.3140270709991455, 1.70341157913208, 1.835204839706421, -0.11333292722702026, 0.3725041449069977] ...
3. Token: p | Token ID: 30 | Embedding: [0.33506327867507935, -1.8702374696731567, 2.0708506107330322, -0.4936813712120056, -0.02078145742416382] ...
4. Token: lear | Token ID: 3266 | Embedding: [0.7550011873245239, -1.617290

# üîπ Tokenization Components in LLMs

## 1Ô∏è‚É£ Token ID
- Converts words into unique numbers for processing.
- Example: `"Deep learning"` ‚Üí `[123, 456]`

## 2Ô∏è‚É£ Token Embeddings
- Maps tokens to high-dimensional vectors.
- Helps the model understand word meanings.
- Example: `Token ID 456 ‚Üí [0.12, -0.34, 0.89, ...]`

## 3Ô∏è‚É£ Positional Embeddings
- Adds position info since transformers process words in parallel.
- Example: `Position 1 ‚Üí [0.001, -0.23, 0.45, ...]`

## 4Ô∏è‚É£ Special Tokens
- `[PAD]` (padding), `[UNK]` (unknown), `[CLS]` (classification), `[SEP]` (separator), `[MASK]` (masking).

## 5Ô∏è‚É£ Byte Pair Encoding (BPE)
- Splits words into subwords for better handling of rare words.
- Example: `"unhappiness"` ‚Üí `["un", "happiness"]`
