# 3.2 Tokens and Tokenization

## Attribution
This notebook was re-used and modified from material created by NVIDIA and Dartmouth College and licensed under the Creative Commons Attribution-Non Commercial 4.0 International License (CC BY-NC 4.0) for the **Generative AI: Theory and Applications** MSc Module at UWS.
Source materials available at: https://developer.nvidia.com/gen-ai-teaching-kit-syllabus (NVIDIA Deep Learning Institute Generative AI Teaching Kit) 

## Overview

Welcome to the second notebook in this week. In this notebook, we will dive deeper into key Natural Language Processing (NLP) tasks and methods that are foundational for processing and analyzing text data.

By the end of this notebook, you will:

- Explore basic tokenization methods, including word, character, and sentence-level tokenization.
- Utilize `tiktoken` to tokenize text for GPT-like models, analyzing how different tokenizers handle text.
- Train and evaluate a subword tokenizer using Byte Pair Encoding (BPE) for efficient text representation.

These tasks will help you build a strong understanding of tokenization and its applications, setting the stage for advanced NLP workflows.

In [None]:
from IPython.display import display, Javascript
display(Javascript('IPython.notebook.kernel.restart();'))

## 1. Basic Tokenization

What is Tokenization?
- Tokenization is the process of splitting text into smaller units, such as words, characters, or sentences.
- It serves as the foundation for almost every NLP task by breaking raw text into manageable components.

Why is Tokenization Important?
- Enables systematic processing of text data.
- Breaks sentences into units, allowing for analysis like word frequencies or semantic meaning.
- Prepares text for machine learning models, ensuring uniformity in input.

What This Section Covers:
1. **Word-level Tokenization**: Breaking text into individual words for analysis.
2. **Character-level Tokenization**: Splitting text into individual characters.
3. **Sentence-level Tokenization**: Dividing text into complete sentences.

Tools Used:
- **NLTK (Natural Language Toolkit)**: A popular Python library for natural language processing.
- Provides efficient methods for tokenization at word, character, and sentence levels.

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt_tab', quiet=True)  # Ensure required data is downloaded

In [None]:
# ### Introduction to Basic Tokenization
# - Tokenization is the process of splitting text into smaller units like words, characters, or sentences.
# - This section demonstrates:
#   1. Word-level tokenization: Breaking text into individual words.
#   2. Character-level tokenization: Breaking text into individual characters.
#   3. Sentence-level tokenization: Splitting text into sentences.
# - We'll use NLTK for basic tokenization tasks to show its functionality.

"""
Demonstrates tokenization at word, character, and sentence levels.
"""
# === Basic Tokenization Examples ===
text = "Hello world! This is a short sentence. Here's another one."
# Text to be tokenized

# Word-level tokenization using NLTK
words = word_tokenize(text)
# Breaks the text into individual words, accounting for punctuation

# Character-level tokenization
chars = list(text)
# Splits the text into individual characters, including spaces and punctuation

# Sentence-level tokenization using NLTK
sentences = sent_tokenize(text)
# Splits the text into complete sentences based on punctuation and grammar rules

# Print outputs for each tokenization level
print("Original Text:\n", text)
print("\nWord Tokenization (NLTK):", words)
print("\nCharacter Tokenization:", chars)
print("\nSentence Tokenization (NLTK):", sentences)
print()



---



## 2. Tokenization with Tiktoken

What is tiktoken?
- `tiktoken` is OpenAI’s library for efficient tokenization of text.
- It is designed specifically for tokenizing input and decoding outputs for GPT models.

Why Use tiktoken?
- Provides tokenizer compatibility with different GPT-like models (e.g., GPT-2, GPT-3, GPT-4).
- Efficiently encodes text into token IDs and decodes token IDs back into text.
- Allows visualization and analysis of tokenization for various OpenAI models.

What This Section Covers:
1. Demonstrates tokenization using `tiktoken` for different models (e.g., GPT-2 and GPT-3.5).
2. Shows how tokenization differs across models.
3. Explores edge cases, such as handling special tokens and long texts.

Applications of tiktoken:
- Preprocessing text for OpenAI API inputs to ensure token limits are respected.
- Analyzing tokenization for optimal prompt design and cost estimation.

Tools Used:
- **tiktoken**: OpenAI's tokenization library for GPT models.

In [None]:
import tiktoken

"""
Demonstrates tokenization and decoding using tiktoken for different models.
"""
# === tiktoken Examples ===
print("=== tiktoken Tokenization Examples ===")

# Step 1: Define a sample text
sample_text = (
    "Hello, how are you today? Let's explore tokenization with tiktoken! "
    "It supports GPT-2, GPT-3, and GPT-4 models."
)
print("\nSample Text:\n", sample_text)

# Step 2: Load encoders for different models
encoders = {
    "GPT-2": tiktoken.get_encoding("gpt2"),
    "GPT-3.5 Turbo": tiktoken.encoding_for_model("gpt-3.5-turbo"),
    "GPT-4": tiktoken.encoding_for_model("gpt-4"),
}

# Step 3: Tokenize the text with each model's encoder
for model, encoder in encoders.items():
    print(f"\n--- {model} Tokenization ---")

    # Encode the text to token IDs
    token_ids = encoder.encode(sample_text)
    print("Encoded Token IDs:", token_ids)

    # Decode back to text
    decoded_text = encoder.decode(token_ids)
    print("Decoded Text:", decoded_text)

    # Token-level decoding (optional for visualization)
    individual_tokens = [encoder.decode([tid]) for tid in token_ids]
    print("Individual Decoded Tokens:", individual_tokens)

    # Length of the tokenized text
    print("Number of Tokens:", len(token_ids))

# Step 4: Explore long text edge cases
long_text = "This is a demonstration of how tiktoken handles very long inputs. " * 50
encoder = encoders["GPT-3.5 Turbo"]
long_token_ids = encoder.encode(long_text)
print("\n--- Long Text Tokenization (GPT-3.5 Turbo) ---")
print("Number of Tokens in Long Text:", len(long_token_ids))

# Step 5: Analyze special tokens
print("\n--- Special Tokens ---")
print("GPT-3.5 Turbo Special Tokens:")

# Corrected special token handling
special_tokens = encoder._special_tokens  # Access the special tokens dictionary
for token_name, token_value in special_tokens.items():
    print(f"  {token_name}: {token_value}")



---



## 3. Training and using a BytePair Encoding Tokenizer

What is Subword Tokenization?
- Subword tokenization is an intermediate approach between word-level and character-level tokenization.
- It breaks words into smaller units (subwords) to handle out-of-vocabulary (OOV) words and reduce vocabulary size.

Why is Subword Tokenization Important?
- Handles rare or unknown words by breaking them into known subwords.
- Reduces vocabulary size, making models more efficient.
- Retains semantic meaning by representing words as a sequence of subwords.

What This Section Covers:
1. Training a Subword Tokenizer: Demonstrates Byte Pair Encoding (BPE) on a small custom corpus.
2. Using the Tokenizer: Tests the trained tokenizer on example sentences, including OOV words.
3. Fine-tuning the Tokenizer: Adds new words or phrases to the vocabulary through incremental training.

Applications of Subword Tokenization:
- Pretraining large language models like BERT or GPT, which rely on subword-level tokenization.
- Handling morphologically rich languages with complex word structures.

Tools Used:
- **Hugging Face Tokenizers Library**: Provides the tools to build and train BPE tokenizers.
- Includes components like pre-tokenizers, decoders, and trainers.

In [None]:
import os
from tokenizers import Tokenizer, trainers, models, pre_tokenizers, decoders

# Step 1: Define a small corpus for training
corpus = [
    "I love natural language processing.",
    "Tokenization is an essential step in NLP.",
    "Subword tokenization helps handle out-of-vocabulary words.",
    "Byte Pair Encoding can reduce the vocabulary size significantly.",
    "This is a demonstration of building our own BPE tokenizer.",
    "We'll later fine-tune this tokenizer on new data."
]

# Step 2: Initialize and configure the tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.decoder = decoders.BPEDecoder()

# Step 3: Configure the BPE trainer
trainer = trainers.BpeTrainer(
    vocab_size=100,  # Define a small vocabulary for demonstration
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]  # Add special tokens
)

# Step 4: Write the corpus to a temporary file for training
temp_filename = "temp_bpe_corpus.txt"
with open(temp_filename, "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

# Step 5: Train the tokenizer on the corpus
tokenizer.train([temp_filename], trainer)

# Step 6: Cleanup temporary file
if os.path.exists(temp_filename):
    os.remove(temp_filename)

# Output results
print("Tokenizer training complete!")
print("Vocabulary Size:", tokenizer.get_vocab_size())

# Step 7: Explore the vocabulary
vocab = tokenizer.get_vocab()
print("\nSample Vocabulary:")
for token, idx in list(vocab.items())[:10]:  # Show the first 10 tokens
    print(f"  {token}: {idx}")

# Step 8: Test the trained tokenizer on new sentences
test_sentences = [
    "I enjoy natural language processing!",
    "Out-of-vocabulary words like SubwordifyMe are handled.",
    "Byte Pair Encoding is efficient for reducing vocabulary."
]

for sentence in test_sentences:
    encoded = tokenizer.encode(sentence)
    print("\nOriginal Sentence:", sentence)
    print("Encoded Tokens:", encoded.tokens)
    print("Token IDs:", encoded.ids)
    print("Decoded Sentence:", tokenizer.decode(encoded.ids))

# Step 9: Visualize subword splitting for OOV words
oov_word = "SubwordifyMePlease"
encoded_oov = tokenizer.encode(oov_word)
print("\nOOV Word:", oov_word)
print("Encoded Tokens:", encoded_oov.tokens)
print("Decoded OOV Word:", tokenizer.decode(encoded_oov.ids))

# Step 10: Fine-tune the tokenizer with new data
additional_corpus = [
    "Fine-tuning a tokenizer can adapt it to new domains.",
    "Specialized vocabularies can be added incrementally.",
    "Let's add new data for incremental training!"
]

# Write the additional corpus to a temporary file
temp_filename_additional = "temp_additional_corpus.txt"
with open(temp_filename_additional, "w", encoding="utf-8") as f:
    for line in additional_corpus:
        f.write(line + "\n")

# Fine-tune the tokenizer
tokenizer.train([temp_filename_additional], trainer)

# Cleanup temporary file
if os.path.exists(temp_filename_additional):
    os.remove(temp_filename_additional)

# Output fine-tuning results
print("\nTokenizer fine-tuned with additional data!")
print("New Vocabulary Size:", tokenizer.get_vocab_size())