# Module 1: Data Preparation and Tokenization

Welcome to the first module of the Storyteller training series! In this notebook, we'll explore how to prepare data for training a language model.

## Learning Objectives

By the end of this notebook, you will understand:
1. How to download and explore story datasets
2. Text preprocessing techniques for language models
3. How tokenization works (Byte-Pair Encoding)
4. How to train a custom tokenizer
5. How to analyze and visualize text data

## Why Data Preparation Matters

The quality of your training data directly impacts model performance:
- **Garbage in, garbage out**: Poor data leads to poor models
- **Tokenization**: Affects vocabulary size, compression, and efficiency
- **Preprocessing**: Removes noise and normalizes text
- **Statistics**: Understanding your data helps debug training issues

In [None]:
import sys
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Add project to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

print("Environment ready!")

## Part 1: Understanding Our Datasets

We'll use two datasets for training:

### 1. TinyStories
- **Size**: ~2M short stories
- **Style**: Simple, child-friendly narratives
- **Vocabulary**: Limited, easy to learn
- **Good for**: Quick experimentation, testing

### 2. WritingPrompts
- **Size**: ~300K creative stories
- **Style**: Diverse, creative writing
- **Vocabulary**: Rich, complex
- **Good for**: Better quality, more variety

In [None]:
# Check if data exists
data_dir = project_root / "data" / "raw"

tinystories_path = data_dir / "tinystories.txt"
writingprompts_path = data_dir / "writingprompts.txt"

print("Dataset Status:")
print(f"  TinyStories: {'✓ Found' if tinystories_path.exists() else '✗ Not found'}")
print(
    f"  WritingPrompts: {'✓ Found' if writingprompts_path.exists() else '✗ Not found'}"
)

if not tinystories_path.exists() and not writingprompts_path.exists():
    print("\nTo download datasets, run:")
    print("  storyteller-download --datasets tinystories writingprompts")

## Part 2: Loading and Exploring Data

Let's load some stories and analyze them.

In [None]:
def load_sample_stories(file_path, num_samples=10):
    """Load a sample of stories from a file."""
    stories = []

    if not Path(file_path).exists():
        print(f"File not found: {file_path}")
        return stories

    with open(file_path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= num_samples:
                break
            stories.append(line.strip())

    return stories


# Load samples
if tinystories_path.exists():
    sample_stories = load_sample_stories(tinystories_path, 5)

    print("Sample Stories from TinyStories:")
    print("=" * 70)
    for i, story in enumerate(sample_stories, 1):
        print(f"\nStory {i}:")
        print(story[:200] + "..." if len(story) > 200 else story)
        print("-" * 70)
else:
    print("Using example story for demonstration...")
    sample_stories = [
        "Once upon a time, there was a little girl named Lily. She loved to play outside."
    ]

## Part 3: Text Statistics

Understanding your data is crucial. Let's compute basic statistics.

In [None]:
def analyze_text_file(file_path, max_stories=10000):
    """Analyze a text file and return statistics."""
    if not Path(file_path).exists():
        return None

    story_lengths = []
    total_chars = 0
    word_counter = Counter()
    char_counter = Counter()

    with open(file_path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= max_stories:
                break

            story = line.strip()
            story_lengths.append(len(story))
            total_chars += len(story)

            # Count words
            words = story.lower().split()
            word_counter.update(words)

            # Count characters
            char_counter.update(story)

    return {
        "num_stories": len(story_lengths),
        "story_lengths": story_lengths,
        "total_chars": total_chars,
        "avg_length": np.mean(story_lengths),
        "median_length": np.median(story_lengths),
        "vocab_size": len(word_counter),
        "total_words": sum(word_counter.values()),
        "unique_chars": len(char_counter),
        "most_common_words": word_counter.most_common(20),
    }


# Analyze datasets
if tinystories_path.exists():
    print("Analyzing TinyStories (this may take a minute)...")
    stats = analyze_text_file(tinystories_path)

    if stats:
        print("\nTinyStories Statistics:")
        print(f"  Stories analyzed: {stats['num_stories']:,}")
        print(f"  Total characters: {stats['total_chars']:,}")
        print(f"  Average story length: {stats['avg_length']:.0f} chars")
        print(f"  Median story length: {stats['median_length']:.0f} chars")
        print(f"  Vocabulary size: {stats['vocab_size']:,} unique words")
        print(f"  Total words: {stats['total_words']:,}")
        print(f"  Unique characters: {stats['unique_chars']}")

        print("\n  Most common words:")
        for word, count in stats["most_common_words"][:10]:
            print(f"    {word}: {count:,}")
else:
    print("Dataset not available. Using example data...")

## Part 4: Visualizing Story Lengths

Understanding the distribution of story lengths helps us choose appropriate sequence lengths for training.

In [None]:
if tinystories_path.exists() and stats:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Histogram
    ax1.hist(stats["story_lengths"], bins=50, alpha=0.7, edgecolor="black")
    ax1.axvline(
        stats["avg_length"],
        color="red",
        linestyle="--",
        label=f"Mean: {stats['avg_length']:.0f}",
    )
    ax1.axvline(
        stats["median_length"],
        color="green",
        linestyle="--",
        label=f"Median: {stats['median_length']:.0f}",
    )
    ax1.set_xlabel("Story Length (characters)")
    ax1.set_ylabel("Count")
    ax1.set_title("Distribution of Story Lengths")
    ax1.legend()
    ax1.grid(alpha=0.3)

    # Cumulative distribution
    sorted_lengths = np.sort(stats["story_lengths"])
    cumulative = np.arange(1, len(sorted_lengths) + 1) / len(sorted_lengths)
    ax2.plot(sorted_lengths, cumulative, linewidth=2)

    # Mark percentiles
    for percentile in [50, 90, 95, 99]:
        length = np.percentile(stats["story_lengths"], percentile)
        ax2.axvline(length, color="red", linestyle=":", alpha=0.5)
        ax2.text(length, 0.5, f"{percentile}%", rotation=90, va="center")

    ax2.set_xlabel("Story Length (characters)")
    ax2.set_ylabel("Cumulative Proportion")
    ax2.set_title("Cumulative Distribution of Story Lengths")
    ax2.grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

    print("\nPercentiles:")
    for p in [50, 75, 90, 95, 99]:
        print(
            f"  {p}th percentile: {np.percentile(stats['story_lengths'], p):.0f} chars"
        )
else:
    print("Visualization requires dataset to be downloaded.")

## Part 5: Understanding Tokenization

### What is Tokenization?

Tokenization converts text into numerical sequences that models can process.

**Example:**
```
Text: "Hello world!"
Tokens: ["Hello", " world", "!"]
Token IDs: [284, 995, 0]
```

### Why Byte-Pair Encoding (BPE)?

BPE is a subword tokenization algorithm that:
1. Starts with character-level tokens
2. Iteratively merges most frequent pairs
3. Balances vocabulary size and token sequence length

**Advantages:**
- Handles unknown words (can represent any text)
- Efficient vocabulary size
- Good compression ratio

In [None]:
# Simple character-level tokenization example
def char_tokenize(text):
    """Simple character-level tokenization."""
    return list(text)


# Simple word-level tokenization example
def word_tokenize(text):
    """Simple word-level tokenization."""
    return text.split()


example_text = "Hello world! This is tokenization."

print("Example Text:", example_text)
print()
print("Character-level tokens:")
char_tokens = char_tokenize(example_text)
print(f"  Tokens: {char_tokens}")
print(f"  Count: {len(char_tokens)}")
print()
print("Word-level tokens:")
word_tokens = word_tokenize(example_text)
print(f"  Tokens: {word_tokens}")
print(f"  Count: {len(word_tokens)}")
print()
print("BPE sits in between - subword level!")

## Part 6: Training a BPE Tokenizer

Let's train a custom BPE tokenizer using the HuggingFace tokenizers library.

In [None]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders


def train_bpe_tokenizer(files, vocab_size=10000, min_frequency=2):
    """
    Train a BPE tokenizer.

    Args:
        files: List of text files to train on
        vocab_size: Target vocabulary size
        min_frequency: Minimum frequency for a token to be included
    """
    # Initialize tokenizer with BPE model
    tokenizer = Tokenizer(models.BPE())

    # Use whitespace pre-tokenizer
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    # Use BPE decoder
    tokenizer.decoder = decoders.BPEDecoder()

    # Configure trainer
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=["<pad>", "<unk>", "<s>", "</s>"],
        show_progress=True,
    )

    # Train
    print(f"Training BPE tokenizer with vocab_size={vocab_size}...")
    tokenizer.train(files, trainer)

    print("✓ Tokenizer trained!")
    print(f"  Vocabulary size: {tokenizer.get_vocab_size()}")

    return tokenizer


# Train tokenizer if data exists
if tinystories_path.exists():
    # For demo, train on small sample
    print("Training tokenizer (this may take a minute)...")
    tokenizer = train_bpe_tokenizer(
        files=[str(tinystories_path)],
        vocab_size=5000,
        min_frequency=2,
    )
else:
    print("Dataset not available. Skipping tokenizer training.")
    tokenizer = None

## Part 7: Testing the Tokenizer

Let's see how our tokenizer performs on sample text.

In [None]:
if tokenizer:
    test_sentences = [
        "Once upon a time, there was a little girl.",
        "The quick brown fox jumps over the lazy dog.",
        "Hello, how are you today?",
        "Tokenization is the process of breaking text into tokens.",
    ]

    print("Tokenization Examples:")
    print("=" * 70)

    for sentence in test_sentences:
        encoding = tokenizer.encode(sentence)

        print(f"\nOriginal: {sentence}")
        print(f"Tokens: {encoding.tokens}")
        print(f"Token IDs: {encoding.ids}")
        print(f"Number of tokens: {len(encoding.tokens)}")
        print(
            f"Compression ratio: {len(sentence) / len(encoding.tokens):.2f} chars/token"
        )
        print("-" * 70)
else:
    print("Tokenizer not trained. Please download dataset first.")

## Part 8: Vocabulary Analysis

In [None]:
if tokenizer:
    vocab = tokenizer.get_vocab()

    print("Vocabulary Statistics:")
    print(f"  Total vocabulary size: {len(vocab)}")

    # Sample vocabulary items
    print("\n  Sample tokens:")
    sample_items = list(vocab.items())[:20]
    for token, idx in sample_items:
        print(f"    '{token}': {idx}")

    # Analyze token lengths
    token_lengths = [len(token) for token in vocab.keys()]

    print("\n  Token length statistics:")
    print(f"    Min: {min(token_lengths)} chars")
    print(f"    Max: {max(token_lengths)} chars")
    print(f"    Average: {np.mean(token_lengths):.2f} chars")
    print(f"    Median: {np.median(token_lengths):.0f} chars")

    # Visualize token lengths
    plt.figure(figsize=(10, 5))
    plt.hist(token_lengths, bins=30, alpha=0.7, edgecolor="black")
    plt.xlabel("Token Length (characters)")
    plt.ylabel("Count")
    plt.title("Distribution of Token Lengths in Vocabulary")
    plt.grid(alpha=0.3)
    plt.show()
else:
    print("Tokenizer not available.")

## Part 9: Saving the Tokenizer

In [None]:
if tokenizer:
    # Create directory
    tokenizer_dir = project_root / "models" / "tokenizer"
    tokenizer_dir.mkdir(parents=True, exist_ok=True)

    # Save tokenizer
    tokenizer_path = tokenizer_dir / "demo_tokenizer.json"
    tokenizer.save(str(tokenizer_path))

    print(f"✓ Tokenizer saved to: {tokenizer_path}")
    print("\nTo use this tokenizer in training:")
    print("  from tokenizers import Tokenizer")
    print(f"  tokenizer = Tokenizer.from_file('{tokenizer_path}')")
else:
    print("No tokenizer to save.")

## Summary

In this notebook, you learned:

1. **Dataset Understanding**:
   - TinyStories and WritingPrompts datasets
   - Loading and exploring text data
   - Computing text statistics

2. **Text Analysis**:
   - Story length distributions
   - Vocabulary statistics
   - Character and word frequencies

3. **Tokenization**:
   - Different tokenization strategies
   - Byte-Pair Encoding (BPE)
   - Training a custom tokenizer
   - Analyzing compression ratios

4. **Practical Skills**:
   - Using HuggingFace tokenizers library
   - Visualizing data distributions
   - Saving and loading tokenizers

### Next Steps

In the next notebook, we'll dive into transformer architectures and understand the attention mechanism that powers modern language models!

### Key Takeaways

- **Good tokenization is crucial** for model efficiency and performance
- **BPE balances** vocabulary size and sequence length
- **Understanding your data** helps you make better modeling decisions
- **Compression ratio** affects training speed and memory usage

## Exercise

Try these experiments:

1. **Different Vocabulary Sizes**: Train tokenizers with 1000, 5000, and 10000 vocab size. How does compression change?

2. **Compare Datasets**: If you have both TinyStories and WritingPrompts, train separate tokenizers and compare their vocabularies.

3. **Custom Text**: Tokenize your own creative writing. How well does the tokenizer handle it?

4. **Token Frequency**: Analyze which tokens appear most frequently in the encoded data.