# Intro

This notebook showcases the key features and processing steps of the **tiny_corpus_prep** library, a Polars-based corpus preparation tool for training tiny GPT-like models.



## Setup & Installation <a name="setup"></a>

First, let's import the necessary modules and verify the installation.


In [2]:
# Import core libraries
import polars as pl
import json
from pathlib import Path

# Import tiny_corpus_prep components (main API)
from tiny_corpus_prep import (
    process_corpus,
    DataPipeline,
    read_parquet,
    write_parquet_with_stats,
    filter_by_readability,
    filter_by_keywords,
    is_middle_school_level,
    CustomFunctionAnnotator,
    generate_stats,
)

# Import internal utilities (not part of public API but useful for demos)
from tiny_corpus_prep.normalize import normalize_text
from tiny_corpus_prep.synonyms import SynonymMapper

## Downloading Data <a name="download"></a>

Before processing, you can download real datasets from Wikipedia or FineWeb-edu. 

### Data Sources:
- **Wikipedia**: Simple English Wikipedia - great for general knowledge
- **FineWeb-edu**: Pre-filtered educational web content from HuggingFace

**Note**: These downloads can be large (100MB - 2GB+) and may take several minutes. The cells below are commented out by default - uncomment to run.


In [3]:
# Create a data directory for downloads
import os
from pathlib import Path

data_dir = Path("demo_data_downloads")
data_dir.mkdir(exist_ok=True)

print(f"Created data directory: {data_dir}")
print(f"Full path: {data_dir.absolute()}")


Created data directory: demo_data_downloads
Full path: /home/gillus/tiny_corpus_prep/demo_data_downloads


### Option A: Download Wikipedia Data

Downloads Simple English Wikipedia, extracts articles, and converts to parquet format.

**Requirements**: `bzip2` (system), `wikiextractor` (Python package)

**Size**: ~500MB compressed, expands to ~2GB

Wikipedia content is released under the Creative Commons Attribution-ShareAlike (CC BY-SA 4.0) license. This means that any model trained on Wikipedia data, as well as any outputs that constitute derivative works, must comply with CC BY-SA requirements. In practice, you must ensure that attribution is preserved (e.g., acknowledging Wikipedia as a source in documentation), and any redistributed derivative content is shared under the same license. 

In [4]:
# Download Wikipedia data
# 
# This will:
# 1. Download Simple Wikipedia dump (~500MB)
# 2. Extract the XML file
# 3. Use WikiExtractor to parse articles
# 4. Convert to parquet format
#

import subprocess

result = subprocess.run([
    "python", "bin/download_data.py",
    "--source", "wikipedia",
    "--output-dir", str(data_dir / "wikipedia"),
    "--date", "20251020"  # You can change the date
], capture_output=False)

if result.returncode == 0:
    print("\n✓ Wikipedia download complete!")
    wiki_file = data_dir / "wikipedia" / "wikipedia.parquet"
    print(f"File: {wiki_file}")
else:
    print(f"\n Download failed with code {result.returncode}")


tiny_corpus_prep - Data Download Tool (Step 0)

Configuration:
  Source: wikipedia
  Output directory: demo_data_downloads/wikipedia
  Date: 20251020


=== Downloading Simple Wikipedia (20251020) ===
✓ XML file already exists: demo_data_downloads/wikipedia/simplewiki-20251020-pages-articles-multistream.xml

=== Extracting Wikipedia content ===
Running WikiExtractor (output to demo_data_downloads/wikipedia/extracted)...


INFO: Preprocessing 'demo_data_downloads/wikipedia/simplewiki-20251020-pages-articles-multistream.xml' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Loaded 43176 templates in 23.2s
INFO: Starting page extraction from demo_data_downloads/wikipedia/simplewiki-20251020-pages-articles-multistream.xml.
INFO: Using 7 extract processes.
INFO: Extracted 100000 articles (3791.7 art/s)
INFO: Extracted 200000 articles (3879.7 art/s)
INFO: Extracted 300000 articles (4301.3 art/s)
INFO: Finished 7-process extraction of 382659 articles in 94.1s (4067.0 art/s)
Processing files:   2%|▏         | 6/246 [00:00<00:04, 54.10it/s]


=== Converting to Parquet ===
Found 246 JSON files to process


Processing files: 100%|██████████| 246/246 [00:04<00:00, 56.52it/s]



Created DataFrame with 382658 rows
shape: (5, 7)
┌───────────────┬──────────────┬──────────────┬──────────────┬──────────────┬───────┬──────────────┐
│ filename      ┆ title        ┆ text         ┆ number_of_ch ┆ number_of_wo ┆ topic ┆ text_quality │
│ ---           ┆ ---          ┆ ---          ┆ aracters     ┆ rds          ┆ ---   ┆ ---          │
│ str           ┆ str          ┆ str          ┆ ---          ┆ ---          ┆ str   ┆ i64          │
│               ┆              ┆              ┆ i64          ┆ i64          ┆       ┆              │
╞═══════════════╪══════════════╪══════════════╪══════════════╪══════════���═══╪═══════╪══════════════╡
│ demo_data_dow ┆ Beverly      ┆ Beverly      ┆ 262          ┆ 41           ┆ N-A   ┆ 0            │
│ nloads/wikipe ┆ Hills Madam  ┆ Hills Madam  ┆              ┆              ┆       ┆              │
│ dia/…         ┆              ┆ (also know…  ┆              ┆              ┆       ┆              │
│ demo_data_dow ┆ The Light at ┆ The Li

### Option B: Download FineWeb-edu Data

Downloads pre-filtered educational web content from HuggingFace.

**Requirements**: None (uses Python's `requests` library)

**Size**: ~100MB per file


In [7]:
# Download FineWeb-edu data 
# This downloads educational web content from HuggingFace
# Each file is ~100MB and contains pre-filtered educational text
#

import subprocess

# Download 2 files as an example (you can increase --num-files)
result = subprocess.run([
    "python", "bin/download_data.py",
    "--source", "fineweb",
    "--output-dir", str(data_dir / "fineweb"),
    "--num-files", "1",  # Download 2 files
    "--start-index", "0"  # Start from file 0
], capture_output=False)

if result.returncode == 0:
    print("\n✓ FineWeb download complete!")
    fineweb_file = data_dir / "fineweb" / "fineweb_combined.parquet"
    print(f"File: {fineweb_file}")
else:
    print(f"\n Download failed with code {result.returncode}")

tiny_corpus_prep - Data Download Tool (Step 0)

Configuration:
  Source: fineweb
  Output directory: demo_data_downloads/fineweb
  Number of files: 1
  Start index: 0


=== Downloading FineWeb-edu ===
Downloading 1 file(s) starting from index 0
✓ File already exists: 000_00000.parquet

✓ Downloaded to: demo_data_downloads/fineweb/000_00000.parquet

❌ Error: Error creating dataset. Could not read schema from 'demo_data_downloads/fineweb/000_00000.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'demo_data_downloads/fineweb/000_00000.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Error type: ArrowInvalid

Full traceback:

❌ Download failed with code 1


Traceback (most recent call last):
  File "/home/gillus/tiny_corpus_prep/bin/download_data.py", line 364, in main
    output_file = download_fineweb(args.output_dir, args.num_files, args.start_index)
  File "/home/gillus/tiny_corpus_prep/bin/download_data.py", line 285, in download_fineweb
    preview_df = pl.read_parquet(result_path, use_pyarrow=True)
  File "/home/gillus/tiny_corpus_prep/.venv/lib/python3.13/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
    return function(*args, **kwargs)
  File "/home/gillus/tiny_corpus_prep/.venv/lib/python3.13/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
    return function(*args, **kwargs)
  File "/home/gillus/tiny_corpus_prep/.venv/lib/python3.13/site-packages/polars/io/parquet/functions.py", line 241, in read_parquet
    return _read_parquet_with_pyarrow(
        source,
    ...<4 lines>...
        rechunk=rechunk,
    )
  File "/home/gillus/tiny_corpus_prep/.venv/lib/python3.13/site-packages/polars/io/

In [5]:
df_original = pl.read_parquet('demo_data_downloads/wikipedia/wikipedia.parquet')

In [6]:
df_original = df_original.sample(10000)

df_original

filename,title,text,number_of_characters,number_of_words,topic,text_quality
str,str,str,i64,i64,str,i64
"""demo_data_downloads/wikipedia/…","""Emmy Award""","""The Emmy Awards are United Sta…",785,136,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""Dooly (character)""","""""",0,0,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""My Ex and Whys""","""My Ex and Whys is a 2017 Filip…",238,39,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""Victoria, Labuan""","""Victoria, or locally known as …",154,27,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""Dobrogea""","""""",0,0,"""N-A""",0
…,…,…,…,…,…,…
"""demo_data_downloads/wikipedia/…","""Crocheting""","""""",0,0,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""Les Mujouls""","""Les Mujouls is a commune. It i…",136,22,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""Kids' Shows""","""""",0,0,"""N-A""",0
"""demo_data_downloads/wikipedia/…","""Price elasticity of demand""","""In economics, the price elasti…",1528,263,"""N-A""",0


## Basic Text Normalization <a name="normalization"></a>

Text normalization includes lowercasing and cleaning up punctuation to create more consistent text for training.

In [None]:

# Apply to entire dataframe
df_normalized = df_original.with_columns(
    pl.col("text").map_elements(normalize_text, return_dtype=pl.Utf8).alias("text_normalized")
)

df_normalized.head()

## 5. Readability Filtering <a name="readability"></a>

The library uses the Flesch-Kincaid grade level to filter text by reading difficulty. This is useful for creating training data appropriate for specific education levels.
# Flesch-Kincaid Grade Level

The **Flesch-Kincaid Grade Level** is a readability test that indicates the U.S. school grade level required to understand a piece of text. It was developed by **Rudolf Flesch** and **J. Peter Kincaid** for the U.S. Navy in **1975**.

## Formula

The score is based on two key factors: sentence length and word complexity.

**Grade Level = a × (total words / total sentences) + b × (total syllables / total words) - c**

### Components

- **Average Sentence Length** = total words / total sentences  
  Longer sentences increase complexity.

- **Average Syllables per Word** = total syllables / total words  
  Words with more syllables indicate higher difficulty.

## Interpretation

| Score | Reading Level | Example Audience            |
|-------|----------------|------------------------------|
| 0–5   | Elementary     | 5th grade and below          |
| 6–8   | Middle School  | 6th–8th grade                |
| 9–12  | High School    | 9th–12th grade               |
| 13–16 | College        | College undergraduate        |
| 17+   | Graduate       | Graduate school and above    |



In [None]:
df_filtered = filter_by_readability(df_original, max_grade=8.0)

print(f"\nOriginal rows: {len(df_original)}")
print(f"Filtered rows: {len(df_filtered)}")
print(f"Removed: {len(df_original) - len(df_filtered)} rows")
print("\nRemaining texts:")
print(df_filtered)


## Building CEFR-Based Synonym Dictionaries <a name="cefr-synonyms"></a>

For more vocabulary simplification, you can build synonym dictionaries based on CEFR (Common European Framework of Reference for Languages) levels. This replaces difficult words (B2, C1, C2) with easier synonyms (A1, A2).

The library includes a script that uses WordNet and CEFR word lists to automatically generate these mappings.

Dataset from https://www.kaggle.com/datasets/nezahatkk/10-000-english-words-cerf-labelled


In [None]:
# Build a CEFR-based synonym dictionary
#
# This requires a CEFR wordlist CSV file. The library looks for it in common locations
# or you can provide your own.
#
# The script will:
# 1. Identify difficult words (B2, C1, C2 levels)
# 2. Find easier synonyms from A1, A2 levels using WordNet
# 3. Generate a JSON mapping file
#

import subprocess

# Note: You'll need a CEFR wordlist CSV file
# The script can use common sources like Oxford 5000 or Cambridge wordlists
# See CEFR_SYNONYMS.md for more information

result = subprocess.run([
    "python", "bin/build_synmap_from_cefr.py",
    "--cefr_csv", "data/ENGLISH_CERF_WORDS.csv",  # Provide your CEFR wordlist
    "--out_dir", "cefr_synonyms",
    "--easy_levels", "A1,A2",
    "--difficult_levels", "B2,C1,C2"
], capture_output=True, text=True)

if result.returncode == 0:
    print("✓ CEFR synonym dictionary built successfully!")
    print("\nOutput files:")
    print("  - cefr_synonyms/synonyms.json  (for use in pipeline)")
    print("  - cefr_synonyms/synonyms.csv   (detailed mapping)")
    print("  - cefr_synonyms/unmapped.txt   (words without mappings)")
    print("  - cefr_synonyms/build_stats.txt (statistics)")
else:
    print(f"✗ Build failed: {result.stderr}")


### Using CEFR Synonym Dictionary

Once you've built a CEFR synonym dictionary, you can use it in your processing pipeline for intelligent vocabulary simplification.


In [None]:
# Example: Using a CEFR synonym dictionary
# (This assumes you've built one using the script above)

cefr_synonyms_file = "cefr_synonyms/synonyms.json"


# Load and use the CEFR synonyms
cefr_mapper = SynonymMapper.from_json(cefr_synonyms_file)

# Test on some complex texts
complex_texts = [
    "The physician utilized sophisticated equipment.",
    "They commenced the endeavor immediately.",
    "The automobile accelerated rapidly."
]

print("CEFR-based vocabulary simplification:")
print("="*70)
for text in complex_texts:
    simplified = cefr_mapper.simplify_line(text)
    print(f"\\nOriginal:   {text}")
    print(f"Simplified: {simplified}")

# Use in a pipeline
print("\\n" + "="*70)
print("Using CEFR synonyms in a pipeline:")

pipeline = DataPipeline().add_synonym_mapper(mapping_path=cefr_synonyms_file)
df_cefr_simplified = pipeline.process(df_original)
print(df_cefr_simplified.select(["text"]).head(3))


## Custom Annotations <a name="custom"></a>

Add custom metadata to your text using annotation functions. This can include word counts, sentiment, complexity scores, or any other features you want to track.


In [None]:
# Define custom annotation function
def text_features(text: str) -> dict:
    """Extract custom features from text."""
    words = text.split()
    sentences = text.count(".") + text.count("!") + text.count("?")
    
    return {
        "word_count": len(words),
        "char_count": len(text),
        "sentence_count": max(1, sentences),  # At least 1
        "avg_word_length": round(sum(len(w) for w in words) / len(words), 1) if words else 0,
        "has_numbers": any(c.isdigit() for c in text),
    }

# Test on a sample text
sample = "The cat sat on the mat. It was a sunny day."
features = text_features(sample)

print("Sample text:")
print(f"  {sample}")
print("\nExtracted features:")
for key, value in features.items():
    print(f"  {key}: {value}")

# Create annotator and apply to dataframe
print("\n" + "="*70)
print("Applying custom annotations to all texts:")
print("="*70)

annotator = CustomFunctionAnnotator(text_features)
# Turn off normalization and dedup for this example to keep original text
pipeline = DataPipeline(text_column="text", normalize=False, dedup=False).add_annotator(annotator)
df_annotated = pipeline.process(df_original)

print("\nAnnotated dataframe (first 5 rows):")
print(df_annotated.head(5))


## Using Google Gemini for AI-Powered Annotations <a name="gemini"></a>

The library includes a built-in Google Gemini annotator that can automatically classify text by topic and education level using AI. This is more sophisticated than keyword-based classification.

**Requirements**:
- Install annotator dependencies: `uv pip install -e ".[annotators]"`
- Google API key: Set `GOOGLE_API_KEY` environment variable

**What it does**:
- Classifies text into 20+ topic categories (Mathematics, Computer Science, Life Sciences, etc.)
- Determines education level (primary school, middle school, high school, university, PhD)


In [None]:
# Using Gemini Annotator
#
# This uses Google's Gemini AI to classify text by topic and education level
# It's more accurate than keyword matching but requires an API key
#
# Setup:
# 1. Get a Google API key from https://makersuite.google.com/app/apikey
# 2. Set it as environment variable: export GOOGLE_API_KEY="your_key"
#    Or create a .env file with: GOOGLE_API_KEY=your_key
# 3. Install dependencies: uv pip install -e ".[annotators]"
#

from tiny_corpus_prep import GeminiAnnotator
import os
# 
# # Check if API key is available
api_key = os.getenv("GOOGLE_API_KEY") or os.getenv("MY_API_KEY")
# 
if api_key:
    try:
        # Initialize Gemini annotator
        gemini = GeminiAnnotator(
            model_name="gemini-2.5-flash-lite",  # Fast, cost-effective model
            temperature=0.1,  # Low temperature for consistent results
        )
        
        print("✓ Gemini annotator initialized")
        
        # Create pipeline with Gemini
        gemini_pipeline = (
            DataPipeline(text_column="text", normalize=False, dedup=False)
            .add_annotator(gemini)
        )
        
        # Process our sample data (this will call the Gemini API)
        print("\\nProcessing texts with Gemini (this may take a minute)...")
        df_gemini = gemini_pipeline.process(df_original)
        
        print("\\n✓ Gemini annotation complete!")
        print(f"\\nColumns: {df_gemini.columns}")
        
        # Show results
        print("\\nSample results:")
        print(df_gemini.select(["text", "topic", "education"]).head(5))
        
        # Show topic distribution
        print("\\nTopic distribution:")
        topic_counts = df_gemini["topic"].value_counts().sort("counts", descending=True)
        print(topic_counts)
        
        # Show education level distribution
        print("\\nEducation level distribution:")
        edu_counts = df_gemini["education"].value_counts().sort("counts", descending=True)
        print(edu_counts)
        
    except ImportError as e:
        print(f"✗ Import error: {e}")
        print("\\nPlease install annotation dependencies:")
        print("  uv pip install -e '.[annotators]'")
    except Exception as e:
        print(f"✗ Error: {e}")
else:
    print(" GOOGLE_API_KEY not found in environment")

### Gemini API Cost and Performance

**Cost**: Gemini 2.5 Flash Lite is very affordable:
- ~$0.15 per 1 million input tokens
- Processing 10,000 texts (~50 words each) costs approximately $0.05

**Performance**:
- Processes ~10-20 texts per second
- Can handle texts up to 15,000 characters

**Tips**:
- Use `gemini-2.5-flash-lite` for cost-effective processing
- Set `temperature=0.1` for consistent results
- Process in batches for large datasets


## Full Pipeline Example <a name="pipeline"></a>

Now let's combine multiple processing steps using the high-level `process_corpus` function.


In [13]:
# Process corpus with multiple filters
print("Processing corpus with:")
print("  - Text normalization")
print("  - Readability filter (max grade 10)")
print("  - Keyword filter (science, math, programming)")
print("  - Synonym mapping")
print("  - Deduplication")
print("\n" + "="*70)

stats = process_corpus(
    input_path='../.cache/nanochat/base_data/shard_00000.parquet',
    output_path="demo_processed.parquet",
    normalize=True,
    max_grade=10.0,
    keywords=["science" "math", "programming", "learning", "biology", "physics"],
    synonyms_map_path="./cefr_synonyms/synonyms.json",
    dedup=True,
)

print("\nProcessing complete!")
print(f"\nTotal rows in output: {stats['total_rows']}")
print(f"Total columns: {stats['total_columns']}")
print(f"Columns: {stats['columns']}")

# Load and display processed data
df_processed = pl.read_parquet("demo_processed.parquet")
print("\nProcessed data:")
print(df_processed)


Processing corpus with:
  - Text normalization
  - Readability filter (max grade 10)
  - Keyword filter (science, math, programming)
  - Synonym mapping
  - Deduplication

Reading input from: ../.cache/nanochat/base_data/shard_00000.parquet
Loaded 53248 rows with columns: ['text']
Starting pipeline with 53248 rows...
Normalizing text...
After normalization: 53248 rows
Applying keyword filter (5 keywords)...
After keyword filter: 8956 rows
Applying readability filter (max grade: 10.0)...
After readability filter: 2163 rows
Applying synonym mapping...
Deduplicating...
Removed 0 duplicate rows, 2163 remaining
Pipeline complete! Final: 2163 rows

Writing output to: demo_processed.parquet
Statistics written to: demo_processed.json

Processing complete!

Total rows in output: 2163
Total columns: 1
Columns: ['text']

Processed data:
shape: (2_163, 1)
┌─────────────────────────────────┐
│ text                            │
│ ---                             │
│ str                             │
