# 01 - Data Preprocessing & EDA

**Goal:** Explore OpenWebText, filter it, tokenize it, and produce clean train/val splits.

Steps:
1. Load dataset (small subset first, then full)
2. EDA: document lengths, token distributions, language stats
3. Filter: English only, remove test set overlap
4. Tokenize with GPT-2 tokenizer
5. Save as binary files for fast training

Once this works â†’ extract into `src/data/preprocess.py`

In [2]:
from pathlib import Path
from datasets import load_dataset, load_from_disk
from transformers import GPT2TokenizerFast
import numpy as np
from collections import Counter

# All paths relative to project root
ROOT = Path("../..").resolve()
DATA_DIR = ROOT / "data"
print(f"Project root: {ROOT}")
print(f"Data dir: {DATA_DIR}")
print(f"Contents: {list(DATA_DIR.iterdir())}")

Project root: /Users/christof/ParrotLLM
Data dir: /Users/christof/ParrotLLM/data
Contents: [PosixPath('/Users/christof/ParrotLLM/data/lid.176.ftz'), PosixPath('/Users/christof/ParrotLLM/data/openwebtext-10k'), PosixPath('/Users/christof/ParrotLLM/data/wikitext-103-test')]


## 1. Load Dataset

Start with the 10k subset for fast iteration. Switch to full later.

In [3]:
# Load from local data/ folder (downloaded by scripts/download_data.py)
ds_small = load_from_disk(str(DATA_DIR / "openwebtext-10k"))
print(f"Documents: {len(ds_small)}")
print(f"Columns: {ds_small.column_names}")
print(f"\nExample document (first 500 chars):")
print(ds_small[0]["text"][:500])

Documents: 10000
Columns: ['text']

Example document (first 500 chars):
Port-au-Prince, Haiti (CNN) -- Earthquake victims, writhing in pain and grasping at life, watched doctors and nurses walk away from a field hospital Friday night after a Belgian medical team evacuated the area, saying it was concerned about security.

The decision left CNN Chief Medical Correspondent Sanjay Gupta as the only doctor at the hospital to get the patients through the night.

CNN initially reported, based on conversations with some of the doctors, that the United Nations ordered the B


## 2. Tokenizer Setup

In [4]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Example: 'Hello world' -> {tokenizer.encode('Hello world')}")

Vocab size: 50257
Example: 'Hello world' -> [15496, 995]


## 3. EDA: Document & Token Statistics

In [5]:
# Tokenize a sample and look at length distribution
doc_lengths = []
for doc in ds_small:
    tokens = tokenizer.encode(doc["text"])
    doc_lengths.append(len(tokens))

doc_lengths = np.array(doc_lengths)
print(f"Total tokens in 10k subset: {doc_lengths.sum():,}")
print(f"Mean doc length: {doc_lengths.mean():.0f} tokens")
print(f"Median doc length: {np.median(doc_lengths):.0f} tokens")
print(f"Min: {doc_lengths.min()}, Max: {doc_lengths.max()}")
print(f"Docs > 1024 tokens: {(doc_lengths > 1024).sum()} ({(doc_lengths > 1024).mean()*100:.1f}%)")

Token indices sequence length is longer than the specified maximum sequence length for this model (1217 > 1024). Running this sequence through the model will result in indexing errors


Total tokens in 10k subset: 11,133,993
Mean doc length: 1113 tokens
Median doc length: 716 tokens
Min: 148, Max: 45207
Docs > 1024 tokens: 3270 (32.7%)


In [6]:
# Extrapolate to full dataset
# Full OWT has ~8M docs. Our 10k sample gives us a rough estimate.
full_docs = 8_013_769
est_total_tokens = int(doc_lengths.mean() * full_docs)
print(f"Estimated total tokens in full OWT: {est_total_tokens:,} (~{est_total_tokens/1e9:.1f}B)")

Estimated total tokens in full OWT: 8,922,524,794 (~8.9B)


## 4. Language Detection

Filter out non-English documents. Use fasttext's language detection model.

In [10]:
import fasttext

# Load from local data/ folder (downloaded by scripts/download_data.py)
model_path = str(DATA_DIR / "lid.176.ftz")
lang_model = fasttext.load_model(model_path)
print(f"Loaded language detection model from {model_path}")

ModuleNotFoundError: No module named 'fasttext'

In [11]:
def detect_language(text, model):
    """Return detected language code and confidence."""
    # fasttext needs single line, no newlines
    first_line = text.split("\n")[0].strip()
    if not first_line:
        return "unknown", 0.0
    pred = model.predict(first_line)
    lang = pred[0][0].replace("__label__", "")
    conf = pred[1][0]
    return lang, conf

In [14]:
# Check language distribution in our sample
lang_counts = Counter()
non_english = []

for i, doc in enumerate(ds_small):
    lang, conf = detect_language(doc["text"], lang_model)
    lang_counts[lang] += 1
    if lang != "en":
        non_english.append((i, lang, conf, doc["text"][:100]))

print("Language distribution (top 10):")
for lang, count in lang_counts.most_common(10):
    print(f"  {lang}: {count} ({count/len(ds_small)*100:.1f}%)")

print(f"\nNon-English documents: {len(non_english)} ({len(non_english)/len(ds_small)*100:.1f}%)")

NameError: name 'lang_model' is not defined

In [13]:
# Look at some non-English examples
print("Sample non-English documents:")
for idx, lang, conf, preview in non_english[:5]:
    print(f"  [{lang} conf={conf:.2f}] {preview}...")

Sample non-English documents:


## 5. Test Set Decontamination

**CRITICAL:** Remove any overlap between training data and test sets.

Test sets to exclude:
- Wikitext-103 test split
- NLP26 OpenWebText eval split

In [None]:
# Load Wikitext-103 test set from local data/ folder
wiki_test = load_from_disk(str(DATA_DIR / "wikitext-103-test"))
print(f"Wikitext-103 test documents: {len(wiki_test)}")
print(f"Example: {wiki_test[0]['text'][:200]}")

In [None]:
# Build a set of n-grams from test sets for overlap detection
def extract_ngrams(text, n=13):
    """Extract character-level n-grams from text for contamination detection."""
    text = text.lower().strip()
    return {text[i:i+n] for i in range(len(text) - n + 1)}

# Build test set n-gram index
test_ngrams = set()
for doc in wiki_test:
    if doc["text"].strip():
        test_ngrams.update(extract_ngrams(doc["text"]))

print(f"Test set n-grams: {len(test_ngrams):,}")

In [None]:
# TODO: Also load and index the NLP26 OWT eval split
# Download from: https://drive.switch.ch/index.php/s/6TLGQFEIkAPJ72K
# Place in data/ directory

# owt_eval = load_dataset("path/to/eval/split")
# for doc in owt_eval:
#     test_ngrams.update(extract_ngrams(doc["text"]))

In [None]:
def is_contaminated(text, test_ngrams, n=13, threshold=0.8):
    """Check if a document has high overlap with test set n-grams."""
    doc_ngrams = extract_ngrams(text, n)
    if not doc_ngrams:
        return False
    overlap = len(doc_ngrams & test_ngrams) / len(doc_ngrams)
    return overlap > threshold

# Test on our sample
contaminated = sum(1 for doc in ds_small if is_contaminated(doc["text"], test_ngrams))
print(f"Contaminated documents in 10k sample: {contaminated}")

## 6. Full Preprocessing Pipeline (preview)

Combine all filters and tokenize. This is what will become `src/data/preprocess.py`.

In [None]:
def preprocess_document(doc, tokenizer, lang_model, test_ngrams):
    """Process one document. Returns token IDs or None if filtered out."""
    text = doc["text"]

    # Filter: skip empty
    if not text.strip():
        return None, "empty"

    # Filter: English only
    lang, conf = detect_language(text, lang_model)
    if lang != "en" or conf < 0.5:
        return None, "non_english"

    # Filter: test set contamination
    if is_contaminated(text, test_ngrams):
        return None, "contaminated"

    # Tokenize
    tokens = tokenizer.encode(text)

    # Filter: too short
    if len(tokens) < 64:
        return None, "too_short"

    return tokens, "kept"


# Run on sample
from collections import Counter
stats = Counter()
all_tokens = []

for doc in ds_small:
    tokens, status = preprocess_document(doc, tokenizer, lang_model, test_ngrams)
    stats[status] += 1
    if tokens is not None:
        all_tokens.extend(tokens)

print("Preprocessing stats:")
for status, count in stats.most_common():
    print(f"  {status}: {count} ({count/len(ds_small)*100:.1f}%)")
print(f"\nTotal tokens kept: {len(all_tokens):,}")

## 7. Save as Binary

Save tokenized data as a memory-mapped numpy array for fast training.

In [None]:
# Convert to numpy uint16 (vocab size 50257 fits in uint16 max=65535)
token_array = np.array(all_tokens, dtype=np.uint16)
print(f"Array shape: {token_array.shape}")
print(f"Size on disk: {token_array.nbytes / 1e6:.1f} MB")

# Save (for the real pipeline, split into train.bin and val.bin)
# out_dir = DATA_DIR / "processed"
# out_dir.mkdir(exist_ok=True)
# token_array.tofile(str(out_dir / "train.bin"))

## Next Steps

1. Run this on the **full** OpenWebText dataset: `load_dataset("Skylion007/openwebtext")`
2. Add the NLP26 eval split to decontamination
3. Create train/val split (99%/1%)
4. Extract working code into `src/data/preprocess.py`
5. Wire into `python main.py --stage preprocess --dataset-size full --lang en`