# 01 - Data Preprocessing & EDA

**Goal:** Explore OpenWebText, filter it, tokenize it, and produce clean train/val splits.

Steps:
1. Load dataset (small subset first, then full)
2. EDA: document lengths, token distributions, language stats
3. Filter: English only, remove test set overlap
4. Tokenize with GPT-2 tokenizer
5. Save as binary files for fast training

Once this works â†’ extract into `src/data/preprocess.py`

In [2]:
from pathlib import Path
from datasets import load_dataset, load_from_disk
from transformers import GPT2TokenizerFast
import numpy as np
import re
from collections import Counter, defaultdict

# All paths relative to project root
ROOT = Path("../..").resolve()
DATA_DIR = ROOT / "data"
print(f"Project root: {ROOT}")
print(f"Data dir: {DATA_DIR}")
print(f"Contents: {list(DATA_DIR.iterdir())}")

Project root: /Users/christof/ParrotLLM
Data dir: /Users/christof/ParrotLLM/data
Contents: [PosixPath('/Users/christof/ParrotLLM/data/lid.176.ftz'), PosixPath('/Users/christof/ParrotLLM/data/openwebtext-10k'), PosixPath('/Users/christof/ParrotLLM/data/wikitext-103-test')]


## 1. Load Dataset

Start with the 10k subset for fast iteration. Switch to full later.

In [3]:
# Load from local data/ folder (downloaded by scripts/download_data.py)
ds_small = load_from_disk(str(DATA_DIR / "openwebtext-10k"))
print(f"Documents: {len(ds_small)}")
print(f"Columns: {ds_small.column_names}")
print(f"\nExample document (first 500 chars):")
print(ds_small[0]["text"][:500])

Documents: 10000
Columns: ['text']

Example document (first 500 chars):
Port-au-Prince, Haiti (CNN) -- Earthquake victims, writhing in pain and grasping at life, watched doctors and nurses walk away from a field hospital Friday night after a Belgian medical team evacuated the area, saying it was concerned about security.

The decision left CNN Chief Medical Correspondent Sanjay Gupta as the only doctor at the hospital to get the patients through the night.

CNN initially reported, based on conversations with some of the doctors, that the United Nations ordered the B


## 2. Remove Eval/Test Overlaps Before EDA

The instructors require that our exploratory analysis already excludes anything that might leak into the evaluation set.
Run the cell below to build the n-gram index from Wikitext-103 (and the optional NLP26 split) and filter the 10k sample.

In [None]:
def extract_ngrams(text: str, n: int = 13) -> set[str]:
    text = text.lower().strip()
    if len(text) < n:
        return set()
    return {text[i:i+n] for i in range(len(text) - n + 1)}

def build_test_ngrams(data_dir: Path) -> set[str]:
    ngrams: set[str] = set()
    wiki_path = data_dir / 'wikitext-103-test'
    if wiki_path.exists():
        wiki_test = load_from_disk(str(wiki_path))
        for doc in wiki_test:
            if doc['text'].strip():
                ngrams.update(extract_ngrams(doc['text']))
    else:
        print('[WARN] Missing wikitext-103-test; run src/scripts/download_data.py')
    owt_eval_path = data_dir / 'owt-eval' / 'NLP26' / 'NLP26_OWT_eval' / 'test'
    if owt_eval_path.exists():
        owt_eval = load_from_disk(str(owt_eval_path))
        for doc in owt_eval:
            if doc['text'].strip():
                ngrams.update(extract_ngrams(doc['text']))
    else:
        print('[INFO] NLP26 eval split not found locally (optional download).')
    print(f'Test-set n-grams: {len(ngrams):,}')
    return ngrams

def is_contaminated(text: str, test_ngrams: set[str], n: int = 13, threshold: float = 0.8) -> bool:
    if not test_ngrams:
        return False
    doc_ngrams = extract_ngrams(text, n)
    if not doc_ngrams:
        return False
    overlap = len(doc_ngrams & test_ngrams) / len(doc_ngrams)
    return overlap > threshold

test_ngrams = build_test_ngrams(DATA_DIR)

if test_ngrams:
    original_len = len(ds_small)
    ds_small = ds_small.filter(lambda sample: not is_contaminated(sample['text'], test_ngrams))
    print(f'Removed {original_len - len(ds_small)} contaminated docs ({(original_len - len(ds_small)) / original_len * 100:.2f}%).')
else:
    print('Skipping filtering because no test-set data was available.')


## 3. Tokenizer Setup

In [4]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Example: 'Hello world' -> {tokenizer.encode('Hello world')}")

Vocab size: 50257
Example: 'Hello world' -> [15496, 995]


## 4. EDA: Document & Token Statistics

In [5]:
# Tokenize a sample and look at length distribution
doc_lengths = []
for doc in ds_small:
    tokens = tokenizer.encode(doc["text"])
    doc_lengths.append(len(tokens))

doc_lengths = np.array(doc_lengths)
print(f"Total tokens in 10k subset: {doc_lengths.sum():,}")
print(f"Mean doc length: {doc_lengths.mean():.0f} tokens")
print(f"Median doc length: {np.median(doc_lengths):.0f} tokens")
print(f"Min: {doc_lengths.min()}, Max: {doc_lengths.max()}")
print(f"Docs > 1024 tokens: {(doc_lengths > 1024).sum()} ({(doc_lengths > 1024).mean()*100:.1f}%)")

Token indices sequence length is longer than the specified maximum sequence length for this model (1217 > 1024). Running this sequence through the model will result in indexing errors


Total tokens in 10k subset: 11,133,993
Mean doc length: 1113 tokens
Median doc length: 716 tokens
Min: 148, Max: 45207
Docs > 1024 tokens: 3270 (32.7%)


In [6]:
# Extrapolate to full dataset
# Full OWT has ~8M docs. Our 10k sample gives us a rough estimate.
full_docs = 8_013_769
est_total_tokens = int(doc_lengths.mean() * full_docs)
print(f"Estimated total tokens in full OWT: {est_total_tokens:,} (~{est_total_tokens/1e9:.1f}B)")

Estimated total tokens in full OWT: 8,922,524,794 (~8.9B)


## 5. Language Detection

Filter out non-English documents. Use fasttext's language detection model.

In [10]:
import fasttext

# Load from local data/ folder (downloaded by scripts/download_data.py)
model_path = str(DATA_DIR / "lid.176.ftz")
lang_model = fasttext.load_model(model_path)
print(f"Loaded language detection model from {model_path}")

ModuleNotFoundError: No module named 'fasttext'

In [11]:
def detect_language(text, model):
    """Return detected language code and confidence."""
    # fasttext needs single line, no newlines
    first_line = text.split("\n")[0].strip()
    if not first_line:
        return "unknown", 0.0
    pred = model.predict(first_line)
    lang = pred[0][0].replace("__label__", "")
    conf = pred[1][0]
    return lang, conf

In [14]:
# Check language distribution in our sample
lang_counts = Counter()
non_english = []

for i, doc in enumerate(ds_small):
    lang, conf = detect_language(doc["text"], lang_model)
    lang_counts[lang] += 1
    if lang != "en":
        non_english.append((i, lang, conf, doc["text"][:100]))

print("Language distribution (top 10):")
for lang, count in lang_counts.most_common(10):
    print(f"  {lang}: {count} ({count/len(ds_small)*100:.1f}%)")

print(f"\nNon-English documents: {len(non_english)} ({len(non_english)/len(ds_small)*100:.1f}%)")

NameError: name 'lang_model' is not defined

In [13]:
# Look at some non-English examples
print("Sample non-English documents:")
for idx, lang, conf, preview in non_english[:5]:
    print(f"  [{lang} conf={conf:.2f}] {preview}...")

Sample non-English documents:


## 6. Formatting & Noise Inspection

We need evidence-backed cleaning rules before touching the Python preprocessing stage.
The cells below scan a sample of OpenWebText for HTML fragments, fenced code blocks,
aggregator headers (e.g. `Title:`, `Category:`), stack traces, and corrupted characters.
Run this after `src/scripts/download_data.py` has populated `data/openwebtext-10k`.

In [None]:
HTML_TAG_RE = re.compile(r'<\/?[a-zA-Z][^>]{0,200}>')
CODE_BLOCK_RE = re.compile(r'```[\s\S]+?```|<code>|</code>|&lt;code&gt;')
HEADER_RE = re.compile(r'^(?:title|category|comments?|permalink|posted by|tags):', re.IGNORECASE | re.MULTILINE)
STACKTRACE_RE = re.compile(r'Traceback \(most recent call last\)|Exception in thread', re.IGNORECASE)
WEIRD_CHAR_RE = re.compile(r'[^^\t\n\r\x20-\x7E]')

def detect_artifacts(text: str) -> list[str]:
    """Return artifact tags present in the text."""
    tags = []
    if HTML_TAG_RE.search(text):
        tags.append('html')
    if CODE_BLOCK_RE.search(text):
        tags.append('code')
    if HEADER_RE.search(text):
        tags.append('header')
    if STACKTRACE_RE.search(text):
        tags.append('stacktrace')
    if WEIRD_CHAR_RE.search(text):
        tags.append('non_ascii')
    return tags


In [None]:
MAX_DOCS = 2000  # keep lightweight
artifact_counts = Counter()
artifact_examples = defaultdict(list)

for idx, doc in enumerate(ds_small):
    if idx >= MAX_DOCS:
        break
    tags = detect_artifacts(doc['text'])
    for tag in tags:
        artifact_counts[tag] += 1
        if len(artifact_examples[tag]) < 5:
            preview = doc['text'][:280].replace('
', ' ')
            artifact_examples[tag].append((idx, preview))

print('Artifact counts (sample of', min(MAX_DOCS, len(ds_small)), 'docs):')
for tag, count in artifact_counts.most_common():
    pct = count / min(MAX_DOCS, len(ds_small)) * 100
    print(f'  {tag}: {count} ({pct:.1f}%)')

if not artifact_counts:
    print('No artifacts detected in the inspected sample.')


In [None]:
for tag, rows in artifact_examples.items():
    print(f'
Examples for {tag.upper()}:')
    for idx, preview in rows:
        print(f'  [doc #{idx}] {preview[:250]}...')
    if not rows:
        print('  (no samples captured)')


### Boilerplate & Metadata Markers
Scraped dumps often contain Gutenberg disclaimers or RSS metadata headers that add no value for LAMBADA/HellaSwag/OpenBookQA. The cell below quantifies how frequently these markers appear in our sample and captures examples so we know what to strip during preprocessing.

In [None]:
BOILERPLATE_EXACT_MARKERS = (
    '<<< begin of text >>>',
    '<<< end of text >>>',
    '*** start of this project gutenberg ebook',
    '*** end of this project gutenberg ebook',
    '*** start of the project gutenberg ebook',
    '*** end of the project gutenberg ebook',
)
BOILERPLATE_PREFIX_MARKERS = (
    '*** start of',
    '*** end of',
    'article url:',
    'source:',
    'category:',
    'tags:',
    'title:',
    'url:',
)

def find_boilerplate_markers(text: str) -> list[tuple[str, str]]:
    matches = []
    for raw_line in text.splitlines():
        line = raw_line.strip()
        if not line:
            continue
        lower = line.lower()
        if lower in BOILERPLATE_EXACT_MARKERS:
            matches.append(('boilerplate_exact', line))
            continue
        for prefix in BOILERPLATE_PREFIX_MARKERS:
            if lower.startswith(prefix):
                tag = f"boilerplate_prefix:{prefix.rstrip(':')}"
                matches.append((tag, line))
                break
    return matches

boilerplate_counts = Counter()
boilerplate_samples = defaultdict(list)
for idx, doc in enumerate(ds_small):
    if idx >= MAX_DOCS:
        break
    for tag, line in find_boilerplate_markers(doc['text']):
        boilerplate_counts[tag] += 1
        if len(boilerplate_samples[tag]) < 3:
            boilerplate_samples[tag].append((idx, line[:240]))

if not boilerplate_counts:
    print('No boilerplate markers detected in the inspected sample.')
else:
    print('Boilerplate markers (first', min(MAX_DOCS, len(ds_small)), 'docs):')
    for tag, count in boilerplate_counts.most_common():
        pct = count / min(MAX_DOCS, len(ds_small)) * 100
        print(f"  {tag}: {count} ({pct:.1f}%)")
    for tag, samples in boilerplate_samples.items():
        print(f"
Examples for {tag}:")
        for idx, line in samples:
            print(f"  [doc #{idx}] {line}")


### Duplicate Fingerprints
Duplicated or near-duplicated articles waste model capacity and can inflate benchmark scores. We fingerprint normalized text with SHA-1 to see how many docs in the sampled subset would be dropped by our deduplication pass.

In [None]:
import hashlib
import unicodedata

FINGERPRINT_WHITESPACE_RE = re.compile(r'\s+')

def normalized_fingerprint(text: str) -> str | None:
    normalized = unicodedata.normalize('NFC', text or '')
    normalized = normalized.lower()
    normalized = FINGERPRINT_WHITESPACE_RE.sub(' ', normalized)
    normalized = normalized.strip()
    if not normalized:
        return None
    return hashlib.sha1(normalized.encode('utf-8')).hexdigest()

fingerprint_to_docs = defaultdict(list)
DUP_SCAN_LIMIT = 5000
for idx, doc in enumerate(ds_small):
    if idx >= DUP_SCAN_LIMIT:
        break
    fingerprint = normalized_fingerprint(doc['text'])
    if fingerprint is None:
        continue
    fingerprint_to_docs[fingerprint].append(idx)

duplicate_clusters = {fp: ids for fp, ids in fingerprint_to_docs.items() if len(ids) > 1}
if not duplicate_clusters:
    print('No duplicates detected in the first', DUP_SCAN_LIMIT, 'documents.')
else:
    dup_docs = sum(len(ids) for ids in duplicate_clusters.values())
    print(f"Duplicate fingerprints: {len(duplicate_clusters)} covering {dup_docs} docs")
    preview_items = list(duplicate_clusters.items())[:5]
    for fp, ids in preview_items:
        preview = ds_small[ids[0]]['text'][:200].replace('
', ' ')
        print(f"
Fingerprint {fp[:12]}... (n={len(ids)} docs, indices={ids[:5]})")
        print(f"  preview: {preview}...")


## 7. Test Set Decontamination

**CRITICAL:** Remove any overlap between training data and test sets.

Test sets to exclude:
- Wikitext-103 test split
- NLP26 OpenWebText eval split

In [None]:
# Load Wikitext-103 test set from local data/ folder
wiki_test = load_from_disk(str(DATA_DIR / "wikitext-103-test"))
print(f"Wikitext-103 test documents: {len(wiki_test)}")
print(f"Example: {wiki_test[0]['text'][:200]}")

In [None]:
# Build a set of n-grams from test sets for overlap detection
def extract_ngrams(text, n=13):
    """Extract character-level n-grams from text for contamination detection."""
    text = text.lower().strip()
    return {text[i:i+n] for i in range(len(text) - n + 1)}

# Build test set n-gram index
test_ngrams = set()
for doc in wiki_test:
    if doc["text"].strip():
        test_ngrams.update(extract_ngrams(doc["text"]))

print(f"Test set n-grams: {len(test_ngrams):,}")

In [None]:
# TODO: Also load and index the NLP26 OWT eval split
# Download from: https://drive.switch.ch/index.php/s/6TLGQFEIkAPJ72K
# Place in data/ directory

# owt_eval = load_dataset("path/to/eval/split")
# for doc in owt_eval:
#     test_ngrams.update(extract_ngrams(doc["text"]))

In [None]:
def is_contaminated(text, test_ngrams, n=13, threshold=0.8):
    """Check if a document has high overlap with test set n-grams."""
    doc_ngrams = extract_ngrams(text, n)
    if not doc_ngrams:
        return False
    overlap = len(doc_ngrams & test_ngrams) / len(doc_ngrams)
    return overlap > threshold

# Test on our sample
contaminated = sum(1 for doc in ds_small if is_contaminated(doc["text"], test_ngrams))
print(f"Contaminated documents in 10k sample: {contaminated}")

## 8. Full Preprocessing Pipeline (preview)

Combine all filters and tokenize. This is what will become `src/data/preprocess.py`.

In [None]:
def preprocess_document(doc, tokenizer, lang_model, test_ngrams):
    """Process one document. Returns token IDs or None if filtered out."""
    text = doc["text"]

    # Filter: skip empty
    if not text.strip():
        return None, "empty"

    # Filter: English only
    lang, conf = detect_language(text, lang_model)
    if lang != "en" or conf < 0.5:
        return None, "non_english"

    # Filter: test set contamination
    if is_contaminated(text, test_ngrams):
        return None, "contaminated"

    # Tokenize
    tokens = tokenizer.encode(text)

    # Filter: too short
    if len(tokens) < 64:
        return None, "too_short"

    return tokens, "kept"


# Run on sample
from collections import Counter
stats = Counter()
all_tokens = []

for doc in ds_small:
    tokens, status = preprocess_document(doc, tokenizer, lang_model, test_ngrams)
    stats[status] += 1
    if tokens is not None:
        all_tokens.extend(tokens)

print("Preprocessing stats:")
for status, count in stats.most_common():
    print(f"  {status}: {count} ({count/len(ds_small)*100:.1f}%)")
print(f"\nTotal tokens kept: {len(all_tokens):,}")

## 9. Save as Binary

Save tokenized data as a memory-mapped numpy array for fast training.

In [None]:
# Convert to numpy uint16 (vocab size 50257 fits in uint16 max=65535)
token_array = np.array(all_tokens, dtype=np.uint16)
print(f"Array shape: {token_array.shape}")
print(f"Size on disk: {token_array.nbytes / 1e6:.1f} MB")

# Save (for the real pipeline, split into train.bin and val.bin)
# out_dir = DATA_DIR / "processed"
# out_dir.mkdir(exist_ok=True)
# token_array.tofile(str(out_dir / "train.bin"))

## Next Steps

1. Run this on the **full** OpenWebText dataset: `load_dataset("Skylion007/openwebtext")`
2. Add the NLP26 eval split to decontamination
3. Create train/val split (99%/1%)
4. Extract working code into `src/data/preprocess.py`
5. Wire into `python main.py --stage preprocess --dataset-size full --lang en`