# OpenWebText EDA (Lightweight)

Goal: understand data quality and likely cleaning needs before preprocessing.
This notebook stays lightweight (no heavy stats or plots).

## 1) Load a tiny streaming sample
We use streaming to avoid downloading large parquet shards.

In [None]:
from datasets import load_dataset

SAMPLE_SIZE = 2000
ds_stream = load_dataset('openwebtext', split='train', streaming=True)

samples = []
for i, ex in enumerate(ds_stream):
    samples.append(ex)
    if i + 1 >= SAMPLE_SIZE:
        break

print(f'Loaded sample size: {len(samples)}')

## 2) Inspect dataset structure
Streaming datasets do not expose full length without iterating through the whole set, so we report the sample size.

In [None]:
print('Sample count (loaded):', len(samples))
if samples:
    print('Example keys:', list(samples[0].keys()))
    print('Example text preview:', samples[0].get('text', '')[:300])

In [None]:
# Print 20 raw samples (first 1000 chars each)
for i, ex in enumerate(samples[:20]):
    text = ex.get('text', '')
    print(f'--- Sample {i} (first 1000 chars) ---')
    print(text[:1000])
    print()

## 3) Lightweight length stats
No heavy stats or plotting; just quick length checks.

In [None]:
lengths = [len(ex.get('text', '')) for ex in samples]

if lengths:
    avg_len = sum(lengths) / len(lengths)
    print(f'Average length: {avg_len:.1f}')
    print(f'Min length: {min(lengths)}')
    print(f'Max length: {max(lengths)}')

    short_count = sum(1 for l in lengths if l < 200)
    short_pct = 100.0 * short_count / len(lengths)
    print(f'Short samples (<200 chars): {short_count} ({short_pct:.1f}%)')

    long_idxs = sorted(range(len(lengths)), key=lambda i: lengths[i], reverse=True)[:3]
    print('--- Extremely long samples (first 1000 chars) ---')
    for idx in long_idxs:
        print(f'Index {idx}, length {lengths[idx]}')
        print(samples[idx].get('text', '')[:1000])
        print()

## 4) Simple content inspection
Quick heuristics for HTML, code-like content, and non-English signals.

In [None]:
import re

html_re = re.compile(r'<[^>]+>')
code_markers = ['```', '{', '}', ';', 'def ', 'class ', '#include', 'import ']

def looks_like_code(text: str) -> bool:
    if '```' in text:
        return True
    brace_count = text.count('{') + text.count('}')
    if brace_count >= 10:
        return True
    marker_hits = sum(1 for m in code_markers if m in text)
    return marker_hits >= 3

def non_english_heuristic(text: str) -> bool:
    if not text:
        return False
    non_ascii = sum(1 for ch in text if ord(ch) > 127)
    return (non_ascii / len(text)) > 0.2

html_count = 0
code_count = 0
non_english_count = 0

for ex in samples:
    text = ex.get('text', '')
    if html_re.search(text):
        html_count += 1
    if looks_like_code(text):
        code_count += 1
    if non_english_heuristic(text):
        non_english_count += 1

print(f'Samples with HTML tags: {html_count}')
print(f'Samples that look like code: {code_count}')
print(f'Samples with non-English signals: {non_english_count}')

## 5) Findings and proposed cleaning rules
Fill this in after running the notebook.

### Findings
- Data quality: _TBD_
- Common noise: _TBD_

### Proposed cleaning rules
- Strip HTML tags
- Remove code-heavy samples
- Remove very short samples
- Filter non-English samples (heuristic)

### Noise patterns to watch
- Boilerplate/navigation text
- Duplicates or near-duplicates
- Mixed-language or encoding artifacts