# Phase 1 - Data Preprocessing

This notebook builds the **processed datasets** used by all downstream model notebooks (RoBERTa/BERT/Mistral).

Key outputs (written to `data/processed_data/` and related folders):
- A **single canonical schema** (`text`, `label`, `task`, `variety_name`, `source_name`)
- Minimal **text normalization** (sarcasm-safe)
- Standard **test sets** (FULL / by source / by variety)
- Training **settings** (by source and by variety)
- Deterministic **train/val splits** with appropriate stratification
- Manifests + audit files so results are reproducible and easy to debug


## Cell 1 — Imports

Load standard libraries used across preprocessing, file I/O, and reproducibility.
We keep dependencies minimal and push reusable logic into `src/` so all notebooks share the same behavior.

In [None]:
import os, sys, json, random, re
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

## Cell 2 — Mount Google Drive

Mount Drive so all outputs (processed data, split files, reports) persist across Colab sessions.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Cell 3 — Project paths

Define the project root and key directories:
- `data/raw`: input files (expected: `besstie_train.*`, `besstie_validation.*`)
- `data/processed_data`: processed outputs used by training notebooks
- `data/splits`: JSON split metadata (row ids, strategy, seed)
- `data/reports`: preprocessing audits and sanity checks
- `src/`: reusable preprocessing modules written below

In [None]:
BASE = Path("/content/drive/MyDrive/DNLP")

DATA_DIR   = BASE / "data"
RAW_DIR    = DATA_DIR / "raw"
PROC_DIR   = DATA_DIR / "processed"
SPLITS_DIR = DATA_DIR / "splits"
REPORT_DIR = DATA_DIR / "reports"

SRC_DIR    = BASE / "src"
NB_DIR     = BASE / "notebooks"
DOCS_DIR   = BASE / "docs"
MODELS_DIR = BASE / "models"

# Ensure base directories exist
for d in [DATA_DIR, RAW_DIR, PROC_DIR, SPLITS_DIR, REPORT_DIR, SRC_DIR, NB_DIR, DOCS_DIR, MODELS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("BASE:", BASE)
print("RAW:", RAW_DIR)
print("PROC:", PROC_DIR)
print("SPLITS:", SPLITS_DIR)

BASE: /content/drive/MyDrive/DNLP
RAW: /content/drive/MyDrive/DNLP/data/raw
PROC: /content/drive/MyDrive/DNLP/data/processed
SPLITS: /content/drive/MyDrive/DNLP/data/splits


## Cell 4 — Global configuration

Centralize settings that affect all runs:
- Random seed for deterministic splits
- Validation ratio
- Sarcasm source filtering (e.g., keep sarcasm only from Reddit)
- `MAX_LEN_FOR_MODELS=512` (note: actual tokenization happens later in model notebooks)

In [None]:
CFG = {
    "SEED": 42,
    "VAL_RATIO": 0.2,             # internal val from train pool
    "SARC_SOURCE_ONLY": "Reddit", # keep sarcasm only from Reddit
    "MAX_LEN_FOR_MODELS": 512,
}

def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(CFG["SEED"])

## Cell 5 — Initialize `src/` package

Create a minimal Python package under `src/` so later notebooks can `import src.*`
instead of copying helper code around.

In [None]:
PKG_DIR = SRC_DIR
PKG_DIR.mkdir(parents=True, exist_ok=True)

(PKG_DIR / "__init__.py").write_text(
    "__version__ = '1.1.0'\n"
)

print("Created package:", PKG_DIR)

Created package: /content/drive/MyDrive/DNLP/src


## Cell 6 — Write `src/io_utils.py`

Create small, reusable utilities:
- `load_any`: load CSV or Parquet
- `safe_name`: sanitize names for folder/file creation (keeps `en-AU` style hyphens)

In [None]:
io_code = r'''
from pathlib import Path
import pandas as pd
import shutil

def load_any(path: Path) -> pd.DataFrame:
    suf = path.suffix.lower()
    if suf == ".csv":
        return pd.read_csv(path)
    if suf in [".parquet", ".pq"]:
        return pd.read_parquet(path)
    raise ValueError(f"Unsupported file type: {path}")

def safe_name(s: str) -> str:
    # Aggressive filename safety: lowercase, no spaces, standard hyphens
    s = str(s).strip().lower()
    s = s.replace(" ", "")
    s = s.replace("/", "-")
    # minimal fallback for weird chars
    s = "".join([c if c.isalnum() or c in "-_." else "" for c in s])
    return s

def clean_dir(path: Path):
    """Safely delete and recreate a directory."""
    if path.exists():
        shutil.rmtree(path)
    path.mkdir(parents=True, exist_ok=True)
'''
(PKG_DIR / "io_utils.py").write_text(io_code)
print("Wrote:", PKG_DIR / "io_utils.py")

Wrote: /content/drive/MyDrive/DNLP/src/io_utils.py


## Cell 7 — Write `src/text_norm.py`

Define **minimal, sarcasm-safe** text normalization:
- Replace URLs with `<url>`, usernames with `<user>`, and numbers with `<num>`
- Decode common HTML entities
- Only whitespace cleanup (do NOT remove punctuation/emojis/casing/elongations)

This preserves the cues that matter for figurative language/sarcasm.

In [None]:
norm_code = r'''
import re

URL_RE  = re.compile(r"(https?://\S+|www\.\S+)", re.IGNORECASE)
USER_RE = re.compile(r"(?<!\w)@\w+")
NUM_RE  = re.compile(r"(?<!\w)\d+([.,]\d+)?(?!\w)")

HTML_ENT = {
    "&amp;": "&", "&lt;": "<", "&gt;": ">", "&quot;": '"', "&#39;": "'", "&nbsp;": " "
}

def normalize_text(s: str) -> str:
    if s is None:
        return ""
    s = str(s)

    for k, v in HTML_ENT.items():
        s = s.replace(k, v)

    s = URL_RE.sub("<url>", s)
    s = USER_RE.sub("<user>", s)
    s = NUM_RE.sub("<num>", s)

    s = re.sub(r"\s+", " ", s).strip()
    return s
'''
(PKG_DIR / "text_norm.py").write_text(norm_code)
print("Wrote:", PKG_DIR / "text_norm.py")

Wrote: /content/drive/MyDrive/DNLP/src/text_norm.py


## Cell 8 — Write `src/schema.py`

Enforce a single canonical schema regardless of raw column naming:
- Map columns (`variety`→`variety_name`, `source`→`source_name`, etc.)
- Cast types
- Validate binary labels {0,1}

If anything is missing or inconsistent, we fail early with a clear error.

In [None]:
schema_code = r'''
import pandas as pd

CANON = {
    "text": ["text"],
    "label": ["label"],
    "task": ["task"],
    "variety_name": ["variety", "variety_name"],
    "source_name": ["source", "source_name"],
}

# Fix A: Canonical Source Mapping
SOURCE_MAP = {
    "reddit": "Reddit",
    "google": "Google",
    "twitter": "Twitter",
    "youtube": "YouTube"
}

def fix_dashes(s: str) -> str:
    """Normalize weird unicode dashes (en-dash, em-dash) to standard hyphen."""
    if not isinstance(s, str): return str(s)
    return s.replace(u'\u2013', '-').replace(u'\u2014', '-').replace(u'\u00AD', '-')

def canonicalize(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    lower_map = {c.lower(): c for c in df.columns}

    found = {}
    for std, aliases in CANON.items():
        for a in aliases:
            if a in lower_map:
                found[std] = lower_map[a]
                break

    missing = [k for k in CANON.keys() if k not in found]
    if missing:
        raise ValueError(f"Missing columns {missing}. Found mapping={found}.")

    df = df.rename(columns={found[k]: k for k in found})

    # --- CANONICAL VALUE NORMALIZATION ---
    df["text"] = df["text"].astype(str)
    df["label"] = df["label"].astype(int)

    # Force task to lowercase (sentiment/sarcasm)
    df["task"] = df["task"].astype(str).str.strip().str.lower()

    # Fix variety names (unicode dashes)
    df["variety_name"] = df["variety_name"].apply(fix_dashes).str.strip()

    # Fix source names (Map 'reddit' -> 'Reddit')
    df["source_name"] = df["source_name"].astype(str).str.strip()
    df["source_name"] = df["source_name"].apply(lambda x: SOURCE_MAP.get(x.lower(), x))

    uniq = sorted(df["label"].unique().tolist())
    if set(uniq) - {0, 1}:
        raise ValueError(f"Label values are {uniq} (expected binary 0/1).")

    return df
'''
(PKG_DIR / "schema.py").write_text(schema_code)
print("Wrote:", PKG_DIR / "schema.py")

Wrote: /content/drive/MyDrive/DNLP/src/schema.py


## Cell 9 — Write `src/splits.py`

Implement deterministic train/val splitting:
- `stratified_split_indices`: stratify by **label** (default)
- `stratified_split_indices_multi`: stratify by composite keys (e.g., **label + variety**)

We use label+variety stratification when the training pool contains multiple varieties, to avoid skew.
Also includes a helper to save split metadata as JSON.

In [None]:
splits_code = r'''
import json
import numpy as np
from pathlib import Path
from typing import Tuple, List, Dict

def stratified_split_indices(labels: np.ndarray, val_ratio: float, seed: int) -> Tuple[np.ndarray, np.ndarray]:
    rng = np.random.default_rng(seed)
    idx = np.arange(len(labels))
    y = labels.astype(int)

    classes = np.unique(y)
    if len(classes) < 2:
        rng.shuffle(idx)
        n_val = int(np.ceil(len(idx) * val_ratio))
        return idx[n_val:], idx[:n_val]

    train_idx, val_idx = [], []
    for c in classes:
        c_idx = idx[y == c]
        rng.shuffle(c_idx)
        n_c = len(c_idx)
        n_val = int(np.round(n_c * val_ratio))
        if n_c >= 2:
            n_val = max(1, min(n_val, n_c - 1))
        else:
            n_val = 0
        val_idx.append(c_idx[:n_val])
        train_idx.append(c_idx[n_val:])

    train_idx = np.concatenate(train_idx) if len(train_idx) else np.array([], dtype=int)
    val_idx   = np.concatenate(val_idx) if len(val_idx) else np.array([], dtype=int)
    rng.shuffle(train_idx)
    rng.shuffle(val_idx)
    return train_idx, val_idx

def stratified_split_indices_multi(strata: np.ndarray, val_ratio: float, seed: int) -> Tuple[np.ndarray, np.ndarray]:
    rng = np.random.default_rng(seed)
    idx = np.arange(len(strata))
    uniq = np.unique(strata)
    train_idx, val_idx = [], []

    for s in uniq:
        s_idx = idx[strata == s]
        rng.shuffle(s_idx)
        n_s = len(s_idx)
        n_val = int(np.round(n_s * val_ratio))
        if n_s >= 2:
            n_val = max(1, min(n_val, n_s - 1))
        else:
            n_val = 0
        val_idx.append(s_idx[:n_val])
        train_idx.append(s_idx[n_val:])

    train_idx = np.concatenate(train_idx) if len(train_idx) else np.array([], dtype=int)
    val_idx   = np.concatenate(val_idx) if len(val_idx) else np.array([], dtype=int)
    rng.shuffle(train_idx)
    rng.shuffle(val_idx)
    return train_idx, val_idx

def save_json(path: Path, obj: Dict):
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w") as f:
        json.dump(obj, f, indent=2)
'''
(PKG_DIR / "splits.py").write_text(splits_code)
print("Wrote:", PKG_DIR / "splits.py")

Wrote: /content/drive/MyDrive/DNLP/src/splits.py


## Cell 10 — Main Pipeline

This is the core preprocessing logic:
1. Load raw train/validation files and canonicalize schema  
2. Compute `text_norm` with minimal normalization  
3. Apply the agreed constraint: **sarcasm only from one source** (e.g., Reddit)  
4. Save debuggable snapshots (`_all/train_all.csv`, `_all/validation_all.csv`)  
5. Build standardized **test sets** per task:
   - `TEST_FULL`
   - `TEST_<source>`
   - `TEST_<variety>`
6. Build training **settings** per task:
   - Sentiment: train on each **source** and also each **variety** (for cross-variety analysis)
   - Sarcasm: `FULL` + per-variety training settings
7. For each (task, setting), create **train/val split** with correct stratification:
   - If multiple varieties in the training pool → stratify by **label+variety**
   - Otherwise → stratify by **label**
8. Save:
   - `train.csv`, `val.csv`
   - `manifest.json` (pointers + summaries)
   - split JSON with row IDs
   - `index.csv` (master table of all settings)
   - `preprocess_audit.csv` (what strategy was used and resulting distributions)

In [None]:
build_code = r'''
import json, shutil
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

from .io_utils import load_any, safe_name
from .schema import canonicalize
from .text_norm import normalize_text
from .splits import stratified_split_indices, stratified_split_indices_multi, save_json

KEEP_COLS = ["row_id","task","label","variety_name","source_name","text","text_norm"]

def find_raw_files(raw_dir: Path) -> Tuple[Path, Path]:
    candidates = list(raw_dir.glob("besstie_train.*"))
    candidates2 = list(raw_dir.glob("besstie_validation.*"))
    if len(candidates) != 1 or len(candidates2) != 1:
        raise FileNotFoundError(f"Missing train/val in {raw_dir}")
    return candidates[0], candidates2[0]

def summarize(df: pd.DataFrame) -> Dict:
    return {
        "n": int(len(df)),
        "label_counts": df["label"].value_counts().to_dict() if len(df) else {},
        "variety_counts": df["variety_name"].value_counts().to_dict() if len(df) else {},
        "source_counts": df["source_name"].value_counts().to_dict() if len(df) else {},
    }

def filter_trainpool(df_task_train: pd.DataFrame, task: str, setting: str) -> pd.DataFrame:
    # Logic: Sentiment (Source or Variety); Sarcasm (FULL or Variety)
    if task == "sentiment":
        if setting in df_task_train["source_name"].unique():
            return df_task_train[df_task_train["source_name"] == setting].copy()
        if setting.startswith("TRAIN_"):
            v = setting.replace("TRAIN_", "")
            return df_task_train[df_task_train["variety_name"] == v].copy()
        return df_task_train.copy()

    if task == "sarcasm":
        if setting == "FULL":
            return df_task_train.copy()
        if setting.startswith("TRAIN_"):
            v = setting.replace("TRAIN_", "")
            return df_task_train[df_task_train["variety_name"] == v].copy()

    return df_task_train.copy()

def build_testsets(df_task_valid: pd.DataFrame, out_dir_task: Path) -> pd.DataFrame:
    # Writes to proc_dir/<task>/testsets/
    test_dir = out_dir_task / "testsets"
    test_dir.mkdir(parents=True, exist_ok=True)

    rows = []

    # 1. FULL
    p_full = test_dir / "TEST_FULL.csv"
    df_task_valid[KEEP_COLS].to_csv(p_full, index=False)
    rows.append({"test_setting":"TEST_FULL", "csv":str(p_full), **summarize(df_task_valid)})

    # 2. By Source
    for src in sorted(df_task_valid["source_name"].unique()):
        df_s = df_task_valid[df_task_valid["source_name"] == src].copy()
        p = test_dir / f"TEST_{safe_name(src)}.csv"
        df_s[KEEP_COLS].to_csv(p, index=False)
        rows.append({"test_setting":f"TEST_{src}", "csv":str(p), **summarize(df_s)})

    # 3. By Variety
    for v in sorted(df_task_valid["variety_name"].unique()):
        df_v = df_task_valid[df_task_valid["variety_name"] == v].copy()
        p = test_dir / f"TEST_{safe_name(v)}.csv"
        df_v[KEEP_COLS].to_csv(p, index=False)
        rows.append({"test_setting":f"TEST_{v}", "csv":str(p), **summarize(df_v)})

    idx = pd.DataFrame(rows).sort_values("test_setting").reset_index(drop=True)
    idx.to_csv(test_dir / "index_testsets.csv", index=False)
    return idx

def run_preprocess(raw_dir, proc_dir, splits_dir, report_dir, seed, val_ratio, sarc_source_only, max_len_for_models):
    train_file, valid_file = find_raw_files(raw_dir)

    # Load & Canonicalize
    df_train = canonicalize(load_any(train_file))
    df_valid = canonicalize(load_any(valid_file))

    # Normalize Text
    df_train["text_norm"] = df_train["text"].apply(normalize_text)
    df_valid["text_norm"] = df_valid["text"].apply(normalize_text)

    # IDs
    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)
    df_train["row_id"] = np.arange(len(df_train), dtype=np.int64)
    df_valid["row_id"] = np.arange(len(df_valid), dtype=np.int64) + int(10**9)

    # Filter Sarcasm Source (Handle Case Insensitivity via Canonical Names)
    # Note: canonicalize() now forces "Reddit" to be Title Case, so simple comparison is safe.
    mask_tr_sarc = df_train["task"] == "sarcasm"
    mask_va_sarc = df_valid["task"] == "sarcasm"

    if mask_tr_sarc.any():
        # sarc_source_only from config is usually "Reddit"
        df_train = pd.concat([df_train[~mask_tr_sarc], df_train[mask_tr_sarc & (df_train["source_name"] == sarc_source_only)]], ignore_index=True)
    if mask_va_sarc.any():
        df_valid = pd.concat([df_valid[~mask_va_sarc], df_valid[mask_va_sarc & (df_valid["source_name"] == sarc_source_only)]], ignore_index=True)

    # Reset IDs after filtering
    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)
    df_train["row_id"] = np.arange(len(df_train), dtype=np.int64)
    df_valid["row_id"] = np.arange(len(df_valid), dtype=np.int64) + int(10**9)

    # --- SAVE FULL SNAPSHOTS ---
    (proc_dir / "_all").mkdir(parents=True, exist_ok=True)
    df_train[KEEP_COLS].to_csv(proc_dir / "_all" / "train_all.csv", index=False)
    df_valid[KEEP_COLS].to_csv(proc_dir / "_all" / "validation_all.csv", index=False)

    tasks = sorted(df_train["task"].unique().tolist())
    all_index_rows = []
    audit_rows = []

    for task in tasks:
        # Create Task Directory Structure
        task_dir = proc_dir / safe_name(task)
        trainsets_dir = task_dir / "trainsets"
        splits_out_dir = task_dir / "splits"

        for d in [task_dir, trainsets_dir, splits_out_dir]:
            d.mkdir(parents=True, exist_ok=True)

        # 1. Build Testsets
        df_task_valid = df_valid[df_valid["task"] == task].copy()
        testsets_index_path = task_dir / "testsets" / "index_testsets.csv"
        if len(df_task_valid) > 0:
            build_testsets(df_task_valid, task_dir)

        # 2. Build Training Settings
        df_task_train = df_train[df_train["task"] == task].copy()
        if len(df_task_train) == 0: continue

        # Define Settings
        if task == "sentiment":
            settings = sorted(df_task_train["source_name"].unique().tolist()) + \
                       [f"TRAIN_{v}" for v in sorted(df_task_train["variety_name"].unique())]
        elif task == "sarcasm":
            settings = ["FULL"] + [f"TRAIN_{v}" for v in sorted(df_task_train["variety_name"].unique())]
        else:
            settings = ["FULL"]

        task_settings_rows = []

        for setting in settings:
            tr_pool = filter_trainpool(df_task_train, task, setting)
            if len(tr_pool) < 10: continue

            # --- Fix B: IMPROVED STRATIFICATION ---
            n_var = tr_pool["variety_name"].nunique()
            n_src = tr_pool["source_name"].nunique()

            # If variety differs, split by label+variety
            if n_var > 1:
                strategy = "label_variety"
                strata = (tr_pool["label"].astype(str) + "__" + tr_pool["variety_name"]).values
                tr_idx, va_idx = stratified_split_indices_multi(strata, val_ratio, seed)
            # If variety is fixed (e.g. TRAIN_Au) but source differs (Google/Reddit), split by label+source
            elif n_src > 1:
                strategy = "label_source"
                strata = (tr_pool["label"].astype(str) + "__" + tr_pool["source_name"]).values
                tr_idx, va_idx = stratified_split_indices_multi(strata, val_ratio, seed)
            # Otherwise simple stratified
            else:
                strategy = "label"
                strata = tr_pool["label"].values
                tr_idx, va_idx = stratified_split_indices(strata, val_ratio, seed)

            df_tr = tr_pool.iloc[tr_idx][KEEP_COLS].reset_index(drop=True)
            df_va = tr_pool.iloc[va_idx][KEEP_COLS].reset_index(drop=True)

            out_dir = trainsets_dir / safe_name(setting)
            out_dir.mkdir(parents=True, exist_ok=True)

            p_tr = out_dir / "train.csv"
            p_va = out_dir / "val.csv"
            df_tr.to_csv(p_tr, index=False)
            df_va.to_csv(p_va, index=False)

            # Splits JSON
            split_obj = {
                "task": task, "setting": setting, "seed": seed, "val_ratio": val_ratio,
                "split_strategy": strategy, "max_len_for_models": int(max_len_for_models),
                "train_row_ids": df_tr["row_id"].tolist(), "val_row_ids": df_va["row_id"].tolist(),
            }
            p_split_global = splits_dir / f"{safe_name(task)}__{safe_name(setting)}.json"
            save_json(p_split_global, split_obj)

            p_split_local = splits_out_dir / f"{safe_name(setting)}.json"
            save_json(p_split_local, split_obj)

            # Manifest
            manifest = {
                "task": task, "setting": setting,
                "files": {"train_csv": str(p_tr), "val_csv": str(p_va)},
                "splits_json": str(p_split_global),
                "testsets_index_csv": str(testsets_index_path),
                "summary": {"train": summarize(df_tr), "val": summarize(df_va)}
            }
            save_json(out_dir / "manifest.json", manifest)

            # Record for indices
            row = {
                "task": task, "setting": setting,
                "train_csv": str(p_tr), "val_csv": str(p_va),
                "splits_json": str(p_split_global),
                "testsets_index_csv": str(testsets_index_path),
                "n_train": len(df_tr), "n_val": len(df_va)
            }
            all_index_rows.append(row)
            task_settings_rows.append(row)

            # Audit
            audit_rows.append({
                "task": task, "setting": setting, "split_strategy": strategy,
                "train_label_dist": dict(df_tr["label"].value_counts(normalize=True)),
                "val_label_dist": dict(df_va["label"].value_counts(normalize=True))
            })

        # --- SAVE PER-TASK INDICES ---
        if task_settings_rows:
            df_ts = pd.DataFrame(task_settings_rows)
            # 1. processed/<task>/index_settings.csv
            df_ts.to_csv(task_dir / "index_settings.csv", index=False)
            # 2. processed/<task>/trainsets/index_trainsets.csv
            df_ts.to_csv(trainsets_dir / "index_trainsets.csv", index=False)

            # 3. processed/<task>/train/index_train.csv (Fake this one if loader looks for it)
            (task_dir / "train").mkdir(exist_ok=True)
            df_ts.to_csv(task_dir / "train" / "index_train.csv", index=False)

            # 4. Fix D: processed/<task>/index_train.csv (The missing one!)
            df_ts.to_csv(task_dir / "index_train.csv", index=False)

    # Global Index
    index_df = pd.DataFrame(all_index_rows)
    if not index_df.empty:
        index_df = index_df.sort_values(["task", "setting"]).reset_index(drop=True)
        index_df.to_csv(proc_dir / "index.csv", index=False)

    audit_df = pd.DataFrame(audit_rows)
    audit_df.to_csv(report_dir / "preprocess_audit.csv", index=False)

    return index_df
'''
(PKG_DIR / "preprocess.py").write_text(build_code)
print("Wrote:", PKG_DIR / "preprocess.py")

Wrote: /content/drive/MyDrive/DNLP/src/preprocess.py


## Cell 11 — Import the pipeline entrypoint

Add project root to `sys.path` and import `run_preprocess` from `src.preprocess`.
This keeps downstream notebooks clean and consistent.

In [None]:
import sys
if str(BASE) not in sys.path:
    sys.path.insert(0, str(BASE))

if str(SRC_DIR) in sys.path:
    sys.path.remove(str(SRC_DIR))

from src.preprocess import run_preprocess
from src.io_utils import clean_dir

## Cell 12 — Run preprocessing

Execute `run_preprocess(...)` with the configured paths and parameters.
This writes all processed datasets + indices to disk and returns the master `index_df`.

In [None]:
# 1. Clean old outputs to prevent mixed state
print("Cleaning old processed data...")
clean_dir(PROC_DIR)
clean_dir(SPLITS_DIR)
clean_dir(REPORT_DIR)

# 2. Run
index_df = run_preprocess(
    raw_dir=RAW_DIR,
    proc_dir=PROC_DIR,
    splits_dir=SPLITS_DIR,
    report_dir=REPORT_DIR,
    seed=CFG["SEED"],
    val_ratio=CFG["VAL_RATIO"],
    sarc_source_only=CFG["SARC_SOURCE_ONLY"],
    max_len_for_models=CFG["MAX_LEN_FOR_MODELS"]
)

print("Preprocess done. Rows in index:", len(index_df))
index_df.head(50)

Cleaning old processed data...
Preprocess done. Rows in index: 9


Unnamed: 0,task,setting,train_csv,val_csv,splits_json,testsets_index_csv,n_train,n_val
0,sarcasm,FULL,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/splits/sarcas...,/content/drive/MyDrive/DNLP/data/processed/sar...,3585,895
1,sarcasm,TRAIN_en-AU,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/splits/sarcas...,/content/drive/MyDrive/DNLP/data/processed/sar...,1411,352
2,sarcasm,TRAIN_en-IN,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/splits/sarcas...,/content/drive/MyDrive/DNLP/data/processed/sar...,1349,337
3,sarcasm,TRAIN_en-UK,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/processed/sar...,/content/drive/MyDrive/DNLP/data/splits/sarcas...,/content/drive/MyDrive/DNLP/data/processed/sar...,825,206
4,sentiment,Google,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/splits/sentim...,/content/drive/MyDrive/DNLP/data/processed/sen...,3529,882
5,sentiment,Reddit,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/splits/sentim...,/content/drive/MyDrive/DNLP/data/processed/sen...,3564,891
6,sentiment,TRAIN_en-AU,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/splits/sentim...,/content/drive/MyDrive/DNLP/data/processed/sen...,2167,542
7,sentiment,TRAIN_en-IN,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/splits/sentim...,/content/drive/MyDrive/DNLP/data/processed/sen...,2667,666
8,sentiment,TRAIN_en-UK,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/processed/sen...,/content/drive/MyDrive/DNLP/data/splits/sentim...,/content/drive/MyDrive/DNLP/data/processed/sen...,2259,565


## Cell 13 — Inspect the master index

Quick view of all (task, setting) combinations created, along with train/val sizes.
This is the table your model notebooks should read to load the correct files.

In [None]:
pd.set_option("display.max_rows", 200)
index_df[["task","setting","n_train","n_val"]]

Unnamed: 0,task,setting,n_train,n_val
0,sarcasm,FULL,3585,895
1,sarcasm,TRAIN_en-AU,1411,352
2,sarcasm,TRAIN_en-IN,1349,337
3,sarcasm,TRAIN_en-UK,825,206
4,sentiment,Google,3529,882
5,sentiment,Reddit,3564,891
6,sentiment,TRAIN_en-AU,2167,542
7,sentiment,TRAIN_en-IN,2667,666
8,sentiment,TRAIN_en-UK,2259,565
