# Phase 3 - RoBERTa Extension - Mixture of Adapters

## Goal
To train and evaluate a Mixture-of-Adapters (MoE) model on figurative language tasks (Sentiment, Sarcasm), analyzing performance across different English varieties.

## Dataset + Evaluation Protocol
*   **Tasks**: Sentiment Analysis, Sarcasm Detection.
*   **Varieties**: en-AU (Australia), en-IN (India), en-UK (United Kingdom), etc.
*   **Sources**: Google, Reddit, Twitter, etc.
*   **Data**: Loaded via `index_settings.csv` and `index_testsets.csv`.

## Notebook Workflow
1.  **Setup**: Imports, Drive mount, Seeds.
2.  **Config**: Define hyperparameters (MoE, LR, Epochs).
3.  **Data Loading**: Parse index CSVs.
4.  **Model**: Define RoBERTa + MoE Adapters + Router.
5.  **Training**: Two-stage protocol (Stage 1: Pooled Pretraining, Stage 2: Specialized Adaptation).
6.  **Evaluation**: Compute F1/Acc, compare vs CE baseline, analyze errors.
7.  **Visualization**: Plot heatmaps and locale-specific bars.

## Outputs
*   **Metrics**: `models/.../metrics/moe_metrics_all.csv`
*   **Predictions**: `models/.../predictions/moe_predictions_all.csv`
*   **Plots**: `models/.../figures/*.png`
*   **Checkpoints**: `models/.../checkpoints/*.pt`

## Reproduction
*   **Seed**: 42
*   **Model**: `roberta-base`
*   **Config**: Defined in Step 2 (`CFG`).

## Step 1 — Setup
*   **Purpose**: Initialize environment, imports, random seeds, and drive paths.
*   **Inputs**: Google Drive path (`/content/drive/MyDrive/DNLP`).
*   **Outputs**: `DEVICE`, `BASE` path.
*   **Assumptions**: Mounts Google Drive if available.

In [None]:
# ==== Cell 1: Setup ====
import os, random, json, time, math
from pathlib import Path

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# (Colab) mount drive if available
try:
    from google.colab import drive
    if not Path("/content/drive").exists():
        drive.mount("/content/drive")
except Exception as e:
    print("Colab drive mount skipped:", repr(e))

# ---------
# Paths
# ---------
BASE = Path("/content/drive/MyDrive/DNLP")
assert BASE.exists(), f"BASE not found: {BASE}"

# Add src to path if exists (hybrid style)
import sys
SRC_DIR = BASE / "src"
if SRC_DIR.exists():
    sys.path.append(str(SRC_DIR))
    print("✅ Added SRC_DIR to sys.path:", SRC_DIR)

# ---------
# Repro
# ---------
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("BASE  =", BASE)
print("DEVICE=", DEVICE)

Mounted at /content/drive
✅ Added SRC_DIR to sys.path: /content/drive/MyDrive/DNLP/src
BASE  = /content/drive/MyDrive/DNLP
DEVICE= cuda


## Step 2 — Config + run folders
*   **Purpose**: Define configuration dictionary (hyperparameters) and create output directories.
*   **Inputs**: `BASE` path.
*   **Outputs**: `CFG` dict, directories (`CKPT_DIR`, `MET_DIR`, etc.).
*   **Assumptions**: Uses `roberta-base`.

In [None]:
# ==== Cell 2: Config + run folders ====

RUN_NAME = f"roberta_extension_mixture_of_adapters"

RUN_DIR = BASE / "models" / RUN_NAME
CKPT_DIR = RUN_DIR / "checkpoints"
MET_DIR  = RUN_DIR / "metrics"
PRD_DIR  = RUN_DIR / "predictions"
PLT_DIR  = RUN_DIR / "figures"
ANA_DIR  = RUN_DIR / "analysis"

for d in [CKPT_DIR, MET_DIR, PRD_DIR, PLT_DIR, ANA_DIR]:
    d.mkdir(parents=True, exist_ok=True)

CFG = {
    # Model
    "MODEL_NAME": "roberta-base",
    "MAX_LEN": 256,

    # Train
    "BATCH_SIZE": 16,
    "LR": 2e-5,
    "WEIGHT_DECAY": 0.01,
    "EPOCHS": 6,
    "PATIENCE": 2,
    "WARMUP_RATIO": 0.06,
    "GRAD_ACCUM": 1,
    "USE_AMP": True,
    "NUM_WORKERS": 2,

    # MoE (mixture of adapters)
    "N_EXPERTS": 3,
    "ADAPTER_BOTTLENECK": 128,
    "ROUTER_HIDDEN": 128,
    "ADAPTER_DROPOUT": 0.10,

    # Regularizers to prevent collapse
    "LOAD_BAL_W": 0.02,     # small
    "ENTROPY_W": 0.01,      # small, encourages higher entropy (we subtract it from loss)

    # Optional (keep off unless you explicitly want it)
    "EXPERT_L2_REG": 0.0,
    "ROUTER_SUP_W": 0.0,              # keep OFF (you moved away from it)
    "ROUTER_LABEL_SMOOTH": 0.05,
    "ROUTER_SUP_DROPOUT_P": 0.30,

    # Decision rule (match CE baseline)
    "FIXED_THRESHOLD": 0.5,

    # Two-stage protocol toggles
    "FREEZE_ROUTER_STAGE2": True,
    "FREEZE_BACKBONE_STAGE2": False,  # set True if you want faster but potentially weaker
}

VARIANT = "roberta_extension_mixture_of_adapters"

print("RUN_DIR =", RUN_DIR)
print("VARIANT =", VARIANT)
print("CFG:", json.dumps(CFG, indent=2))

RUN_DIR = /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters
VARIANT = roberta_extension_mixture_of_adapters
CFG: {
  "MODEL_NAME": "roberta-base",
  "MAX_LEN": 256,
  "BATCH_SIZE": 16,
  "LR": 2e-05,
  "WEIGHT_DECAY": 0.01,
  "EPOCHS": 6,
  "PATIENCE": 2,
  "WARMUP_RATIO": 0.06,
  "GRAD_ACCUM": 1,
  "USE_AMP": true,
  "NUM_WORKERS": 2,
  "N_EXPERTS": 3,
  "ADAPTER_BOTTLENECK": 128,
  "ROUTER_HIDDEN": 128,
  "ADAPTER_DROPOUT": 0.1,
  "LOAD_BAL_W": 0.02,
  "ENTROPY_W": 0.01,
  "EXPERT_L2_REG": 0.0,
  "ROUTER_SUP_W": 0.0,
  "ROUTER_LABEL_SMOOTH": 0.05,
  "ROUTER_SUP_DROPOUT_P": 0.3,
  "FIXED_THRESHOLD": 0.5,
  "FREEZE_ROUTER_STAGE2": true,
  "FREEZE_BACKBONE_STAGE2": false
}


## Step 3 — Load indices for a task
*   **Purpose**: Load train/test split metadata (CSV paths, settings) for specific tasks.
*   **Inputs**: `data/processed/{task}/index_settings.csv` and `index_testsets.csv`.
*   **Outputs**: `settings_df` (train settings), `testsets_df` (test sets).
*   **Assumptions**: CSV files exist in the standard processed data structure.

In [None]:
# ==== Cell 3: Load indices for a task (index_settings + index_testsets) ====

def resolve_csv(p: str) -> Path:
    p = Path(p)
    return p if p.is_absolute() else (BASE / p)

def load_task_indices(task: str):
    task_dir = BASE / "data" / "processed" / task
    INDEX_SETTINGS = task_dir / "index_settings.csv"
    INDEX_TESTSETS = task_dir / "testsets" / "index_testsets.csv"

    assert INDEX_SETTINGS.exists(), f"Missing: {INDEX_SETTINGS}"
    assert INDEX_TESTSETS.exists(), f"Missing: {INDEX_TESTSETS}"

    settings_df = pd.read_csv(INDEX_SETTINGS)
    testsets_df = pd.read_csv(INDEX_TESTSETS)

    settings_df.columns = [c.strip().lower() for c in settings_df.columns]
    testsets_df.columns  = [c.strip().lower() for c in testsets_df.columns]

    # compat renames
    if "setting" in settings_df.columns and "train_setting" not in settings_df.columns:
        settings_df.rename(columns={"setting": "train_setting"}, inplace=True)
    if "csv" in testsets_df.columns and "test_csv" not in testsets_df.columns:
        testsets_df.rename(columns={"csv": "test_csv"}, inplace=True)

    need_s = {"train_setting","train_csv","val_csv"}
    need_t = {"test_setting","test_csv"}
    assert need_s.issubset(set(settings_df.columns)), f"{task}: settings_df missing {need_s - set(settings_df.columns)}"
    assert need_t.issubset(set(testsets_df.columns)), f"{task}: testsets_df missing {need_t - set(testsets_df.columns)}"

    settings_df["train_csv_abs"] = settings_df["train_csv"].apply(resolve_csv)
    settings_df["val_csv_abs"]   = settings_df["val_csv"].apply(resolve_csv)
    testsets_df["test_csv_abs"]  = testsets_df["test_csv"].apply(resolve_csv)

    # sanity
    for p in settings_df["train_csv_abs"].tolist() + settings_df["val_csv_abs"].tolist() + testsets_df["test_csv_abs"].tolist():
        assert Path(p).exists(), f"Missing CSV: {p}"

    return settings_df, testsets_df

# quick peek
for t in ["sentiment", "sarcasm"]:
    s_df, te_df = load_task_indices(t)
    print(f"\n[{t}] settings:", len(s_df), "testsets:", len(te_df))
    print("settings head:", s_df["train_setting"].head(6).tolist())
    print("testsets head:", te_df["test_setting"].head(6).tolist())


[sentiment] settings: 5 testsets: 6
settings head: ['Google', 'Reddit', 'TRAIN_en-AU', 'TRAIN_en-IN', 'TRAIN_en-UK']
testsets head: ['TEST_FULL', 'TEST_Google', 'TEST_Reddit', 'TEST_en-AU', 'TEST_en-IN', 'TEST_en-UK']

[sarcasm] settings: 4 testsets: 5
settings head: ['FULL', 'TRAIN_en-AU', 'TRAIN_en-IN', 'TRAIN_en-UK']
testsets head: ['TEST_FULL', 'TEST_Reddit', 'TEST_en-AU', 'TEST_en-IN', 'TEST_en-UK']


## Step 4 — Load CE artifacts
*   **Purpose**: Load pre-computed Cross-Entropy (baseline) metrics/predictions for comparison.
*   **Inputs**: `models/roberta_baseline_ce` metrics and prediction CSVs.
*   **Outputs**: `metrics_ce`, `preds_ce` (DataFrames).
*   **Assumptions**: Baseline run has been completed and files exist.

In [None]:
# ==== Cell 4: Load CE artifacts (optional but recommended) ====

CE_RUN_DIR = BASE / "models" / "roberta_baseline_ce"
ce_met_path = CE_RUN_DIR / "metrics" / "roberta_ce_metrics_all.csv"
ce_prd_path = CE_RUN_DIR / "predictions" / "roberta_ce_predictions_all.csv"

metrics_ce = None
preds_ce = None

if ce_met_path.exists() and ce_prd_path.exists():
    metrics_ce = pd.read_csv(ce_met_path)
    preds_ce   = pd.read_csv(ce_prd_path)
    metrics_ce.columns = [c.strip().lower() for c in metrics_ce.columns]
    preds_ce.columns   = [c.strip().lower() for c in preds_ce.columns]
    print("✅ Loaded CE metrics/preds:", ce_met_path.name, ce_prd_path.name)
else:
    print("⚠️ CE metrics/preds not found. Delta + CE-vs-MoE error analysis will be skipped.")
    print("Expected:", ce_met_path)
    print("          ", ce_prd_path)

✅ Loaded CE metrics/preds: roberta_ce_metrics_all.csv roberta_ce_predictions_all.csv


## Step 5 — Dataset + Loader
*   **Purpose**: Define the `TextDS` dataset class and `make_loader` function for tokenization and batching.
*   **Inputs**: CSV paths containing text and labels.
*   **Outputs**: `TextDS` class, `DataLoader` instances.
*   **Assumptions**: Input CSVs contain columns for `text`, `label`, and optionally `variety` info.

In [None]:
# ==== Cell 5: Dataset + Loader ====

tok = AutoTokenizer.from_pretrained(CFG["MODEL_NAME"], use_fast=True)

def infer_text_col(df):
    for c in ["text", "text_clean", "text_norm", "sentence", "content"]:
        if c in df.columns:
            return c
    raise ValueError(f"No text col found. Columns={list(df.columns)[:40]}")

def infer_label_col(df):
    for c in ["label", "y", "gold", "target"]:
        if c in df.columns:
            return c
    raise ValueError(f"No label col found. Columns={list(df.columns)[:40]}")

def infer_variety_cols(df):
    # Prefer explicit id + name if available
    id_col = None
    name_col = None
    for c in ["variety_id"]:
        if c in df.columns: id_col = c
    for c in ["variety", "variety_name", "variety_code"]:
        if c in df.columns: name_col = c
    return id_col, name_col

def map_variety_to_id(v):
    # Standardize common forms to {0,1,2} = {en-AU, en-IN, en-UK}
    s = str(v).strip()
    s = s.replace("TEST_", "").replace("TRAIN_", "")
    s = s.replace("_", "-")
    s = s.replace("en-", "en-")
    mp = {
        "en-AU": 0, "AU": 0, "En-AU": 0, "en-au": 0,
        "en-IN": 1, "IN": 1, "En-IN": 1, "en-in": 1,
        "en-UK": 2, "UK": 2, "En-UK": 2, "en-uk": 2,
    }
    return mp.get(s, -1), s

class TextDS(Dataset):
    def __init__(self, csv_path: Path, max_len: int):
        self.df = pd.read_csv(csv_path)
        self.df.columns = [c.strip().lower() for c in self.df.columns]

        self.text_col = infer_text_col(self.df)
        self.label_col = infer_label_col(self.df)
        self.vid_col, self.vname_col = infer_variety_cols(self.df)

        self.texts  = self.df[self.text_col].astype(str).tolist()
        self.labels = self.df[self.label_col].astype(int).values

        # row_id for joins
        if "row_id" in self.df.columns:
            self.row_id = self.df["row_id"].astype(int).values
        else:
            self.row_id = np.arange(len(self.df), dtype=int)

        # variety id + name
        self.variety_id = None
        self.variety_name = None

        if self.vid_col is not None and np.issubdtype(self.df[self.vid_col].dtype, np.number):
            self.variety_id = self.df[self.vid_col].astype(int).values
            if self.vname_col is not None:
                self.variety_name = self.df[self.vname_col].astype(str).values
            else:
                # best-effort name
                inv = {0:"en-AU", 1:"en-IN", 2:"en-UK"}
                self.variety_name = np.array([inv.get(int(x), "UNK") for x in self.variety_id], dtype=object)
        elif self.vname_col is not None:
            vraw = self.df[self.vname_col].astype(str).values
            vids, vnames = [], []
            for x in vraw:
                vid, vn = map_variety_to_id(x)
                vids.append(vid); vnames.append(vn)
            self.variety_id = np.array(vids, dtype=int)
            self.variety_name = np.array(vnames, dtype=object)
        else:
            # still runnable, but router won't learn variety routing
            self.variety_id = np.full(len(self.df), -1, dtype=int)
            self.variety_name = np.full(len(self.df), "UNK", dtype=object)

        self.max_len = max_len

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, i):
        t = self.texts[i]
        enc = tok(
            t,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )
        enc = {k: v.squeeze(0) for k, v in enc.items()}
        y   = int(self.labels[i])
        vid = int(self.variety_id[i])
        rid = int(self.row_id[i])
        vnm = str(self.variety_name[i])
        return enc, y, vid, rid, vnm, t

def make_loader(csv_path: Path, shuffle: bool):
    ds = TextDS(csv_path, max_len=CFG["MAX_LEN"])
    ld = DataLoader(
        ds,
        batch_size=CFG["BATCH_SIZE"],
        shuffle=shuffle,
        num_workers=CFG["NUM_WORKERS"],
        pin_memory=(DEVICE=="cuda"),
    )
    return ds, ld

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Step 6 — Metrics
*   **Purpose**: Define helper functions to calculate Accuracy, Precision, Recall, and F1.
*   **Inputs**: Ground truth labels and predicted probabilities.
*   **Outputs**: Dictionary of metric scores.
*   **Assumptions**: Uses a fixed threshold (e.g., 0.5) for binary classification.

In [None]:
# ==== Cell 6: Metrics (fixed thr=0.5) ====

def compute_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="macro", zero_division=0
    )
    return {"acc": float(acc), "precision": float(p), "recall": float(r), "macro_f1": float(f1)}

def metrics_from_probs(y_true, prob, thr=None):
    thr = CFG["FIXED_THRESHOLD"] if thr is None else float(thr)
    y_pred = (prob >= thr).astype(int)
    m = compute_metrics(y_true, y_pred)
    return m, thr, y_pred

## Step 7 — MoE-of-Adapters model
*   **Purpose**: Define the `RobertaMoEAdapters` architecture (Backbone + Router + Experts).
*   **Inputs**: Pretrained `roberta-base`, config parameters (n_experts, bottleneck).
*   **Outputs**: PyTorch model class.
*   **Assumptions**: Uses a mixture of bottleneck adapters managed by a router.

In [None]:
# ==== Cell 7: MoE-of-Adapters model ====

class BottleneckAdapter(nn.Module):
    def __init__(self, dim, bottleneck=128, dropout=0.1):
        super().__init__()
        self.down = nn.Linear(dim, bottleneck)
        self.up   = nn.Linear(bottleneck, dim)
        self.drop = nn.Dropout(dropout)
        self.act  = nn.ReLU()

    def forward(self, x):
        z = self.down(x)
        z = self.act(z)
        z = self.drop(z)
        z = self.up(z)
        return z  # delta

class Router(nn.Module):
    def __init__(self, dim, n_experts, hidden=128, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden),
            nn.Tanh(),
            nn.Dropout(dropout),
            nn.Linear(hidden, n_experts),
        )

    def forward(self, x):
        return self.net(x)  # logits

class RobertaMoEAdapters(nn.Module):
    def __init__(self, model_name, n_experts=3, bottleneck=128, router_hidden=128, dropout=0.1, num_labels=2):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        dim = self.backbone.config.hidden_size

        self.router = Router(dim, n_experts=n_experts, hidden=router_hidden, dropout=dropout)
        self.experts = nn.ModuleList([
            BottleneckAdapter(dim, bottleneck=bottleneck, dropout=dropout)
            for _ in range(n_experts)
        ])
        self.classifier = nn.Linear(dim, num_labels)

    def forward(self, input_ids, attention_mask):
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        h = out.last_hidden_state[:, 0, :]  # CLS

        logits_router = self.router(h)                 # [B,E]
        w = torch.softmax(logits_router, dim=-1)       # [B,E]

        deltas = []
        for ex in self.experts:
            deltas.append(ex(h))                       # [B,D]
        deltas = torch.stack(deltas, dim=1)            # [B,E,D]

        mix_delta = (w.unsqueeze(-1) * deltas).sum(dim=1)  # [B,D]
        rep = h + mix_delta
        logits = self.classifier(rep)                  # [B,2]

        return logits, w, deltas, logits_router

## Step 8 — Loss + Optim + Warmup
*   **Purpose**: Define loss functions (CE, Load Balancing, Entropy), Optimizer, and Scheduler.
*   **Inputs**: Hyperparameters from `CFG`, class weights from data.
*   **Outputs**: Optimizer, Loss functions, LR scheduler helper.
*   **Assumptions**: Includes regularization to prevent router collapse.

In [None]:
# ==== Cell 8: Loss + Optim + Warmup ====

def compute_class_weights_from_csv(train_csv: Path, label_col="label"):
    df = pd.read_csv(train_csv)
    df.columns = [c.strip().lower() for c in df.columns]
    assert label_col in df.columns, f"{train_csv} missing '{label_col}'"
    y = df[label_col].astype(int).values
    n = len(y)
    n1 = int((y == 1).sum())
    n0 = int((y == 0).sum())
    n0 = max(n0, 1)
    n1 = max(n1, 1)
    w0 = n / (2.0 * n0)
    w1 = n / (2.0 * n1)
    return torch.tensor([w0, w1], dtype=torch.float32), (n0, n1, n)

def load_balance_loss(w):
    # encourage mean usage ~ uniform
    E = w.size(1)
    target = 1.0 / E
    mean_w = w.mean(dim=0)  # [E]
    return ((mean_w - target) ** 2).sum()

def entropy_value(w, eps=1e-9):
    # entropy per sample, averaged (higher = more spread)
    return (-(w * torch.log(w + eps)).sum(dim=1)).mean()

def expert_l2_reg(deltas):
    # encourage experts to be different (pairwise MSE) – optional
    E = deltas.size(1)
    if E <= 1:
        return deltas.new_tensor(0.0)
    reg = 0.0
    cnt = 0
    for i in range(E):
        for j in range(i+1, E):
            reg = reg + F.mse_loss(deltas[:, i, :], deltas[:, j, :])
            cnt += 1
    return reg / max(cnt, 1)

def make_optimizer(model):
    no_decay = ["bias", "LayerNorm.weight"]
    params = [
        {"params": [p for n,p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": CFG["WEIGHT_DECAY"]},
        {"params": [p for n,p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    return torch.optim.AdamW(params, lr=CFG["LR"])

def linear_warmup(step, total_steps, warmup_ratio):
    warmup_steps = int(total_steps * warmup_ratio)
    if warmup_steps <= 0:
        return 1.0
    return min(1.0, step / warmup_steps)

## Step 9 — predict_probs
*   **Purpose**: Inference loop to generate probabilities and gather metadata (labels, variety IDs).
*   **Inputs**: Trained `model` and `loader`.
*   **Outputs**: Arrays of probabilities, true labels, variety IDs, and texts.

In [None]:
# ==== Cell 9: predict_probs ====

@torch.no_grad()
def predict_probs(model, loader):
    model.eval()
    probs, ys, vids, rids, vnames, texts = [], [], [], [], [], []
    for enc, y, vid, rid, vnm, t in loader:
        enc = {k: v.to(DEVICE) for k,v in enc.items()}
        logits, w, deltas, logits_router = model(enc["input_ids"], enc["attention_mask"])
        p1 = torch.softmax(logits, dim=-1)[:, 1].detach().cpu().numpy()

        probs.append(p1)
        ys.append(np.array(y, dtype=int))
        vids.append(np.array(vid, dtype=int))
        rids.append(np.array(rid, dtype=int))
        vnames += list(vnm)
        texts  += list(t)

    return (
        np.concatenate(probs),
        np.concatenate(ys),
        np.concatenate(vids),
        np.concatenate(rids),
        np.array(vnames, dtype=object),
        texts,
    )

## Step 10 — Train one setting
*   **Purpose**: Main training routine for a single setting. Supports Stage 1 (Pool) and Stage 2 (Adaptation).
*   **Inputs**: Task, train/val CSVs, init checkpoint, freeze toggles.
*   **Outputs**: Trained model checkpoint (`.pt`).
*   **Assumptions**: Uses `RobertaMoEAdapters` and saves best model by Val F1.

In [None]:
# ==== Cell 10: Train one setting (supports init_state + router freeze) ====

def set_requires_grad(module, flag: bool):
    for p in module.parameters():
        p.requires_grad = flag

def train_one_setting_moe(
    task: str,
    train_setting: str,
    train_csv: Path,
    val_csv: Path,
    save_name: str,
    init_ckpt: Path = None,
    freeze_router: bool = False,
    freeze_backbone: bool = False,
):
    ckpt_path = CKPT_DIR / f"{save_name}.pt"
    if ckpt_path.exists():
        print(f"✅ CKPT exists, skipping train: {ckpt_path.name}")
        return ckpt_path

    tr_ds, tr_ld = make_loader(train_csv, shuffle=True)
    va_ds, va_ld = make_loader(val_csv, shuffle=False)

    class_wts_cpu, (n0, n1, n) = compute_class_weights_from_csv(train_csv)
    class_wts = class_wts_cpu.to(DEVICE)
    print(f"[{task} | {train_setting}] train label counts: n0={n0} n1={n1} n={n} | class_wts=[{class_wts_cpu[0]:.3f},{class_wts_cpu[1]:.3f}]")

    model = RobertaMoEAdapters(
        CFG["MODEL_NAME"],
        n_experts=CFG["N_EXPERTS"],
        bottleneck=CFG["ADAPTER_BOTTLENECK"],
        router_hidden=CFG["ROUTER_HIDDEN"],
        dropout=CFG["ADAPTER_DROPOUT"],
    ).to(DEVICE)

    # init from FULL ckpt if provided
    init_meta = {"init_from": None}
    if init_ckpt is not None:
        blob = torch.load(init_ckpt, map_location="cpu")
        model.load_state_dict(blob["state_dict"], strict=True)
        init_meta["init_from"] = str(init_ckpt)
        print("✅ Initialized from:", init_ckpt.name)

    # freeze toggles for stage2
    if freeze_router:
        set_requires_grad(model.router, False)
    if freeze_backbone:
        set_requires_grad(model.backbone, False)

    opt = make_optimizer(model)

    total_steps = (len(tr_ld) * CFG["EPOCHS"]) // max(CFG["GRAD_ACCUM"], 1)
    scaler = torch.amp.GradScaler("cuda", enabled=(CFG["USE_AMP"] and DEVICE=="cuda"))

    best_val_f1 = -1.0
    best_state = None
    bad_epochs = 0
    step = 0

    for ep in range(1, CFG["EPOCHS"] + 1):
        model.train()
        losses = []

        opt.zero_grad(set_to_none=True)

        for it, batch in enumerate(tr_ld, start=1):
            enc, y, vid, rid, vnm, t = batch
            enc = {k: v.to(DEVICE) for k,v in enc.items()}
            y = torch.as_tensor(y, dtype=torch.long, device=DEVICE)

            with torch.amp.autocast("cuda", enabled=(CFG["USE_AMP"] and DEVICE=="cuda")):
                logits, w, deltas, logits_router = model(enc["input_ids"], enc["attention_mask"])

                # weighted CE
                loss_ce = F.cross_entropy(logits, y, weight=class_wts)

                # regularizers
                loss_lb = load_balance_loss(w)
                ent = entropy_value(w)  # higher is better

                loss = loss_ce + CFG["LOAD_BAL_W"] * loss_lb - CFG["ENTROPY_W"] * ent

                if CFG["EXPERT_L2_REG"] > 0:
                    loss = loss + CFG["EXPERT_L2_REG"] * expert_l2_reg(deltas)

                # NOTE: router supervision intentionally OFF here (ROUTER_SUP_W=0)

                loss = loss / max(CFG["GRAD_ACCUM"], 1)

            scaler.scale(loss).backward()

            if it % CFG["GRAD_ACCUM"] == 0:
                # warmup LR
                step += 1
                lr_scale = linear_warmup(step, total_steps, CFG["WARMUP_RATIO"])
                for g in opt.param_groups:
                    g["lr"] = CFG["LR"] * lr_scale

                scaler.step(opt)
                scaler.update()
                opt.zero_grad(set_to_none=True)

            losses.append(float(loss.detach().cpu().item()))

        # ---- Validation (fixed thr=0.5) ----
        prob, y_true, v_true, rid_true, vnm_true, texts = predict_probs(model, va_ld)
        m, thr_use, _ = metrics_from_probs(y_true, prob, thr=CFG["FIXED_THRESHOLD"])
        val_f1 = m["macro_f1"]

        print(f"[{task} | {train_setting} | {VARIANT}] EP{ep} loss={np.mean(losses):.4f} valF1@0.5={val_f1:.4f}")

        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_state = {k: v.detach().cpu() for k,v in model.state_dict().items()}
            bad_epochs = 0
        else:
            bad_epochs += 1
            if bad_epochs >= CFG["PATIENCE"]:
                print("Early stopping.")
                break

    assert best_state is not None, "Training failed: best_state is None"

    torch.save({
        "state_dict": best_state,
        "best_val_f1": float(best_val_f1),
        "cfg": CFG,
        "task": task,
        "train_setting": train_setting,
        "variant": VARIANT,
        "threshold": float(CFG["FIXED_THRESHOLD"]),
        "freeze_router": bool(freeze_router),
        "freeze_backbone": bool(freeze_backbone),
        **init_meta,
    }, ckpt_path)

    print("✅ Saved:", ckpt_path)
    return ckpt_path

## Step 11 — Evaluate a checkpoint
*   **Purpose**: Evaluate a saved model on all associated test sets and breakdown performance by variety.
*   **Inputs**: Checkpoint path, testsets DataFrame.
*   **Outputs**: DataFrames for metrics, predictions, and per-variety stats.

In [None]:
# ==== Cell 11: Evaluate a checkpoint on all testsets (with per-variety breakdown) ====

def load_model_from_ckpt(ckpt_path: Path):
    blob = torch.load(ckpt_path, map_location="cpu")
    model = RobertaMoEAdapters(
        CFG["MODEL_NAME"],
        n_experts=CFG["N_EXPERTS"],
        bottleneck=CFG["ADAPTER_BOTTLENECK"],
        router_hidden=CFG["ROUTER_HIDDEN"],
        dropout=CFG["ADAPTER_DROPOUT"],
    ).to(DEVICE)
    model.load_state_dict(blob["state_dict"], strict=True)
    model.eval()
    return model, blob

def eval_on_testsets(task: str, train_setting: str, ckpt_path: Path, testsets_df: pd.DataFrame):
    model, blob = load_model_from_ckpt(ckpt_path)

    all_met = []
    all_prd = []
    all_pervar = []

    for _, r in testsets_df.iterrows():
        test_setting = r["test_setting"]
        test_csv = r["test_csv_abs"]

        te_ds, te_ld = make_loader(test_csv, shuffle=False)
        prob, y_true, v_true, rid_true, vnm_true, texts = predict_probs(model, te_ld)

        m, thr_use, y_pred = metrics_from_probs(y_true, prob, thr=CFG["FIXED_THRESHOLD"])

        # ---- global metrics row (match baseline schema) ----
        mrow = {
            "task": task,
            "train_setting": train_setting,
            "variant": VARIANT,
            "test_setting": test_setting,
            "split": "test",
            "n": int(len(y_true)),
            **m,
            "threshold_type": "fixed0.5",
            "threshold": float(thr_use),
        }
        all_met.append(mrow)

        # ---- predictions ----
        dfp = pd.DataFrame({
            "task": task,
            "train_setting": train_setting,
            "variant": VARIANT,
            "test_setting": test_setting,
            "row_id": rid_true,
            "label": y_true,
            "prob": prob,
            "pred": y_pred,
            "threshold": float(thr_use),
            "variety_id": v_true,
            "variety_name": vnm_true,
            "text": texts,
        })
        all_prd.append(dfp)

        # ---- per-variety breakdown inside this testset ----
        for vid in sorted(set(v_true.tolist())):
            mask = (v_true == vid)
            if mask.sum() == 0:
                continue
            mv, _, ypv = metrics_from_probs(y_true[mask], prob[mask], thr=thr_use)
            vname = str(pd.Series(vnm_true[mask]).mode().iloc[0]) if mask.sum() > 0 else "UNK"
            all_pervar.append({
                "task": task,
                "train_setting": train_setting,
                "variant": VARIANT,
                "test_setting": test_setting,
                "split": "test",
                "variety_id": int(vid),
                "variety_name": vname,
                "n": int(mask.sum()),
                **mv,
                "threshold_type": "fixed0.5",
                "threshold": float(thr_use),
            })

    return pd.DataFrame(all_met), pd.concat(all_prd, ignore_index=True), pd.DataFrame(all_pervar)

## Step 12 — Execution Loop (Stage 1 + Stage 2)
*   **Purpose**: Orchestrate the 2-stage training process for all tasks (Sentiment, Sarcasm).
    1.  **Stage 1**: Train on pooled data (FULL) to get a general model.
    2.  **Stage 2**: Adapt the general model to specific settings (e.g., en-AU, en-IN) with frozen components.
*   **Inputs**: `TASKS` list, Indices.
*   **Outputs**: Final merged metrics/predictions CSVs.

In [None]:
# ==== Cell 12: Build pooled stage-1 if FULL missing, then adapt all settings (NO DUPLICATE FULL) ====

TASKS = ["sentiment", "sarcasm"]

TMP_POOL_DIR = RUN_DIR / "tmp_pooled"
TMP_POOL_DIR.mkdir(parents=True, exist_ok=True)

def _read_df(csv_path: Path) -> pd.DataFrame:
    df = pd.read_csv(csv_path)
    df.columns = [c.strip().lower() for c in df.columns]
    return df

def build_pooled_train_val(task: str, settings_df: pd.DataFrame, pool_from=("Google","Reddit")):
    """
    Creates synthetic pooled train/val CSVs for stage-1 init.
    Prefer pooling from Google+Reddit if both exist; otherwise pool from ALL available settings.
    De-duplicates by row_id.
    Returns (pool_train_csv, pool_val_csv, pool_setting_name, pool_sources)
    """
    available = settings_df["train_setting"].astype(str).tolist()
    pool = [s for s in pool_from if s in available]
    if len(pool) == 0:
        pool = available  # fallback: pool everything

    train_dfs, val_dfs = [], []
    for s in pool:
        r = settings_df[settings_df["train_setting"].astype(str) == s].iloc[0]
        train_dfs.append(_read_df(Path(r["train_csv_abs"])))
        val_dfs.append(_read_df(Path(r["val_csv_abs"])))

    tr_all = pd.concat(train_dfs, ignore_index=True)
    va_all = pd.concat(val_dfs, ignore_index=True)

    # De-dup by row_id (preferred)
    if "row_id" in tr_all.columns and "row_id" in va_all.columns:
        tr_all = tr_all.drop_duplicates(subset=["row_id"]).reset_index(drop=True)
        va_all = va_all.drop_duplicates(subset=["row_id"]).reset_index(drop=True)
    else:
        key = [c for c in ["text","label","task","variety_id","variety_name","source","source_name"] if c in tr_all.columns]
        tr_all = tr_all.drop_duplicates(subset=key if key else None).reset_index(drop=True)
        va_all = va_all.drop_duplicates(subset=key if key else None).reset_index(drop=True)

    pool_setting = "FULL"  # synthetic label (not in settings_df)
    out_dir = TMP_POOL_DIR / task
    out_dir.mkdir(parents=True, exist_ok=True)

    pool_train_csv = out_dir / f"{pool_setting}_train.csv"
    pool_val_csv   = out_dir / f"{pool_setting}_val.csv"
    tr_all.to_csv(pool_train_csv, index=False)
    va_all.to_csv(pool_val_csv, index=False)

    return pool_train_csv, pool_val_csv, pool_setting, pool

def find_full_like_existing(settings_df: pd.DataFrame):
    ts = settings_df["train_setting"].astype(str)
    # exact FULL
    m = ts.str.upper().eq("FULL")
    if m.any():
        return settings_df[m].iloc[0], "exact_FULL"
    # common variants
    aliases = {"TRAIN_FULL", "FULL_TRAIN", "FULLTRAIN", "ALL", "COMBINED", "TRAIN_ALL", "TRAIN_COMBINED"}
    m = ts.str.upper().isin(aliases)
    if m.any():
        return settings_df[m].iloc[0], "alias_match"
    # contains FULL
    m = ts.str.upper().str.contains("FULL", na=False)
    if m.any():
        return settings_df[m].iloc[0], "contains_FULL"
    return None, None

ALL_METRICS, ALL_PREDS, ALL_PERVAR = [], [], []

for task in TASKS:
    settings_df, testsets_df = load_task_indices(task)

    print("\n==============================")
    print(f"RUN TASK: {task}")
    print("Available train_settings:", settings_df["train_setting"].astype(str).tolist())

    # -------------------------
    # Stage 1: pooled pretrain (existing FULL or synthetic FULL)
    # -------------------------
    full_row, how = find_full_like_existing(settings_df)

    stage1_is_existing_row = (full_row is not None)

    if stage1_is_existing_row:
        stage1_name = str(full_row["train_setting"])
        pool_train_csv = Path(full_row["train_csv_abs"])
        pool_val_csv   = Path(full_row["val_csv_abs"])
        pool_sources   = [stage1_name]
        print(f"✅ [{task}] Using existing pooled setting: '{stage1_name}' ({how})")
    else:
        pool_train_csv, pool_val_csv, stage1_name, pool_sources = build_pooled_train_val(
            task, settings_df, pool_from=("Google","Reddit")
        )
        print(f"✅ [{task}] Built synthetic '{stage1_name}' from: {pool_sources}")
        print("   train_csv:", pool_train_csv.name, "rows=", len(pd.read_csv(pool_train_csv)))
        print("   val_csv  :", pool_val_csv.name,   "rows=", len(pd.read_csv(pool_val_csv)))

    full_ckpt_name = f"{task}__{stage1_name}__{VARIANT}__stage1_poolpretrain"
    full_ckpt = train_one_setting_moe(
        task=task,
        train_setting=stage1_name,
        train_csv=pool_train_csv,
        val_csv=pool_val_csv,
        save_name=full_ckpt_name,
        init_ckpt=None,
        freeze_router=False,
        freeze_backbone=False,
    )

    # Evaluate stage-1 pooled model
    met_full, prd_full, pv_full = eval_on_testsets(task, stage1_name, full_ckpt, testsets_df)
    ALL_METRICS.append(met_full)
    ALL_PREDS.append(prd_full)
    ALL_PERVAR.append(pv_full)

    # -------------------------
    # Stage 2: adapt each REAL setting from pooled init
    #   - If stage1 is an existing row (e.g., sarcasm FULL), skip it to avoid duplicates.
    # -------------------------
    if stage1_is_existing_row:
        stage2_df = settings_df.drop(index=full_row.name).copy()
        print(f"✅ [{task}] Stage-2: adapt {len(stage2_df)} real settings (skipping '{stage1_name}')")
    else:
        stage2_df = settings_df.copy()
        print(f"✅ [{task}] Stage-2: adapt {len(stage2_df)} real settings")

    for _, row in stage2_df.iterrows():
        train_setting = str(row["train_setting"])
        train_csv = Path(row["train_csv_abs"])
        val_csv   = Path(row["val_csv_abs"])

        save_name = f"{task}__{train_setting}__{VARIANT}__stage2_fromPOOL_routerFz"
        ckpt_path = train_one_setting_moe(
            task=task,
            train_setting=train_setting,
            train_csv=train_csv,
            val_csv=val_csv,
            save_name=save_name,
            init_ckpt=full_ckpt,
            freeze_router=bool(CFG["FREEZE_ROUTER_STAGE2"]),
            freeze_backbone=bool(CFG["FREEZE_BACKBONE_STAGE2"]),
        )

        met_df, prd_df, pv_df = eval_on_testsets(task, train_setting, ckpt_path, testsets_df)
        ALL_METRICS.append(met_df)
        ALL_PREDS.append(prd_df)
        ALL_PERVAR.append(pv_df)

# -------------------------
# Save combined artifacts
# -------------------------
metrics_moe = pd.concat(ALL_METRICS, ignore_index=True)
preds_moe   = pd.concat(ALL_PREDS, ignore_index=True)
pervar_moe  = pd.concat(ALL_PERVAR, ignore_index=True)

met_path = MET_DIR / "moe_metrics_all.csv"
prd_path = PRD_DIR / "moe_predictions_all.csv"
pv_path  = MET_DIR / "moe_pervar_metrics_all.csv"

metrics_moe.to_csv(met_path, index=False)
preds_moe.to_csv(prd_path, index=False)
pervar_moe.to_csv(pv_path, index=False)

print("\n✅ Saved:", met_path)
print("✅ Saved:", prd_path)
print("✅ Saved:", pv_path)

display(metrics_moe.head(20))


RUN TASK: sentiment
Available train_settings: ['Google', 'Reddit', 'TRAIN_en-AU', 'TRAIN_en-IN', 'TRAIN_en-UK']
✅ [sentiment] Built synthetic 'FULL' from: ['Google', 'Reddit']
   train_csv: FULL_train.csv rows= 7093
   val_csv  : FULL_val.csv rows= 1773
[sentiment | FULL] train label counts: n0=3579 n1=3514 n=7093 | class_wts=[0.991,1.009]


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sentiment | FULL | roberta_extension_mixture_of_adapters] EP1 loss=0.3633 valF1@0.5=0.8917
[sentiment | FULL | roberta_extension_mixture_of_adapters] EP2 loss=0.1980 valF1@0.5=0.8909
[sentiment | FULL | roberta_extension_mixture_of_adapters] EP3 loss=0.1322 valF1@0.5=0.8855
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sentiment__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ [sentiment] Stage-2: adapt 5 real settings from pooled init
[sentiment | Google] train label counts: n0=900 n1=2629 n=3529 | class_wts=[1.961,0.671]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sentiment__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sentiment | Google | roberta_extension_mixture_of_adapters] EP1 loss=0.1919 valF1@0.5=0.9083
[sentiment | Google | roberta_extension_mixture_of_adapters] EP2 loss=0.1236 valF1@0.5=0.8913
[sentiment | Google | roberta_extension_mixture_of_adapters] EP3 loss=0.1033 valF1@0.5=0.8832
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sentiment__Google__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sentiment | Reddit] train label counts: n0=2679 n1=885 n=3564 | class_wts=[0.665,2.014]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sentiment__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sentiment | Reddit | roberta_extension_mixture_of_adapters] EP1 loss=0.3059 valF1@0.5=0.8268
[sentiment | Reddit | roberta_extension_mixture_of_adapters] EP2 loss=0.1910 valF1@0.5=0.8204
[sentiment | Reddit | roberta_extension_mixture_of_adapters] EP3 loss=0.1153 valF1@0.5=0.8199
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sentiment__Reddit__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sentiment | TRAIN_en-AU] train label counts: n0=1161 n1=1006 n=2167 | class_wts=[0.933,1.077]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sentiment__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sentiment | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP1 loss=0.1889 valF1@0.5=0.9114
[sentiment | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP2 loss=0.1108 valF1@0.5=0.9203
[sentiment | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP3 loss=0.0380 valF1@0.5=0.9221
[sentiment | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP4 loss=0.0406 valF1@0.5=0.9194
[sentiment | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP5 loss=0.0363 valF1@0.5=0.8782
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sentiment__TRAIN_en-AU__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sentiment | TRAIN_en-IN] train label counts: n0=1338 n1=1329 n=2667 | class_wts=[0.997,1.003]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sentiment__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sentiment | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP1 loss=0.3025 valF1@0.5=0.8722
[sentiment | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP2 loss=0.2127 valF1@0.5=0.8708
[sentiment | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP3 loss=0.1509 valF1@0.5=0.8648
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sentiment__TRAIN_en-IN__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sentiment | TRAIN_en-UK] train label counts: n0=1080 n1=1179 n=2259 | class_wts=[1.046,0.958]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sentiment__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sentiment | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP1 loss=0.1112 valF1@0.5=0.9537
[sentiment | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP2 loss=0.0545 valF1@0.5=0.9716
[sentiment | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP3 loss=0.0214 valF1@0.5=0.9787
[sentiment | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP4 loss=0.0151 valF1@0.5=0.9841
[sentiment | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP5 loss=0.0008 valF1@0.5=0.9752
[sentiment | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP6 loss=0.0074 valF1@0.5=0.9770
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sentiment__TRAIN_en-UK__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



RUN TASK: sarcasm
Available train_settings: ['FULL', 'TRAIN_en-AU', 'TRAIN_en-IN', 'TRAIN_en-UK']
✅ [sarcasm] Using existing pooled setting: 'FULL' (exact_FULL)
[sarcasm | FULL] train label counts: n0=2631 n1=954 n=3585 | class_wts=[0.681,1.879]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP1 loss=0.6794 valF1@0.5=0.4709
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP2 loss=0.6310 valF1@0.5=0.6209
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP3 loss=0.5326 valF1@0.5=0.6802
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP4 loss=0.3684 valF1@0.5=0.6878
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP5 loss=0.2056 valF1@0.5=0.6595
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP6 loss=0.1380 valF1@0.5=0.6817
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sarcasm__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ [sarcasm] Stage-2: adapt 4 real settings from pooled init
[sarcasm | FULL] train label counts: n0=2631 n1=954 n=3585 | class_wts=[0.681,1.879]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sarcasm__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP1 loss=0.1963 valF1@0.5=0.6768
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP2 loss=0.1152 valF1@0.5=0.6800
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP3 loss=0.0685 valF1@0.5=0.6896
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP4 loss=0.0697 valF1@0.5=0.6766
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP5 loss=0.0561 valF1@0.5=0.6904
[sarcasm | FULL | roberta_extension_mixture_of_adapters] EP6 loss=0.0277 valF1@0.5=0.6827
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sarcasm__FULL__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sarcasm | TRAIN_en-AU] train label counts: n0=818 n1=593 n=1411 | class_wts=[0.862,1.190]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sarcasm__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sarcasm | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP1 loss=0.2198 valF1@0.5=0.8309
[sarcasm | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP2 loss=0.1252 valF1@0.5=0.8589
[sarcasm | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP3 loss=0.0506 valF1@0.5=0.8120
[sarcasm | TRAIN_en-AU | roberta_extension_mixture_of_adapters] EP4 loss=0.0800 valF1@0.5=0.8381
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sarcasm__TRAIN_en-AU__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sarcasm | TRAIN_en-IN] train label counts: n0=1170 n1=179 n=1349 | class_wts=[0.576,3.768]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sarcasm__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sarcasm | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP1 loss=0.3451 valF1@0.5=0.7417
[sarcasm | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP2 loss=0.2160 valF1@0.5=0.7408
[sarcasm | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP3 loss=0.1178 valF1@0.5=0.8321
[sarcasm | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP4 loss=0.0634 valF1@0.5=0.8232
[sarcasm | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP5 loss=0.0912 valF1@0.5=0.8498
[sarcasm | TRAIN_en-IN | roberta_extension_mixture_of_adapters] EP6 loss=0.0981 valF1@0.5=0.8112
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sarcasm__TRAIN_en-IN__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[sarcasm | TRAIN_en-UK] train label counts: n0=643 n1=182 n=825 | class_wts=[0.642,2.266]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Initialized from: sarcasm__FULL__roberta_extension_mixture_of_adapters__stage1_poolpretrain.pt
[sarcasm | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP1 loss=0.2976 valF1@0.5=0.8958
[sarcasm | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP2 loss=0.1708 valF1@0.5=0.8623
[sarcasm | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP3 loss=0.1031 valF1@0.5=0.9005
[sarcasm | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP4 loss=0.0638 valF1@0.5=0.8958
[sarcasm | TRAIN_en-UK | roberta_extension_mixture_of_adapters] EP5 loss=0.0343 valF1@0.5=0.8449
Early stopping.
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/checkpoints/sarcasm__TRAIN_en-UK__roberta_extension_mixture_of_adapters__stage2_fromPOOL_routerFz.pt


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/metrics/moe_metrics_all.csv
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/predictions/moe_predictions_all.csv
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/metrics/moe_pervar_metrics_all.csv


Unnamed: 0,task,train_setting,variant,test_setting,split,n,acc,precision,recall,macro_f1,threshold_type,threshold
0,sentiment,FULL,roberta_extension_mixture_of_adapters,TEST_FULL,test,1212,0.885314,0.887273,0.885612,0.885218,fixed0.5,0.5
1,sentiment,FULL,roberta_extension_mixture_of_adapters,TEST_Google,test,603,0.905473,0.894342,0.848235,0.867766,fixed0.5,0.5
2,sentiment,FULL,roberta_extension_mixture_of_adapters,TEST_Reddit,test,609,0.865353,0.815489,0.841674,0.826912,fixed0.5,0.5
3,sentiment,FULL,roberta_extension_mixture_of_adapters,TEST_en-AU,test,371,0.913747,0.913172,0.914813,0.913565,fixed0.5,0.5
4,sentiment,FULL,roberta_extension_mixture_of_adapters,TEST_en-IN,test,455,0.813187,0.819774,0.813657,0.812371,fixed0.5,0.5
5,sentiment,FULL,roberta_extension_mixture_of_adapters,TEST_en-UK,test,386,0.943005,0.943447,0.942397,0.942819,fixed0.5,0.5
6,sentiment,Google,roberta_extension_mixture_of_adapters,TEST_FULL,test,1212,0.893564,0.89387,0.893686,0.893558,fixed0.5,0.5
7,sentiment,Google,roberta_extension_mixture_of_adapters,TEST_Google,test,603,0.917081,0.895927,0.881895,0.888574,fixed0.5,0.5
8,sentiment,Google,roberta_extension_mixture_of_adapters,TEST_Reddit,test,609,0.870279,0.82308,0.838291,0.830156,fixed0.5,0.5
9,sentiment,Google,roberta_extension_mixture_of_adapters,TEST_en-AU,test,371,0.908356,0.908592,0.907208,0.907792,fixed0.5,0.5


## Step 13 — Delta vs CE + error analysis
*   **Purpose**: Compare MoE results against the loaded CE baseline and identify top error corrections.
*   **Inputs**: MoE outputs, CE outputs.
*   **Outputs**: Delta CSV (`moe_delta_vs_ce.csv`) and top error correction tables.
*   **Assumptions**: Requires matching CE metrics/predictions to be loaded.

In [None]:
# ==== Cell 13: Delta vs CE + error analysis ====

# ---- Delta vs CE (global metrics) ----
delta_path = MET_DIR / "moe_delta_vs_ce.csv"

if metrics_ce is None:
    print("⚠️ CE metrics not loaded -> skipping delta.")
else:
    mce = metrics_ce.copy()
    mce = mce[mce["variant"].astype(str).str.lower() == "ce"].copy()

    moe_key = metrics_moe[["task","train_setting","test_setting","split","macro_f1","acc"]].copy()
    moe_key.rename(columns={"macro_f1":"moe_macro_f1","acc":"moe_acc"}, inplace=True)

    ce_key  = mce[["task","train_setting","test_setting","split","macro_f1","acc"]].copy()
    ce_key.rename(columns={"macro_f1":"ce_macro_f1","acc":"ce_acc"}, inplace=True)

    joined = moe_key.merge(ce_key, on=["task","train_setting","test_setting","split"], how="left")
    joined["delta_macro_f1"] = joined["moe_macro_f1"] - joined["ce_macro_f1"]
    joined["delta_acc"]      = joined["moe_acc"]      - joined["ce_acc"]

    joined.to_csv(delta_path, index=False)
    print("✅ Saved delta:", delta_path)
    display(joined.sort_values(["task","delta_macro_f1"], ascending=[True, False]).head(30))

# ---- Error analysis: CE wrong, MoE correct (per task, per train_setting, per test_setting) ----
# Uses row_id joins between CE preds and MoE preds
if preds_ce is None:
    print("⚠️ CE preds not loaded -> skipping CE-vs-MoE error tables.")
else:
    out_rows = []
    ce = preds_ce.copy()
    moe = preds_moe.copy()

    # --- FIX START ---
    # Ensure 'prob' column name consistency for CE predictions
    if 'prob1' in ce.columns and 'prob' not in ce.columns:
        ce.rename(columns={'prob1': 'prob'}, inplace=True)

    # Ensure 'row_id' column exists for CE predictions (create if missing)
    if 'row_id' not in ce.columns:
        ce['row_id'] = np.arange(len(ce))
    # --- FIX END ---

    # keep only matching key columns
    key_cols = ["task","train_setting","test_setting","row_id"]
    ce = ce[key_cols + ["label","pred","prob","text"]].rename(columns={"pred":"ce_pred","prob":"ce_prob"})
    moe = moe[key_cols + ["pred","prob"]].rename(columns={"pred":"moe_pred","prob":"moe_prob"})

    merged = moe.merge(ce, on=key_cols, how="inner")

    # CE wrong, MoE correct
    good = merged[(merged["ce_pred"] != merged["label"]) & (merged["moe_pred"] == merged["label"])].copy()
    # MoE regressions
    bad  = merged[(merged["ce_pred"] == merged["label"]) & (merged["moe_pred"] != merged["label"])].copy()

    # Save top tables (by confidence gap)
    good["gain"] = (good["moe_prob"] - good["ce_prob"]).abs()
    bad["loss"]  = (bad["moe_prob"] - bad["ce_prob"]).abs()

    good_path = ANA_DIR / "ce_wrong_moe_correct_top.csv"
    bad_path  = ANA_DIR / "ce_correct_moe_wrong_top.csv"

    good.sort_values("gain", ascending=False).head(200).to_csv(good_path, index=False)
    bad.sort_values("loss", ascending=False).head(200).to_csv(bad_path, index=False)

    print("✅ Saved:", good_path)
    print("✅ Saved:", bad_path)

✅ Saved delta: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/metrics/moe_delta_vs_ce.csv


Unnamed: 0,task,train_setting,test_setting,split,moe_macro_f1,moe_acc,ce_macro_f1,ce_acc,delta_macro_f1,delta_acc
50,sarcasm,TRAIN_en-AU,TEST_en-UK,test,0.648671,0.702128,0.368667,0.368794,0.280004,0.333333
53,sarcasm,TRAIN_en-IN,TEST_en-AU,test,0.650958,0.676349,0.423267,0.580913,0.227691,0.095436
58,sarcasm,TRAIN_en-UK,TEST_en-AU,test,0.710728,0.73444,0.494624,0.543568,0.216104,0.190871
51,sarcasm,TRAIN_en-IN,TEST_FULL,test,0.693805,0.772876,0.535637,0.727124,0.158168,0.045752
52,sarcasm,TRAIN_en-IN,TEST_Reddit,test,0.693805,0.772876,0.535637,0.727124,0.158168,0.045752
46,sarcasm,TRAIN_en-AU,TEST_FULL,test,0.698643,0.75817,0.554371,0.566993,0.144272,0.191176
47,sarcasm,TRAIN_en-AU,TEST_Reddit,test,0.698643,0.75817,0.554371,0.566993,0.144272,0.191176
56,sarcasm,TRAIN_en-UK,TEST_FULL,test,0.709219,0.789216,0.58279,0.676471,0.126428,0.112745
57,sarcasm,TRAIN_en-UK,TEST_Reddit,test,0.709219,0.789216,0.58279,0.676471,0.126428,0.112745
49,sarcasm,TRAIN_en-AU,TEST_en-IN,test,0.543091,0.795652,0.426662,0.508696,0.11643,0.286957


✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/analysis/ce_wrong_moe_correct_top.csv
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/analysis/ce_correct_moe_wrong_top.csv


## Step 14 — Paper-style plots
*   **Purpose**: Generate visualizations (Heatmaps, Locale Bar Charts) to analyze cross-variety performance.
*   **Inputs**: Metrics CSVs (`moe_metrics_all.csv`, `moe_pervar_metrics_all.csv`).
*   **Outputs**: PNG files in `figures/`.

In [None]:
# ==== Cell 14: Paper-style plots (locale bars + cross-variety heatmaps) ====
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# Paper-like locale palette (sampled from paper figure)
LOCALE_COLORS = {
    "en-AU": "#708EBF",  # blue
    "en-IN": "#FFB16E",  # orange
    "en-UK": "#E7797A",  # red
    "AU": "#708EBF",
    "IN": "#FFB16E",
    "UK": "#E7797A",
}

def normalize_locale_name(x):
    x = str(x)
    x = x.replace("TRAIN_", "").replace("TEST_", "")
    x = x.replace("_", "-")
    if x in ["en-au","en-AU"]: return "en-AU"
    if x in ["en-in","en-IN"]: return "en-IN"
    if x in ["en-uk","en-UK"]: return "en-UK"
    if x in ["AU"]: return "en-AU"
    if x in ["IN"]: return "en-IN"
    if x in ["UK"]: return "en-UK"
    return x

def plot_locale_bars_from_pervar(task, train_setting, test_setting, pervar_df, out_png):
    d = pervar_df[(pervar_df["task"]==task) &
                  (pervar_df["train_setting"]==train_setting) &
                  (pervar_df["test_setting"]==test_setting) &
                  (pervar_df["variant"]==VARIANT)].copy()
    if len(d)==0:
        print("skip bars (no rows):", task, train_setting, test_setting)
        return
    d["locale"] = d["variety_name"].map(normalize_locale_name)

    # sort locales AU, IN, UK if present
    order = [x for x in ["en-AU","en-IN","en-UK"] if x in set(d["locale"])]
    if not order:
        order = sorted(d["locale"].unique().tolist())

    d = d.sort_values("locale")
    vals = [float(d[d["locale"]==loc]["macro_f1"].iloc[0]) for loc in order]
    cols = [LOCALE_COLORS.get(loc, "#999999") for loc in order]

    plt.figure(figsize=(6,4))
    bars = plt.bar(order, vals, color=cols)
    plt.ylim(0, 1.0)
    plt.ylabel("Macro-F1")
    plt.title(f"{task.upper()} | {train_setting} -> {test_setting} | {VARIANT}")
    for b, v in zip(bars, vals):
        plt.text(b.get_x()+b.get_width()/2, v+0.01, f"{v:.2f}", ha="center", fontsize=10)
    plt.tight_layout()
    plt.savefig(out_png, dpi=200)
    plt.close()
    print("✅ Saved:", out_png)

def plot_cross_variety_heatmap(task, metric_csv, out_prefix):
    df = pd.read_csv(metric_csv)
    df.columns = [c.strip().lower() for c in df.columns]
    df = df[(df["task"]==task) & (df["split"]=="test")].copy()

    # Cross-variety: train_setting in TRAIN_en-*, test_setting in TEST_en-*
    tr = df[df["train_setting"].astype(str).str.startswith("TRAIN_")].copy()
    tr = tr[tr["test_setting"].astype(str).str.startswith("TEST_en-")].copy()
    if len(tr)==0:
        print("skip heatmap (no cross-variety rows):", task)
        return

    tr["train_var"] = tr["train_setting"].astype(str).str.replace("TRAIN_", "", regex=False)
    tr["test_var"]  = tr["test_setting"].astype(str).str.replace("TEST_", "", regex=False)

    # build CE pivot if available
    piv_moe = tr[tr["variant"].astype(str)==VARIANT].pivot_table(
        index="train_var", columns="test_var", values="macro_f1", aggfunc="mean"
    )

    plt.figure(figsize=(5.5,4.5))
    sns.heatmap(piv_moe, annot=True, fmt=".3f", cbar=False, linewidths=0.8, linecolor="white",
                cmap="Reds" if task=="sentiment" else "Blues", vmin=0.0, vmax=1.0)
    plt.title(f"{task.upper()} | MoE cross-variety (Macro-F1)")
    plt.xlabel("Tested On")
    plt.ylabel("Trained On")
    plt.tight_layout()
    out1 = PLT_DIR / f"{out_prefix}__{task}__moe_crossvar_heatmap.png"
    plt.savefig(out1, dpi=220)
    plt.close()
    print("✅ Saved:", out1)

    # Delta vs CE heatmap (if delta exists)
    delta_file = MET_DIR / "moe_delta_vs_ce.csv"
    if delta_file.exists():
        dd = pd.read_csv(delta_file)
        dd.columns = [c.strip().lower() for c in dd.columns]
        dd = dd[(dd["task"]==task) &
                (dd["train_setting"].astype(str).str.startswith("TRAIN_")) &
                (dd["test_setting"].astype(str).str.startswith("TEST_en-")) &
                (dd["split"]=="test")].copy()
        if len(dd)>0:
            dd["train_var"] = dd["train_setting"].astype(str).str.replace("TRAIN_", "", regex=False)
            dd["test_var"]  = dd["test_setting"].astype(str).str.replace("TEST_", "", regex=False)
            piv_d = dd.pivot_table(index="train_var", columns="test_var", values="delta_macro_f1", aggfunc="mean")

            plt.figure(figsize=(5.5,4.5))
            sns.heatmap(piv_d, annot=True, fmt=".3f", cbar=False, linewidths=0.8, linecolor="white",
                        cmap="RdBu_r", center=0.0)
            plt.title(f"{task.upper()} | Δ(MoE-CE) cross-variety (Macro-F1)")
            plt.xlabel("Tested On")
            plt.ylabel("Trained On")
            plt.tight_layout()
            out2 = PLT_DIR / f"{out_prefix}__{task}__delta_crossvar_heatmap.png"
            plt.savefig(out2, dpi=220)
            plt.close()
            print("✅ Saved:", out2)

# ---- Generate key plots ----
# 1) Cross-variety heatmaps (MoE + Δ)
plot_cross_variety_heatmap("sarcasm", MET_DIR / "moe_metrics_all.csv", out_prefix="FIG_crossvar")
plot_cross_variety_heatmap("sentiment", MET_DIR / "moe_metrics_all.csv", out_prefix="FIG_crossvar")

# 2) Locale bars (sentiment): show Google and Reddit testsets if they exist
pervar_df = pd.read_csv(MET_DIR / "moe_pervar_metrics_all.csv")
pervar_df.columns = [c.strip().lower() for c in pervar_df.columns]

for test_setting in ["TEST_Google", "TEST_Reddit", "TEST_FULL"]:
    out = PLT_DIR / f"FIG_localeBars__sentiment__FULL_to_{test_setting}.png"
    plot_locale_bars_from_pervar("sentiment", "FULL", test_setting, pervar_df, out)

# 3) Locale bars (sarcasm): typically Reddit + FULL
for test_setting in ["TEST_Reddit", "TEST_FULL"]:
    out = PLT_DIR / f"FIG_localeBars__sarcasm__FULL_to_{test_setting}.png"
    plot_locale_bars_from_pervar("sarcasm", "FULL", test_setting, pervar_df, out)

print("✅ Plotting finished. Check:", PLT_DIR)

✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_crossvar__sarcasm__moe_crossvar_heatmap.png
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_crossvar__sarcasm__delta_crossvar_heatmap.png
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_crossvar__sentiment__moe_crossvar_heatmap.png
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_crossvar__sentiment__delta_crossvar_heatmap.png
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_localeBars__sentiment__FULL_to_TEST_Google.png
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_localeBars__sentiment__FULL_to_TEST_Reddit.png
✅ Saved: /content/drive/MyDrive/DNLP/models/roberta_extension_mixture_of_adapters/plots/FIG_localeBars__sentiment__FULL_to_TEST_FULL.png
✅ Saved: /content/drive/MyDrive/D

## Step 15 — Leakage audit
*   **Purpose**: Verify that there is no row-overlap between Train/Val and Test sets.
*   **Inputs**: Task indices.
*   **Outputs**: Audit status prints (✅ or ❌).

In [None]:
# ==== Cell 15: Leakage audit (row_id overlap train/val vs testsets) ====
import pandas as pd
from pathlib import Path

def _read_row_ids(csv_path: Path):
    df = pd.read_csv(csv_path)
    df.columns = [c.strip().lower() for c in df.columns]
    if "row_id" not in df.columns:
        raise ValueError(f"{csv_path} missing row_id column.")
    return set(df["row_id"].astype(int).tolist())

def audit_task(task: str):
    settings_df, testsets_df = load_task_indices(task)

    # 1) train vs val disjoint per setting
    tv_bad = 0
    for _, r in settings_df.iterrows():
        tr_ids = _read_row_ids(r["train_csv_abs"])
        va_ids = _read_row_ids(r["val_csv_abs"])
        inter = tr_ids & va_ids
        if inter:
            tv_bad += 1
            print(f"❌ train∩val overlap | {task} | {r['train_setting']}: {len(inter)}")
    if tv_bad == 0:
        print(f"✅ {task}: All settings train∩val overlap = 0")

    # 2) train+val vs each testset disjoint
    leak = 0
    for _, r in settings_df.iterrows():
        trva = _read_row_ids(r["train_csv_abs"]) | _read_row_ids(r["val_csv_abs"])
        for _, t in testsets_df.iterrows():
            te_ids = _read_row_ids(t["test_csv_abs"])
            inter = trva & te_ids
            if inter:
                leak += 1
                print(f"🚨 LEAKAGE | {task} | {r['train_setting']} vs {t['test_setting']}: overlap={len(inter)}")

    if leak == 0:
        print(f"✅ {task}: No train/val ↔ test overlap found.")
    else:
        print(f"🚨 {task}: Found leakage in {leak} (setting,test) pairs.")

audit_task("sarcasm")
audit_task("sentiment")

✅ sarcasm: All settings train∩val overlap = 0
✅ sarcasm: No train/val ↔ test overlap found.
✅ sentiment: All settings train∩val overlap = 0
✅ sentiment: No train/val ↔ test overlap found.
