# OOD Evaluation of Toxic Comment Classifiers
## Final Project Experiments & Results

This notebook reproduces all experiments, results, and plots for the project.
It covers:
1.  **Setup**: Environment and Data.
2.  **Baselines**: TF-IDF + Logistic Regression / SVM.
3.  **Models**: RoBERTa (In-domain and Cross-domain).
4.  **Fairness**: Demographic Parity and Equal Opportunity analysis.
5.  **Analysis**: Final plots and tables.

**Datasets**: Civil Comments, HateXplain, Jigsaw.
**Goal**: Evaluate OOD generalization and fairness.

In [None]:
# @title 1. Setup & Installation
import os
import sys
import subprocess
from pathlib import Path

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("Running in Local Environment")

# Install dependencies
if IN_COLAB:
    print("Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])
    # Install specific versions if needed, e.g. transformers
    subprocess.check_call([sys.executable, "-m", "pip", "install", "transformers", "scikit-learn", "pandas", "matplotlib", "seaborn", "torch"])

# Mount Drive if needed (optional)
# from google.colab import drive
# drive.mount('/content/drive')

# Set paths
REPO_DIR = Path(os.getcwd())
if IN_COLAB:
    # Assuming repo is cloned to /content/ood-eval-toxic-classifiers or similar
    # If not, clone it:
    if not (REPO_DIR / "scripts").exists():
        print("Cloning repository...")
        !git clone https://github.com/aayushakumar/ood-eval-toxic-classifiers.git
        os.chdir("ood-eval-toxic-classifiers")
        REPO_DIR = Path(os.getcwd())

DATA_DIR = REPO_DIR / "data"
EXPERIMENTS_DIR = REPO_DIR / "experiments"
SCRIPTS_DIR = REPO_DIR / "scripts"

EXPERIMENTS_DIR.mkdir(exist_ok=True)

print(f"Working Directory: {REPO_DIR}")
print(f"Data Directory: {DATA_DIR}")
print(f"Experiments Directory: {EXPERIMENTS_DIR}")

In [None]:
# @title 2. Data Verification
import pandas as pd

REQUIRED_FILES = [
    "civil_train.csv", "civil_val.csv", "civil_test.csv",
    "hatexplain_train.csv", "hatexplain_val.csv", "hatexplain_test.csv",
    "jigsaw_train.csv", "jigsaw_val.csv", "jigsaw_test.csv"
]

FULL_FILES = [
    "jigsaw_test_full.csv",  # Required for fairness analysis
    "jigsaw_train_full.csv",
    "jigsaw_val_full.csv",
]

missing_files = []
for fname in REQUIRED_FILES:
    if not (DATA_DIR / fname).exists():
        missing_files.append(fname)

if missing_files:
    print(f"❌ WARNING: Missing data files: {missing_files}")
    print("Please ensure data is uploaded to the 'data/' directory.")
else:
    print("✓ All required data files found.")

# Check full files for fairness
missing_full = []
for fname in FULL_FILES:
    if not (DATA_DIR / fname).exists():
        missing_full.append(fname)

if missing_full:
    print(f"\n⚠️  Missing full data files (needed for fairness): {missing_full}")
else:
    print("✓ Full data files for fairness analysis found.")

# Dataset statistics
print("\n" + "=" * 60)
print("DATASET STATISTICS")
print("=" * 60)

for dataset in ["civil", "hatexplain", "jigsaw"]:
    print(f"\n{dataset.upper()}:")
    for split in ["train", "val", "test"]:
        fpath = DATA_DIR / f"{dataset}_{split}.csv"
        if fpath.exists():
            df = pd.read_csv(fpath)
            n_toxic = (df["label"] == 1).sum()
            n_total = len(df)
            print(f"  {split:5s}: {n_total:6d} samples, {n_toxic:5d} toxic ({100*n_toxic/n_total:.1f}%)")

## 3. TF-IDF Baselines

We train Logistic Regression and Linear SVM models on each source dataset and evaluate on all target datasets.
Metrics: Accuracy, F1, ROC-AUC, PR-AUC.

In [None]:
# @title Run TF-IDF Experiments
RUN_TFIDF = True # @param {type:"boolean"}

datasets = ["civil", "hatexplain", "jigsaw"]

if RUN_TFIDF:
    print("Running TF-IDF Baselines...")
    for source in datasets:
        # Target all other datasets + self
        targets = [d for d in datasets]
        
        print(f"\n--- Training on {source} ---")
        cmd = [
            sys.executable, "scripts/run_tfidf_baselines.py",
            "--source_dataset", source,
            "--target_datasets"
        ] + targets + [
            "--model", "both",  # Run both LogReg and SVM
            "--save_preds"
        ]
        
        print(f"Executing: {' '.join(cmd)}")
        subprocess.check_call(cmd)
        
    print("\nTF-IDF Experiments Completed.")
else:
    print("Skipping TF-IDF experiments.")

## 4. RoBERTa Experiments

We train RoBERTa-base models.
Options:
- **Standard**: Fine-tuning on source.
- **CORAL**: Domain adaptation (requires unlabeled target).
- **LoRA**: Parameter-efficient fine-tuning.

We also enable **Calibration** (Temperature Scaling) and save predictions for fairness analysis.

In [None]:
# @title Run RoBERTa Experiments
RUN_ROBERTA = True # @param {type:"boolean"}
USE_CORAL = False # @param {type:"boolean"}
USE_LORA = False # @param {type:"boolean"}
FAST_MODE = True # @param {type:"boolean"}

# Configuration
EPOCHS = 1 if FAST_MODE else 3
SEEDS = [42] # Add more seeds for full paper results, e.g. [42, 123, 456]
BATCH_SIZE = 16
MAX_LEN = 128

if RUN_ROBERTA:
    print("Running RoBERTa Experiments...")
    
    for source in datasets:
        targets = [d for d in datasets] # All datasets
        
        # Base arguments
        cmd = [
            sys.executable, "scripts/run_roberta.py",
            "--source_dataset", source,
            "--model_name", "roberta-base",
            "--epochs", str(EPOCHS),
            "--batch_size", str(BATCH_SIZE),
            "--max_len", str(MAX_LEN),
            "--seeds"
        ] + [str(s) for s in SEEDS] + [
            "--target_datasets"
        ] + targets + [
            "--calibration", "temperature", # Enable calibration
            "--save_preds",
            "--amp", # Use Mixed Precision
            "--tune_threshold"
        ]
        
        # LoRA
        if USE_LORA:
            cmd += ["--peft", "lora"]
            
        # CORAL (Example: if source is civil, adapt to jigsaw)
        # Note: CORAL requires a specific target. For simplicity, we run standard here.
        # To run CORAL, you would need a separate loop or logic.
        
        print(f"\n--- Training RoBERTa on {source} ---")
        print(f"Executing: {' '.join(cmd)}")
        
        # Run
        subprocess.check_call(cmd)
        
    print("\nRoBERTa Experiments Completed.")
else:
    print("Skipping RoBERTa experiments.")

## 5. Fairness Analysis

We compute fairness metrics (Demographic Parity, Equal Opportunity) for models on datasets with identity attributes (Jigsaw, Civil).
We use the `scripts/fairness_metrics.py` script.

In [None]:
# @title Run Fairness Metrics
RUN_FAIRNESS = True # @param {type:"boolean"}

# Define which datasets have identity labels and which split to use
# NOTE: Only Jigsaw has g_* identity columns in *_full.csv files
# Civil Comments full files only have 'toxicity' column, no identity groups
fairness_targets = [
    {"dataset": "jigsaw", "split": "test", "full_file": "jigsaw_test_full.csv", "group_prefix": "g_"},
]

if RUN_FAIRNESS:
    print("Running Fairness Analysis...")
    print("Note: Only Jigsaw has identity group columns for fairness analysis.\n")
    
    for target in fairness_targets:
        dataset = target["dataset"]
        split = target["split"]
        full_file = DATA_DIR / target["full_file"]
        
        if not full_file.exists():
            print(f"Skipping {dataset}: Full data file {full_file} not found.")
            continue
            
        # Check for predictions from RoBERTa (in-domain and cross-domain)
        # We look for preds_{source}_{dataset}.csv where dataset is the target
        # For in-domain: preds_{dataset}_test.csv
        
        # 1. In-domain RoBERTa
        pred_file = EXPERIMENTS_DIR / f"preds_{dataset}_test.csv"
        if pred_file.exists():
            print(f"\n--- Fairness: {dataset} (RoBERTa In-domain) ---")
            out_prefix = EXPERIMENTS_DIR / f"fairness_{dataset}_roberta_indomain"
            cmd = [
                sys.executable, "scripts/fairness_metrics.py",
                "--dataset", dataset,
                "--split", split,
                "--pred_file", str(pred_file),
                "--full_data_file", str(full_file),
                "--group_prefix", target["group_prefix"],
                "--out_prefix", str(out_prefix)
            ]
            subprocess.check_call(cmd)
        else:
            print(f"RoBERTa predictions not found for {dataset} in-domain: {pred_file}")

        # 2. In-domain TF-IDF (LogReg)
        pred_file = EXPERIMENTS_DIR / f"preds_tfidf_logreg_{dataset}_test.csv"
        if pred_file.exists():
            print(f"\n--- Fairness: {dataset} (TF-IDF LogReg In-domain) ---")
            out_prefix = EXPERIMENTS_DIR / f"fairness_{dataset}_tfidf_logreg_indomain"
            cmd = [
                sys.executable, "scripts/fairness_metrics.py",
                "--dataset", dataset,
                "--split", split,
                "--pred_file", str(pred_file),
                "--full_data_file", str(full_file),
                "--group_prefix", target["group_prefix"],
                "--out_prefix", str(out_prefix)
            ]
            subprocess.check_call(cmd)

        # 3. Cross-domain (e.g., Civil -> Jigsaw, HateXplain -> Jigsaw)
        for source in datasets:
            if source == dataset: 
                continue
            
            # RoBERTa cross-domain
            pred_file = EXPERIMENTS_DIR / f"preds_{source}_to_{dataset}.csv"
            if pred_file.exists():
                print(f"\n--- Fairness: {source} -> {dataset} (RoBERTa) ---")
                out_prefix = EXPERIMENTS_DIR / f"fairness_{source}_to_{dataset}_roberta"
                cmd = [
                    sys.executable, "scripts/fairness_metrics.py",
                    "--dataset", dataset,
                    "--split", split,
                    "--pred_file", str(pred_file),
                    "--full_data_file", str(full_file),
                    "--group_prefix", target["group_prefix"],
                    "--out_prefix", str(out_prefix)
                ]
                subprocess.check_call(cmd)
            
            # TF-IDF cross-domain
            pred_file = EXPERIMENTS_DIR / f"preds_tfidf_logreg_{source}_to_{dataset}.csv"
            if pred_file.exists():
                print(f"\n--- Fairness: {source} -> {dataset} (TF-IDF LogReg) ---")
                out_prefix = EXPERIMENTS_DIR / f"fairness_{source}_to_{dataset}_tfidf_logreg"
                cmd = [
                    sys.executable, "scripts/fairness_metrics.py",
                    "--dataset", dataset,
                    "--split", split,
                    "--pred_file", str(pred_file),
                    "--full_data_file", str(full_file),
                    "--group_prefix", target["group_prefix"],
                    "--out_prefix", str(out_prefix)
                ]
                subprocess.check_call(cmd)
            
    print("\nFairness Analysis Completed.")
else:
    print("Skipping Fairness Analysis.")

## 6. Results & Analysis

We aggregate results from all experiments and visualize:
1.  **Cross-Domain Performance**: F1 Score Heatmap.
2.  **Fairness Gaps**: Demographic Parity and Equal Opportunity differences.

In [None]:
# @title Aggregate Results
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load TF-IDF Summaries
tfidf_files = glob.glob(str(EXPERIMENTS_DIR / "summary_tfidf_*.csv"))
roberta_files = glob.glob(str(EXPERIMENTS_DIR / "summary_*.csv"))

all_metrics = []

# Process TF-IDF
for f in tfidf_files:
    try:
        df = pd.read_csv(f)
        # Filename: summary_tfidf_{source}_{model}.csv
        fname = Path(f).stem
        parts = fname.split("_")
        # parts: ['summary', 'tfidf', source, model]
        if len(parts) >= 4:
            source = parts[2]
            model = parts[3]
        else:
            continue
            
        for _, row in df.iterrows():
            split = row["split"]
            if split == "in_domain_test":
                target = source
            elif split.startswith("cross_"):
                target = split.replace("cross_", "")
            else:
                continue
                
            all_metrics.append({
                "Model": f"TF-IDF ({model})",
                "Source": source,
                "Target": target,
                "F1": row.get("f1", 0),
                "Accuracy": row.get("accuracy", 0),
                "AUROC": row.get("auroc", 0)
            })
    except Exception as e:
        print(f"Error reading {f}: {e}")

# Process RoBERTa
for f in roberta_files:
    if "tfidf" in f: continue # Skip tfidf here
    try:
        df = pd.read_csv(f)
        source = Path(f).stem.replace("summary_", "")
        
        for _, row in df.iterrows():
            split = row["split"]
            if split == "in_domain_test":
                target = source
            elif split.startswith("cross_"):
                target = split.replace("cross_", "")
            else:
                continue
                
            all_metrics.append({
                "Model": "RoBERTa",
                "Source": source,
                "Target": target,
                "F1": row.get("f1", 0),
                "Accuracy": row.get("accuracy", 0),
                "AUROC": row.get("auroc", 0)
            })
    except Exception as e:
        print(f"Error reading {f}: {e}")

results_df = pd.DataFrame(all_metrics)

if not results_df.empty:
    print("Loaded Results:")
    print(results_df.head())
else:
    print("No results found.")

In [None]:
# @title Plot Cross-Domain Performance Heatmaps
if not results_df.empty:
    models = results_df["Model"].unique()
    n_models = len(models)
    
    fig, axes = plt.subplots(1, n_models, figsize=(6 * n_models, 5))
    if n_models == 1:
        axes = [axes]
    
    for ax, model in zip(axes, models):
        model_data = results_df[results_df["Model"] == model]
        if model_data.empty:
            continue
        pivot_table = model_data.pivot(index="Source", columns="Target", values="F1")
        sns.heatmap(pivot_table, annot=True, cmap="viridis", fmt=".3f", ax=ax, vmin=0, vmax=1)
        ax.set_title(f"{model} Cross-Domain F1 Score")
    
    plt.tight_layout()
    plt.savefig(EXPERIMENTS_DIR / "cross_domain_f1_heatmap.png", dpi=150, bbox_inches="tight")
    plt.show()
    print(f"Saved: {EXPERIMENTS_DIR / 'cross_domain_f1_heatmap.png'}")
else:
    print("No results to plot.")

In [None]:
# @title Plot Fairness Gaps
fairness_files = glob.glob(str(EXPERIMENTS_DIR / "fairness_*_summary.csv"))

fairness_data = []

for f in fairness_files:
    try:
        df = pd.read_csv(f)
        # Filename: fairness_{context}_summary.csv
        fname = Path(f).stem
        context = fname.replace("fairness_", "").replace("_summary", "")
        
        # Get max gaps across all identity groups
        max_dp = df["dp_diff"].max()
        max_eop = df["eop_diff"].max()
        max_eo = df["eo_diff"].max()
        
        fairness_data.append({
            "Setting": context,
            "Max DP Gap": max_dp,
            "Max EOp Gap": max_eop,
            "Max EO Gap": max_eo
        })
    except Exception as e:
        print(f"Error reading {f}: {e}")

if fairness_data:
    f_df = pd.DataFrame(fairness_data)
    print("\nFairness Gaps Summary:")
    print(f_df.to_string(index=False))
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 6))
    f_df.set_index("Setting")[["Max DP Gap", "Max EOp Gap", "Max EO Gap"]].plot(kind="bar", ax=ax)
    ax.set_title("Fairness Gaps by Setting (Lower is Better)")
    ax.set_ylabel("Gap")
    ax.set_xlabel("Model / Domain Setting")
    plt.xticks(rotation=45, ha="right")
    plt.legend(title="Metric")
    plt.tight_layout()
    plt.savefig(EXPERIMENTS_DIR / "fairness_gaps.png", dpi=150, bbox_inches="tight")
    plt.show()
    print(f"Saved: {EXPERIMENTS_DIR / 'fairness_gaps.png'}")
else:
    print("No fairness results found.")

## 7. Summary Tables for Report

Generate formatted tables ready for the final report.

In [None]:
# @title Generate Summary Tables
if not results_df.empty:
    # Table 1: In-Domain Performance
    print("=" * 80)
    print("TABLE 1: In-Domain Performance (Source = Target)")
    print("=" * 80)
    in_domain = results_df[results_df["Source"] == results_df["Target"]]
    in_domain_pivot = in_domain.pivot(index="Source", columns="Model", values=["F1", "Accuracy", "AUROC"])
    print(in_domain_pivot.round(4).to_string())
    
    # Table 2: Cross-Domain Performance (F1)
    print("\n" + "=" * 80)
    print("TABLE 2: Cross-Domain F1 Score")
    print("=" * 80)
    for model in results_df["Model"].unique():
        print(f"\n{model}:")
        model_data = results_df[results_df["Model"] == model]
        pivot = model_data.pivot(index="Source", columns="Target", values="F1")
        print(pivot.round(4).to_string())
    
    # Table 3: OOD Performance Drop
    print("\n" + "=" * 80)
    print("TABLE 3: OOD Performance Drop (In-Domain F1 - Cross-Domain F1)")
    print("=" * 80)
    for model in results_df["Model"].unique():
        print(f"\n{model}:")
        model_data = results_df[results_df["Model"] == model]
        pivot = model_data.pivot(index="Source", columns="Target", values="F1")
        # Calculate drop from diagonal (in-domain)
        for src in pivot.index:
            if src in pivot.columns:
                in_domain_f1 = pivot.loc[src, src]
                for tgt in pivot.columns:
                    if src != tgt:
                        pivot.loc[src, tgt] = in_domain_f1 - pivot.loc[src, tgt]
                pivot.loc[src, src] = 0  # No drop for in-domain
        print(pivot.round(4).to_string())
    
    # Save to CSV
    results_df.to_csv(EXPERIMENTS_DIR / "all_results_combined.csv", index=False)
    print(f"\nSaved all results to: {EXPERIMENTS_DIR / 'all_results_combined.csv'}")
else:
    print("No results available.")

In [None]:
# @title Export Results for Report (LaTeX Tables)
def df_to_latex(df, caption="", label=""):
    """Convert DataFrame to LaTeX table format."""
    latex = df.to_latex(float_format="%.4f", escape=False)
    if caption:
        latex = latex.replace("\\begin{tabular}", f"\\caption{{{caption}}}\n\\label{{{label}}}\n\\begin{tabular}")
    return latex

if not results_df.empty:
    # Export main results table
    for model in results_df["Model"].unique():
        model_data = results_df[results_df["Model"] == model]
        pivot = model_data.pivot(index="Source", columns="Target", values="F1")
        
        model_name_clean = model.replace(" ", "_").replace("(", "").replace(")", "")
        latex_file = EXPERIMENTS_DIR / f"table_{model_name_clean}_f1.tex"
        
        with open(latex_file, "w") as f:
            f.write(df_to_latex(pivot, 
                               caption=f"{model} Cross-Domain F1 Scores",
                               label=f"tab:{model_name_clean}_f1"))
        print(f"Saved LaTeX table: {latex_file}")

# Export fairness table
if fairness_data:
    f_df = pd.DataFrame(fairness_data)
    latex_file = EXPERIMENTS_DIR / "table_fairness.tex"
    with open(latex_file, "w") as f:
        f.write(df_to_latex(f_df.set_index("Setting"), 
                           caption="Fairness Gaps Across Models and Domains",
                           label="tab:fairness"))
    print(f"Saved LaTeX table: {latex_file}")

print("\n✓ All exports complete!")

## 8. Usage Notes

### Running on Google Colab Pro
1. Upload the entire repository to Colab or mount from Google Drive
2. Ensure GPU runtime is enabled: `Runtime → Change runtime type → GPU`
3. Run cells in order, adjusting toggles:
   - `RUN_TFIDF = True` → Runs TF-IDF baselines (~5-10 min)
   - `RUN_ROBERTA = True` → Runs RoBERTa experiments (~30-60 min per dataset with GPU)
   - `FAST_MODE = True` → Uses 1 epoch for quick testing
   - `RUN_FAIRNESS = True` → Computes fairness metrics on Jigsaw dataset

### Expected Outputs
- `experiments/summary_*.csv` - Performance metrics for each model/dataset
- `experiments/preds_*.csv` - Predictions for fairness analysis
- `experiments/fairness_*_summary.csv` - Fairness gap metrics
- `experiments/*.png` - Visualization plots
- `experiments/*.tex` - LaTeX tables for report

### Key Findings to Report
1. **In-Domain Performance**: How well models perform on their training distribution
2. **Cross-Domain Generalization**: Performance drop when tested on different datasets
3. **Fairness**: Demographic parity and equal opportunity gaps across identity groups