# F-Regularization Large-Scale Experiment

## Goal
Validate the causal hypothesis at scale: **Does minimizing geDIG F during training improve performance across multiple models and tasks?**

## Experiment Matrix
- **Models**: DistilBERT, BERT-base, RoBERTa-base
- **Tasks**: SST-2, MRPC, CoLA, QNLI (GLUE subset)
- **α sweep**: [0, 0.001, 0.01, 0.1]
- **Seeds**: [42, 123, 456, 789, 1024]

## Expected Runtime
- Full sweep: ~8-12 hours on T4/V100
- Single task/model: ~30-60 min

In [1]:
# Check GPU
!nvidia-smi

Wed Dec 17 14:48:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             47W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# Install dependencies
!pip install -q transformers datasets accelerate scipy

In [3]:
# Core imports
from __future__ import annotations

import json
import math
import os
import time
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from scipy import stats
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    set_seed,
)
from transformers.modeling_outputs import SequenceClassifierOutput

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch: 2.9.0+cu126
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB


In [4]:
# ============================================================================
# Differentiable geDIG Calculator
# ============================================================================

@dataclass
class DifferentiableGeDIG:
    """Computes geDIG F in a differentiable manner for backpropagation."""
    lambda_param: float = 1.0
    gamma: float = 0.5
    temperature: float = 0.1
    percentile: float = 0.9
    max_path_length: int = 4

    def compute_F(self, attention: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> Dict[str, torch.Tensor]:
        batch_size, num_heads, seq_len, _ = attention.shape
        if attention_mask is not None:
            mask_2d = attention_mask.unsqueeze(1).unsqueeze(2) * attention_mask.unsqueeze(1).unsqueeze(3)
            attention = attention * mask_2d.float()
        delta_epc = self._compute_soft_density(attention)
        delta_h = self._compute_entropy(attention, attention_mask)
        delta_sp = self._compute_soft_path_efficiency(attention, attention_mask)
        F_values = delta_epc - self.lambda_param * (delta_h + self.gamma * delta_sp)
        return {"F": F_values, "F_mean": F_values.mean(), "delta_epc": delta_epc, "delta_h": delta_h, "delta_sp": delta_sp}

    def _compute_soft_density(self, attention: torch.Tensor) -> torch.Tensor:
        batch_size, num_heads, seq_len, _ = attention.shape
        attn_flat = attention.view(batch_size, num_heads, -1)
        k = int(self.percentile * seq_len * seq_len)
        threshold = torch.kthvalue(attn_flat, k, dim=-1).values.unsqueeze(-1).unsqueeze(-1)
        edge_probs = torch.sigmoid((attention - threshold) / self.temperature)
        return edge_probs.sum(dim=(-2, -1)) / (seq_len * seq_len)

    def _compute_entropy(self, attention: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        batch_size, num_heads, seq_len, _ = attention.shape
        attn_flat = attention.view(batch_size, num_heads, -1)
        attn_norm = attn_flat / (attn_flat.sum(dim=-1, keepdim=True) + 1e-10)
        entropy = -(attn_norm * torch.log(attn_norm + 1e-10)).sum(dim=-1)
        if attention_mask is not None:
            valid_count = attention_mask.sum(dim=-1).float()
            max_entropy = torch.log(valid_count * valid_count + 1e-10).unsqueeze(1)
        else:
            max_entropy = math.log(seq_len * seq_len)
        return entropy / (max_entropy + 1e-10)

    def _compute_soft_path_efficiency(self, attention: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        batch_size, num_heads, seq_len, _ = attention.shape
        attn_flat = attention.view(batch_size, num_heads, -1)
        k = int(self.percentile * seq_len * seq_len)
        threshold = torch.kthvalue(attn_flat, k, dim=-1).values.unsqueeze(-1).unsqueeze(-1)
        adj = torch.sigmoid((attention - threshold) / self.temperature)
        eye = torch.eye(seq_len, device=attention.device).unsqueeze(0).unsqueeze(0)
        adj = adj + eye
        path_efficiency = torch.zeros(batch_size, num_heads, device=attention.device)
        adj_power = adj.clone()
        for path_len in range(1, self.max_path_length + 1):
            if path_len > 1:
                adj_power = torch.clamp(torch.matmul(adj_power, adj), 0, 1)
            path_efficiency = path_efficiency + (1.0 / path_len) * (adj_power > 0.5).float().mean(dim=(-2, -1))
        return path_efficiency / self.max_path_length

In [5]:
# ============================================================================
# F-Regularized Model and Trainer
# ============================================================================

class FRegularizedModel(nn.Module):
    """Wrapper that adds geDIG F regularization to the loss."""
    def __init__(self, base_model: nn.Module, alpha: float = 0.1, gedig_config: Optional[Dict[str, Any]] = None):
        super().__init__()
        self.base_model = base_model
        self.alpha = alpha
        self.gedig = DifferentiableGeDIG(**(gedig_config or {}))
        self._last_gedig_metrics: Optional[Dict[str, float]] = None

    def forward(self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, 
                labels: Optional[torch.Tensor] = None, **kwargs) -> SequenceClassifierOutput:
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask, 
                                   labels=labels, output_attentions=True, **kwargs)
        if labels is not None and self.alpha > 0:
            f_values = [self.gedig.compute_F(layer_attn, attention_mask)["F_mean"] 
                       for layer_attn in outputs.attentions]
            f_mean = torch.stack(f_values).mean()
            total_loss = outputs.loss + self.alpha * f_mean
            self._last_gedig_metrics = {
                "f_mean": f_mean.item(), 
                "ce_loss": outputs.loss.item(), 
                "total_loss": total_loss.item()
            }
            return SequenceClassifierOutput(loss=total_loss, logits=outputs.logits, 
                                            hidden_states=None, attentions=None)
        return SequenceClassifierOutput(loss=outputs.loss, logits=outputs.logits, 
                                        hidden_states=None, attentions=None)


class FRegularizedTrainer(Trainer):
    """Trainer with geDIG metric logging."""
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        outputs = model(**inputs)
        loss = outputs.loss
        if hasattr(model, "_last_gedig_metrics") and model._last_gedig_metrics:
            self.log(model._last_gedig_metrics)
        return (loss, outputs) if return_outputs else loss

In [6]:
# ============================================================================
# Task Configurations
# ============================================================================

TASK_CONFIGS = {
    "sst2": {
        "dataset": ("glue", "sst2"),
        "text_field": "sentence",
        "num_labels": 2,
        "metric": "accuracy",
    },
    "mrpc": {
        "dataset": ("glue", "mrpc"),
        "text_field": ["sentence1", "sentence2"],
        "num_labels": 2,
        "metric": "f1",
    },
    "cola": {
        "dataset": ("glue", "cola"),
        "text_field": "sentence",
        "num_labels": 2,
        "metric": "matthews_correlation",
    },
    "qnli": {
        "dataset": ("glue", "qnli"),
        "text_field": ["question", "sentence"],
        "num_labels": 2,
        "metric": "accuracy",
    },
}

MODEL_CONFIGS = {
    "distilbert": "distilbert-base-uncased",
    "bert": "bert-base-uncased",
    "roberta": "roberta-base",
}

# Experiment settings
ALPHAS = [0.0, 0.001, 0.01, 0.1]
SEEDS = [42, 123, 456, 789, 1024]

In [7]:
# ============================================================================
# Metrics Computation
# ============================================================================

from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef

def compute_metrics(pred, metric_name="accuracy"):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    
    if metric_name == "accuracy":
        return {"accuracy": accuracy_score(labels, preds)}
    elif metric_name == "f1":
        return {
            "f1": f1_score(labels, preds),
            "accuracy": accuracy_score(labels, preds),
        }
    elif metric_name == "matthews_correlation":
        return {
            "matthews_correlation": matthews_corrcoef(labels, preds),
            "accuracy": accuracy_score(labels, preds),
        }
    return {"accuracy": accuracy_score(labels, preds)}


def compute_final_gedig_metrics(model, eval_dataset, tokenizer, data_collator):
    """Compute geDIG metrics on eval set."""
    from torch.utils.data import DataLoader
    device = next(model.parameters()).device
    model.eval()
    dataloader = DataLoader(eval_dataset, batch_size=32, collate_fn=data_collator)
    gedig = DifferentiableGeDIG()
    all_f = []
    
    with torch.no_grad():
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items() if isinstance(v, torch.Tensor)}
            base = model.base_model if hasattr(model, "base_model") else model
            outputs = base(input_ids=batch["input_ids"], attention_mask=batch.get("attention_mask"), 
                          output_attentions=True)
            for layer_attn in outputs.attentions:
                metrics = gedig.compute_F(layer_attn, batch.get("attention_mask"))
                all_f.append(metrics["F"].mean().item())
    
    return {"f_mean": np.mean(all_f), "f_std": np.std(all_f)}

In [8]:
# ============================================================================
# Single Experiment Runner
# ============================================================================

def run_single_experiment(
    model_name: str,
    task_name: str,
    alpha: float,
    seed: int,
    max_train_samples: Optional[int] = None,
    max_eval_samples: Optional[int] = None,
    epochs: int = 3,
    batch_size: int = 16,
    learning_rate: float = 2e-5,
    output_dir: Optional[Path] = None,
) -> Dict[str, Any]:
    """Run a single F-regularization experiment."""
    set_seed(seed)
    
    task_config = TASK_CONFIGS[task_name]
    model_path = MODEL_CONFIGS[model_name]
    
    if output_dir is None:
        output_dir = Path(f"results/{model_name}/{task_name}/alpha_{alpha}_seed_{seed}")
    output_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"\n{'='*70}")
    print(f"Model: {model_name} | Task: {task_name} | α: {alpha} | Seed: {seed}")
    print(f"{'='*70}")
    
    start_time = time.time()
    
    # Load dataset
    ds_name, ds_config = task_config["dataset"]
    train_split = "train" if max_train_samples is None else f"train[:{max_train_samples}]"
    eval_split = "validation" if max_eval_samples is None else f"validation[:{max_eval_samples}]"
    
    ds_train = load_dataset(ds_name, ds_config, split=train_split)
    ds_eval = load_dataset(ds_name, ds_config, split=eval_split)
    
    print(f"Train: {len(ds_train)} samples | Eval: {len(ds_eval)} samples")
    
    # Tokenize
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    text_field = task_config["text_field"]
    
    if isinstance(text_field, list):
        tokenize_fn = lambda ex: tokenizer(ex[text_field[0]], ex[text_field[1]], 
                                           truncation=True, max_length=128)
    else:
        tokenize_fn = lambda ex: tokenizer(ex[text_field], truncation=True, max_length=128)
    
    train_ds = ds_train.map(tokenize_fn, batched=True)
    eval_ds = ds_eval.map(tokenize_fn, batched=True)
    
    # Remove unused columns
    keep_cols = {"input_ids", "attention_mask", "label"}
    train_ds = train_ds.remove_columns([c for c in train_ds.column_names if c not in keep_cols])
    eval_ds = eval_ds.remove_columns([c for c in eval_ds.column_names if c not in keep_cols])
    train_ds = train_ds.with_format("torch")
    eval_ds = eval_ds.with_format("torch")
    
    # Load model
    base_model = AutoModelForSequenceClassification.from_pretrained(
        model_path, num_labels=task_config["num_labels"]
    )
    model = FRegularizedModel(base_model, alpha=alpha) if alpha > 0 else base_model
    
    # Training args
    training_args = TrainingArguments(
        output_dir=str(output_dir),
        eval_strategy="epoch",
        logging_steps=50,
        save_strategy="no",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=epochs,
        learning_rate=learning_rate,
        weight_decay=0.01,
        report_to=[],
        seed=seed,
        fp16=torch.cuda.is_available(),
    )
    
    # Trainer
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    metric_name = task_config["metric"]
    
    trainer = FRegularizedTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=eval_ds,
        processing_class=tokenizer,
        data_collator=data_collator,
        compute_metrics=lambda p: compute_metrics(p, metric_name),
    )
    
    # Train
    train_result = trainer.train()
    eval_result = trainer.evaluate()
    
    # Final geDIG metrics
    final_f = compute_final_gedig_metrics(model, eval_ds, tokenizer, data_collator)
    
    elapsed = time.time() - start_time
    
    # Compile result
    result = {
        "model": model_name,
        "task": task_name,
        "alpha": alpha,
        "seed": seed,
        "train_samples": len(ds_train),
        "eval_samples": len(ds_eval),
        "epochs": epochs,
        "metric_name": metric_name,
        "eval_metric": eval_result.get(f"eval_{metric_name}"),
        "eval_accuracy": eval_result.get("eval_accuracy"),
        "eval_loss": eval_result.get("eval_loss"),
        "final_f_mean": final_f["f_mean"],
        "final_f_std": final_f["f_std"],
        "runtime_seconds": elapsed,
    }
    
    # Save
    (output_dir / "result.json").write_text(json.dumps(result, indent=2))
    
    print(f"Result: {metric_name}={result['eval_metric']:.4f}, F={final_f['f_mean']:.4f}, time={elapsed:.1f}s")
    
    return result

In [9]:
# ============================================================================
# Large-Scale Experiment Runner
# ============================================================================

def run_large_scale_experiment(
    models: List[str] = ["distilbert", "bert", "roberta"],
    tasks: List[str] = ["sst2", "mrpc", "cola", "qnli"],
    alphas: List[float] = ALPHAS,
    seeds: List[int] = SEEDS,
    max_train_samples: Optional[int] = None,  # None = full dataset
    max_eval_samples: Optional[int] = None,
    epochs: int = 3,
    output_dir: Path = Path("results"),
) -> List[Dict[str, Any]]:
    """Run large-scale F-regularization experiment."""
    
    total_experiments = len(models) * len(tasks) * len(alphas) * len(seeds)
    print(f"\n{'#'*70}")
    print(f"# LARGE-SCALE F-REGULARIZATION EXPERIMENT")
    print(f"# Models: {models}")
    print(f"# Tasks: {tasks}")
    print(f"# Alphas: {alphas}")
    print(f"# Seeds: {seeds}")
    print(f"# Total experiments: {total_experiments}")
    print(f"{'#'*70}\n")
    
    all_results = []
    experiment_idx = 0
    
    for model_name in models:
        for task_name in tasks:
            for alpha in alphas:
                for seed in seeds:
                    experiment_idx += 1
                    print(f"\n[{experiment_idx}/{total_experiments}]")
                    
                    try:
                        result = run_single_experiment(
                            model_name=model_name,
                            task_name=task_name,
                            alpha=alpha,
                            seed=seed,
                            max_train_samples=max_train_samples,
                            max_eval_samples=max_eval_samples,
                            epochs=epochs,
                            output_dir=output_dir / model_name / task_name / f"alpha_{alpha}_seed_{seed}",
                        )
                        all_results.append(result)
                        
                        # Save intermediate results
                        output_dir.mkdir(parents=True, exist_ok=True)
                        (output_dir / "all_results_partial.json").write_text(
                            json.dumps(all_results, indent=2)
                        )
                        
                    except Exception as e:
                        print(f"ERROR: {e}")
                        all_results.append({
                            "model": model_name, "task": task_name, 
                            "alpha": alpha, "seed": seed, "error": str(e)
                        })
    
    # Save final results
    (output_dir / "all_results.json").write_text(json.dumps(all_results, indent=2))
    print(f"\nSaved {len(all_results)} results to {output_dir / 'all_results.json'}")
    
    return all_results

In [10]:
# ============================================================================
# Statistical Analysis
# ============================================================================

def analyze_results(results: List[Dict], output_dir: Path = Path("results")):
    """Comprehensive statistical analysis of experiment results."""
    
    df = pd.DataFrame([r for r in results if "error" not in r])
    
    print("\n" + "="*70)
    print("STATISTICAL ANALYSIS")
    print("="*70)
    
    # 1. Overall summary by alpha
    print("\n### Overall Summary by Alpha ###")
    overall = df.groupby("alpha").agg({
        "eval_accuracy": ["mean", "std", "count"],
        "final_f_mean": ["mean", "std"],
    }).round(4)
    print(overall)
    
    # 2. Per-task analysis
    print("\n### Per-Task Summary ###")
    for task in df["task"].unique():
        print(f"\n--- {task.upper()} ---")
        task_df = df[df["task"] == task]
        task_summary = task_df.groupby("alpha").agg({
            "eval_metric": ["mean", "std"],
        }).round(4)
        print(task_summary)
    
    # 3. Per-model analysis
    print("\n### Per-Model Summary ###")
    for model in df["model"].unique():
        print(f"\n--- {model.upper()} ---")
        model_df = df[df["model"] == model]
        model_summary = model_df.groupby("alpha").agg({
            "eval_accuracy": ["mean", "std"],
        }).round(4)
        print(model_summary)
    
    # 4. Statistical tests (t-test: best alpha vs baseline)
    print("\n### Statistical Significance Tests ###")
    baseline_df = df[df["alpha"] == 0.0]
    
    stat_results = []
    for alpha in [a for a in df["alpha"].unique() if a > 0]:
        treatment_df = df[df["alpha"] == alpha]
        
        baseline_acc = baseline_df["eval_accuracy"].values
        treatment_acc = treatment_df["eval_accuracy"].values
        
        if len(baseline_acc) > 1 and len(treatment_acc) > 1:
            t_stat, p_value = stats.ttest_ind(treatment_acc, baseline_acc)
            effect_size = (treatment_acc.mean() - baseline_acc.mean()) / np.sqrt(
                (baseline_acc.std()**2 + treatment_acc.std()**2) / 2
            )
            
            stat_results.append({
                "alpha": alpha,
                "baseline_mean": float(baseline_acc.mean()),
                "baseline_std": float(baseline_acc.std()),
                "treatment_mean": float(treatment_acc.mean()),
                "treatment_std": float(treatment_acc.std()),
                "improvement_pct": float((treatment_acc.mean() - baseline_acc.mean()) * 100),
                "t_statistic": float(t_stat),
                "p_value": float(p_value),
                "cohens_d": float(effect_size),
            })
            
            print(f"\nα={alpha} vs α=0 (baseline):")
            print(f"  Baseline: {baseline_acc.mean():.4f} ± {baseline_acc.std():.4f}")
            print(f"  Treatment: {treatment_acc.mean():.4f} ± {treatment_acc.std():.4f}")
            print(f"  Improvement: {(treatment_acc.mean() - baseline_acc.mean())*100:+.2f}%")
            print(f"  t-statistic: {t_stat:.3f}")
            print(f"  p-value: {p_value:.4f} {'***' if p_value < 0.001 else '**' if p_value < 0.01 else '*' if p_value < 0.05 else ''}")
            print(f"  Cohen's d: {effect_size:.3f}")
    
    # 5. Find best configuration
    print("\n### Best Configurations ###")
    best_configs = []
    best_overall = df.groupby(["model", "task", "alpha"])["eval_metric"].mean().reset_index()
    for task in df["task"].unique():
        task_best = best_overall[best_overall["task"] == task]
        best_row = task_best.loc[task_best["eval_metric"].idxmax()]
        baseline_row = task_best[(task_best["alpha"] == 0.0)]
        if not baseline_row.empty:
            baseline_val = baseline_row["eval_metric"].mean()
            improvement = (best_row["eval_metric"] - baseline_val) * 100
            best_configs.append({
                "task": task,
                "best_alpha": float(best_row["alpha"]),
                "best_model": best_row["model"],
                "best_metric": float(best_row["eval_metric"]),
                "improvement_pct": float(improvement),
            })
            print(f"{task}: Best α={best_row['alpha']} ({best_row['model']}), "
                  f"metric={best_row['eval_metric']:.4f}, improvement={improvement:+.2f}%")
    
    # Save analysis (JSON-serializable format)
    analysis = {
        "statistical_tests": stat_results,
        "best_configurations": best_configs,
        "timestamp": datetime.now().isoformat(),
        "total_experiments": len(df),
    }
    (output_dir / "analysis.json").write_text(json.dumps(analysis, indent=2))
    
    return df

In [11]:
# ============================================================================
# Visualization
# ============================================================================

import matplotlib.pyplot as plt

def plot_results(df: pd.DataFrame, output_dir: Path = Path("results")):
    """Generate comprehensive visualization of results."""
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Overall Alpha vs Accuracy
    ax = axes[0, 0]
    overall = df.groupby("alpha")["eval_accuracy"].agg(["mean", "std"]).reset_index()
    ax.errorbar(range(len(overall)), overall["mean"], yerr=overall["std"],
                marker="o", markersize=10, linewidth=2, capsize=5)
    ax.set_xticks(range(len(overall)))
    ax.set_xticklabels([f"{a}" for a in overall["alpha"]])
    ax.set_xlabel("Alpha")
    ax.set_ylabel("Accuracy")
    ax.set_title("Overall: Alpha vs Accuracy")
    ax.grid(True, alpha=0.3)
    baseline = overall[overall["alpha"] == 0]["mean"].values[0]
    ax.axhline(y=baseline, color="gray", linestyle="--", alpha=0.7, label="Baseline")
    ax.legend()
    
    # 2. Per-Task Alpha vs Metric
    ax = axes[0, 1]
    for task in df["task"].unique():
        task_df = df[df["task"] == task]
        task_summary = task_df.groupby("alpha")["eval_metric"].mean().reset_index()
        ax.plot(range(len(task_summary)), task_summary["eval_metric"], 
                marker="o", label=task, linewidth=2)
    ax.set_xticks(range(len(ALPHAS)))
    ax.set_xticklabels([f"{a}" for a in ALPHAS])
    ax.set_xlabel("Alpha")
    ax.set_ylabel("Task Metric")
    ax.set_title("Per-Task: Alpha vs Metric")
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 3. Per-Model Alpha vs Accuracy
    ax = axes[0, 2]
    for model in df["model"].unique():
        model_df = df[df["model"] == model]
        model_summary = model_df.groupby("alpha")["eval_accuracy"].mean().reset_index()
        ax.plot(range(len(model_summary)), model_summary["eval_accuracy"],
                marker="s", label=model, linewidth=2)
    ax.set_xticks(range(len(ALPHAS)))
    ax.set_xticklabels([f"{a}" for a in ALPHAS])
    ax.set_xlabel("Alpha")
    ax.set_ylabel("Accuracy")
    ax.set_title("Per-Model: Alpha vs Accuracy")
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 4. Alpha vs Final F
    ax = axes[1, 0]
    f_summary = df.groupby("alpha")["final_f_mean"].agg(["mean", "std"]).reset_index()
    ax.errorbar(range(len(f_summary)), f_summary["mean"], yerr=f_summary["std"],
                marker="s", markersize=10, linewidth=2, capsize=5, color="orange")
    ax.set_xticks(range(len(f_summary)))
    ax.set_xticklabels([f"{a}" for a in f_summary["alpha"]])
    ax.set_xlabel("Alpha")
    ax.set_ylabel("Final F (geDIG)")
    ax.set_title("Alpha vs Final F")
    ax.grid(True, alpha=0.3)
    
    # 5. Accuracy vs F scatter (correlation)
    ax = axes[1, 1]
    scatter = ax.scatter(df["final_f_mean"], df["eval_accuracy"], 
                         c=[ALPHAS.index(a) for a in df["alpha"]], 
                         cmap="viridis", alpha=0.6, s=50)
    # Trend line
    z = np.polyfit(df["final_f_mean"], df["eval_accuracy"], 1)
    p = np.poly1d(z)
    x_range = np.linspace(df["final_f_mean"].min(), df["final_f_mean"].max(), 100)
    corr = np.corrcoef(df["final_f_mean"], df["eval_accuracy"])[0, 1]
    ax.plot(x_range, p(x_range), "r--", alpha=0.5, label=f"r={corr:.3f}")
    ax.set_xlabel("Final F (geDIG)")
    ax.set_ylabel("Accuracy")
    ax.set_title(f"Accuracy vs F Correlation (r={corr:.3f})")
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.colorbar(scatter, ax=ax, label="Alpha index")
    
    # 6. Improvement heatmap (model x task)
    ax = axes[1, 2]
    # Calculate improvement for best alpha vs baseline
    improvements = []
    for model in df["model"].unique():
        row = []
        for task in df["task"].unique():
            subset = df[(df["model"] == model) & (df["task"] == task)]
            baseline = subset[subset["alpha"] == 0]["eval_metric"].mean()
            best = subset.groupby("alpha")["eval_metric"].mean().max()
            improvement = (best - baseline) * 100
            row.append(improvement)
        improvements.append(row)
    
    im = ax.imshow(improvements, cmap="RdYlGn", aspect="auto", vmin=-2, vmax=2)
    ax.set_xticks(range(len(df["task"].unique())))
    ax.set_xticklabels(df["task"].unique())
    ax.set_yticks(range(len(df["model"].unique())))
    ax.set_yticklabels(df["model"].unique())
    ax.set_title("Improvement (%) vs Baseline")
    plt.colorbar(im, ax=ax, label="Improvement %")
    
    # Add text annotations
    for i in range(len(df["model"].unique())):
        for j in range(len(df["task"].unique())):
            ax.text(j, i, f"{improvements[i][j]:.2f}", ha="center", va="center", fontsize=10)
    
    plt.tight_layout()
    plt.savefig(output_dir / "fig_large_scale_results.png", dpi=150)
    plt.show()
    print(f"Saved: {output_dir / 'fig_large_scale_results.png'}")

---
# EXPERIMENT EXECUTION
---

In [12]:
# ============================================================================
# OPTION 1: Quick Test (single model, single task)
# Runtime: ~10-15 min
# ============================================================================

QUICK_TEST = False  # Set to False for full experiment

if QUICK_TEST:
    results = run_large_scale_experiment(
        models=["distilbert"],
        tasks=["sst2"],
        alphas=[0.0, 0.001, 0.01],
        seeds=[42, 123],
        max_train_samples=2000,
        max_eval_samples=500,
        epochs=2,
        output_dir=Path("results_quick"),
    )

In [13]:
# ============================================================================
# OPTION 2: Medium Scale (all tasks, one model)
# Runtime: ~2-3 hours
# ============================================================================

MEDIUM_SCALE = False  # Set to True to run

if MEDIUM_SCALE:
    results = run_large_scale_experiment(
        models=["distilbert"],
        tasks=["sst2", "mrpc", "cola", "qnli"],
        alphas=ALPHAS,
        seeds=[42, 123, 456],
        max_train_samples=5000,
        epochs=3,
        output_dir=Path("results_medium"),
    )

In [14]:
# ============================================================================
# OPTION 3: Full Scale Experiment
# Runtime: ~8-12 hours (recommend A100/V100)
# ============================================================================

FULL_SCALE = True  # Set to True to run

if FULL_SCALE:
    results = run_large_scale_experiment(
        models=["distilbert", "bert", "roberta"],
        tasks=["sst2", "mrpc", "cola", "qnli"],
        alphas=ALPHAS,
        seeds=SEEDS,
        max_train_samples=None,  # Full dataset
        max_eval_samples=None,
        epochs=3,
        output_dir=Path("results_full"),
    )


######################################################################
# LARGE-SCALE F-REGULARIZATION EXPERIMENT
# Models: ['distilbert', 'bert', 'roberta']
# Tasks: ['sst2', 'mrpc', 'cola', 'qnli']
# Alphas: [0.0, 0.001, 0.01, 0.1]
# Seeds: [42, 123, 456, 789, 1024]
# Total experiments: 240
######################################################################


[1/240]

Model: distilbert | Task: sst2 | α: 0.0 | Seed: 42


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Train: 67349 samples | Eval: 872 samples


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1759,0.289852,0.904817
2,0.1055,0.364133,0.897936
3,0.099,0.416476,0.904817




Result: accuracy=0.9048, F=-0.4259, time=358.5s

[2/240]

Model: distilbert | Task: sst2 | α: 0.0 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1998,0.294576,0.905963
2,0.1094,0.32733,0.908257
3,0.0724,0.38771,0.908257


Result: accuracy=0.9083, F=-0.4258, time=328.4s

[3/240]

Model: distilbert | Task: sst2 | α: 0.0 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1989,0.261846,0.908257
2,0.0996,0.373069,0.90711
3,0.0691,0.402178,0.909404


Result: accuracy=0.9094, F=-0.4306, time=325.5s

[4/240]

Model: distilbert | Task: sst2 | α: 0.0 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1616,0.321182,0.90367
2,0.1157,0.338206,0.90367
3,0.088,0.416497,0.905963


Result: accuracy=0.9060, F=-0.4278, time=327.4s

[5/240]

Model: distilbert | Task: sst2 | α: 0.0 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1537,0.30454,0.900229
2,0.1428,0.34032,0.90711
3,0.1196,0.432247,0.908257


Result: accuracy=0.9083, F=-0.4273, time=328.6s

[6/240]

Model: distilbert | Task: sst2 | α: 0.001 | Seed: 42
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1772,0.308264,0.913991
2,0.101,0.358656,0.90367
3,0.0921,0.411854,0.91055


Result: accuracy=0.9106, F=-0.4343, time=555.6s

[7/240]

Model: distilbert | Task: sst2 | α: 0.001 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.202,0.28214,0.90367
2,0.1021,0.366136,0.901376
3,0.0737,0.391146,0.911697


Result: accuracy=0.9117, F=-0.4398, time=556.0s

[8/240]

Model: distilbert | Task: sst2 | α: 0.001 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2072,0.278046,0.902523
2,0.1017,0.376885,0.909404
3,0.069,0.443387,0.90367


Result: accuracy=0.9037, F=-0.4333, time=555.6s

[9/240]

Model: distilbert | Task: sst2 | α: 0.001 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1588,0.303739,0.91055
2,0.1074,0.333668,0.90367
3,0.0879,0.407324,0.908257


Result: accuracy=0.9083, F=-0.4353, time=548.8s

[10/240]

Model: distilbert | Task: sst2 | α: 0.001 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1704,0.277414,0.905963
2,0.1324,0.351968,0.902523
3,0.1029,0.461447,0.905963


Result: accuracy=0.9060, F=-0.4325, time=549.3s

[11/240]

Model: distilbert | Task: sst2 | α: 0.01 | Seed: 42
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1736,0.304901,0.909404
2,0.1012,0.334326,0.90711
3,0.0796,0.402558,0.90367


Result: accuracy=0.9037, F=-0.4731, time=545.8s

[12/240]

Model: distilbert | Task: sst2 | α: 0.01 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1978,0.274549,0.90367
2,0.096,0.357827,0.902523
3,0.0736,0.388972,0.908257


Result: accuracy=0.9083, F=-0.4762, time=553.8s

[13/240]

Model: distilbert | Task: sst2 | α: 0.01 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2033,0.276006,0.899083
2,0.0951,0.371849,0.91055
3,0.0626,0.44045,0.900229


Result: accuracy=0.9002, F=-0.4729, time=555.6s

[14/240]

Model: distilbert | Task: sst2 | α: 0.01 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1552,0.300546,0.905963
2,0.1039,0.337926,0.901376
3,0.0821,0.396194,0.911697


Result: accuracy=0.9117, F=-0.4713, time=562.6s

[15/240]

Model: distilbert | Task: sst2 | α: 0.01 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1673,0.279356,0.905963
2,0.1267,0.343404,0.90711
3,0.1001,0.436836,0.905963


Result: accuracy=0.9060, F=-0.4689, time=547.5s

[16/240]

Model: distilbert | Task: sst2 | α: 0.1 | Seed: 42
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1464,0.256553,0.90367
2,0.0488,0.34102,0.892202
3,0.0319,0.357055,0.91055


Result: accuracy=0.9106, F=-0.6027, time=545.5s

[17/240]

Model: distilbert | Task: sst2 | α: 0.1 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.148,0.234684,0.901376
2,0.0687,0.28232,0.90711
3,0.0188,0.346791,0.915138


Result: accuracy=0.9151, F=-0.6026, time=557.2s

[18/240]

Model: distilbert | Task: sst2 | α: 0.1 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1532,0.246494,0.900229
2,0.048,0.322586,0.912844
3,0.0196,0.396408,0.904817


Result: accuracy=0.9048, F=-0.6036, time=556.0s

[19/240]

Model: distilbert | Task: sst2 | α: 0.1 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1005,0.258331,0.908257
2,0.0587,0.33769,0.90367
3,0.0546,0.384129,0.901376


Result: accuracy=0.9014, F=-0.6021, time=572.9s

[20/240]

Model: distilbert | Task: sst2 | α: 0.1 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1055,0.239356,0.902523
2,0.0701,0.309704,0.909404
3,0.0898,0.398297,0.899083


Result: accuracy=0.8991, F=-0.6053, time=565.2s

[21/240]

Model: distilbert | Task: mrpc | α: 0.0 | Seed: 42


mrpc/train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

mrpc/validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

mrpc/test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Train: 3668 samples | Eval: 408 samples


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4864,0.417076,0.873857,0.830882
2,0.3671,0.361558,0.893543,0.85049
3,0.2438,0.40911,0.9,0.857843


Result: f1=0.9000, F=-0.4323, time=34.9s

[22/240]

Model: distilbert | Task: mrpc | α: 0.0 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4631,0.406113,0.87813,0.821078
2,0.3776,0.383014,0.893039,0.845588
3,0.2361,0.395302,0.891525,0.843137


Result: f1=0.8915, F=-0.4268, time=28.1s

[23/240]

Model: distilbert | Task: mrpc | α: 0.0 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4613,0.535515,0.850998,0.762255
2,0.3429,0.357634,0.882883,0.840686
3,0.261,0.406651,0.893761,0.845588


Result: f1=0.8938, F=-0.4289, time=28.0s

[24/240]

Model: distilbert | Task: mrpc | α: 0.0 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5141,0.434951,0.872786,0.806373
2,0.3615,0.377052,0.893836,0.848039
3,0.271,0.408029,0.901361,0.857843


Result: f1=0.9014, F=-0.4328, time=28.2s

[25/240]

Model: distilbert | Task: mrpc | α: 0.0 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.488,0.459828,0.819417,0.772059
2,0.3656,0.38242,0.881647,0.830882
3,0.2372,0.401076,0.893039,0.845588


Result: f1=0.8930, F=-0.4372, time=28.2s

[26/240]

Model: distilbert | Task: mrpc | α: 0.001 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4822,0.413214,0.857143,0.811275
2,0.3767,0.360617,0.883959,0.833333
3,0.2474,0.386507,0.895369,0.85049


Result: f1=0.8954, F=-0.4294, time=41.2s

[27/240]

Model: distilbert | Task: mrpc | α: 0.001 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4865,0.430447,0.868687,0.808824
2,0.3983,0.383102,0.873921,0.821078
3,0.2611,0.393589,0.892308,0.845588


Result: f1=0.8923, F=-0.4417, time=41.1s

[28/240]

Model: distilbert | Task: mrpc | α: 0.001 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4724,0.495566,0.85,0.764706
2,0.3436,0.366075,0.892857,0.852941
3,0.2585,0.403867,0.893836,0.848039


Result: f1=0.8938, F=-0.4303, time=41.1s

[29/240]

Model: distilbert | Task: mrpc | α: 0.001 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5305,0.442534,0.869565,0.801471
2,0.3441,0.366092,0.891938,0.845588
3,0.2502,0.394129,0.901695,0.857843


Result: f1=0.9017, F=-0.4365, time=41.2s

[30/240]

Model: distilbert | Task: mrpc | α: 0.001 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4944,0.44346,0.836852,0.791667
2,0.3845,0.391927,0.891122,0.840686
3,0.2298,0.408051,0.902027,0.857843


Result: f1=0.9020, F=-0.4441, time=41.3s

[31/240]

Model: distilbert | Task: mrpc | α: 0.01 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4784,0.41344,0.857143,0.811275
2,0.3726,0.349983,0.890411,0.843137
3,0.2404,0.383578,0.893836,0.848039


Result: f1=0.8938, F=-0.4379, time=41.6s

[32/240]

Model: distilbert | Task: mrpc | α: 0.01 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4793,0.42703,0.868687,0.808824
2,0.3934,0.386401,0.876712,0.823529
3,0.2489,0.387497,0.893836,0.848039


Result: f1=0.8938, F=-0.4434, time=40.7s

[33/240]

Model: distilbert | Task: mrpc | α: 0.01 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4688,0.483731,0.855346,0.77451
2,0.3396,0.361179,0.888889,0.848039
3,0.2572,0.400208,0.893836,0.848039


Result: f1=0.8938, F=-0.4377, time=40.7s

[34/240]

Model: distilbert | Task: mrpc | α: 0.01 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5298,0.435515,0.869565,0.801471
2,0.3415,0.360113,0.887348,0.840686
3,0.2453,0.394112,0.900169,0.855392


Result: f1=0.9002, F=-0.4430, time=40.6s

[35/240]

Model: distilbert | Task: mrpc | α: 0.01 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4941,0.445978,0.823301,0.776961
2,0.3815,0.389821,0.886667,0.833333
3,0.2261,0.406278,0.900169,0.855392


Result: f1=0.9002, F=-0.4540, time=41.2s

[36/240]

Model: distilbert | Task: mrpc | α: 0.1 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4925,0.401876,0.868376,0.811275
2,0.3593,0.309136,0.883072,0.835784
3,0.2189,0.346125,0.892308,0.845588


Result: f1=0.8923, F=-0.4954, time=41.4s

[37/240]

Model: distilbert | Task: mrpc | α: 0.1 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4623,0.402024,0.862876,0.79902
2,0.3547,0.346592,0.891156,0.843137
3,0.2144,0.361344,0.88586,0.835784


Result: f1=0.8859, F=-0.5048, time=41.1s

[38/240]

Model: distilbert | Task: mrpc | α: 0.1 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4371,0.399759,0.8672,0.796569
2,0.3088,0.320512,0.897163,0.857843
3,0.2209,0.360943,0.890785,0.843137


Result: f1=0.8908, F=-0.4937, time=41.5s

[39/240]

Model: distilbert | Task: mrpc | α: 0.1 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5002,0.401202,0.858537,0.786765
2,0.3037,0.326069,0.890411,0.843137
3,0.2108,0.361531,0.892617,0.843137


Result: f1=0.8926, F=-0.4954, time=41.0s

[40/240]

Model: distilbert | Task: mrpc | α: 0.1 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4635,0.449281,0.780083,0.740196
2,0.3446,0.345919,0.894825,0.845588
3,0.1769,0.365216,0.90084,0.855392


Result: f1=0.9008, F=-0.5017, time=41.1s

[41/240]

Model: distilbert | Task: cola | α: 0.0 | Seed: 42


cola/train-00000-of-00001.parquet:   0%|          | 0.00/251k [00:00<?, ?B/s]

cola/validation-00000-of-00001.parquet:   0%|          | 0.00/37.6k [00:00<?, ?B/s]

cola/test-00000-of-00001.parquet:   0%|          | 0.00/37.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Train: 8551 samples | Eval: 1043 samples


Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4606,0.467009,0.455969,0.783317
2,0.3288,0.498664,0.538214,0.811122
3,0.2408,0.580326,0.530589,0.810163


Result: matthews_correlation=0.5306, F=-0.5054, time=57.8s

[42/240]

Model: distilbert | Task: cola | α: 0.0 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4695,0.474692,0.446971,0.780441
2,0.3201,0.539565,0.51399,0.80441
3,0.2469,0.613814,0.520577,0.806328


Result: matthews_correlation=0.5206, F=-0.4992, time=51.9s

[43/240]

Model: distilbert | Task: cola | α: 0.0 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4459,0.493798,0.379344,0.759348
2,0.331,0.492544,0.495548,0.79674
3,0.2356,0.602647,0.500553,0.798658


Result: matthews_correlation=0.5006, F=-0.5175, time=51.7s

[44/240]

Model: distilbert | Task: cola | α: 0.0 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4535,0.476024,0.450775,0.7814
2,0.3335,0.490528,0.509687,0.801534
3,0.2054,0.592937,0.494834,0.79674


Result: matthews_correlation=0.4948, F=-0.5016, time=52.2s

[45/240]

Model: distilbert | Task: cola | α: 0.0 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4635,0.453946,0.472806,0.78907
2,0.303,0.515733,0.507278,0.801534
3,0.2132,0.593146,0.524175,0.806328


Result: matthews_correlation=0.5242, F=-0.5043, time=51.7s

[46/240]

Model: distilbert | Task: cola | α: 0.001 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4676,0.485813,0.412227,0.769895
2,0.3558,0.488826,0.504729,0.798658
3,0.2305,0.580839,0.534067,0.811122


Result: matthews_correlation=0.5341, F=-0.5151, time=82.3s

[47/240]

Model: distilbert | Task: cola | α: 0.001 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4633,0.474844,0.429462,0.774688
2,0.3016,0.54636,0.49734,0.798658
3,0.233,0.605463,0.520022,0.806328


Result: matthews_correlation=0.5200, F=-0.4977, time=83.8s

[48/240]

Model: distilbert | Task: cola | α: 0.001 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4498,0.501101,0.379832,0.759348
2,0.322,0.482696,0.515571,0.80441
3,0.232,0.587006,0.523522,0.807287


Result: matthews_correlation=0.5235, F=-0.5157, time=84.3s

[49/240]

Model: distilbert | Task: cola | α: 0.001 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4741,0.473813,0.453549,0.782359
2,0.3355,0.483052,0.510249,0.802493
3,0.2351,0.582547,0.503591,0.799616


Result: matthews_correlation=0.5036, F=-0.5097, time=84.4s

[50/240]

Model: distilbert | Task: cola | α: 0.001 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4686,0.463003,0.449759,0.7814
2,0.3042,0.487873,0.514256,0.803452
3,0.1846,0.60391,0.521857,0.806328


Result: matthews_correlation=0.5219, F=-0.5055, time=84.8s

[51/240]

Model: distilbert | Task: cola | α: 0.01 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4644,0.481875,0.432534,0.776606
2,0.3579,0.484192,0.502057,0.797699
3,0.2317,0.575531,0.518539,0.805369


Result: matthews_correlation=0.5185, F=-0.5256, time=82.1s

[52/240]

Model: distilbert | Task: cola | α: 0.01 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4594,0.469557,0.434828,0.776606
2,0.2987,0.537917,0.502897,0.800575
3,0.2295,0.599859,0.520577,0.806328


Result: matthews_correlation=0.5206, F=-0.5084, time=82.7s

[53/240]

Model: distilbert | Task: cola | α: 0.01 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4467,0.496364,0.370257,0.756472
2,0.3166,0.479428,0.510249,0.802493
3,0.2282,0.58341,0.526164,0.808245


Result: matthews_correlation=0.5262, F=-0.5263, time=81.8s

[54/240]

Model: distilbert | Task: cola | α: 0.01 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4684,0.471503,0.455315,0.783317
2,0.3334,0.476703,0.502885,0.799616
3,0.2264,0.578672,0.495184,0.79674


Result: matthews_correlation=0.4952, F=-0.5207, time=81.7s

[55/240]

Model: distilbert | Task: cola | α: 0.01 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4639,0.460614,0.451155,0.782359
2,0.2988,0.485525,0.511596,0.802493
3,0.1783,0.601351,0.529114,0.809204


Result: matthews_correlation=0.5291, F=-0.5151, time=84.2s

[56/240]

Model: distilbert | Task: cola | α: 0.1 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4109,0.413686,0.424197,0.77373
2,0.3075,0.417684,0.507829,0.800575
3,0.1788,0.518841,0.505227,0.800575


Result: matthews_correlation=0.5052, F=-0.5991, time=83.3s

[57/240]

Model: distilbert | Task: cola | α: 0.1 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4156,0.419737,0.429161,0.774688
2,0.2609,0.485289,0.486351,0.794823
3,0.1793,0.538478,0.531437,0.810163


Result: matthews_correlation=0.5314, F=-0.5857, time=82.2s

[58/240]

Model: distilbert | Task: cola | α: 0.1 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3988,0.439271,0.370111,0.756472
2,0.2743,0.433918,0.506213,0.801534
3,0.1813,0.525879,0.510249,0.802493


Result: matthews_correlation=0.5102, F=-0.5972, time=83.3s

[59/240]

Model: distilbert | Task: cola | α: 0.1 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4276,0.421345,0.452233,0.782359
2,0.3036,0.414641,0.506636,0.800575
3,0.1829,0.516978,0.493293,0.79674


Result: matthews_correlation=0.4933, F=-0.5890, time=82.2s

[60/240]

Model: distilbert | Task: cola | α: 0.1 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4256,0.435977,0.418173,0.771812
2,0.264,0.422247,0.507581,0.801534
3,0.1474,0.527765,0.521952,0.807287


Result: matthews_correlation=0.5220, F=-0.5990, time=81.8s

[61/240]

Model: distilbert | Task: qnli | α: 0.0 | Seed: 42


qnli/train-00000-of-00001.parquet:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

qnli/validation-00000-of-00001.parquet:   0%|          | 0.00/872k [00:00<?, ?B/s]

qnli/test-00000-of-00001.parquet:   0%|          | 0.00/877k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/104743 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5463 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5463 [00:00<?, ? examples/s]

Train: 104743 samples | Eval: 5463 samples


Map:   0%|          | 0/104743 [00:00<?, ? examples/s]

Map:   0%|          | 0/5463 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3617,0.301578,0.875526
2,0.2498,0.298296,0.882665
3,0.1857,0.352261,0.890353


Result: accuracy=0.8904, F=-0.4183, time=572.5s

[62/240]

Model: distilbert | Task: qnli | α: 0.0 | Seed: 123
Train: 104743 samples | Eval: 5463 samples


Map:   0%|          | 0/5463 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3675,0.308295,0.874428
2,0.2451,0.297184,0.888523
3,0.1674,0.361536,0.888523


Result: accuracy=0.8885, F=-0.4099, time=552.1s

[63/240]

Model: distilbert | Task: qnli | α: 0.0 | Seed: 456
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3954,0.309816,0.872048
2,0.2371,0.285889,0.88413
3,0.1699,0.352908,0.890902


Result: accuracy=0.8909, F=-0.4176, time=551.8s

[64/240]

Model: distilbert | Task: qnli | α: 0.0 | Seed: 789
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.36,0.319119,0.864543
2,0.2535,0.327672,0.879187
3,0.1902,0.368623,0.881201


Result: accuracy=0.8812, F=-0.4183, time=552.3s

[65/240]

Model: distilbert | Task: qnli | α: 0.0 | Seed: 1024
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3486,0.298168,0.877723
2,0.2627,0.288926,0.891452
3,0.1678,0.365906,0.888889


Result: accuracy=0.8889, F=-0.4178, time=552.6s

[66/240]

Model: distilbert | Task: qnli | α: 0.001 | Seed: 42
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3448,0.287515,0.88175
2,0.2509,0.283319,0.889255
3,0.1998,0.347937,0.893831


Result: accuracy=0.8938, F=-0.4141, time=931.6s

[67/240]

Model: distilbert | Task: qnli | α: 0.001 | Seed: 123
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3866,0.30904,0.87095
2,0.2657,0.314547,0.885045
3,0.1676,0.363299,0.883763


Result: accuracy=0.8838, F=-0.4195, time=937.0s

[68/240]

Model: distilbert | Task: qnli | α: 0.001 | Seed: 456
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4024,0.308687,0.870584
2,0.2447,0.295581,0.885228
3,0.1676,0.354933,0.885411


Result: accuracy=0.8854, F=-0.4191, time=933.8s

[69/240]

Model: distilbert | Task: qnli | α: 0.001 | Seed: 789
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3523,0.314625,0.866557
2,0.2467,0.313482,0.878272
3,0.1916,0.366486,0.881201


Result: accuracy=0.8812, F=-0.4223, time=924.8s

[70/240]

Model: distilbert | Task: qnli | α: 0.001 | Seed: 1024
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3631,0.339603,0.859418
2,0.262,0.311486,0.876442
3,0.1928,0.368444,0.882665


Result: accuracy=0.8827, F=-0.4206, time=923.6s

[71/240]

Model: distilbert | Task: qnli | α: 0.01 | Seed: 42
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.337,0.281452,0.883763
2,0.2565,0.279183,0.887974
3,0.1894,0.339931,0.893831


Result: accuracy=0.8938, F=-0.4330, time=978.5s

[72/240]

Model: distilbert | Task: qnli | α: 0.01 | Seed: 123
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3859,0.3135,0.867289
2,0.2665,0.309875,0.884313
3,0.1708,0.362889,0.881384


Result: accuracy=0.8814, F=-0.4372, time=999.6s

[73/240]

Model: distilbert | Task: qnli | α: 0.01 | Seed: 456
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4035,0.305889,0.871316
2,0.2374,0.300999,0.881384
3,0.1545,0.362768,0.883397


Result: accuracy=0.8834, F=-0.4441, time=987.8s

[74/240]

Model: distilbert | Task: qnli | α: 0.01 | Seed: 789
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3476,0.307134,0.866374
2,0.2407,0.307357,0.884313
3,0.1756,0.354449,0.88175


Result: accuracy=0.8817, F=-0.4368, time=998.9s

[75/240]

Model: distilbert | Task: qnli | α: 0.01 | Seed: 1024
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3656,0.342282,0.857954
2,0.2551,0.309743,0.879553
3,0.2008,0.368571,0.883214


Result: accuracy=0.8832, F=-0.4389, time=1005.4s

[76/240]

Model: distilbert | Task: qnli | α: 0.1 | Seed: 42
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3276,0.282131,0.859235
2,0.2311,0.271924,0.864543
3,0.1747,0.336601,0.87095


Result: accuracy=0.8710, F=-0.5378, time=978.6s

[77/240]

Model: distilbert | Task: qnli | α: 0.1 | Seed: 123
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3439,0.280851,0.857404
2,0.2305,0.273456,0.86912
3,0.155,0.341321,0.867655


Result: accuracy=0.8677, F=-0.5473, time=1001.0s

[78/240]

Model: distilbert | Task: qnli | α: 0.1 | Seed: 456
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3544,0.281877,0.855025
2,0.2106,0.277873,0.86674
3,0.1222,0.341274,0.86674


Result: accuracy=0.8667, F=-0.5510, time=996.3s

[79/240]

Model: distilbert | Task: qnli | α: 0.1 | Seed: 789
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3085,0.267553,0.861248
2,0.2064,0.281702,0.869303
3,0.152,0.334931,0.869852


Result: accuracy=0.8699, F=-0.5519, time=993.9s

[80/240]

Model: distilbert | Task: qnli | α: 0.1 | Seed: 1024
Train: 104743 samples | Eval: 5463 samples


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3303,0.291389,0.85356
2,0.2433,0.268883,0.867655
3,0.1897,0.322812,0.866557


Result: accuracy=0.8666, F=-0.5563, time=944.6s

[81/240]

Model: bert | Task: sst2 | α: 0.0 | Seed: 42
Train: 67349 samples | Eval: 872 samples


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1763,0.236627,0.924312
2,0.1081,0.287195,0.925459
3,0.1005,0.328609,0.928899




Result: accuracy=0.9289, F=-0.4278, time=601.6s

[82/240]

Model: bert | Task: sst2 | α: 0.0 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1797,0.286627,0.916284
2,0.114,0.260266,0.922018
3,0.0669,0.344921,0.922018


Result: accuracy=0.9220, F=-0.4241, time=594.2s

[83/240]

Model: bert | Task: sst2 | α: 0.0 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1728,0.252437,0.923165
2,0.0837,0.304888,0.931193
3,0.0583,0.347462,0.930046


Result: accuracy=0.9300, F=-0.4249, time=593.4s

[84/240]

Model: bert | Task: sst2 | α: 0.0 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1548,0.258871,0.918578
2,0.1239,0.263217,0.931193
3,0.0927,0.31591,0.931193


Result: accuracy=0.9312, F=-0.4285, time=590.1s

[85/240]

Model: bert | Task: sst2 | α: 0.0 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1531,0.264119,0.918578
2,0.1249,0.297757,0.924312
3,0.1069,0.358336,0.925459


Result: accuracy=0.9255, F=-0.4256, time=590.2s

[86/240]

Model: bert | Task: sst2 | α: 0.001 | Seed: 42
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1654,0.248906,0.919725
2,0.0978,0.304388,0.919725
3,0.1052,0.319844,0.928899


Result: accuracy=0.9289, F=-0.4322, time=1013.7s

[87/240]

Model: bert | Task: sst2 | α: 0.001 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1936,0.277024,0.912844
2,0.0963,0.310291,0.920872
3,0.0802,0.334843,0.925459


Result: accuracy=0.9255, F=-0.4363, time=1012.8s

[88/240]

Model: bert | Task: sst2 | α: 0.001 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1767,0.261864,0.917431
2,0.0983,0.319959,0.926606
3,0.0557,0.338807,0.926606


Result: accuracy=0.9266, F=-0.4415, time=1009.8s

[89/240]

Model: bert | Task: sst2 | α: 0.001 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1326,0.26608,0.920872
2,0.1274,0.273757,0.930046
3,0.0839,0.31599,0.932339


Result: accuracy=0.9323, F=-0.4283, time=1012.8s

[90/240]

Model: bert | Task: sst2 | α: 0.001 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1502,0.2646,0.923165
2,0.1329,0.270594,0.930046
3,0.0964,0.318488,0.931193


Result: accuracy=0.9312, F=-0.4268, time=1013.3s

[91/240]

Model: bert | Task: sst2 | α: 0.01 | Seed: 42
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1683,0.249528,0.924312
2,0.0901,0.306988,0.919725
3,0.113,0.356657,0.922018


Result: accuracy=0.9220, F=-0.4877, time=1032.8s

[92/240]

Model: bert | Task: sst2 | α: 0.01 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1863,0.276249,0.912844
2,0.116,0.275689,0.922018
3,0.0741,0.343268,0.928899


Result: accuracy=0.9289, F=-0.4862, time=1016.9s

[93/240]

Model: bert | Task: sst2 | α: 0.01 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1732,0.248723,0.923165
2,0.0886,0.313467,0.928899
3,0.0494,0.332217,0.926606


Result: accuracy=0.9266, F=-0.4847, time=1018.0s

[94/240]

Model: bert | Task: sst2 | α: 0.01 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1229,0.254713,0.917431
2,0.1328,0.248961,0.928899
3,0.0726,0.28787,0.936927


Result: accuracy=0.9369, F=-0.4852, time=1011.4s

[95/240]

Model: bert | Task: sst2 | α: 0.01 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1428,0.262661,0.920872
2,0.153,0.268921,0.926606
3,0.0935,0.324005,0.926606


Result: accuracy=0.9266, F=-0.4872, time=1013.3s

[96/240]

Model: bert | Task: sst2 | α: 0.1 | Seed: 42
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1269,0.19383,0.922018
2,0.0419,0.273469,0.912844
3,0.0534,0.325247,0.922018


Result: accuracy=0.9220, F=-0.6285, time=1023.6s

[97/240]

Model: bert | Task: sst2 | α: 0.1 | Seed: 123
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1458,0.228714,0.916284
2,0.0521,0.278446,0.912844
3,0.0176,0.333223,0.913991


Result: accuracy=0.9140, F=-0.6331, time=1020.3s

[98/240]

Model: bert | Task: sst2 | α: 0.1 | Seed: 456
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1421,0.235477,0.902523
2,0.0439,0.287676,0.923165
3,0.0135,0.320529,0.924312


Result: accuracy=0.9243, F=-0.6319, time=1018.5s

[99/240]

Model: bert | Task: sst2 | α: 0.1 | Seed: 789
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0798,0.213435,0.920872
2,0.0695,0.265891,0.918578
3,0.0508,0.292451,0.931193


Result: accuracy=0.9312, F=-0.6309, time=1011.4s

[100/240]

Model: bert | Task: sst2 | α: 0.1 | Seed: 1024
Train: 67349 samples | Eval: 872 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1009,0.209099,0.923165
2,0.0993,0.245431,0.926606
3,0.0532,0.329978,0.917431


Result: accuracy=0.9174, F=-0.6318, time=1018.8s

[101/240]

Model: bert | Task: mrpc | α: 0.0 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5146,0.416542,0.872727,0.828431
2,0.3758,0.351957,0.885305,0.843137
3,0.2401,0.395776,0.899654,0.857843


Result: f1=0.8997, F=-0.4483, time=46.3s

[102/240]

Model: bert | Task: mrpc | α: 0.0 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4991,0.405986,0.875887,0.828431
2,0.3852,0.396416,0.888519,0.835784
3,0.2302,0.404081,0.895833,0.852941


Result: f1=0.8958, F=-0.4519, time=46.5s

[103/240]

Model: bert | Task: mrpc | α: 0.0 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4868,0.515515,0.867925,0.794118
2,0.3804,0.367961,0.883212,0.843137
3,0.2676,0.425317,0.886986,0.838235


Result: f1=0.8870, F=-0.4613, time=46.2s

[104/240]

Model: bert | Task: mrpc | α: 0.0 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5737,0.407005,0.861702,0.808824
2,0.3569,0.355789,0.891608,0.848039
3,0.2265,0.408852,0.889667,0.845588


Result: f1=0.8897, F=-0.4506, time=46.9s

[105/240]

Model: bert | Task: mrpc | α: 0.0 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4739,0.401668,0.863551,0.821078
2,0.36,0.346029,0.891697,0.852941
3,0.2298,0.389067,0.905923,0.867647


Result: f1=0.9059, F=-0.4261, time=46.0s

[106/240]

Model: bert | Task: mrpc | α: 0.001 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5122,0.412017,0.875214,0.821078
2,0.3945,0.363513,0.881834,0.835784
3,0.2471,0.430404,0.891566,0.845588


Result: f1=0.8916, F=-0.4312, time=71.3s

[107/240]

Model: bert | Task: mrpc | α: 0.001 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5032,0.410412,0.875433,0.823529
2,0.3598,0.355883,0.899489,0.855392
3,0.2149,0.355892,0.899471,0.860294


Result: f1=0.8995, F=-0.4344, time=71.3s

[108/240]

Model: bert | Task: mrpc | α: 0.001 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5283,0.47656,0.860317,0.784314
2,0.4076,0.389099,0.866171,0.823529
3,0.2616,0.406715,0.886165,0.840686


Result: f1=0.8862, F=-0.4537, time=72.5s

[109/240]

Model: bert | Task: mrpc | α: 0.001 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5641,0.475495,0.850088,0.791667
2,0.3627,0.396658,0.866792,0.82598
3,0.2563,0.418804,0.886926,0.843137


Result: f1=0.8869, F=-0.4420, time=72.0s

[110/240]

Model: bert | Task: mrpc | α: 0.001 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4875,0.428191,0.84453,0.801471
2,0.3699,0.364602,0.893617,0.852941
3,0.1974,0.417423,0.907534,0.867647


Result: f1=0.9075, F=-0.4373, time=71.4s

[111/240]

Model: bert | Task: mrpc | α: 0.01 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5341,0.420696,0.880399,0.823529
2,0.3887,0.353345,0.884956,0.840686
3,0.2205,0.367782,0.897391,0.855392


Result: f1=0.8974, F=-0.4364, time=71.6s

[112/240]

Model: bert | Task: mrpc | α: 0.01 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4851,0.425479,0.869281,0.803922
2,0.361,0.343973,0.901554,0.860294
3,0.2035,0.374688,0.901754,0.862745


Result: f1=0.9018, F=-0.4410, time=71.4s

[113/240]

Model: bert | Task: mrpc | α: 0.01 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5365,0.540277,0.825444,0.710784
2,0.4125,0.393156,0.872458,0.830882
3,0.2661,0.418222,0.880829,0.830882


Result: f1=0.8808, F=-0.4739, time=72.1s

[114/240]

Model: bert | Task: mrpc | α: 0.01 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5883,0.484921,0.846939,0.779412
2,0.3766,0.438024,0.839458,0.796569
3,0.2684,0.420833,0.875,0.828431


Result: f1=0.8750, F=-0.4446, time=71.3s

[115/240]

Model: bert | Task: mrpc | α: 0.01 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4838,0.444278,0.811765,0.764706
2,0.3831,0.367572,0.892035,0.85049
3,0.2127,0.398738,0.911304,0.875


Result: f1=0.9113, F=-0.4350, time=71.3s

[116/240]

Model: bert | Task: mrpc | α: 0.1 | Seed: 42
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4594,0.375937,0.861368,0.816176
2,0.3438,0.333157,0.890388,0.840686
3,0.1738,0.356387,0.90566,0.865196


Result: f1=0.9057, F=-0.4948, time=70.9s

[117/240]

Model: bert | Task: mrpc | α: 0.1 | Seed: 123
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.455,0.378193,0.889256,0.835784
2,0.2985,0.326513,0.903879,0.860294
3,0.1408,0.343974,0.895944,0.855392


Result: f1=0.8959, F=-0.4947, time=71.2s

[118/240]

Model: bert | Task: mrpc | α: 0.1 | Seed: 456
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4746,0.429974,0.852665,0.769608
2,0.35,0.368411,0.842912,0.79902
3,0.1981,0.384489,0.887348,0.840686


Result: f1=0.8873, F=-0.5178, time=71.5s

[119/240]

Model: bert | Task: mrpc | α: 0.1 | Seed: 789
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.5612,0.440042,0.841402,0.767157
2,0.3329,0.402558,0.828125,0.784314
3,0.2306,0.370328,0.881834,0.835784


Result: f1=0.8818, F=-0.5084, time=71.2s

[120/240]

Model: bert | Task: mrpc | α: 0.1 | Seed: 1024
Train: 3668 samples | Eval: 408 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4602,0.43634,0.790021,0.752451
2,0.3492,0.335816,0.882979,0.838235
3,0.1571,0.392224,0.897747,0.855392


Result: f1=0.8977, F=-0.4773, time=72.0s

[121/240]

Model: bert | Task: cola | α: 0.0 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3997,0.411153,0.559928,0.821668
2,0.2912,0.509569,0.562538,0.822627
3,0.1828,0.654855,0.567568,0.824545


Result: matthews_correlation=0.5676, F=-0.4937, time=92.9s

[122/240]

Model: bert | Task: cola | α: 0.0 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4225,0.435346,0.523502,0.808245
2,0.2401,0.54111,0.562582,0.822627
3,0.188,0.630204,0.585751,0.831256


Result: matthews_correlation=0.5858, F=-0.5149, time=91.8s

[123/240]

Model: bert | Task: cola | α: 0.0 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3928,0.425721,0.552139,0.818792
2,0.2753,0.476391,0.562907,0.822627
3,0.1843,0.679987,0.575641,0.827421


Result: matthews_correlation=0.5756, F=-0.5027, time=92.2s

[124/240]

Model: bert | Task: cola | α: 0.0 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4428,0.480536,0.495131,0.797699
2,0.2915,0.455803,0.572701,0.826462
3,0.1913,0.609642,0.593133,0.834132


Result: matthews_correlation=0.5931, F=-0.5041, time=92.2s

[125/240]

Model: bert | Task: cola | α: 0.0 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4303,0.449751,0.504513,0.800575
2,0.2749,0.491189,0.554796,0.819751
3,0.1806,0.636516,0.585866,0.831256


Result: matthews_correlation=0.5859, F=-0.5187, time=92.3s

[126/240]

Model: bert | Task: cola | α: 0.001 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4062,0.413311,0.547346,0.816874
2,0.2837,0.487217,0.567827,0.824545
3,0.191,0.642991,0.578203,0.82838


Result: matthews_correlation=0.5782, F=-0.5125, time=151.2s

[127/240]

Model: bert | Task: cola | α: 0.001 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4434,0.443219,0.536921,0.813039
2,0.2472,0.575527,0.556431,0.819751
3,0.1861,0.6093,0.577797,0.82838


Result: matthews_correlation=0.5778, F=-0.5116, time=151.6s

[128/240]

Model: bert | Task: cola | α: 0.001 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.404,0.446888,0.488865,0.795781
2,0.2885,0.445182,0.570118,0.825503
3,0.2113,0.620072,0.580551,0.829338


Result: matthews_correlation=0.5806, F=-0.5168, time=150.6s

[129/240]

Model: bert | Task: cola | α: 0.001 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4427,0.421718,0.565754,0.823586
2,0.2897,0.475157,0.565003,0.823586
3,0.1563,0.614628,0.582897,0.830297


Result: matthews_correlation=0.5829, F=-0.4926, time=151.6s

[130/240]

Model: bert | Task: cola | α: 0.001 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4181,0.436691,0.543201,0.814957
2,0.27,0.462107,0.589046,0.832215
3,0.1177,0.666847,0.570636,0.825503


Result: matthews_correlation=0.5706, F=-0.4974, time=150.7s

[131/240]

Model: bert | Task: cola | α: 0.01 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4024,0.412132,0.557414,0.820709
2,0.274,0.448815,0.588414,0.832215
3,0.18,0.661195,0.567605,0.824545


Result: matthews_correlation=0.5676, F=-0.5114, time=151.8s

[132/240]

Model: bert | Task: cola | α: 0.01 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4316,0.445031,0.502055,0.800575
2,0.2438,0.583247,0.522202,0.807287
3,0.1952,0.62958,0.541898,0.814957


Result: matthews_correlation=0.5419, F=-0.5306, time=150.7s

[133/240]

Model: bert | Task: cola | α: 0.01 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4016,0.43711,0.507785,0.802493
2,0.2656,0.471753,0.580628,0.829338
3,0.1806,0.609764,0.593084,0.834132


Result: matthews_correlation=0.5931, F=-0.5231, time=150.7s

[134/240]

Model: bert | Task: cola | α: 0.01 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4397,0.416531,0.563388,0.822627
2,0.2865,0.459358,0.567552,0.824545
3,0.1579,0.613574,0.577859,0.82838


Result: matthews_correlation=0.5779, F=-0.4994, time=150.9s

[135/240]

Model: bert | Task: cola | α: 0.01 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.4156,0.420319,0.526761,0.809204
2,0.2717,0.493038,0.554824,0.819751
3,0.1129,0.628441,0.578203,0.82838


Result: matthews_correlation=0.5782, F=-0.5048, time=151.3s

[136/240]

Model: bert | Task: cola | α: 0.1 | Seed: 42
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3458,0.365616,0.564979,0.823586
2,0.2107,0.498878,0.577797,0.82838
3,0.1265,0.619946,0.567661,0.824545


Result: matthews_correlation=0.5677, F=-0.5926, time=151.0s

[137/240]

Model: bert | Task: cola | α: 0.1 | Seed: 123
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3859,0.377289,0.507428,0.802493
2,0.184,0.528398,0.525143,0.808245
3,0.127,0.577555,0.564979,0.823586


Result: matthews_correlation=0.5650, F=-0.5965, time=150.5s

[138/240]

Model: bert | Task: cola | α: 0.1 | Seed: 456
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3484,0.403443,0.486494,0.794823
2,0.2216,0.395238,0.570508,0.825503
3,0.1359,0.579889,0.56755,0.824545


Result: matthews_correlation=0.5676, F=-0.5962, time=151.2s

[139/240]

Model: bert | Task: cola | α: 0.1 | Seed: 789
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3922,0.368928,0.540333,0.813998
2,0.2321,0.402864,0.575241,0.827421
3,0.1016,0.581533,0.557414,0.820709


Result: matthews_correlation=0.5574, F=-0.5851, time=150.6s

[140/240]

Model: bert | Task: cola | α: 0.1 | Seed: 1024
Train: 8551 samples | Eval: 1043 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation,Accuracy
1,0.3719,0.374939,0.532241,0.811122
2,0.2082,0.418966,0.575916,0.827421
3,0.0753,0.623258,0.568063,0.824545


Result: matthews_correlation=0.5681, F=-0.5791, time=150.3s

[141/240]

Model: bert | Task: qnli | α: 0.0 | Seed: 42
Train: 104743 samples | Eval: 5463 samples


Map:   0%|          | 0/104743 [00:00<?, ? examples/s]

Map:   0%|          | 0/5463 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2953,0.259488,0.901336
2,0.1807,0.296637,0.90756
3,0.1481,0.372494,0.909207


Result: accuracy=0.9092, F=-0.4291, time=1072.4s

[142/240]

Model: bert | Task: qnli | α: 0.0 | Seed: 123
Train: 104743 samples | Eval: 5463 samples


Map:   0%|          | 0/5463 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3211,0.253774,0.900787
2,0.2124,0.29892,0.904265
3,0.1374,0.395109,0.908109


Result: accuracy=0.9081, F=-0.4218, time=1061.2s

[143/240]

Model: bert | Task: qnli | α: 0.0 | Seed: 456
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3629,0.245061,0.902252
2,0.1725,0.282573,0.909024
3,0.1355,0.381594,0.911038


Result: accuracy=0.9110, F=-0.4175, time=995.3s

[144/240]

Model: bert | Task: qnli | α: 0.0 | Seed: 789
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3031,0.257373,0.897309
2,0.2008,0.289922,0.902801
3,0.1225,0.403671,0.906279


Result: accuracy=0.9063, F=-0.4228, time=992.1s

[145/240]

Model: bert | Task: qnli | α: 0.0 | Seed: 1024
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2889,0.235521,0.906462
2,0.1952,0.261329,0.917079
3,0.1074,0.357436,0.916346


Result: accuracy=0.9163, F=-0.4202, time=999.2s

[146/240]

Model: bert | Task: qnli | α: 0.001 | Seed: 42
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3015,0.2531,0.898957
2,0.1926,0.269678,0.907194
3,0.1605,0.368082,0.911953


Result: accuracy=0.9120, F=-0.4195, time=1677.3s

[147/240]

Model: bert | Task: qnli | α: 0.001 | Seed: 123
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.313,0.264085,0.897492
2,0.202,0.292653,0.907194
3,0.1164,0.387572,0.911221


Result: accuracy=0.9112, F=-0.4180, time=1681.8s

[148/240]

Model: bert | Task: qnli | α: 0.001 | Seed: 456
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3402,0.243102,0.903533
2,0.1939,0.261038,0.912319
3,0.0917,0.387461,0.910672


Result: accuracy=0.9107, F=-0.4228, time=1700.7s

[149/240]

Model: bert | Task: qnli | α: 0.001 | Seed: 789
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3186,0.252028,0.89676
2,0.2109,0.300014,0.906462
3,0.1037,0.373975,0.90756


Result: accuracy=0.9076, F=-0.4215, time=1683.6s

[150/240]

Model: bert | Task: qnli | α: 0.001 | Seed: 1024
Train: 104743 samples | Eval: 5463 samples


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2881,0.235542,0.910855


: 

In [1]:
# ============================================================================
# Analyze and Visualize Results
# ============================================================================

# Load results (adjust path based on which experiment you ran)
result_dir = Path("results_full")  # or results_medium, results_quick

if (result_dir / "all_results.json").exists():
    with open(result_dir / "all_results.json") as f:
        results = json.load(f)
    
    df = analyze_results(results, result_dir)
    plot_results(df, result_dir)
else:
    print(f"No results found in {result_dir}")

NameError: name 'Path' is not defined

In [None]:
# ============================================================================
# Download Results
# ============================================================================

from google.colab import files

# Adjust based on which experiment you ran
result_dir = "results_full"  # or results_medium, results_quick

!zip -r f_reg_large_scale_results.zip {result_dir}/
files.download('f_reg_large_scale_results.zip')

---
## Interpretation Guide

### Success Criteria for "やばい" (Breakthrough) Level

| Criterion | Threshold | Status |
|-----------|-----------|--------|
| Consistent improvement | α>0 beats baseline in >75% of settings | ? |
| Statistical significance | p < 0.01 for best α vs baseline | ? |
| Effect size | Cohen's d > 0.3 (medium effect) | ? |
| Cross-model generalization | Works on BERT, RoBERTa, DistilBERT | ? |
| Cross-task generalization | Works on SST-2, MRPC, CoLA, QNLI | ? |

### If Successful
- geDIG F is a **trainable objective** for Transformer optimization
- Opens path to **Attention-free architectures** based on graph principles
- Publishable at ACL/EMNLP/NeurIPS level