
## Simulation Framework

### 1. 实验目标与总体流程
本文的 simulation 采用“**基础基线 → 预训练模型快速探索 → 阶段性训练 → 统一测试 → 鲁棒性分析**”的五阶段框架。核心目标是比较不同语义匹配模型在标准测试集与对抗式语义扰动下的性能差异，并分析模型对细粒度语义变化的敏感性与稳定性。

### 2. 基础 Baseline（两种方法）
1. **Lexical Baseline（TF-IDF）**：将句对编码为词法特征，并以线性分类/相似度计算作为下界。  
2. **Static Embedding Baseline（静态词向量池化）**：对句子进行词向量池化（如 mean/max），再以 cosine 相似度建模语义接近程度。  

该部分提供“低成本、可解释”的参照系，用于衡量后续深度模型的增益。

### 3. 现有模型探索（简单测试）
在统一输入与评估协议下，对 **BERT、ALBERT、RoBERTa、SBERT** 做快速验证（如零样本/少量训练设置），观察其在语义匹配任务上的初始表现。  
其中，SBERT 采用双塔句向量 + cosine；BERT/ALBERT/RoBERTa 采用 cross-encoder 打分或分类头输出，形成统一可比的相似度/标签预测。

### 4. 基于 BERT 的阶段性训练
采用两阶段训练策略：  
- **Phase A**：先在句对二分类任务（如 paraphrase）上训练，学习通用匹配能力；  
- **Phase B**：迁移到语义相似度回归任务继续微调，使输出更贴近细粒度语义强度。  

该策略用于验证“先学习句对判别，再学习相似度标定”的迁移收益。

### 5. 测试集统一评测（所有模型）
在同一测试集上报告全模型性能：  
- 分类任务：`Accuracy`、`F1`；  
- 回归/相似度任务：`Pearson`、`Spearman`（可附 `MAE`）。  

并通过统一阈值选择策略（在验证集选阈值）保证比较公平性。

### 6. 自建鲁棒性数据集与分析
构建 stress-test 集合，覆盖 8 类关键语义扰动：  
1) Negation（否定）  
2) Increase/Decrease（方向变化）  
3) Comparative Flip（比较级反转）  
4) Role Swap（角色交换）  
5) Numeric Change（数值变化）  
6) Quantifier Shift（量词变化）  
7) Modal Shift（模态变化）  
8) Direction Swap（方向交换）

评测方式为：在开发集选阈值、在测试集报告总体 `Acc/F1` 与分类型准确率；同时补充错误案例分析（高词面重叠但语义冲突），以刻画模型鲁棒性短板。

### 7. 论文呈现建议
建议在 simulation 部分给出三张核心表：  
- **Table 1**：Baseline 与预训练模型的主结果；  
- **Table 2**：BERT 阶段性训练前后对比；  
- **Table 3**：8 类鲁棒性分项结果与典型错误案例。  

如果你愿意，我可以下一步直接给你一版“论文风格（更正式、可投稿）”的中英文双语 simulation 小节。

In [2]:
# 基础依赖与全局配置
from __future__ import annotations

import os
import re
import random
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

import torch
from datasets import load_dataset

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

from scipy.stats import pearsonr, spearmanr

SEED = 42
DEBUG_SUBSET = True  # True: 跑通流程；False: 全量训练/评估
MAX_DEBUG_TRAIN = 2000
MAX_DEBUG_EVAL = 500

# 训练轮数（可按机器性能调整）
BERT_NUM_EPOCHS = 1 if DEBUG_SUBSET else 3
BERT_STSB_NUM_EPOCHS = 1 if DEBUG_SUBSET else 3
SBERT_NUM_EPOCHS = 1 if DEBUG_SUBSET else 2

PROJET_DIR = "/Users/jinzhuoyuan/King/Saclay/Course/HoNLP/projet"
MODELS_DIR = os.path.join(PROJET_DIR, "models")
EMB_PATH = "/Users/jinzhuoyuan/King/Saclay/Course/HoNLP/Devoir/4/enwiki-50k_100d.txt"

os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")


def get_best_device() -> str:
    if torch.cuda.is_available():
        return "cuda"
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    return "cpu"


DEVICE = get_best_device()

random.seed(SEED)
np.random.seed(SEED)
pd.set_option("display.max_colwidth", 120)

print("torch:", torch.__version__, "| device:", DEVICE)

torch: 2.10.0 | device: mps


In [3]:
# 1. 数据加载与预处理
_re_non_alnum = re.compile(r"[^a-z0-9\s]")
_re_ws = re.compile(r"\s+")


def normalize_text(text: str | None) -> str:
    if text is None:
        return ""
    text = text.lower()
    text = _re_non_alnum.sub(" ", text)
    text = _re_ws.sub(" ", text).strip()
    return text


def split_80_10_10(df: pd.DataFrame, seed: int = 42) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    train_df, tmp_df = train_test_split(df, test_size=0.2, random_state=seed, shuffle=True)
    val_df, test_df = train_test_split(tmp_df, test_size=0.5, random_state=seed, shuffle=True)
    return train_df.reset_index(drop=True), val_df.reset_index(drop=True), test_df.reset_index(drop=True)


def to_pair_df(ds_split, task: str) -> pd.DataFrame:
    if task == "qqp":
        df = pd.DataFrame({
            "s1": [normalize_text(x) for x in ds_split["question1"]],
            "s2": [normalize_text(x) for x in ds_split["question2"]],
            "label": ds_split["label"],
        })
        df = df.dropna(subset=["label"])
        df["label"] = df["label"].astype(int)
        return df

    if task == "stsb":
        df = pd.DataFrame({
            "s1": [normalize_text(x) for x in ds_split["sentence1"]],
            "s2": [normalize_text(x) for x in ds_split["sentence2"]],
            "label": ds_split["label"],
        })
        df = df.dropna(subset=["label"])
        df["label"] = (df["label"].astype(float) / 5.0).clip(0.0, 1.0)
        return df

    raise ValueError(f"unknown task: {task}")


qqp_raw = load_dataset("glue", "qqp")
stsb_raw = load_dataset("glue", "stsb")

qqp_all = pd.concat(
    [to_pair_df(qqp_raw["train"], "qqp"), to_pair_df(qqp_raw["validation"], "qqp")],
    ignore_index=True,
)
stsb_all = pd.concat(
    [to_pair_df(stsb_raw["train"], "stsb"), to_pair_df(stsb_raw["validation"], "stsb")],
    ignore_index=True,
)

qqp_train, qqp_val, qqp_test = split_80_10_10(qqp_all, seed=SEED)
stsb_train, stsb_val, stsb_test = split_80_10_10(stsb_all, seed=SEED)

if DEBUG_SUBSET:
    qqp_train = qqp_train.head(MAX_DEBUG_TRAIN)
    qqp_val = qqp_val.head(MAX_DEBUG_EVAL)
    qqp_test = qqp_test.head(MAX_DEBUG_EVAL)
    stsb_train = stsb_train.head(MAX_DEBUG_TRAIN)
    stsb_val = stsb_val.head(MAX_DEBUG_EVAL)
    stsb_test = stsb_test.head(MAX_DEBUG_EVAL)

print("QQP splits:", len(qqp_train), len(qqp_val), len(qqp_test))
print("STS-B splits:", len(stsb_train), len(stsb_val), len(stsb_test))
display(qqp_train.head(3))
display(stsb_train.head(3))



QQP splits: 2000 500 500
STS-B splits: 2000 500 500


Unnamed: 0,s1,s2,label
0,when is the right time to start a startup,when is it the right time to stop your startup,0
1,do presidents have to go through a security clearance check if so how would trump pass and maintain,what level of national security clearance if any did trump have to pass to be briefed by the government as a candida...,1
2,how long does it take to code a simple app,what are the best simple note taking web apps,0


Unnamed: 0,s1,s2,label
0,the brown dog is running along a grassy pathway,a brown dog is running along a grassy stretch divided by strings,0.68
1,a man is playing guitar,smeone is laying down,0.0
2,by late afternoon the dow jones industrial average was up 12 81 or 0 1 percent at 9 331 77 having gained 201 points ...,in morning trading the dow jones industrial average was down 8 76 or 0 1 percent at 9 310 20 having gained 201 point...,0.25


## 2. Baselines: Lexical + Static Embedding

- Lexical Baseline: TF-IDF + Logistic Regression (QQP), TF-IDF cosine (STS-B)
- Static Embedding Baseline: 词向量池化 + cosine，相似度映射到 [0,1]

In [4]:
# Baseline 实现：TF-IDF + 静态词向量池化
from gensim.models import KeyedVectors

def _load_word2vec(path: str, max_words: int = 50_000):
    kv = KeyedVectors.load_word2vec_format(
        path,
        binary=False,
        no_header=True,
        limit=max_words,
    )
    return kv, int(kv.vector_size)


def _sent_embedding(text: str, kv: KeyedVectors, dim: int, pooling: str = "mean") -> np.ndarray:
    toks = [t for t in text.split() if t in kv]
    if not toks:
        return np.zeros((dim,), dtype=np.float32)
    mat = np.stack([kv.get_vector(t) for t in toks], axis=0)
    if pooling == "mean":
        return mat.mean(axis=0)
    if pooling == "max":
        return mat.max(axis=0)
    raise ValueError(f"unknown pooling: {pooling}")


def _cosine(a: np.ndarray, b: np.ndarray) -> float:
    denom = (np.linalg.norm(a) * np.linalg.norm(b))
    if denom == 0.0:
        return 0.0
    return float(np.dot(a, b) / denom)


def _pick_best_threshold(y_true: np.ndarray, y_score: np.ndarray) -> tuple[float, float]:
    ts = np.linspace(0.0, 1.0, 101)
    best_t, best_f1 = 0.5, -1.0
    for t in ts:
        pred = (y_score >= t).astype(int)
        cur = f1_score(y_true, pred)
        if cur > best_f1:
            best_f1 = float(cur)
            best_t = float(t)
    return best_t, best_f1


def train_tfidf_and_static_baselines(
    qqp_train: pd.DataFrame,
    qqp_val: pd.DataFrame,
    qqp_test: pd.DataFrame,
    stsb_train: pd.DataFrame,
    stsb_val: pd.DataFrame,
    stsb_test: pd.DataFrame,
    emb_path: str = EMB_PATH,
    max_words: int = 50_000,
    pooling: str = "mean",
):
    # Lexical TF-IDF (QQP cosine + threshold)
    tfidf_cls = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_features=200_000)
    tfidf_cls.fit(pd.concat([qqp_train["s1"], qqp_train["s2"]], ignore_index=True).tolist())

    def lexical_predict_qqp(df: pd.DataFrame) -> np.ndarray:
        A = tfidf_cls.transform(df["s1"].tolist())
        B = tfidf_cls.transform(df["s2"].tolist())
        sims = np.array([cosine_similarity(A[i], B[i])[0, 0] for i in range(A.shape[0])], dtype=np.float32)
        return sims.clip(0.0, 1.0)

    lex_val_score = lexical_predict_qqp(qqp_val)
    lex_test_score = lexical_predict_qqp(qqp_test)
    lex_t, _ = _pick_best_threshold(qqp_val["label"].values.astype(int), lex_val_score)
    lex_val_pred_cls = (lex_val_score >= lex_t).astype(int)
    lex_test_pred_cls = (lex_test_score >= lex_t).astype(int)
    lex_qqp_metrics = {
        "qqp_val_acc": float(accuracy_score(qqp_val["label"], lex_val_pred_cls)),
        "qqp_val_f1": float(f1_score(qqp_val["label"], lex_val_pred_cls)),
        "qqp_test_acc": float(accuracy_score(qqp_test["label"], lex_test_pred_cls)),
        "qqp_test_f1": float(f1_score(qqp_test["label"], lex_test_pred_cls)),
        "threshold": float(lex_t),
    }

    # Lexical TF-IDF (STS-B cosine)
    tfidf_sts = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_features=200_000)
    tfidf_sts.fit(pd.concat([stsb_train["s1"], stsb_train["s2"]], ignore_index=True).tolist())

    def lexical_predict_stsb(df: pd.DataFrame) -> np.ndarray:
        A = tfidf_sts.transform(df["s1"].tolist())
        B = tfidf_sts.transform(df["s2"].tolist())
        sims = np.array([cosine_similarity(A[i], B[i])[0, 0] for i in range(A.shape[0])], dtype=np.float32)
        return sims.clip(0.0, 1.0)

    lex_val_pred_sts = lexical_predict_stsb(stsb_val)
    lex_test_pred_sts = lexical_predict_stsb(stsb_test)
    lex_stsb_metrics = {
        "stsb_val_pearson": float(pearsonr(stsb_val["label"].values, lex_val_pred_sts).statistic),
        "stsb_val_spearman": float(spearmanr(stsb_val["label"].values, lex_val_pred_sts).statistic),
        "stsb_test_pearson": float(pearsonr(stsb_test["label"].values, lex_test_pred_sts).statistic),
        "stsb_test_spearman": float(spearmanr(stsb_test["label"].values, lex_test_pred_sts).statistic),
    }

    # Static Embedding baseline (Word2Vec format)
    kv, dim = _load_word2vec(emb_path, max_words=max_words)

    def _static_cosine_scores(df: pd.DataFrame) -> np.ndarray:
        scores = np.zeros((len(df),), dtype=np.float32)
        for i, (s1, s2) in enumerate(zip(df["s1"].tolist(), df["s2"].tolist())):
            e1 = _sent_embedding(s1, kv, dim, pooling=pooling)
            e2 = _sent_embedding(s2, kv, dim, pooling=pooling)
            scores[i] = (_cosine(e1, e2) + 1.0) / 2.0
        return scores.clip(0.0, 1.0)

    static_stsb_val = _static_cosine_scores(stsb_val)
    static_stsb_test = _static_cosine_scores(stsb_test)
    static_stsb_metrics = {
        "stsb_val_pearson": float(pearsonr(stsb_val["label"].values, static_stsb_val).statistic),
        "stsb_val_spearman": float(spearmanr(stsb_val["label"].values, static_stsb_val).statistic),
        "stsb_test_pearson": float(pearsonr(stsb_test["label"].values, static_stsb_test).statistic),
        "stsb_test_spearman": float(spearmanr(stsb_test["label"].values, static_stsb_test).statistic),
    }

    static_val_score = _static_cosine_scores(qqp_val)
    static_test_score = _static_cosine_scores(qqp_test)
    static_t, _ = _pick_best_threshold(qqp_val["label"].values.astype(int), static_val_score)
    static_val_pred = (static_val_score >= static_t).astype(int)
    static_test_pred = (static_test_score >= static_t).astype(int)
    static_qqp_metrics = {
        "qqp_val_acc": float(accuracy_score(qqp_val["label"], static_val_pred)),
        "qqp_val_f1": float(f1_score(qqp_val["label"], static_val_pred)),
        "qqp_test_acc": float(accuracy_score(qqp_test["label"], static_test_pred)),
        "qqp_test_f1": float(f1_score(qqp_test["label"], static_test_pred)),
        "threshold": float(static_t),
    }

    def static_predict_qqp(df: pd.DataFrame) -> np.ndarray:
        return _static_cosine_scores(df)

    def static_predict_stsb(df: pd.DataFrame) -> np.ndarray:
        return _static_cosine_scores(df)

    return {
        "lexical": {
            "qqp_metrics": lex_qqp_metrics,
            "stsb_metrics": lex_stsb_metrics,
            "predict_qqp": lexical_predict_qqp,
            "predict_stsb": lexical_predict_stsb,
            "stsb_val_pred": lex_val_pred_sts,
            "stsb_test_pred": lex_test_pred_sts,
        },
        "static": {
            "qqp_metrics": static_qqp_metrics,
            "stsb_metrics": static_stsb_metrics,
            "predict_qqp": static_predict_qqp,
            "predict_stsb": static_predict_stsb,
            "stsb_val_pred": static_stsb_val,
            "stsb_test_pred": static_stsb_test,
        },
    }

In [5]:
baseline_bundle = train_tfidf_and_static_baselines(
    qqp_train=qqp_train,
    qqp_val=qqp_val,
    qqp_test=qqp_test,
    stsb_train=stsb_train,
    stsb_val=stsb_val,
    stsb_test=stsb_test,
    emb_path=EMB_PATH,
    max_words=50_000,
    pooling="mean",
)

lex_qqp_metrics = baseline_bundle["lexical"]["qqp_metrics"]
lex_stsb_metrics = baseline_bundle["lexical"]["stsb_metrics"]
static_qqp_metrics = baseline_bundle["static"]["qqp_metrics"]
static_stsb_metrics = baseline_bundle["static"]["stsb_metrics"]

lexical_predict_qqp = baseline_bundle["lexical"]["predict_qqp"]
lexical_predict_stsb = baseline_bundle["lexical"]["predict_stsb"]
static_predict_qqp = baseline_bundle["static"]["predict_qqp"]
static_predict_stsb = baseline_bundle["static"]["predict_stsb"]

lex_stsb_test_pred = baseline_bundle["lexical"]["stsb_test_pred"]
static_stsb_test_pred = baseline_bundle["static"]["stsb_test_pred"]

print("Lexical metrics:", {**lex_qqp_metrics, **lex_stsb_metrics})
print("Static embed metrics:", {**static_qqp_metrics, **static_stsb_metrics})

Lexical metrics: {'qqp_val_acc': 0.636, 'qqp_val_f1': 0.6726618705035972, 'qqp_test_acc': 0.604, 'qqp_test_f1': 0.611764705882353, 'threshold': 0.24, 'stsb_val_pearson': 0.6378415042480152, 'stsb_val_spearman': 0.6196225048578498, 'stsb_test_pearson': 0.5649861286035803, 'stsb_test_spearman': 0.5581035202027839}
Static embed metrics: {'qqp_val_acc': 0.632, 'qqp_val_f1': 0.642023346303502, 'qqp_test_acc': 0.634, 'qqp_test_f1': 0.6163522012578616, 'threshold': 0.96, 'stsb_val_pearson': 0.6134472130305588, 'stsb_val_spearman': 0.6238302416623481, 'stsb_test_pearson': 0.6351538419133075, 'stsb_test_spearman': 0.6277128213767955}


## 3. 预训练模型快速探索（Zero-shot）

在统一输入协议下，快速对比 BERT / SBERT 的初始表现。

In [6]:
# Zero-shot BERT / SBERT (multi-model comparison)
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer

# Cross-encoder models (classification head is random in zero-shot)
BERT_ZS_MODELS = [
    "bert-base-uncased",
    "albert-base-v2",
    "roberta-base",
]

# Bi-encoder sentence-transformers (cosine similarity)
SBERT_ZS_MODELS = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/all-mpnet-base-v2",
    "sentence-transformers/paraphrase-MiniLM-L6-v2",
]


def bert_zeroshot_predict_labels(model_name: str, df: pd.DataFrame, batch_size: int = 32) -> np.ndarray:
    tok = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    model.to(DEVICE)
    model.eval()

    preds = []
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i : i + batch_size]
        enc = tok(
            batch["s1"].tolist(),
            batch["s2"].tolist(),
            truncation=True,
            padding=True,
            max_length=128,
            return_tensors="pt",
        )
        enc = {k: v.to(DEVICE) for k, v in enc.items()}
        with torch.no_grad():
            logits = model(**enc).logits
            preds.append(torch.argmax(logits, dim=-1).detach().cpu().numpy())
    return np.concatenate(preds, axis=0)


def sbert_zeroshot_scores(model_name: str, df: pd.DataFrame, batch_size: int = 64) -> np.ndarray:
    sbert = SentenceTransformer(model_name, device=DEVICE)
    emb1 = sbert.encode(df["s1"].tolist(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False)
    emb2 = sbert.encode(df["s2"].tolist(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False)
    scores = (emb1 * emb2).sum(axis=1)
    return ((scores + 1.0) / 2.0).clip(0.0, 1.0).astype(np.float32)


qqp_rows = []
stsb_rows = []

# --- Cross-encoder zero-shot on QQP ---
for model_name in BERT_ZS_MODELS:
    val_pred = bert_zeroshot_predict_labels(model_name, qqp_val, batch_size=32)
    test_pred = bert_zeroshot_predict_labels(model_name, qqp_test, batch_size=32)

    metrics = {
        "model": model_name,
        "qqp_val_acc": float(accuracy_score(qqp_val["label"].values, val_pred)),
        "qqp_val_f1": float(f1_score(qqp_val["label"].values, val_pred)),
        "qqp_test_acc": float(accuracy_score(qqp_test["label"].values, test_pred)),
        "qqp_test_f1": float(f1_score(qqp_test["label"].values, test_pred)),
    }
    qqp_rows.append(metrics)

# Keep the original single-model metrics for downstream summary
bert_qqp_zeroshot_metrics = next((r for r in qqp_rows if r["model"] == "bert-base-uncased"), {})
print("BERT zero-shot (random head) QQP:", bert_qqp_zeroshot_metrics)

# --- SBERT zero-shot on STS-B + QQP (thresholded) ---
for model_name in SBERT_ZS_MODELS:
    stsb_val_scores = sbert_zeroshot_scores(model_name, stsb_val)
    stsb_test_scores = sbert_zeroshot_scores(model_name, stsb_test)

    stsb_metrics = {
        "model": model_name,
        "stsb_val_pearson": float(pearsonr(stsb_val["label"].values, stsb_val_scores).statistic),
        "stsb_val_spearman": float(spearmanr(stsb_val["label"].values, stsb_val_scores).statistic),
        "stsb_test_pearson": float(pearsonr(stsb_test["label"].values, stsb_test_scores).statistic),
        "stsb_test_spearman": float(spearmanr(stsb_test["label"].values, stsb_test_scores).statistic),
    }
    stsb_rows.append(stsb_metrics)

    qqp_val_scores = sbert_zeroshot_scores(model_name, qqp_val)
    qqp_test_scores = sbert_zeroshot_scores(model_name, qqp_test)

    thresholds = np.linspace(0.0, 1.0, 101)
    best_t = 0.5
    best_f1 = -1.0
    y_val = qqp_val["label"].values.astype(int)
    for t in thresholds:
        pred = (qqp_val_scores >= t).astype(int)
        f = f1_score(y_val, pred)
        if f > best_f1:
            best_f1 = float(f)
            best_t = float(t)

    y_test = qqp_test["label"].values.astype(int)
    qqp_val_pred = (qqp_val_scores >= best_t).astype(int)
    qqp_test_pred = (qqp_test_scores >= best_t).astype(int)

    qqp_rows.append({
        "model": model_name,
        "qqp_val_acc": float(accuracy_score(y_val, qqp_val_pred)),
        "qqp_val_f1": float(f1_score(y_val, qqp_val_pred)),
        "qqp_test_acc": float(accuracy_score(y_test, qqp_test_pred)),
        "qqp_test_f1": float(f1_score(y_test, qqp_test_pred)),
        "threshold": best_t,
    })

# Keep the original single-model metrics for downstream summary
sbert_stsb_zeroshot_metrics = next((r for r in stsb_rows if r["model"] == "sentence-transformers/all-MiniLM-L6-v2"), {})
sbert_qqp_zeroshot_metrics = next((r for r in qqp_rows if r["model"] == "sentence-transformers/all-MiniLM-L6-v2"), {})
print("SBERT zero-shot STS-B:", sbert_stsb_zeroshot_metrics)
print("SBERT zero-shot QQP (thresholded):", sbert_qqp_zeroshot_metrics)

# --- Comparison tables ---
qqp_zero_df = pd.DataFrame(qqp_rows)
qqp_zero_df = qqp_zero_df.sort_values(["qqp_test_f1", "qqp_test_acc"], ascending=[False, False]).reset_index(drop=True)

stsb_zero_df = pd.DataFrame(stsb_rows)
stsb_zero_df = stsb_zero_df.sort_values(["stsb_test_pearson", "stsb_test_spearman"], ascending=[False, False]).reset_index(drop=True)

print("\nZero-shot QQP comparison:")
display(qqp_zero_df)

print("\nZero-shot STS-B comparison:")
display(stsb_zero_df)

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/25 [00:00<?, ?it/s]

AlbertForSequenceClassification LOAD REPORT from: albert-base-v2
Key                          | Status     | 
-----------------------------+------------+-
predictions.decoder.bias     | UNEXPECTED | 
predictions.dense.weight     | UNEXPECTED | 
predictions.dense.bias       | UNEXPECTED | 
predictions.LayerNorm.weight | UNEXPECTED | 
predictions.LayerNorm.bias   | UNEXPECTED | 
predictions.bias             | UNEXPECTED | 
classifier.weight            | MISSING    | 
classifier.bias              | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Loading weights:   0%|          | 0/25 [00:00<?, ?it/s]

AlbertForSequenceClassification LOAD REPORT from: albert-base-v2
Key                          | Status     | 
-----------------------------+------------+-
predictions.decoder.bias     | UNEXPECTED | 
predictions.dense.weight     | UNEXPECTED | 
predictions.dense.bias       | UNEXPECTED | 
predictions.LayerNorm.weight | UNEXPECTED | 
predictions.LayerNorm.bias   | UNEXPECTED | 
predictions.bias             | UNEXPECTED | 
classifier.weight            | MISSING    | 
classifier.bias              | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
classifier.dense.weight         | MISSING    | 
classifier.out_proj.weight      | MISSING    | 
classifier.dense.bias           | MISSING    | 
classifier.out_proj.bias        | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
classifier.dense.weight         | MISSING    | 
classifier.out_proj.weight      | MISSING    | 
classifier.dense.bias           | MISSING    | 
classifier.out_proj.bias        | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


BERT zero-shot (random head) QQP: {'model': 'bert-base-uncased', 'qqp_val_acc': 0.43, 'qqp_val_f1': 0.5802650957290133, 'qqp_test_acc': 0.376, 'qqp_test_f1': 0.5343283582089552}


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


SBERT zero-shot STS-B: {'model': 'sentence-transformers/all-MiniLM-L6-v2', 'stsb_val_pearson': 0.8408758100036464, 'stsb_val_spearman': 0.8198952084381534, 'stsb_test_pearson': 0.8656866930684459, 'stsb_test_spearman': 0.8385417477204856}
SBERT zero-shot QQP (thresholded): {'model': 'sentence-transformers/all-MiniLM-L6-v2', 'qqp_val_acc': 0.778, 'qqp_val_f1': 0.7692307692307693, 'qqp_test_acc': 0.736, 'qqp_test_f1': 0.7130434782608696, 'threshold': 0.84}

Zero-shot QQP comparison:


Unnamed: 0,model,qqp_val_acc,qqp_val_f1,qqp_test_acc,qqp_test_f1,threshold
0,sentence-transformers/all-mpnet-base-v2,0.82,0.780488,0.814,0.753316,0.89
1,sentence-transformers/paraphrase-MiniLM-L6-v2,0.778,0.769231,0.738,0.719486,0.85
2,sentence-transformers/all-MiniLM-L6-v2,0.778,0.769231,0.736,0.713043,0.84
3,bert-base-uncased,0.43,0.580265,0.376,0.534328,
4,albert-base-v2,0.576,0.258741,0.644,0.011111,
5,roberta-base,0.394,0.56528,0.642,0.0,



Zero-shot STS-B comparison:


Unnamed: 0,model,stsb_val_pearson,stsb_val_spearman,stsb_test_pearson,stsb_test_spearman
0,sentence-transformers/all-MiniLM-L6-v2,0.840876,0.819895,0.865687,0.838542
1,sentence-transformers/all-mpnet-base-v2,0.869383,0.852568,0.865659,0.838948
2,sentence-transformers/paraphrase-MiniLM-L6-v2,0.851478,0.829636,0.863941,0.841344


## 4. 阶段性训练（BERT）

- Phase A: QQP 二分类（Cross-Encoder）
- Phase B: QQP → STS-B 回归迁移
- 对照：直接在 STS-B 上训练（无 QQP 迁移）

In [None]:
# BERT 直接训练 STS-B（Cross-Encoder Regression）
import inspect
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

print("transformers version:", transformers.__version__)
print("DEVICE:", DEVICE)
print("BERT_STSB_NUM_EPOCHS:", BERT_STSB_NUM_EPOCHS)

bert_stsb_tok = AutoTokenizer.from_pretrained("bert-base-uncased")


def tokenize_stsb_direct(batch):
    return bert_stsb_tok(batch["s1"], batch["s2"], truncation=True, padding="max_length", max_length=128)


def df_to_hf_reg_direct(df: pd.DataFrame):
    from datasets import Dataset

    ds = Dataset.from_pandas(df[["s1", "s2", "label"]], preserve_index=False)
    ds = ds.map(tokenize_stsb_direct, batched=True).rename_column("label", "labels")
    cols = ["input_ids", "attention_mask", "labels"]
    if "token_type_ids" in ds.column_names:
        cols.insert(2, "token_type_ids")
    ds.set_format(type="torch", columns=cols)
    return ds


stsb_train_ds_direct = df_to_hf_reg_direct(stsb_train)
stsb_val_ds_direct = df_to_hf_reg_direct(stsb_val)
stsb_test_ds_direct = df_to_hf_reg_direct(stsb_test)

bert_stsb_direct = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=1)
bert_stsb_direct.config.problem_type = "regression"


def compute_stsb_reg_metrics(eval_pred):
    preds, labels = eval_pred
    preds = np.asarray(preds).reshape(-1)
    labels = np.asarray(labels).reshape(-1)
    preds = np.clip(preds, 0.0, 1.0)
    return {
        "pearson": float(pearsonr(labels, preds).statistic),
        "spearman": float(spearmanr(labels, preds).statistic),
    }


ta_sig = inspect.signature(TrainingArguments)
ta_kwargs = dict(
    output_dir=os.path.join(MODELS_DIR, "bert_stsb_direct"),
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=BERT_STSB_NUM_EPOCHS,
    weight_decay=0.01,
    save_strategy="no",
    logging_steps=50,
    seed=SEED,
)

if "evaluation_strategy" in ta_sig.parameters:
    ta_kwargs["evaluation_strategy"] = "epoch"
elif "eval_strategy" in ta_sig.parameters:
    ta_kwargs["eval_strategy"] = "epoch"

if "report_to" in ta_sig.parameters:
    ta_kwargs["report_to"] = "none"

if "dataloader_pin_memory" in ta_sig.parameters:
    ta_kwargs["dataloader_pin_memory"] = False

if "use_mps_device" in ta_sig.parameters:
    ta_kwargs["use_mps_device"] = (DEVICE == "mps")

bert_stsb_direct_args = TrainingArguments(**ta_kwargs)

tr_sig = inspect.signature(Trainer)
trainer_kwargs = dict(
    model=bert_stsb_direct,
    args=bert_stsb_direct_args,
    train_dataset=stsb_train_ds_direct,
    eval_dataset=stsb_val_ds_direct,
    compute_metrics=compute_stsb_reg_metrics,
)
if "tokenizer" in tr_sig.parameters:
    trainer_kwargs["tokenizer"] = bert_stsb_tok
elif "processing_class" in tr_sig.parameters:
    trainer_kwargs["processing_class"] = bert_stsb_tok

bert_stsb_direct_trainer = Trainer(**trainer_kwargs)

bert_stsb_direct_trainer.train()

val_eval = bert_stsb_direct_trainer.evaluate(stsb_val_ds_direct)
test_eval = bert_stsb_direct_trainer.evaluate(stsb_test_ds_direct)

_val_pred = bert_stsb_direct_trainer.predict(stsb_val_ds_direct).predictions.reshape(-1)
_test_pred = bert_stsb_direct_trainer.predict(stsb_test_ds_direct).predictions.reshape(-1)
bert_stsb_direct_val_pred = np.clip(_val_pred, 0.0, 1.0).astype(np.float32)
bert_stsb_direct_test_pred = np.clip(_test_pred, 0.0, 1.0).astype(np.float32)

bert_stsb_direct_metrics = {
    "stsb_val_pearson": float(val_eval.get("eval_pearson", np.nan)),
    "stsb_val_spearman": float(val_eval.get("eval_spearman", np.nan)),
    "stsb_test_pearson": float(test_eval.get("eval_pearson", np.nan)),
    "stsb_test_spearman": float(test_eval.get("eval_spearman", np.nan)),
}
print("BERT direct STS-B:", bert_stsb_direct_metrics)

bert_stsb_direct_trainer.save_model(bert_stsb_direct_args.output_dir)
bert_stsb_tok.save_pretrained(bert_stsb_direct_args.output_dir)
print("Saved BERT direct STS-B model to:", bert_stsb_direct_args.output_dir)

In [None]:
# Phase A: BERT Cross-Encoder 微调 (QQP classification)
import evaluate

print("DEVICE:", DEVICE)
print("BERT_NUM_EPOCHS:", BERT_NUM_EPOCHS)

qqp_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


def tokenize_qqp(batch):
    return qqp_tokenizer(batch["s1"], batch["s2"], truncation=True, padding="max_length", max_length=128)


def df_to_hf_qqp(df: pd.DataFrame):
    from datasets import Dataset

    ds = Dataset.from_pandas(df[["s1", "s2", "label"]], preserve_index=False)
    ds = ds.map(tokenize_qqp, batched=True).rename_column("label", "labels")
    cols = ["input_ids", "attention_mask", "labels"]
    if "token_type_ids" in ds.column_names:
        cols.insert(2, "token_type_ids")
    ds.set_format(type="torch", columns=cols)
    return ds


qqp_train_ds = df_to_hf_qqp(qqp_train)
qqp_val_ds = df_to_hf_qqp(qqp_val)
qqp_test_ds = df_to_hf_qqp(qqp_test)

bert_qqp = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

acc = evaluate.load("accuracy")
f1_eval = evaluate.load("f1")


def compute_qqp_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": acc.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1_eval.compute(predictions=preds, references=labels)["f1"],
    }


ta_sig = inspect.signature(TrainingArguments)
ta_kwargs = dict(
    output_dir=os.path.join(MODELS_DIR, "bert_qqp"),
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=BERT_NUM_EPOCHS,
    weight_decay=0.01,
    save_strategy="no",
    logging_steps=50,
    seed=SEED,
)

if "evaluation_strategy" in ta_sig.parameters:
    ta_kwargs["evaluation_strategy"] = "epoch"
elif "eval_strategy" in ta_sig.parameters:
    ta_kwargs["eval_strategy"] = "epoch"

if "report_to" in ta_sig.parameters:
    ta_kwargs["report_to"] = "none"

if "dataloader_pin_memory" in ta_sig.parameters:
    ta_kwargs["dataloader_pin_memory"] = False

if "use_mps_device" in ta_sig.parameters:
    ta_kwargs["use_mps_device"] = (DEVICE == "mps")

args = TrainingArguments(**ta_kwargs)

tr_sig = inspect.signature(Trainer)
trainer_kwargs = dict(
    model=bert_qqp,
    args=args,
    train_dataset=qqp_train_ds,
    eval_dataset=qqp_val_ds,
    compute_metrics=compute_qqp_metrics,
)
if "tokenizer" in tr_sig.parameters:
    trainer_kwargs["tokenizer"] = qqp_tokenizer
elif "processing_class" in tr_sig.parameters:
    trainer_kwargs["processing_class"] = qqp_tokenizer

qqp_trainer = Trainer(**trainer_kwargs)

qqp_trainer.train()

bert_qqp_val = qqp_trainer.evaluate(qqp_val_ds)
bert_qqp_test = qqp_trainer.evaluate(qqp_test_ds)

bert_qqp_metrics = {
    "qqp_val_acc": float(bert_qqp_val["eval_accuracy"]),
    "qqp_val_f1": float(bert_qqp_val["eval_f1"]),
    "qqp_test_acc": float(bert_qqp_test["eval_accuracy"]),
    "qqp_test_f1": float(bert_qqp_test["eval_f1"]),
}
print("BERT QQP metrics:", bert_qqp_metrics)

qqp_trainer.save_model(args.output_dir)
qqp_tokenizer.save_pretrained(args.output_dir)
print("Saved QQP fine-tuned model to:", args.output_dir)

In [None]:
# Phase B: QQP → STS-B 回归迁移
import torch.nn as nn

QQP_MODEL_DIR = os.path.join(MODELS_DIR, "bert_qqp")
STSB_MODEL_DIR = os.path.join(MODELS_DIR, "bert_qqp_to_stsb_reg")

if not os.path.exists(QQP_MODEL_DIR):
    raise FileNotFoundError(
        f"QQP_MODEL_DIR not found: {QQP_MODEL_DIR}. "
        "请先运行 QQP 微调单元，或把 QQP_MODEL_DIR 改成你的 checkpoint 目录。"
    )

stsb_tokenizer = qqp_tokenizer if "qqp_tokenizer" in globals() else AutoTokenizer.from_pretrained("bert-base-uncased")


def tokenize_stsb(batch):
    return stsb_tokenizer(batch["s1"], batch["s2"], truncation=True, padding="max_length", max_length=128)


def df_to_hf_reg(df: pd.DataFrame):
    from datasets import Dataset

    ds = Dataset.from_pandas(df[["s1", "s2", "label"]], preserve_index=False)
    ds = ds.map(tokenize_stsb, batched=True).rename_column("label", "labels")
    cols = ["input_ids", "attention_mask", "labels"]
    if "token_type_ids" in ds.column_names:
        cols.insert(2, "token_type_ids")
    ds.set_format(type="torch", columns=cols)
    return ds


stsb_train_ds = df_to_hf_reg(stsb_train)
stsb_val_ds = df_to_hf_reg(stsb_val)
stsb_test_ds = df_to_hf_reg(stsb_test)

reg_model = AutoModelForSequenceClassification.from_pretrained(QQP_MODEL_DIR)

if hasattr(reg_model, "classifier") and isinstance(reg_model.classifier, nn.Linear):
    in_features = int(reg_model.classifier.in_features)
    reg_model.classifier = nn.Linear(in_features, 1)
elif hasattr(reg_model, "score") and isinstance(reg_model.score, nn.Linear):
    in_features = int(reg_model.score.in_features)
    reg_model.score = nn.Linear(in_features, 1)
else:
    raise RuntimeError("Unsupported model head: cannot find Linear classifier or score layer.")

reg_model.config.num_labels = 1
reg_model.config.problem_type = "regression"


def compute_reg_metrics(eval_pred):
    preds, labels = eval_pred
    preds = np.asarray(preds).reshape(-1)
    labels = np.asarray(labels).reshape(-1)
    preds = np.clip(preds, 0.0, 1.0)
    return {
        "pearson": float(pearsonr(labels, preds).statistic),
        "spearman": float(spearmanr(labels, preds).statistic),
    }


ta_sig = inspect.signature(TrainingArguments)
ta_kwargs = dict(
    output_dir=STSB_MODEL_DIR,
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=BERT_STSB_NUM_EPOCHS,
    weight_decay=0.01,
    save_strategy="no",
    logging_steps=50,
    seed=SEED,
)

if "evaluation_strategy" in ta_sig.parameters:
    ta_kwargs["evaluation_strategy"] = "epoch"
elif "eval_strategy" in ta_sig.parameters:
    ta_kwargs["eval_strategy"] = "epoch"

if "report_to" in ta_sig.parameters:
    ta_kwargs["report_to"] = "none"

if "dataloader_pin_memory" in ta_sig.parameters:
    ta_kwargs["dataloader_pin_memory"] = False

if "use_mps_device" in ta_sig.parameters:
    ta_kwargs["use_mps_device"] = (DEVICE == "mps")

reg_args = TrainingArguments(**ta_kwargs)

tr_sig = inspect.signature(Trainer)
reg_trainer_kwargs = dict(
    model=reg_model,
    args=reg_args,
    train_dataset=stsb_train_ds,
    eval_dataset=stsb_val_ds,
    compute_metrics=compute_reg_metrics,
)
if "tokenizer" in tr_sig.parameters:
    reg_trainer_kwargs["tokenizer"] = stsb_tokenizer
elif "processing_class" in tr_sig.parameters:
    reg_trainer_kwargs["processing_class"] = stsb_tokenizer

reg_trainer = Trainer(**reg_trainer_kwargs)

reg_trainer.train()

stsb_eval_val = reg_trainer.evaluate(stsb_val_ds)
stsb_eval_test = reg_trainer.evaluate(stsb_test_ds)

_val_pred = reg_trainer.predict(stsb_val_ds).predictions.reshape(-1)
_test_pred = reg_trainer.predict(stsb_test_ds).predictions.reshape(-1)
bert_stsb_val_pred = np.clip(_val_pred, 0.0, 1.0).astype(np.float32)
bert_stsb_test_pred = np.clip(_test_pred, 0.0, 1.0).astype(np.float32)

bert_stsb_metrics = {
    "stsb_val_pearson": float(stsb_eval_val.get("eval_pearson", np.nan)),
    "stsb_val_spearman": float(stsb_eval_val.get("eval_spearman", np.nan)),
    "stsb_test_pearson": float(stsb_eval_test.get("eval_pearson", np.nan)),
    "stsb_test_spearman": float(stsb_eval_test.get("eval_spearman", np.nan)),
}
print("BERT QQP→STS-B metrics:", bert_stsb_metrics)

reg_trainer.save_model(reg_args.output_dir)
stsb_tokenizer.save_pretrained(reg_args.output_dir)
print("Saved QQP→STS-B regression model to:", reg_args.output_dir)

## 5. 阶段性训练（SBERT）

SBERT 使用句向量 + CosineSimilarityLoss 进行 STS-B 回归微调。

In [None]:
# SBERT STS-B 微调
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

print("DEVICE:", DEVICE)
print("SBERT_NUM_EPOCHS:", SBERT_NUM_EPOCHS)

sbert = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=DEVICE)

train_examples = [InputExample(texts=[a, b], label=float(y)) for a, b, y in zip(stsb_train["s1"], stsb_train["s2"], stsb_train["label"])]
val_examples = [InputExample(texts=[a, b], label=float(y)) for a, b, y in zip(stsb_val["s1"], stsb_val["s2"], stsb_val["label"])]

train_loader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model=sbert)

warmup_steps = int(0.1 * len(train_loader) * SBERT_NUM_EPOCHS)

sbert_output_dir = os.path.join(MODELS_DIR, "sbert_stsb")

sbert.fit(
    train_objectives=[(train_loader, train_loss)],
    epochs=SBERT_NUM_EPOCHS,
    warmup_steps=warmup_steps,
    output_path=sbert_output_dir,
    show_progress_bar=True,
)

sbert = SentenceTransformer(sbert_output_dir, device=DEVICE)


def sbert_cosine_scores(df: pd.DataFrame) -> np.ndarray:
    emb1 = sbert.encode(df["s1"].tolist(), batch_size=64, normalize_embeddings=True, show_progress_bar=False)
    emb2 = sbert.encode(df["s2"].tolist(), batch_size=64, normalize_embeddings=True, show_progress_bar=False)
    scores = (emb1 * emb2).sum(axis=1)
    return ((scores + 1.0) / 2.0).clip(0.0, 1.0).astype(np.float32)


sbert_val_pred = sbert_cosine_scores(stsb_val)
sbert_test_pred = sbert_cosine_scores(stsb_test)

sbert_stsb_metrics = {
    "stsb_val_pearson": float(pearsonr(stsb_val["label"].values, sbert_val_pred).statistic),
    "stsb_val_spearman": float(spearmanr(stsb_val["label"].values, sbert_val_pred).statistic),
    "stsb_test_pearson": float(pearsonr(stsb_test["label"].values, sbert_test_pred).statistic),
    "stsb_test_spearman": float(spearmanr(stsb_test["label"].values, sbert_test_pred).statistic),
}
print("SBERT STS-B metrics:", sbert_stsb_metrics)

## 6. 统一评测与结果汇总

从磁盘加载已保存模型，统一在 test 上重算，保证表格可复现。

In [None]:
# 统一评测：从磁盘加载模型重算
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer

BERT_QQP_DIR = os.path.join(MODELS_DIR, "bert_qqp")
BERT_STSB_DIRECT_DIR = os.path.join(MODELS_DIR, "bert_stsb_direct")
BERT_QQP_TO_STSB_DIR = os.path.join(MODELS_DIR, "bert_qqp_to_stsb_reg")
SBERT_STSB_DIR = os.path.join(MODELS_DIR, "sbert_stsb")


def _batched_indices(n: int, batch_size: int):
    for i in range(0, n, batch_size):
        yield i, min(i + batch_size, n)


@torch.no_grad()
def eval_bert_qqp_from_dir(model_dir: str, df: pd.DataFrame, batch_size: int = 32) -> dict:
    tok = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    model.to(DEVICE)
    model.eval()

    y_true = df["label"].values.astype(int)
    preds = np.zeros((len(df),), dtype=np.int64)

    for i, j in _batched_indices(len(df), batch_size):
        batch = df.iloc[i:j]
        enc = tok(
            batch["s1"].tolist(),
            batch["s2"].tolist(),
            truncation=True,
            padding=True,
            max_length=128,
            return_tensors="pt",
        )
        enc = {k: v.to(DEVICE) for k, v in enc.items()}
        logits = model(**enc).logits
        preds[i:j] = torch.argmax(logits, dim=-1).detach().cpu().numpy()

    return {
        "qqp_test_acc": float(accuracy_score(y_true, preds)),
        "qqp_test_f1": float(f1_score(y_true, preds)),
    }


@torch.no_grad()
def eval_bert_stsb_reg_from_dir(model_dir: str, df: pd.DataFrame, batch_size: int = 32) -> dict:
    tok = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    model.to(DEVICE)
    model.eval()

    y_true = df["label"].values.astype(np.float32)
    preds = np.zeros((len(df),), dtype=np.float32)

    for i, j in _batched_indices(len(df), batch_size):
        batch = df.iloc[i:j]
        enc = tok(
            batch["s1"].tolist(),
            batch["s2"].tolist(),
            truncation=True,
            padding=True,
            max_length=128,
            return_tensors="pt",
        )
        enc = {k: v.to(DEVICE) for k, v in enc.items()}
        logits = model(**enc).logits
        p = logits.detach().cpu().numpy().reshape(-1).astype(np.float32)
        preds[i:j] = np.clip(p, 0.0, 1.0)

    return {
        "stsb_test_pearson": float(pearsonr(y_true, preds).statistic),
        "stsb_test_spearman": float(spearmanr(y_true, preds).statistic),
    }


@torch.no_grad()
def bert_reg_scores(model_dir: str, df: pd.DataFrame, batch_size: int = 32) -> np.ndarray:
    tok = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    model.to(DEVICE)
    model.eval()

    preds = np.zeros((len(df),), dtype=np.float32)
    for i, j in _batched_indices(len(df), batch_size):
        batch = df.iloc[i:j]
        enc = tok(
            batch["s1"].tolist(),
            batch["s2"].tolist(),
            truncation=True,
            padding=True,
            max_length=128,
            return_tensors="pt",
        )
        enc = {k: v.to(DEVICE) for k, v in enc.items()}
        logits = model(**enc).logits
        p = logits.detach().cpu().numpy().reshape(-1).astype(np.float32)
        preds[i:j] = np.clip(p, 0.0, 1.0)
    return preds


def eval_reg_as_binary_on_qqp(model_dir: str, val_df: pd.DataFrame, test_df: pd.DataFrame) -> dict:
    val_score = bert_reg_scores(model_dir, val_df)
    test_score = bert_reg_scores(model_dir, test_df)

    thresholds = np.linspace(0.0, 1.0, 101)
    best_t = 0.5
    best_f1 = -1.0
    y_val = val_df["label"].values.astype(int)
    for t in thresholds:
        pred = (val_score >= t).astype(int)
        cur = f1_score(y_val, pred)
        if cur > best_f1:
            best_f1 = float(cur)
            best_t = float(t)

    y_test = test_df["label"].values.astype(int)
    test_pred = (test_score >= best_t).astype(int)
    return {
        "qqp_test_acc": float(accuracy_score(y_test, test_pred)),
        "qqp_test_f1": float(f1_score(y_test, test_pred)),
        "threshold": float(best_t),
    }


def sbert_scores(model_dir: str, df: pd.DataFrame, batch_size: int = 64) -> np.ndarray:
    sbert_eval = SentenceTransformer(model_dir, device=DEVICE)
    emb1 = sbert_eval.encode(df["s1"].tolist(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False)
    emb2 = sbert_eval.encode(df["s2"].tolist(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False)
    scores = (emb1 * emb2).sum(axis=1)
    return ((scores + 1.0) / 2.0).clip(0.0, 1.0).astype(np.float32)


def eval_sbert_stsb_from_dir(model_dir: str, df: pd.DataFrame, batch_size: int = 64) -> dict:
    preds = sbert_scores(model_dir, df, batch_size=batch_size)
    y_true = df["label"].values.astype(np.float32)
    return {
        "stsb_test_pearson": float(pearsonr(y_true, preds).statistic),
        "stsb_test_spearman": float(spearmanr(y_true, preds).statistic),
    }


def eval_sbert_as_binary_on_qqp(model_dir: str, val_df: pd.DataFrame, test_df: pd.DataFrame) -> dict:
    val_score = sbert_scores(model_dir, val_df)
    test_score = sbert_scores(model_dir, test_df)

    thresholds = np.linspace(0.0, 1.0, 101)
    best_t = 0.5
    best_f1 = -1.0
    y_val = val_df["label"].values.astype(int)
    for t in thresholds:
        pred = (val_score >= t).astype(int)
        cur = f1_score(y_val, pred)
        if cur > best_f1:
            best_f1 = float(cur)
            best_t = float(t)

    y_test = test_df["label"].values.astype(int)
    test_pred = (test_score >= best_t).astype(int)
    return {
        "qqp_test_acc": float(accuracy_score(y_test, test_pred)),
        "qqp_test_f1": float(f1_score(y_test, test_pred)),
        "threshold": float(best_t),
    }


bert_qqp_metrics_disk = eval_bert_qqp_from_dir(BERT_QQP_DIR, qqp_test) if os.path.exists(BERT_QQP_DIR) else {}
bert_stsb_direct_metrics_disk = eval_bert_stsb_reg_from_dir(BERT_STSB_DIRECT_DIR, stsb_test) if os.path.exists(BERT_STSB_DIRECT_DIR) else {}
bert_stsb_metrics_disk = eval_bert_stsb_reg_from_dir(BERT_QQP_TO_STSB_DIR, stsb_test) if os.path.exists(BERT_QQP_TO_STSB_DIR) else {}

sbert_stsb_metrics_disk = eval_sbert_stsb_from_dir(SBERT_STSB_DIR, stsb_test) if os.path.exists(SBERT_STSB_DIR) else {}

bert_stsb_direct_qqp_metrics = (
    eval_reg_as_binary_on_qqp(BERT_STSB_DIRECT_DIR, qqp_val, qqp_test)
    if os.path.exists(BERT_STSB_DIRECT_DIR)
    else {}
 )

sbert_stsb_qqp_metrics = (
    eval_sbert_as_binary_on_qqp(SBERT_STSB_DIR, qqp_val, qqp_test)
    if os.path.exists(SBERT_STSB_DIR)
    else {}
 )


def _get_metrics(name: str) -> dict:
    m = globals().get(name, {})
    return m if isinstance(m, dict) else {}


def make_row(model_name: str, qqp: dict, stsb: dict) -> dict:
    return {
        "model": model_name,
        "QQP acc (test)": qqp.get("qqp_test_acc", np.nan),
        "QQP F1 (test)": qqp.get("qqp_test_f1", np.nan),
        "STS-B Pearson (test)": stsb.get("stsb_test_pearson", np.nan),
        "STS-B Spearman (test)": stsb.get("stsb_test_spearman", np.nan),
    }


qqp_rows_local = globals().get("qqp_rows", [])
albert_qqp_zeroshot_metrics = next((r for r in qqp_rows_local if r.get("model") == "albert-base-v2"), {})
roberta_qqp_zeroshot_metrics = next((r for r in qqp_rows_local if r.get("model") == "roberta-base"), {})


rows = [
    make_row("Lexical (TF-IDF)", _get_metrics("lex_qqp_metrics"), _get_metrics("lex_stsb_metrics")),
    make_row("Static Emb (enwiki-50k_100d mean)", _get_metrics("static_qqp_metrics"), _get_metrics("static_stsb_metrics")),
    make_row("BERT base (zero-shot)", _get_metrics("bert_qqp_zeroshot_metrics"), {}),
    make_row("ALBERT (zero-shot)", albert_qqp_zeroshot_metrics, {}),
    make_row("RoBERTa (zero-shot)", roberta_qqp_zeroshot_metrics, {}),
    make_row("SBERT (pretrained)", _get_metrics("sbert_qqp_zeroshot_metrics"), _get_metrics("sbert_stsb_zeroshot_metrics")),
    make_row("BERT (STS-B)", bert_stsb_direct_qqp_metrics, bert_stsb_direct_metrics_disk),
    make_row("SBERT (STS-B)", sbert_stsb_qqp_metrics, sbert_stsb_metrics_disk),
    make_row("BERT (QQP→STS-B)", bert_qqp_metrics_disk, bert_stsb_metrics_disk),
]

summary = pd.DataFrame(rows)
summary

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

NameError: name 'qqp_rows' is not defined

## 7. 鲁棒性分析（Stress Test）

构建 8 类语义扰动的 stress-test，dev 选阈值，test 报告 Acc/F1 与分类型准确率。

In [8]:
# Stress-test: 8 类语义扰动
from sklearn.metrics import accuracy_score, f1_score

BERT_QQP_STSB_DIR = os.path.join(MODELS_DIR, "bert_qqp_to_stsb_reg")
SBERT_STSB_DIR = os.path.join(MODELS_DIR, "sbert_stsb")

stress_pairs = [
    # 1) Negation
    {"cat": "negation", "s1": "The product is available.", "s2": "The product is available.", "label": 1},
    {"cat": "negation", "s1": "The product is available.", "s2": "The product is not available.", "label": 0},
    {"cat": "negation", "s1": "He likes coffee.", "s2": "He does not like coffee.", "label": 0},
    {"cat": "negation", "s1": "The service did not fail.", "s2": "The service was successful.", "label": 1},

    # 2) Increase/Decrease
    {"cat": "inc_dec", "s1": "Revenue increased by 10 percent.", "s2": "Revenue went up by 10 percent.", "label": 1},
    {"cat": "inc_dec", "s1": "Revenue increased by 10 percent.", "s2": "Revenue decreased by 10 percent.", "label": 0},
    {"cat": "inc_dec", "s1": "The temperature rose rapidly.", "s2": "The temperature fell rapidly.", "label": 0},
    {"cat": "inc_dec", "s1": "Sales dropped this quarter.", "s2": "Sales decreased this quarter.", "label": 1},

    # 3) Comparative Flip
    {"cat": "comparative", "s1": "This model is more accurate.", "s2": "This model is less accurate.", "label": 0},
    {"cat": "comparative", "s1": "Version A is better than version B.", "s2": "Version A outperforms version B.", "label": 1},
    {"cat": "comparative", "s1": "The new phone is cheaper.", "s2": "The new phone is more expensive.", "label": 0},
    {"cat": "comparative", "s1": "Latency is lower in system X.", "s2": "System X has lower latency.", "label": 1},

    # 4) Role Swap
    {"cat": "role_swap", "s1": "Alice gave Bob the book.", "s2": "Bob gave Alice the book.", "label": 0},
    {"cat": "role_swap", "s1": "The teacher praised the student.", "s2": "The student praised the teacher.", "label": 0},
    {"cat": "role_swap", "s1": "The nurse treated the patient.", "s2": "The patient was treated by the nurse.", "label": 1},
    {"cat": "role_swap", "s1": "Tom helped Jerry.", "s2": "Jerry was helped by Tom.", "label": 1},

    # 5) Numeric Change
    {"cat": "numeric", "s1": "The package weighs 5 kg.", "s2": "The package weighs 50 kg.", "label": 0},
    {"cat": "numeric", "s1": "The discount is 10 percent.", "s2": "The discount is 10 percent.", "label": 1},
    {"cat": "numeric", "s1": "The meeting starts at 3 PM.", "s2": "The meeting starts at 8 PM.", "label": 0},
    {"cat": "numeric", "s1": "The event is in 2024.", "s2": "The event is in 2023.", "label": 0},

    # 6) Quantifier Shift
    {"cat": "quantifier", "s1": "All customers received a refund.", "s2": "Some customers received a refund.", "label": 0},
    {"cat": "quantifier", "s1": "Every file was uploaded.", "s2": "All files were uploaded.", "label": 1},
    {"cat": "quantifier", "s1": "None of the users agreed.", "s2": "Some of the users agreed.", "label": 0},
    {"cat": "quantifier", "s1": "A few students were absent.", "s2": "Some students were absent.", "label": 1},

    # 7) Modal Shift
    {"cat": "modal", "s1": "Users must reset their password.", "s2": "Users may reset their password.", "label": 0},
    {"cat": "modal", "s1": "You should back up the data.", "s2": "You ought to back up the data.", "label": 1},
    {"cat": "modal", "s1": "Access is required to enter.", "s2": "Access is optional to enter.", "label": 0},
    {"cat": "modal", "s1": "Visitors are allowed to park here.", "s2": "Visitors may park here.", "label": 1},

    # 8) Direction Swap
    {"cat": "direction", "s1": "Flights from Paris to London are delayed.", "s2": "Flights from London to Paris are delayed.", "label": 0},
    {"cat": "direction", "s1": "The train goes from A to B.", "s2": "The train goes from B to A.", "label": 0},
    {"cat": "direction", "s1": "The package moved from room A to room B.", "s2": "The package moved from room A to room B.", "label": 1},
    {"cat": "direction", "s1": "Water flows from the tank to the pipe.", "s2": "Water goes from the tank into the pipe.", "label": 1},
]

stress_df = pd.DataFrame(stress_pairs)


def pick_best_threshold(y_true: np.ndarray, y_score: np.ndarray) -> tuple[float, float]:
    ts = np.linspace(0.0, 1.0, 101)
    best_t, best_f1 = 0.5, -1.0
    for t in ts:
        pred = (y_score >= t).astype(int)
        cur = f1_score(y_true, pred)
        if cur > best_f1:
            best_f1 = float(cur)
            best_t = float(t)
    return best_t, best_f1


def eval_binary(y_true: np.ndarray, y_score: np.ndarray, t: float) -> dict:
    pred = (y_score >= t).astype(int)
    return {
        "acc": float(accuracy_score(y_true, pred)),
        "f1": float(f1_score(y_true, pred)),
    }


def corr_pack(y_true: np.ndarray, y_score: np.ndarray) -> dict:
    return {
        "pearson": float(pearsonr(y_true, y_score).statistic),
        "spearman": float(spearmanr(y_true, y_score).statistic),
        "mae": float(np.mean(np.abs(y_true - y_score))),
    }


@torch.no_grad()
def bert_similarity_scores(model_dir: str, df: pd.DataFrame, batch_size: int = 16) -> np.ndarray:
    tok = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    model.to(DEVICE)
    model.eval()

    out = np.zeros((len(df),), dtype=np.float32)
    for i in range(0, len(df), batch_size):
        j = min(i + batch_size, len(df))
        batch = df.iloc[i:j]
        enc = tok(
            batch["s1"].tolist(),
            batch["s2"].tolist(),
            truncation=True,
            padding=True,
            max_length=128,
            return_tensors="pt",
        )
        enc = {k: v.to(DEVICE) for k, v in enc.items()}
        logits = model(**enc).logits
        if logits.shape[-1] == 1:
            score = torch.sigmoid(logits).reshape(-1)
        else:
            score = torch.softmax(logits, dim=-1)[:, 1]
        out[i:j] = torch.clamp(score, 0.0, 1.0).detach().cpu().numpy()
    return out


def sbert_similarity_scores(model_dir: str, df: pd.DataFrame, batch_size: int = 32) -> np.ndarray:
    sbert_eval = SentenceTransformer(model_dir, device=DEVICE)
    emb1 = sbert_eval.encode(df["s1"].tolist(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False)
    emb2 = sbert_eval.encode(df["s2"].tolist(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False)
    cosine = (emb1 * emb2).sum(axis=1)
    return ((cosine + 1.0) / 2.0).clip(0.0, 1.0).astype(np.float32)


dev_df, test_df = train_test_split(
    stress_df,
    test_size=0.6,
    random_state=SEED,
    stratify=stress_df["label"],
)

y_dev = dev_df["label"].values.astype(int)
y_test = test_df["label"].values.astype(int)

rows = []
per_cat_tables = []

if os.path.exists(BERT_QQP_STSB_DIR):
    bert_dev_score = bert_similarity_scores(BERT_QQP_STSB_DIR, dev_df)
    bert_test_score = bert_similarity_scores(BERT_QQP_STSB_DIR, test_df)
    bert_t, bert_dev_f1 = pick_best_threshold(y_dev, bert_dev_score)
    bert_test_metrics = eval_binary(y_test, bert_test_score, bert_t)
    bert_corr = corr_pack(y_test.astype(np.float32), bert_test_score)
    rows.append({
        "model": "BERT-QQP→STS-B",
        "threshold": round(bert_t, 3),
        "dev_f1": round(bert_dev_f1, 4),
        "test_acc": round(bert_test_metrics["acc"], 4),
        "test_f1": round(bert_test_metrics["f1"], 4),
        "pearson": round(bert_corr["pearson"], 4),
        "spearman": round(bert_corr["spearman"], 4),
        "mae": round(bert_corr["mae"], 4),
    })
    tmp = test_df.copy()
    tmp["pred"] = (bert_test_score >= bert_t).astype(int)
    tmp["ok"] = (tmp["pred"] == tmp["label"]).astype(int)
    per_cat = tmp.groupby("cat", as_index=False)["ok"].mean().rename(columns={"ok": "acc"})
    per_cat["model"] = "BERT-QQP→STS-B"
    per_cat_tables.append(per_cat)
else:
    print("WARN: missing dir", BERT_QQP_STSB_DIR)

if os.path.exists(SBERT_STSB_DIR):
    sbert_dev_score = sbert_similarity_scores(SBERT_STSB_DIR, dev_df)
    sbert_test_score = sbert_similarity_scores(SBERT_STSB_DIR, test_df)
    sbert_t, sbert_dev_f1 = pick_best_threshold(y_dev, sbert_dev_score)
    sbert_test_metrics = eval_binary(y_test, sbert_test_score, sbert_t)
    sbert_corr = corr_pack(y_test.astype(np.float32), sbert_test_score)
    rows.append({
        "model": "SBERT-STS-B",
        "threshold": round(sbert_t, 3),
        "dev_f1": round(sbert_dev_f1, 4),
        "test_acc": round(sbert_test_metrics["acc"], 4),
        "test_f1": round(sbert_test_metrics["f1"], 4),
        "pearson": round(sbert_corr["pearson"], 4),
        "spearman": round(sbert_corr["spearman"], 4),
        "mae": round(sbert_corr["mae"], 4),
    })
    tmp = test_df.copy()
    tmp["pred"] = (sbert_test_score >= sbert_t).astype(int)
    tmp["ok"] = (tmp["pred"] == tmp["label"]).astype(int)
    per_cat = tmp.groupby("cat", as_index=False)["ok"].mean().rename(columns={"ok": "acc"})
    per_cat["model"] = "SBERT-STS-B"
    per_cat_tables.append(per_cat)
else:
    print("WARN: missing dir", SBERT_STSB_DIR)

if len(rows) == 0:
    print("没有可用模型可比较，请先确保 bert_qqp_to_stsb_reg 与 sbert_stsb 存在。")
else:
    result_df = pd.DataFrame(rows).sort_values(["pearson", "spearman", "mae"], ascending=[False, False, True]).reset_index(drop=True)
    print("Binary stress test summary:")
    display(result_df)

    cat_df = pd.concat(per_cat_tables, ignore_index=True)
    print("Per-category accuracy on stress-test test split:")
    display(cat_df.pivot(index="cat", columns="model", values="acc"))

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Binary stress test summary:


Unnamed: 0,model,threshold,dev_f1,test_acc,test_f1,pearson,spearman,mae
0,BERT-QQP→STS-B,0.68,0.8,0.65,0.6316,0.3419,0.3224,0.5068
1,SBERT-STS-B,0.89,0.8,0.55,0.5714,0.1997,0.2354,0.5275


Per-category accuracy on stress-test test split:


model,BERT-QQP→STS-B,SBERT-STS-B
cat,Unnamed: 1_level_1,Unnamed: 2_level_1
comparative,0.666667,0.333333
direction,1.0,0.666667
inc_dec,0.333333,0.666667
modal,1.0,0.5
negation,0.5,0.5
numeric,0.666667,0.666667
quantifier,1.0,1.0
role_swap,0.333333,0.333333


In [12]:
# Stress-test 预测结果明细（test split）

if "test_df" not in globals():
    raise RuntimeError("Stress-test not prepared. Run the stress-test cell first.")

pred_df = test_df.copy()

if "bert_test_score" not in globals() and os.path.exists(BERT_QQP_STSB_DIR):
    bert_test_score = bert_similarity_scores(BERT_QQP_STSB_DIR, test_df)

if "sbert_test_score" not in globals() and os.path.exists(SBERT_STSB_DIR):
    sbert_test_score = sbert_similarity_scores(SBERT_STSB_DIR, test_df)

if "bert_t" not in globals() and "bert_test_score" in globals():
    bert_t, _ = pick_best_threshold(y_dev, bert_dev_score)

if "sbert_t" not in globals() and "sbert_test_score" in globals():
    sbert_t, _ = pick_best_threshold(y_dev, sbert_dev_score)

if "bert_test_score" in globals() and "bert_t" in globals():
    pred_df["bert_score"] = np.round(bert_test_score, 4)
    pred_df["bert_pred"] = (bert_test_score >= bert_t).astype(int)

if "sbert_test_score" in globals() and "sbert_t" in globals():
    pred_df["sbert_score"] = np.round(sbert_test_score, 4)
    pred_df["sbert_pred"] = (sbert_test_score >= sbert_t).astype(int)
    pred_df = pred_df.sort_index()
print("Stress-test predictions (test split):")
display(pred_df)

Stress-test predictions (test split):


Unnamed: 0,cat,s1,s2,label,bert_score,bert_pred,sbert_score,sbert_pred
0,negation,The product is available.,The product is available.,1,0.6918,1,1.0,1
3,negation,The service did not fail.,The service was successful.,1,0.6442,0,0.7565,0
5,inc_dec,Revenue increased by 10 percent.,Revenue decreased by 10 percent.,0,0.6917,1,0.8741,0
6,inc_dec,The temperature rose rapidly.,The temperature fell rapidly.,0,0.6136,0,0.8339,0
7,inc_dec,Sales dropped this quarter.,Sales decreased this quarter.,1,0.6786,0,0.8882,0
8,comparative,This model is more accurate.,This model is less accurate.,0,0.672,0,0.9304,1
9,comparative,Version A is better than version B.,Version A outperforms version B.,1,0.657,0,0.8689,0
11,comparative,Latency is lower in system X.,System X has lower latency.,1,0.7116,1,0.9867,1
12,role_swap,Alice gave Bob the book.,Bob gave Alice the book.,0,0.7108,1,0.9955,1
13,role_swap,The teacher praised the student.,The student praised the teacher.,0,0.7131,1,0.9957,1


### 7.1 错误案例分析（高词面重叠但语义冲突）

In [9]:
# 错误分析：高词面重合但语义差/相反
from sentence_transformers import SentenceTransformer

if "SBERT_STSB_DIR" not in globals():
    SBERT_STSB_DIR = os.path.join(MODELS_DIR, "sbert_stsb")


def jaccard(a: str, b: str) -> float:
    sa = set(a.split())
    sb = set(b.split())
    if not sa and not sb:
        return 1.0
    if not sa or not sb:
        return 0.0
    return len(sa & sb) / len(sa | sb)


def hard_samples_stsb(df: pd.DataFrame, pred: np.ndarray, top_k: int = 10):
    overlaps = np.array([jaccard(x, y) for x, y in zip(df["s1"], df["s2"])], dtype=np.float32)
    y = df["label"].values.astype(np.float32)
    err = np.abs(pred.astype(np.float32) - y)
    score = overlaps * err
    idx = np.argsort(-score)[:top_k]
    return df.iloc[idx].assign(jaccard=overlaps[idx], y_true=y[idx], y_pred=pred[idx], abs_err=err[idx])


if "sbert_test_pred" not in globals() and os.path.exists(SBERT_STSB_DIR):
    sbert_eval = SentenceTransformer(SBERT_STSB_DIR, device=DEVICE)
    emb1 = sbert_eval.encode(stsb_test["s1"].tolist(), batch_size=64, normalize_embeddings=True, show_progress_bar=False)
    emb2 = sbert_eval.encode(stsb_test["s2"].tolist(), batch_size=64, normalize_embeddings=True, show_progress_bar=False)
    scores = (emb1 * emb2).sum(axis=1)
    sbert_test_pred = ((scores + 1.0) / 2.0).clip(0.0, 1.0).astype(np.float32)

if "lex_stsb_test_pred" in globals():
    hard_lex = hard_samples_stsb(stsb_test, lex_stsb_test_pred, top_k=8)
    print("Hard samples for Lexical TF-IDF cosine:")
    display(hard_lex[["s1", "s2", "jaccard", "y_true", "y_pred", "abs_err"]])
else:
    print("Skipping lexical hard samples (baseline not computed).")

if "sbert_test_pred" in globals():
    hard_sbert = hard_samples_stsb(stsb_test, sbert_test_pred, top_k=8)
    print("Hard samples for SBERT:")
    display(hard_sbert[["s1", "s2", "jaccard", "y_true", "y_pred", "abs_err"]])
else:
    print("Skipping SBERT hard samples (SBERT not computed).")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Hard samples for Lexical TF-IDF cosine:


Unnamed: 0,s1,s2,jaccard,y_true,y_pred,abs_err
38,i don t want a president who cares,i don t want a president who is charasmatic,0.7,0.2,0.985769,0.785769
331,a man is tying his shoe,a man ties his shoe,0.571429,1.0,0.260872,0.739128
193,a woman is slicing a leek,a woman is slicing ginger,0.666667,0.44,1.0,0.56
325,india ink image of the day january 27,india ink image of the day march 20,0.6,0.2,0.815552,0.615552
12,oscar pistorius sentenced to 5 years in prison,bookkeeper of auschwitz sentenced to four years in prison,0.416667,0.0,0.856633,0.856633
216,a band is performing on a stage,a band is playing onstage,0.375,1.0,0.055266,0.944734
119,a man is playing the drums,a man plays the drum,0.375,1.0,0.092896,0.907104
30,hrithik roshan wife sussanne part ways,hrithik roshan sussanne to divorce,0.375,0.88,0.0,0.88


Hard samples for SBERT:


Unnamed: 0,s1,s2,jaccard,y_true,y_pred,abs_err
38,i don t want a president who cares,i don t want a president who is charasmatic,0.7,0.2,0.8365,0.6365
325,india ink image of the day january 27,india ink image of the day march 20,0.6,0.2,0.82834,0.62834
275,a woman is writing,a woman is swimming,0.6,0.1,0.67709,0.57709
4,the note s must reads for friday may 24 2013,the note s must reads for tuesday october 29 2013,0.538462,0.4,0.944518,0.544518
12,oscar pistorius sentenced to 5 years in prison,bookkeeper of auschwitz sentenced to four years in prison,0.416667,0.0,0.696947,0.696947
297,a cat is playing,a woman is playing flute,0.5,0.05,0.630331,0.580331
250,a man is running on the road,a panda dog is running on the road,0.666667,0.3334,0.768279,0.434879
126,a woman is laying down on the floor and holding a baby up above her,a man is laying on the floor holding a baby up above him,0.625,0.44,0.903722,0.463722
