# Fakeddit 文本‑only（BERT）基线

本 Notebook 以 **Fakeddit v2.0 的多模态样本（仅文本）** 为输入，构建论文方法论中的**文本分支**基线。

**目标**
- 使用 `clean_title` 训练文本模型（BERT）
- 支持 2/3/6‑way 分类
- 为后续多模态融合提供可对照的文本基线

**你将依次完成**
1. 配置与依赖
2. 数据加载与标签映射
3. BERT 训练与评估
4. 保存模型与后续扩展建议


---
## 0. 环境准备（必要时执行）

如果你本地未安装依赖，可以运行：
```bash
pip install torch torchvision torchaudio transformers scikit-learn tqdm
```

说明：首次运行会下载 HuggingFace 模型权重（`bert-base-uncased`）。


---
## 0.1 Linux 服务器环境建议（与你的配置一致）

你给的环境是 **PyTorch 2.8.0 / Python 3.12 / Ubuntu 22.04 / CUDA 12.8**，
建议直接使用项目根目录的 `requirements.txt`：
```bash
pip install -r requirements.txt
```

如果在服务器上运行，请确保：
- `nvidia-smi` 可正常输出 GPU
- CUDA 驱动版本 ≥ 12.8


In [None]:
# 环境检查（可选）
# - 确认 PyTorch / CUDA / GPU 是否正常

import torch
print('torch:', torch.__version__)
print('cuda available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('cuda version:', torch.version.cuda)
    print('gpu:', torch.cuda.get_device_name(0))


In [None]:
# 配置区：集中管理所有实验超参数与路径
# - 修改 TASK 可切换 2/3/6-way
# - 修改 MODEL_NAME 可切换不同 BERT 变体
# - MAX_SAMPLES 用于快速小样本调试

from pathlib import Path

DATA_ROOT = Path('Fakeddit datasetv2.0')
TRAIN_PATH = DATA_ROOT / 'multimodal_only_samples' / 'multimodal_train.tsv'
VAL_PATH = DATA_ROOT / 'multimodal_only_samples' / 'multimodal_validate.tsv'
TEST_PATH = DATA_ROOT / 'multimodal_only_samples' / 'multimodal_test_public.tsv'

TASK = 2  # 2 / 3 / 6
MODEL_NAME = 'bert-base-uncased'
MAX_LEN = 128
BATCH_SIZE = 32
EPOCHS = 3
LR = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1
MAX_SAMPLES = None
NUM_WORKERS = 2

OUTPUT_DIR = Path('outputs') / f'bert_text_only_{TASK}way'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print('Train:', TRAIN_PATH)
print('Val  :', VAL_PATH)
print('Test :', TEST_PATH)
print('Output:', OUTPUT_DIR)


In [None]:
# 依赖与随机种子
# - 保证实验可复现
# - 自动选择 GPU / CPU

import csv
import random
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

try:
    from tqdm import tqdm
except Exception:
    # 如果 tqdm 未安装，给一个简单兜底
    def tqdm(x, **kwargs):
        return x

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device:', device)


---
## 1. 标签映射（已在本地 TSV 验证）
- 2‑way：`1=True`，`0=Fake`
- 3‑way：`0=True`，`1=Fake with true text`，`2=Fake with false text`
- 6‑way：`0=True`，`1=Satire/Parody`，`2=False Connection`，`3=Imposter`，`4=Manipulated`，`5=Misleading`


In [None]:
# 加载 TSV 并抽取 clean_title + 标签
# - 使用 clean_title 保证文本预处理与论文一致
# - 可通过 MAX_SAMPLES 做小样本调试

LABEL_NAMES = {
    2: {0: 'Fake', 1: 'True'},
    3: {0: 'True', 1: 'Fake-TrueText', 2: 'Fake-FalseText'},
    6: {0: 'True', 1: 'Satire/Parody', 2: 'False Connection', 3: 'Imposter', 4: 'Manipulated', 5: 'Misleading'},
}

def load_tsv(path, task=2, max_samples=None):
    texts = []
    labels = []
    label_key = f'{task}_way_label'
    with open(path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f, delimiter='	')
        for row in reader:
            text = (row.get('clean_title') or '').strip()
            if not text:
                continue
            label_str = row.get(label_key)
            if label_str is None or label_str == '':
                continue
            label = int(float(label_str))
            texts.append(text)
            labels.append(label)
            if max_samples is not None and len(texts) >= max_samples:
                break
    return texts, labels

train_texts, train_labels = load_tsv(TRAIN_PATH, task=TASK, max_samples=MAX_SAMPLES)
val_texts, val_labels = load_tsv(VAL_PATH, task=TASK, max_samples=MAX_SAMPLES)
test_texts, test_labels = load_tsv(TEST_PATH, task=TASK, max_samples=MAX_SAMPLES)

print('train:', len(train_texts))
print('val  :', len(val_texts))
print('test :', len(test_texts))

from collections import Counter
print('train label dist:', Counter(train_labels))
print('val label dist  :', Counter(val_labels))


---
## 2. 分词与数据集封装

这一部分把文本转换成 BERT 可以读取的 token，并封装成 PyTorch 数据集。


In [None]:
# 分词器初始化
# - 自动加载与 MODEL_NAME 对应的 tokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

class TextDataset(Dataset):
    # 数据集封装：提供 __len__ 和 __getitem__
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

def collate_fn(batch):
    # 批量分词 + padding + truncation
    texts, labels = zip(*batch)
    enc = tokenizer(list(texts), padding=True, truncation=True, max_length=MAX_LEN, return_tensors='pt')
    enc['labels'] = torch.tensor(labels, dtype=torch.long)
    return enc

train_ds = TextDataset(train_texts, train_labels)
val_ds = TextDataset(val_texts, val_labels)
test_ds = TextDataset(test_texts, test_labels)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, collate_fn=collate_fn)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, collate_fn=collate_fn)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, collate_fn=collate_fn)


---
## 3. 模型、损失函数与优化器

这里创建 BERT 分类器，并加入类别不平衡的权重。


In [None]:
# 模型初始化
# - num_labels 决定分类头的类别数

num_labels = TASK
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)
model.to(device)

# 类别权重：用于缓解类别不平衡（尤其是 6-way）
from collections import Counter
label_counts = Counter(train_labels)
weights = [0.0] * num_labels
total = sum(label_counts.values())
for k in range(num_labels):
    count = label_counts.get(k, 1)
    weights[k] = total / (num_labels * count)
weights = torch.tensor(weights, dtype=torch.float, device=device)

import torch.nn as nn
criterion = nn.CrossEntropyLoss(weight=weights)

optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)
total_steps = len(train_loader) * EPOCHS
warmup_steps = int(total_steps * WARMUP_RATIO)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)

scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())
print('total steps:', total_steps, 'warmup:', warmup_steps)


---
## 4. 评估函数

输出 loss / accuracy / macro‑F1 / 混淆矩阵。


In [None]:
# 评估函数
# - 统一评估逻辑，方便在 val/test 上复用

@torch.no_grad()
def evaluate(model, loader):
    model.eval()
    all_preds = []
    all_labels = []
    total_loss = 0.0
    for batch in loader:
        labels = batch.pop('labels').to(device)
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits
        loss = criterion(logits, labels)
        total_loss += loss.item() * labels.size(0)
        preds = torch.argmax(logits, dim=-1)
        all_preds.extend(preds.cpu().tolist())
        all_labels.extend(labels.cpu().tolist())

    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='macro')
    cm = confusion_matrix(all_labels, all_preds)
    avg_loss = total_loss / max(1, len(all_labels))
    return avg_loss, acc, f1, cm


---
## 5. 训练

训练过程中每个 epoch 会输出验证集指标。


In [None]:
# 训练循环
# - 使用 AMP 提升速度
# - 每个 epoch 评估一次验证集

for epoch in range(1, EPOCHS + 1):
    model.train()
    running_loss = 0.0
    for batch in tqdm(train_loader, desc=f'Epoch {epoch}/{EPOCHS}'):
        labels = batch.pop('labels').to(device)
        batch = {k: v.to(device) for k, v in batch.items()}
        optimizer.zero_grad()
        with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
            outputs = model(**batch)
            logits = outputs.logits
            loss = criterion(logits, labels)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        running_loss += loss.item() * labels.size(0)

    train_loss = running_loss / max(1, len(train_ds))
    val_loss, val_acc, val_f1, val_cm = evaluate(model, val_loader)
    print(f'Epoch {epoch}: train_loss={train_loss:.4f} val_loss={val_loss:.4f} val_acc={val_acc:.4f} val_f1={val_f1:.4f}')
    print('Val confusion matrix:\n', val_cm)


---
## 6. 测试集评估与保存

最后在测试集上评估并保存模型。


In [None]:
# 测试集评估与保存
# - 测试集仅用于最终报告

test_loss, test_acc, test_f1, test_cm = evaluate(model, test_loader)
print(f'Test: loss={test_loss:.4f} acc={test_acc:.4f} f1={test_f1:.4f}')
print('Test confusion matrix:\n', test_cm)

# 保存模型与 tokenizer
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print('Saved to', OUTPUT_DIR)


---
## 7. 后续多模态扩展建议
- 添加图像分支（ResNet/ViT）获取 `image_emb`
- 加入语义一致性分支（`cosine` / MLP）
- 融合策略：`concat(text_emb, image_emb, consistency)` + MLP

需要的话我可以继续把多模态版补齐。
