## 本分类问题有很多经典解决办法

- 朴素贝叶斯分类器（Naive Bayes Classifier）通过某些条件发生的概率来进行分类。
- 支持向量机（Support Vector Machine, SVM）通过寻找最佳分割超平面来进行分类。
- 决策树（Decision Tree）通过树状结构进行分类。
- 随机森林（Random Forest）通过集成多个决策树进行分类。
- LDA（Latent Dirichlet Allocation）通过主题建模进行分类。
- 神经网络（Neural Networks）通过多层感知器进行分类。

面临主要问题
1. 类别多样性：情感类别多样，可能存在多个情感标签，增加分类难度。
2. 语言复杂性：自然语言表达复杂，包含隐含情感、讽刺、双关等，难以准确捕捉情感倾向。
3. 数据不平衡：某些情感类别的数据量较少，导致模型难以学习这些类别的特征。
4. 上下文依赖性：情感表达往往依赖上下文，单独的句子可能无法准确反映情感倾向。
5. 多义词和同义词：词语可能具有多重含义或相似含义，增加情感分类的复杂性。
6. 多语言支持：不同语言的情感表达方式不同，增加跨语言情感分类的难度。

## BERT
BERT 采用了基于 MLM 的模型训练方式，即 Mask Language Model。因为 BERT 是 Transformer 的一部分，即 encoder 环节，所以没有 decoder 的部分（其实就是 GPT）。

为了解决这个问题，MLM 方式应运而生。它的思想也非常简单，就是在训练之前，随机将文本中一部分的词语（token）进行屏蔽（mask），然后在训练的过程中，使用其他没有被屏蔽的 token 对被屏蔽的 token 进行预测。

``` shell
wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
unzip multi_cased_L-12_H-768_A-12.zip
```
### 格式准备

- Token embeddings：词向量。这里需要注意的是，Token embeddings 的第一个开头的 token 一定得是“[CLS]”。[CLS]作为整篇文本的语义表示，用于文本分类等任务。
- Segment embeddings。这个向量主要是用来将两句话进行区分，比如问答任务，会有问句和答句同时输入，这就需要一个能够区分两句话的操作。不过在咱们此次的分类任务中，只有一个句子。
- Position embeddings。位置向量。Transformer 结构中没有 RNN 那样的时序信息，所以需要通过位置向量来表示词语在句子中的位置。

### 模型配置

id2label：这个字段记录了类别标签和类别名称的映射关系。

label2id：这个字段记录了类别名称和类别标签的映射关系。

num_labels_cate：类别的数量。数据准备

1. InputExample：它用于记录单个训练数据的文本内容的结构。
2. DataProcessor：通过这个类中的函数，我们可以将训练数据集的文本，表示为多个 InputExample 组成的数据集合。
3. get_features：用于把 InputExample 数据转换成 BERT 能够理解的数据结构的关键函数。我们具体来看一下各个数据都怎么生成的。

input_ids 记录了输入 token 对应在 vocab.txt 的 id 序号，它是通过如下的代码得到的。

In [1]:
# 环境与依赖（可重复执行）
%pip install -q transformers datasets accelerate scikit-learn

import torch, transformers, datasets, sklearn
print("Versions:")
print("  torch:", torch.__version__)
print("  transformers:", transformers.__version__)
print("  datasets:", datasets.__version__)


Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


Versions:
  torch: 2.5.1+cu121
  transformers: 4.57.1
  datasets: 4.4.0


In [2]:
# InputExample 与 Processor 定义（IMDB 二分类示例）
from dataclasses import dataclass
from typing import Optional, List, Dict
from datasets import load_dataset

@dataclass
class InputExample:
    guid: str
    text_a: str
    text_b: Optional[str] = None
    label: Optional[int] = None  # 0/1


def load_imdb_as_examples(split: str) -> List[InputExample]:
    ds = load_dataset("imdb", split=split)
    examples = []
    for i, ex in enumerate(ds):
        guid = f"imdb-{split}-{i}"
        examples.append(InputExample(guid=guid, text_a=ex["text"], label=int(ex["label"])))
    return examples

# 演示：取前2个样本看一下结构
train_examples_preview = load_imdb_as_examples("train[:2]")
train_examples_preview


[InputExample(guid='imdb-train[:2]-0', text_a='I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex

In [3]:
# Tokenizer、特征化与小批量数据构造
from transformers import AutoTokenizer
from torch.utils.data import Dataset
import torch

MODEL_NAME = "bert-base-uncased"
max_length = 256

_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

class BertTextDataset(Dataset):
    def __init__(self, examples: List[InputExample]):
        self.examples = examples
    def __len__(self):
        return len(self.examples)
    def __getitem__(self, idx):
        ex = self.examples[idx]
        enc = _tokenizer(
            ex.text_a,
            ex.text_b,
            truncation=True,
            padding=False,
            max_length=max_length,
            return_tensors="pt",
        )
        item = {k: v.squeeze(0) for k, v in enc.items()}  # remove batch dim
        if ex.label is not None:
            item["labels"] = torch.tensor(ex.label, dtype=torch.long)
        return item

from torch.utils.data import DataLoader

def collate_batch(features: List[Dict[str, torch.Tensor]]):
    return _tokenizer.pad(
        features,
        padding=True,
        max_length=max_length,
        return_tensors="pt",
    )

# 构建 train/validation/test 划分
train_examples = load_imdb_as_examples("train[:95%]")
valid_examples = load_imdb_as_examples("train[95%:]")
test_examples  = load_imdb_as_examples("test")

train_ds = BertTextDataset(train_examples)
valid_ds = BertTextDataset(valid_examples)
test_ds  = BertTextDataset(test_examples)

len(train_ds), len(valid_ds), len(test_ds)


(23750, 1250, 25000)

In [6]:
# 使用 BertForSequenceClassification 的训练代码（纯 PyTorch 训练循环，含 AMP、DataLoader 优化）
import os, math
import torch
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup

# 目录（Notebook 中使用工作目录）
ckpt_dir2 = os.path.join(os.getcwd(), "ckpts", "bert_emotion_bertcls")
os.makedirs(ckpt_dir2, exist_ok=True)

# 设备与加速选项
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 允许 TF32（Ampere+ GPU 有效），可提升吞吐
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    torch.set_float32_matmul_precision("high")
except Exception:
    pass

# AMP 设置（优先使用 bfloat16，否则使用 float16）
use_autocast = torch.cuda.is_available()
try:
    autocast_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
except Exception:
    autocast_dtype = torch.float16
scaler = torch.cuda.amp.GradScaler(enabled=(autocast_dtype == torch.float16 and torch.cuda.is_available()))

# DataLoader（多进程、固定内存、预取）
num_workers = max(1, min(8, (os.cpu_count() or 2) // 2))
train_loader = DataLoader(
    train_ds, batch_size=16, shuffle=True, collate_fn=collate_batch,
    num_workers=num_workers, pin_memory=True, persistent_workers=True, prefetch_factor=2
)
valid_loader = DataLoader(
    valid_ds, batch_size=32, shuffle=False, collate_fn=collate_batch,
    num_workers=num_workers, pin_memory=True, persistent_workers=True, prefetch_factor=2
)
test_loader  = DataLoader(
    test_ds,  batch_size=32, shuffle=False, collate_fn=collate_batch,
    num_workers=num_workers, pin_memory=True, persistent_workers=True, prefetch_factor=2
)
print(f"Using device: {device}, num_workers={num_workers}, amp_dtype={autocast_dtype}")

# 创建模型（加载预训练 BERT）
model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
model.to(device)

# 优化器与调度器
lr = 2e-5
epochs = 3
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
num_training_steps = epochs * len(train_loader)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * num_training_steps),
    num_training_steps=num_training_steps,
)

# 简单准确率
def accuracy_from_logits(logits, labels):
    preds = logits.argmax(dim=-1)
    return (preds == labels).float().mean().item()

best_val_loss = math.inf
print("Start training (BertForSequenceClassification)...")
for epoch in range(1, epochs + 1):
    # Train
    model.train()
    tr_loss, tr_acc = 0.0, 0.0
    for batch in train_loader:
        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast(enabled=use_autocast, dtype=autocast_dtype):
            out = model(**batch)  # returns SequenceClassifierOutput
            loss = out.loss
        if scaler.is_enabled():
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            optimizer.step()
        scheduler.step()
        tr_loss += loss.item()
        tr_acc  += accuracy_from_logits(out.logits, batch["labels"])    
    tr_loss /= len(train_loader)
    tr_acc  /= len(train_loader)

    # Validate
    model.eval()
    va_loss, va_acc = 0.0, 0.0
    with torch.no_grad():
        for batch in valid_loader:
            batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
            with torch.cuda.amp.autocast(enabled=use_autocast, dtype=autocast_dtype):
                out = model(**batch)
                va_loss += out.loss.item()
                va_acc  += accuracy_from_logits(out.logits, batch["labels"])
    va_loss /= len(valid_loader)
    va_acc  /= len(valid_loader)

    # 保存最优
    if va_loss < best_val_loss:
        best_val_loss = va_loss
        model.save_pretrained(ckpt_dir2)
        _tokenizer.save_pretrained(ckpt_dir2)

    print(f"[Epoch {epoch}/{epochs}] train_loss={tr_loss:.4f} train_acc={tr_acc:.4f} | val_loss={va_loss:.4f} val_acc={va_acc:.4f}")

# 测试
model.eval()
te_loss, te_acc = 0.0, 0.0
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
        with torch.cuda.amp.autocast(enabled=use_autocast, dtype=autocast_dtype):
            out = model(**batch)
            te_loss += out.loss.item()
            te_acc  += accuracy_from_logits(out.logits, batch["labels"])  
te_loss /= len(test_loader)
te_acc  /= len(test_loader)
print(f"[Test] loss={te_loss:.4f} acc={te_acc:.4f}")

print("Saved best model to:", ckpt_dir2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Start training (BertForSequenceClassification)...
[Epoch 1/3] train_loss=0.3022 train_acc=0.8705 | val_loss=0.1987 val_acc=0.9242
[Epoch 1/3] train_loss=0.3022 train_acc=0.8705 | val_loss=0.1987 val_acc=0.9242
[Epoch 2/3] train_loss=0.1433 train_acc=0.9489 | val_loss=0.1346 val_acc=0.9555
[Epoch 2/3] train_loss=0.1433 train_acc=0.9489 | val_loss=0.1346 val_acc=0.9555
[Epoch 3/3] train_loss=0.0615 train_acc=0.9811 | val_loss=0.2592 val_acc=0.9148
[Epoch 3/3] train_loss=0.0615 train_acc=0.9811 | val_loss=0.2592 val_acc=0.9148
[Test] loss=0.2426 acc=0.9230
Saved best model to: /home/uceeqz4/Project/learning/PyTorch_study/Application/ckpts/bert_emotion_bertcls
[Test] loss=0.2426 acc=0.9230
Saved best model to: /home/uceeqz4/Project/learning/PyTorch_study/Application/ckpts/bert_emotion_bertcls
