## 2.4 BERT 实战

本节在 IMDB 数据集上微调 BERT 模型，~~照抄~~复现了 Hugging Face 上的[教程](https://huggingface.co/docs/transformers/training)。

受算力限制，训练样本数为 5000，测试样本数为 1000。

得益于 `transformers` 库的优秀抽象，更改 `model_id` 即可切换基础模型。故不再给出微调 GPT-2 的代码。微调 GPT-2 的最终准确率为 88.50%，略弱于 BERT。

In [1]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = 'bert-base-uncased'

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
from datasets import load_dataset

imdb = load_dataset('stanfordnlp/imdb')
imdb_train = imdb['train'].shuffle(seed=0).take(5000)
imdb_train_tokenized = imdb_train.map(lambda x: tokenizer(x['text'], truncation=True, padding='max_length'), batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [3]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir='./training-logs', report_to='tensorboard')

In [4]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./training-logs',
    save_strategy='no',
    report_to='tensorboard',
    num_train_epochs=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=imdb_train_tokenized,
)
trainer.train()
model.eval()

  0%|          | 0/625 [00:00<?, ?it/s]

{'loss': 0.3875, 'grad_norm': 7.124253749847412, 'learning_rate': 1e-05, 'epoch': 0.8}
{'train_runtime': 221.3284, 'train_samples_per_second': 22.591, 'train_steps_per_second': 2.824, 'train_loss': 0.36949022216796873, 'epoch': 1.0}


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [5]:
import torch

def inference(text):
    with torch.no_grad():
        result = model(**tokenizer(text, truncation=True, return_tensors='pt').to(model.device))
        neg, pos = result.logits[0]
        return int(pos > neg)

inference('This movie is great!')

1

In [6]:
correct, incorrect = 0, 0
for sample in imdb['test'].shuffle(seed=0).take(1000):
    if inference(sample['text']) == sample['label']:
        correct += 1
    else:
        incorrect += 1
print(f'{correct / (correct + incorrect):.2%}')

91.00%
