## 作业三
使用 pytorch 或者 tensorflow 的相关神经网络库，编写 BERT 的语言模型，并基于训练好的词向量，利用少量的训练数据，微调 BERT 模型用于与实验二相同的文本分类任务，并和实验二的 RNN 模型进行对比分析。

具体来说，在本次实验中，需要通过预训练后的 BERT 模型在数据集上微调后实现文本情感分类(Text Sentiment Classification)：输入一个句子，输出是0(负面)或1(正面)。

**SA22221043 王家振** 

### 实验环境及数据集
- Colab: 使用随机分配的 GPU 及默认版本的 PyTorch、Transformers、Datasets 等
- IMDB: 公开数据集“Large Movie Review Dataset”，包含 25000 个样本的训练集和 25000 个样本的测试集，实验中划分训练集的 20% 作为验证集


### 实验过程

#### 基础实验过程

1. 下载及导入相关包

In [22]:
!pip install transformers datasets

import torch
import numpy as np
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer
from datasets import load_dataset, load_metric

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


2. 词向量编码器

使用 Huggingface 预训练的 Tokenizer

In [None]:
PRE_TRAINED_MODEL_NAME = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

3. 数据准备和数据转换

对于文本数据，通过 tokenizer 分词及处理，限定最大序列长度为 200，超过 200 的序列进行截取，不足 200 的序列通过补 0 补齐；

使用 Huggingface Datasets 自带的数据集加载器和数据转换函数进行处理，将训练集和验证集划分为 8 : 2；

In [None]:
imdb = load_dataset("imdb")
train_dataset = imdb["train"].shuffle(seed=42).select([i for i in range(20000)])
eval_dataset = imdb["train"].shuffle(seed=42).select([i for i in range(20000, 25000)])
test_dataset = imdb["test"].shuffle(seed=42)

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]



In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=200)

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_eval = eval_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

4. 验证指标

使用 accuracy 作为验证指标；

In [None]:
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   return {"accuracy": accuracy}

5. 模型结构

模型直接使用 Huggingface 提供的 DistilBertForSequenceClassification，其包括 一个 Bert 模型和一个分类器（MLP），使用 DistilBert 而不使用原始 Bert 是因为 DistilBert 模型更轻量、训练更快且性能损失很小；不采用自己编写的分类头是因为希望使用更优的预训练权重，避免小学习率导致分类头欠拟合或大学习率导致 Encoder 过拟合； 

In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

6. 训练相关的函数封装

使用 Huggingface 提供的 Trainer 进行训练、验证和测试； 

In [None]:
training_args = TrainingArguments(
   output_dir="finetune",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=1,
   weight_decay=0.01,
   save_strategy="no",
   logging_strategy="no"
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_eval,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

8. 模型训练和验证

可以看到验证准确率较好，说明基本实验流程跑通，因为模型训练耗时较长，所以只微调 1 个 epoch；

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 20000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1250
  Number of trainable parameters = 66955010


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1250, training_loss=0.3145908935546875, metrics={'train_runtime': 345.7176, 'train_samples_per_second': 57.851, 'train_steps_per_second': 3.616, 'total_flos': 1034901552000000.0, 'train_loss': 0.3145908935546875, 'epoch': 1.0})

In [None]:
acc = trainer.evaluate()['eval_accuracy']
print(f'evaluate acc: {acc}')

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


evaluate acc: 0.8934


#### 超参数搜索

使用超参数搜索，在验证集上验证，研究学习率（lr）和批次大小（bs）对模型性能的影响；

从以下实验中可以看出，lr 应选择 5e-5，bs 应选择 16；

In [None]:
import transformers
transformers.utils.logging.set_verbosity_error()

In [None]:
lrs = [2e-5, 5e-5]
for lr in lrs:
    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
    training_args = TrainingArguments(
      output_dir="finetune",
      learning_rate=lr,
      per_device_train_batch_size=16,
      per_device_eval_batch_size=16,
      num_train_epochs=1,
      weight_decay=0.01,
      save_strategy="no",
      logging_strategy="no",
      log_level='error'
    )
    trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized_train,
      eval_dataset=tokenized_eval,
      tokenizer=tokenizer,
      data_collator=data_collator,
      compute_metrics=compute_metrics,
    )
    trainer.train()
    acc = trainer.evaluate()['eval_accuracy']
    print(f'evaluate acc: {acc}')



{'train_runtime': 354.3221, 'train_samples_per_second': 56.446, 'train_steps_per_second': 3.528, 'train_loss': 0.3227504150390625, 'epoch': 1.0}
{'eval_loss': 0.27021580934524536, 'eval_accuracy': 0.8906, 'eval_runtime': 30.073, 'eval_samples_per_second': 166.262, 'eval_steps_per_second': 10.408, 'epoch': 1.0}
evaluate acc: 0.8906




{'train_runtime': 350.0832, 'train_samples_per_second': 57.129, 'train_steps_per_second': 3.571, 'train_loss': 0.31419267578125, 'epoch': 1.0}
{'eval_loss': 0.2689257562160492, 'eval_accuracy': 0.8942, 'eval_runtime': 29.8571, 'eval_samples_per_second': 167.464, 'eval_steps_per_second': 10.483, 'epoch': 1.0}
evaluate acc: 0.8942


In [None]:
lr = 5e-5
batch_sizes = [8, 16]
for bs in batch_sizes:
    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
    training_args = TrainingArguments(
      output_dir="finetune",
      learning_rate=lr,
      per_device_train_batch_size=bs,
      per_device_eval_batch_size=bs,
      num_train_epochs=1,
      weight_decay=0.01,
      save_strategy="no",
      logging_strategy="no",
      log_level='error'
    )
    trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized_train,
      eval_dataset=tokenized_eval,
      tokenizer=tokenizer,
      data_collator=data_collator,
      compute_metrics=compute_metrics,
    )
    trainer.train()
    acc = trainer.evaluate()['eval_accuracy']
    print(f'evaluate acc: {acc}')



{'train_runtime': 377.1793, 'train_samples_per_second': 53.025, 'train_steps_per_second': 6.628, 'train_loss': 0.3465671875, 'epoch': 1.0}
{'eval_loss': 0.2975882291793823, 'eval_accuracy': 0.892, 'eval_runtime': 29.612, 'eval_samples_per_second': 168.85, 'eval_steps_per_second': 21.106, 'epoch': 1.0}
evaluate acc: 0.892




{'train_runtime': 349.8756, 'train_samples_per_second': 57.163, 'train_steps_per_second': 3.573, 'train_loss': 0.31419267578125, 'epoch': 1.0}
{'eval_loss': 0.2689257562160492, 'eval_accuracy': 0.8942, 'eval_runtime': 29.9772, 'eval_samples_per_second': 166.794, 'eval_steps_per_second': 10.441, 'epoch': 1.0}
evaluate acc: 0.8942


#### 最终测试

非常幸运，我们在超参数搜索的最后一次训练得到了最优的验证集精度，所以我们可以直接进行测试，可以看到我们最终的测试准确率为 0.89996，高于之前使用 RNN 模型得到的 0.85，可以看出 Bert 预训练模型的强大性能，因为 Bert 模型训练耗时较长，所以我们没有进行更长时间的多微调以及更多的超参数搜索，可以预见更长的训练时间和更多的超参数搜索可以得到更优的性能；

In [21]:
trainer.predict(tokenized_test)

PredictionOutput(predictions=array([[-2.9077165 ,  2.4661252 ],
       [-1.0265001 ,  0.6936195 ],
       [ 1.8306328 , -2.1708558 ],
       ...,
       [-0.43584192,  0.26829585],
       [-0.5052705 ,  0.2864788 ],
       [ 0.9269244 , -1.1421107 ]], dtype=float32), label_ids=array([1, 1, 0, ..., 0, 1, 0]), metrics={'test_loss': 0.2540622353553772, 'test_accuracy': 0.89996, 'test_runtime': 139.2814, 'test_samples_per_second': 179.493, 'test_steps_per_second': 11.222})