# DeepLearning Assignment 3 实验报告
# SA22221042 汪泱泱

## 一、实验环境

GPU TITAN Xp  
CUDA 10.1  
python 3.7.13  
torch 1.8.1  
torchtext 0.6.0  
spacy 3.4.3  
transformers-4.25.1

## 二、实验过程

In [1]:
import random
import sys
import time
import torch
import torch.nn as nn
import torchtext
import tqdm
from transformers import AutoTokenizer, AutoModel

选择使用的BERT模型，一共对两种英文不区分大小写的BERT预训练模型（'bert-base-uncased'和'bert-large-uncased'）做了实验，这里以'bert-base-uncased'为例，得到了对应的Tokenizer和Model。

In [2]:
pretrained_model_name = 'bert-base-uncased'

In [3]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)
bertModel = AutoModel.from_pretrained(pretrained_model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


进行数据集的预处理。
由于IMDB公开数据集“Large Movie Review Dataset“是非常常见的公开数据集，torchtext中提供了接口`torchtext.datasets.imdb.IMDB`，我们可以直接使用其进行预处理。

按BERT最大的可输入长度截断token

In [13]:
MAX_TOKENS = tokenizer.max_model_input_sizes[pretrained_model_name]-2
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence, max_length=MAX_TOKENS, truncation=True)
    return tokens

分别使用torchtext.data.Field和torchtext.data.LabelField存储token和label，预设一些对应的特殊token包括表示句子开头、句子结尾、填充、不存在的token。

In [14]:
train_text = torchtext.data.Field(batch_first=True,
                            use_vocab=False,
                            tokenize = tokenize_and_cut,
                            preprocessing = tokenizer.convert_tokens_to_ids,
                            init_token=tokenizer.cls_token_id,
                            eos_token=tokenizer.sep_token_id,
                            pad_token=tokenizer.pad_token_id,
                            unk_token=tokenizer.unk_token_id)
train_label = torchtext.data.LabelField(dtype = torch.float)

In [15]:
train_data, test_data = torchtext.datasets.imdb.IMDB.splits(train_text, train_label)

划分验证集，划分比例为训练集：验证集=4:1

In [16]:
SEED=20230102
train_data, valid_data = train_data.split(random_state = random.seed(SEED),split_ratio=0.8)
train_label.build_vocab(train_data)

In [18]:
len(train_data), len(valid_data), len(test_data)

(20000, 5000, 25000)

定义情感分析模型，设计的比较简单：BERT模型+平均池化+分类头。  
分类头简单使用一层全连接层和一层sigmoid将值映射到$[0,1]$上，便于计算二分类交叉熵

In [26]:
class SentimentAnalysisModel(nn.Module):
    def __init__(self, bertModel):
        super().__init__()
        self.bertModel = bertModel
        self.classifier = torch.nn.Sequential(nn.Linear(bertModel.config.hidden_size, 1), nn.Sigmoid())
        
    def forward(self, text):
        embedded = self.bertModel(text)[0]
        avg = embedded.mean(dim=1)
        predict = self.classifier(avg)
        return predict

按batch_size打包数据

In [27]:
BATCH_SIZE = 32
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = torchtext.data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

实例化模型

In [28]:
model = SentimentAnalysisModel(bertModel)

使用AdamW作为优化器、BCELoss作为模型损失函数

In [29]:
optimizer = torch.optim.AdamW(model.parameters(), lr = 1e-5, eps = 1e-8)
Loss = torch.nn.BCELoss()

将模型和损失函数放到gpu上，由于BERT模型很大，所以需要DataParallel实现多卡

In [30]:
model = nn.DataParallel(model).to(device)
Loss = Loss.to(device)

定义计算准确率的函数，对结果取四舍五入近似和真实值比较是否相同

In [31]:
def cal_acc(preds, y):
    rounded_preds = torch.round(preds)
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

训练代码（有反向传播更新参数）和在验证集上的上测试loss和acc的代码：

In [32]:
def train(model, iterator, optimizer, Loss):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in tqdm.tqdm(iterator, desc='training...', file=sys.stdout):
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = Loss(predictions, batch.label)
        acc = cal_acc(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, Loss):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in tqdm.tqdm(iterator, desc='evaluating...', file=sys.stdout):
            text = batch.text
            predictions = model(text).squeeze(1)
            loss = Loss(predictions, batch.label)
            acc = cal_acc(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

开始训练。经过实验发现收敛很快，在1-2轮即可收敛，所以跑3轮即可。

In [33]:
epochs = 3
best_valid_loss = float('inf')
for epoch in range(epochs):
    train_loss, train_acc = train(model, train_iterator, optimizer, Loss)
    valid_loss, valid_acc = evaluate(model, valid_iterator, Loss)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc:.5f}')
    print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc:.5f}')

training...: 100%|██████████| 625/625 [03:55<00:00,  2.66it/s]
evaluating...: 100%|██████████| 157/157 [00:20<00:00,  7.53it/s]
Epoch: 01
	Train Loss: 0.274 | Train Acc: 0.88475
	 Val. Loss: 0.211 | Val. Acc: 0.91979
training...: 100%|██████████| 625/625 [03:44<00:00,  2.79it/s]
evaluating...: 100%|██████████| 157/157 [00:20<00:00,  7.50it/s]
Epoch: 02
	Train Loss: 0.164 | Train Acc: 0.93885
	 Val. Loss: 0.192 | Val. Acc: 0.92615
training...: 100%|██████████| 625/625 [03:43<00:00,  2.79it/s]
evaluating...: 100%|██████████| 157/157 [00:21<00:00,  7.44it/s]
Epoch: 03
	Train Loss: 0.103 | Train Acc: 0.96520
	 Val. Loss: 0.217 | Val. Acc: 0.93033


<All keys matched successfully>

### 三、参数选取

下面是不同BERT预训练模型和学习率的情况下，在验证集进行测试，以求找到最佳参数。

| BERT模型 | 学习率 | Best Valid Loss |
| ---------- | --------------- | ---|
| bert-base-uncased         | 1e-5          | **0.192** |
| bert-base-uncased         | 1e-6          | 0.211 |
| bert-large-uncased        | 1e-5      | 0.197 |
| bert-large-uncased        | 1e-6      | 0.201 |

### 四、测试结果

读取最佳模型参数

In [None]:
model.load_state_dict(torch.load('model.pt'))

选择验证集上表现最好的模型参数在测试集上测试

In [34]:
test_loss, test_acc = evaluate(model, test_iterator, Loss)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc:.5f}')

evaluating...: 100%|██████████| 782/782 [01:44<00:00,  7.52it/s]
Test Loss: 0.181 | Test Acc: 0.93119


ACC为0.93119

### 五、和RNN模型的比较

在同为BCELoss的情况下，两者在测试集上的表现为：

| Model | BCELoss | ACC     |
| ----- | ------- | ------- |
| RNN   | 0.284   | 89.143% |
| BERT  | 0.181   | 93.119% |

从模型表现和性能上来说，显然BERT要更优。

当然BERT的参数更多，bert-base-uncased有104w参数，bert-large-uncased有335w参数，所以无论是消耗的显存、以及训练单轮所需要的时间，BERT模型都需要更多。  
训练时，bert-base-uncased需要37GB显存和5min单轮训练时间(BATCH_SIZE=32),bert-large-uncased需要29GB显存和36min单轮训练时间(BATCH_SIZE=8)，而RNN模型在BATCH_SIZE=256时也仅需要11GB内存和约40s单轮训练时间。