# E2. 使用 Bert + prompt 完成 SST2 分类

&emsp; 1 &ensp; 基础介绍：`prompt-based model`简介、与`fastNLP`的结合

&emsp; 2 &ensp; 准备工作：`P-Tuning v2`原理概述、`P-Tuning v2`模型搭建

&emsp; 3 &ensp; 模型训练：加载`tokenizer`、预处理`dataset`、模型训练与分析

### 1. 基础介绍：prompt-based model 简介、与 fastNLP 的结合

&emsp; 本示例使用`GLUE`评估基准中的`SST2`数据集，通过`prompt-based tuning`方式

&emsp; &emsp; 微调`bert-base-uncased`模型，实现文本情感的二分类，在此之前本示例

&emsp; &emsp; 将首先简单介绍提示学习模型的研究，以及与`fastNLP v0.8`结合的优势

**`prompt`**，**提示词、提词器**，最早出自**`PET`**，

&emsp; 

**`prompt-based tuning`**，**基于提示的微调**，描述

&emsp; **`prompt-based model`**，**基于提示的模型**

**`prompt-based model`**，**基于提示的模型**，举例

&emsp; 案例一：**`P-Tuning v1`**

&emsp; 案例二：**`PromptTuning`**

&emsp; 案例三：**`PrefixTuning`**

&emsp; 案例四：**`SoftPrompt`**

使用`fastNLP v0.8`实现`prompt-based model`的优势

&emsp; 

&emsp; 本示例仍使用了`tutorial-E1`的`SST2`数据集，将`bert-base-uncased`作为基础模型

&emsp; &emsp; 在后续实现中，意图通过将连续的`prompt`与`model`拼接，解决`SST2`二分类任务

In [1]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

import sys
sys.path.append('..')

import fastNLP
from fastNLP import Trainer
from fastNLP.core.metrics import Accuracy

print(transformers.__version__)

task = 'sst2'
model_checkpoint = 'bert-base-uncased'

4.18.0


### 2. 准备工作：P-Tuning v2 原理概述、P-Tuning v2 模型搭建

&emsp; 本示例使用`P-Tuning v2`作为`prompt-based tuning`与`fastNLP v0.8`结合的案例

&emsp; &emsp; 以下首先简述`P-Tuning v2`的论文原理，并由此引出`fastNLP v0.8`的代码实践

`P-Tuning v2`出自论文 [Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)

&emsp; 其主要贡献在于，在`PrefixTuning`等深度提示学习基础上，提升了其在分类标注等`NLU`任务的表现

&emsp; &emsp; 并使之在中等规模模型，主要是参数量在`100M-1B`区间的模型上，获得与全参数微调相同的效果

&emsp; 其结构如图所示，

<img src="./figures/E2-fig-p-tuning-v2.png" width="60%" height="60%" align="center"></img>

In [3]:
class SeqClsModel(nn.Module):
    def __init__(self, model_checkpoint, num_labels, pre_seq_len):
        nn.Module.__init__(self)
        self.num_labels = num_labels
        self.back_bone = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                                            num_labels=num_labels)
        self.embeddings = self.back_bone.get_input_embeddings()

        for param in self.back_bone.parameters():
            param.requires_grad = False
        
        self.pre_seq_len = pre_seq_len
        self.prefix_tokens = torch.arange(self.pre_seq_len).long()
        self.prefix_encoder = nn.Embedding(self.pre_seq_len, self.embeddings.embedding_dim)
    
    def get_prompt(self, batch_size):
        prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(self.back_bone.device)
        prompts = self.prefix_encoder(prefix_tokens)
        return prompts

    def forward(self, input_ids, attention_mask, labels=None):
        
        batch_size = input_ids.shape[0]
        raw_embedding = self.embeddings(input_ids)
        
        prompts = self.get_prompt(batch_size=batch_size)
        inputs_embeds = torch.cat((prompts, raw_embedding), dim=1)
        prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.back_bone.device)
        attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)

        outputs = self.back_bone(inputs_embeds=inputs_embeds, 
                                 attention_mask=attention_mask, labels=labels)
        return outputs

    def train_step(self, input_ids, attention_mask, labels):
        loss = self(input_ids, attention_mask, labels).loss
        return {'loss': loss}

    def evaluate_step(self, input_ids, attention_mask, labels):
        pred = self(input_ids, attention_mask, labels).logits
        pred = torch.max(pred, dim=-1)[1]
        return {'pred': pred, 'target': labels}

接着，通过确定分类数量初始化模型实例，同时调用`torch.optim.AdamW`模块初始化优化器

&emsp; 根据`P-Tuning v2`论文：*Generally, simple classification tasks prefer shorter prompts (less than 20)*

&emsp; 此处`pre_seq_len`参数设定为`20`，学习率相应做出调整，其他内容和`tutorial-E1`中的内容一致

In [4]:
model = SeqClsModel(model_checkpoint=model_checkpoint, num_labels=2, pre_seq_len=20)

optimizers = AdamW(params=model.parameters(), lr=1e-2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### 3. 模型训练：加载 tokenizer、预处理 dataset、模型训练与分析

&emsp; 本示例沿用`tutorial-E1`中的数据集，即使用`GLUE`评估基准中的`SST2`数据集

&emsp; &emsp; 以`bert-base-uncased`模型作为基准，基于`P-Tuning v2`方式微调

&emsp; &emsp; 数据集加载相关代码流程见下，内容和`tutorial-E1`中的内容基本一致

首先，使用`datasets.load_dataset`加载数据集，使用`transformers.AutoTokenizer`

&emsp; 构建`tokenizer`实例，通过`dataset.map`使用`tokenizer`将文本替换为词素序号序列

In [5]:
from datasets import load_dataset, load_metric

dataset = load_dataset('glue', task)

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Reusing dataset glue (/remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-18ec0e709f05e61e.arrow
Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e2f02ee7442ad73e.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

然后，定义`SeqClsDataset`类、定义校对函数`collate_fn`，这里沿用`tutorial-E1`中的内容

&emsp; 同样需要注意/强调的是，**`__getitem__`函数的返回值必须和原始数据集中的属性对应**

&emsp; **`collate_fn`函数的返回值必须和`train_step`和`evaluate_step`函数的参数匹配**

In [7]:
class SeqClsDataset(Dataset):
    def __init__(self, dataset):
        Dataset.__init__(self)
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        item = self.dataset[item]
        return item['input_ids'], item['attention_mask'], [item['label']] 

def collate_fn(batch):
    input_ids, atten_mask, labels = [], [], []
    max_length = [0] * 3
    for each_item in batch:
        input_ids.append(each_item[0])
        max_length[0] = max(max_length[0], len(each_item[0]))
        atten_mask.append(each_item[1])
        max_length[1] = max(max_length[1], len(each_item[1]))
        labels.append(each_item[2])
        max_length[2] = max(max_length[2], len(each_item[2]))

    for i in range(3):
        each = (input_ids, atten_mask, labels)[i]
        for item in each:
            item.extend([0] * (max_length[i] - len(item)))
    return {'input_ids': torch.cat([torch.tensor([item]) for item in input_ids], dim=0),
            'attention_mask': torch.cat([torch.tensor([item]) for item in atten_mask], dim=0),
            'labels': torch.cat([torch.tensor(item) for item in labels], dim=0)}

再然后，分别对`tokenizer`处理过的训练集数据、验证集数据，进行预处理和批量划分

In [9]:
dataset_train = SeqClsDataset(encoded_dataset['train'])
dataloader_train = DataLoader(dataset=dataset_train, 
                              batch_size=32, shuffle=True, collate_fn=collate_fn)
dataset_valid = SeqClsDataset(encoded_dataset['validation'])
dataloader_valid = DataLoader(dataset=dataset_valid, 
                              batch_size=32, shuffle=False, collate_fn=collate_fn)

&emsp;

In [None]:
trainer = Trainer(
    model=model,
    driver='torch',
    device=[0, 1],
    n_epochs=20,
    optimizers=optimizers,
    train_dataloader=dataloader_train,
    evaluate_dataloaders=dataloader_valid,
    metrics={'acc': Accuracy()}
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


&emsp;

In [None]:
trainer.run(num_eval_batch_per_dl=10)

&emsp;

In [None]:
trainer.evaluator.run()