# 使用 ModelScope 和 fastNLP 来实现情感分类

&emsp;&emsp;本篇教程将为您详细展示如何使用 `fastNLP` 和 `ModelScope` 来实现简单的情感任务。

&emsp;&emsp;本篇教程的数据集是SST-2 文本情感二分类数据集。

## 1. 基础介绍：达摩院 ModelScope 和 StructBERT 模型

### ModelScope

&emsp;&emsp;**ModelScope** 旨在打造下一代开源的模型即服务共享平台，为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品，让模型应用更简单。提供的服务包括

- **丰富的预训练SOTA模型**
- **多元开放的数据集**
- **一行代码使用模型推理能力**
- **十行代码快速构建专属行业模型**
- **即开即用的在线开发平台**
- **灵活的模型框架与部署方式**
- **丰富的教学内容与技术资源**

### StructBERT

&emsp;&emsp;StructBERT 的中文 Large 预训练模型是使用 wikipedia 数据和 masked language model 任务训练的中文自然语言理解预训练模型。我们通过引入语言结构信息的方式，将 BERT 扩展为了一个新模型 --StructBERT。我们在 BERT 的基础上新引入了两个辅助任务来让模型学习字级别的顺序信息和句子级别的顺序信息， 从而更好的建模语言结构。

![模型结构](https://www.modelscope.cn/api/v1/models/damo/nlp_structbert_sentiment-classification_chinese-base/repo?Revision=master&FilePath=model.jpg&View=true)

&emsp;&emsp;详见论文 [StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding](https://arxiv.org/abs/1908.04577)。


## 2. 准备工作：加载数据，加载 tokenizer、预处理 dataset、dataloader

&emsp;&emsp;在此教程中，我们仍旧使用 `sst-2` 来训练模型，实现情感分类。首先使用 `datasets` 来加载 `sst-2`，通过 `fastNLP` 的 `DataSet` 和 `DataBundle` 改变数据集的格式。



In [1]:
from fastNLP.io import DataBundle
from fastNLP import DataSet
from datasets import load_dataset

train_dataset, val_dataset, test_dataset = load_dataset("glue", "sst2", split=["train", "validation", "test"])
train_dataset, val_dataset, test_dataset = train_dataset[:], val_dataset[:], test_dataset[:]

train_dataset = DataSet(train_dataset)
val_dataset = DataSet(val_dataset)
test_dataset = DataSet(test_dataset)

datasets = {"train": train_dataset, "val": val_dataset, "test": test_dataset}
data_bundle = DataBundle(datasets=datasets)
print(data_bundle)

Found cached dataset glue (/remote-home/kychen/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

In total 3 datasets:
	train has 10000 instances.
	val has 872 instances.
	test has 1000 instances.




&emsp;&emsp;之后，通过 `snapshot_download` 函数将预训练模型下载到本地，使用 `SequenceClassificationPreprocessor` 将句子进行分词，并提取特征。

&emsp;&emsp;其中，`SequenceClassificationPreprocessor` 预处理器基于 `transformers.tokenizer` 实现，用于各输入格式符合 `transformers` 输入格式的文本分类预处理。在传入数据后，预处理器会尝试将字符串类型、int类型的标签映射为 id，float 类型的 id 会保持不变。

&emsp;&emsp;采用`cache_results`，可以缓存处理的数据，节省非首次处理数据的时间

&emsp;&emsp;同时，我们通过分词，提取出特征，并将这些特征加入`data_bundle`数据集中，特征包括

- `input_ids`
- `attention_mask`

In [2]:
from modelscope.hub.snapshot_download import snapshot_download
from modelscope.preprocessors import SequenceClassificationPreprocessor
from fastNLP import cache_results


@cache_results('caches/cache.pkl')
def process_data(data_bundle, model_name):
    tokenizer = SequenceClassificationPreprocessor.from_pretrained(model_name)

    def _process(review):
        encodings_review = tokenizer(review)

        input_ids = encodings_review["input_ids"].squeeze()

        attention_mask = encodings_review["attention_mask"].squeeze()
        return {'input_ids': input_ids, 'attention_mask': attention_mask}

    data_bundle.apply_field_more(_process, field_name='sentence')

    return data_bundle, tokenizer


model_id = 'damo/nlp_structbert_sentiment-classification_chinese-base'
model_checkpoint = snapshot_download(model_id)

data_bundle, tokenizer = process_data(data_bundle, model_checkpoint, _refresh=True)

2022-11-10 16:59:08,521 - modelscope - INFO - PyTorch version 1.13.0 Found.
2022-11-10 16:59:08,524 - modelscope - INFO - Loading ast index from /remote-home/kychen/.cache/modelscope/ast_indexer
2022-11-10 16:59:08,616 - modelscope - INFO - Loading done! Current index file version is 1.0.3, with md5 f03e7e04ee360a0482e2f177143f01c2
2022-11-10 16:59:09,632 - modelscope - INFO - Model revision not specified, use the latest revision: v1.0.0
2022-11-10 16:59:09,864 - modelscope - INFO - File config.json already in cache, skip downloading!
2022-11-10 16:59:09,866 - modelscope - INFO - File configuration.json already in cache, skip downloading!
2022-11-10 16:59:09,867 - modelscope - INFO - File label_mapping.json already in cache, skip downloading!
2022-11-10 16:59:09,868 - modelscope - INFO - File model.jpg already in cache, skip downloading!
2022-11-10 16:59:09,870 - modelscope - INFO - File pytorch_model.bin already in cache, skip downloading!
2022-11-10 16:59:09,870 - modelscope - INFO -

Output()

Output()

Output()

&emsp;&emsp;再然后，**定义校对函数 collate_fn 对齐同个 batch 内的每笔数据**，需要注意的是该函数的 **返回值必须是字典**，**键值必须同待训练模型的 `train_step` 和 `evaluate_step` 函数的参数相对应**；这也就是在基础篇 [tutorial-0](../basic/fastnlp_tutorial_0.ipynb) 中便被强调的，`fastNLP v1.0`的第一条**参数匹配**机制。

In [3]:
from fastNLP import prepare_torch_dataloader
import torch

def collate_fn(batch):
    input_ids, atten_mask, labels = [], [], []
    max_length = [0] * 3

    for each_item in batch:
        input_ids.append(each_item["input_ids"].tolist())
        max_length[0] = max(max_length[0], len(each_item["input_ids"].tolist()))
        atten_mask.append(each_item["attention_mask"].tolist())
        max_length[1] = max(max_length[1], len(each_item["attention_mask"].tolist()))

        labels.append([each_item["label"]])
        max_length[2] = max(max_length[2], len([each_item["label"]]))

    for i in range(3):
        each = (input_ids, atten_mask, labels)[i]
        for item in each:
            item.extend([0] * (max_length[i] - len(item)))

    return {'input_ids': torch.cat([torch.tensor([item]) for item in input_ids], dim=0),
            'attention_mask': torch.cat([torch.tensor([item]) for item in atten_mask], dim=0),
            'labels': torch.cat([torch.tensor(item) for item in labels], dim=0)}




dict_values([tensor([[  101,  8233,   112,  ...,     0,     0,     0],
        [  101,   163,  8171,  ...,     0,     0,     0],
        [  101,  8513,  9024,  ...,     0,     0,     0],
        ...,
        [  101,  8997,  8859,  ...,     0,     0,     0],
        [  101, 12311,  8329,  ...,     0,     0,     0],
        [  101,   143,  8373,  ...,     0,     0,     0]]), tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), tensor([1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1])])
dict_values([tensor([[  101,  8174, 13152,  ...,     0,     0,     0],
        [  101, 10677, 11582,  ...,     0,     0,     0],
        [  101,   119,   119,  ...,     0,     0,     0],
        ...,
        [  101,   151, 11643,  ...,     0,     0,     0],
        [  101, 13158,  9457,  ...,     0,     0,     0],
        [  101,   119,   119, 

&emsp;&emsp;最后使用 `prepare_torch_dataloader` 来加载数据，对 `tokenizer` 处理过的训练集数据、验证集数据，进行预处理和批量划分：

In [None]:
train_dataset = data_bundle.get_dataset('train')
evaluate_dataset = data_bundle.get_dataset('val')
train_dataloader = prepare_torch_dataloader(train_dataset, batch_size=16, shuffle=True, collate_fn=collate_fn)
evaluate_dataloader = prepare_torch_dataloader(evaluate_dataset, batch_size=16, collate_fn=collate_fn)
for i in evaluate_dataloader:
    print(i.values())

## 3. 模型训练：加载 StructBERT、fastNLP 参数匹配、fine-tuning

&emsp;&emsp;最后就是模型训练的不分，需要使用 `damo/nlp_structbert_sentiment-classification_chinese-base` 搭建分类模型，此处使用的 `nn.Module` 模块搭建模型，与 `tokenizer` 类似，通过从 `modelscope` 库中导入 `SbertModel` 模块，加载模型，并且导入 `SbertConfig` 模块，加载模型配置。

In [4]:
from modelscope.models.nlp import SbertModel
from modelscope.models.nlp.structbert import SbertConfig
from torch import nn


class SeqClsModel(nn.Module):
    def __init__(self, num_labels, model_checkpoint):
        nn.Module.__init__(self)
        self.num_labels = num_labels
        self.config = SbertConfig.from_pretrained(model_checkpoint)

        self.back_bone = SbertModel.from_pretrained(model_checkpoint, config=self.config,num_labels=num_labels)


    def forward(self, input_ids, attention_mask, labels=None):
        output = self.back_bone(input_ids=input_ids,
                                attention_mask=attention_mask, labels=labels)

        return output

    def train_step(self, input_ids, attention_mask, labels):
        loss = self(input_ids, attention_mask, labels)["loss"]
        return {'loss': loss}

    def evaluate_step(self, input_ids, attention_mask, labels):

        pred = self(input_ids, attention_mask, labels)["logits"]
        pred = torch.max(pred, dim=-1)[1]
        return {'pred': pred, 'target': labels}


model = SeqClsModel(num_labels=2, model_checkpoint=model_checkpoint)

2022-11-10 17:02:09,574 - modelscope - INFO - initialize model from /remote-home/kychen/.cache/modelscope/hub/damo/nlp_structbert_sentiment-classification_chinese-base
2022-11-10 17:02:15,055 - modelscope - INFO - All model checkpoint weights were used when initializing SequenceClassificationModel.

2022-11-10 17:02:15,057 - modelscope - INFO - All the weights of SequenceClassificationModel were initialized from the model checkpoint If your task is similar to the task the model of the checkpoint was trained on, you can already use SequenceClassificationModel for predictions without further training.


&emsp;&emsp;初始化优化器 `Optimizer`、训练模块 `Trainer`，最后，使用之前完成的 `train_dataloader` 和 `evaluate_dataloader`，训练模块 `Trainer`，得到训练结果。

In [5]:
from fastNLP import Trainer, Accuracy
from torch.optim import AdamW

optimizers = AdamW(params=model.parameters(), lr=5e-5)

trainer = Trainer(
    model=model,
    driver='torch',
    device=1,  # 'cuda'
    n_epochs=10,
    optimizers=optimizers,
    train_dataloader=train_dataloader,
    evaluate_dataloaders=evaluate_dataloader,
    metrics={'acc': Accuracy()}
)

trainer.run(num_eval_batch_per_dl=10)

Output()

Output()

In [6]:
trainer.evaluator.run()

Output()

{'acc#acc': 0.802752}