代码整体框架：
1. 需要定义一个DataLoader构建训练数据集，输入为(batch_size, seq_len)
2. 经过Bert模型，输出为(batch_size, seq_len, embed_size)
3. 经过crf层

In [1]:
# 下载必要的库
!pip install transformers pytorch-crf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Task 1:NER
baseline: Bert+BiLSTM+CRF

In [28]:
import pandas as pd
import torch

# 加载训练数据和测试数据
train_dataframe = pd.read_csv('./train_data_public.csv')
train_dataframe.drop('class', axis=1, inplace=True)  # class信息是情感分析的，与任务NER无关
test_dataframe = pd.read_csv('./test_public.csv')

# 将sentence-level信息切分为character-level的信息
train_dataframe['BIO_anno'] = train_dataframe['BIO_anno'].apply(lambda x: x.split(' '))  # label
train_dataframe['training_data_text'] = train_dataframe.apply(lambda row: list(row['text']), axis=1)
test_dataframe['testing_data_text'] = test_dataframe.apply(lambda row: list(row['text']), axis=1)
test_dataframe.head()

Unnamed: 0,id,text,testing_data_text
0,0,共享一个额度，没啥必要，四个卡不要年费吗？你这种人头，银行最喜欢，广发是出了名的风控严，套现...,"[共, 享, 一, 个, 额, 度, ，, 没, 啥, 必, 要, ，, 四, 个, 卡, ..."
1,1,炸了，就2000.浦发没那么好心，草,"[炸, 了, ，, 就, 2, 0, 0, 0, ., 浦, 发, 没, 那, 么, 好, ..."
2,2,挂了电话自己打过去分期提额可以少分一点的,"[挂, 了, 电, 话, 自, 己, 打, 过, 去, 分, 期, 提, 额, 可, 以, ..."
3,3,比如你首卡10k，二卡也10k，信报上显示邮政总共给你的授信额度是20k,"[比, 如, 你, 首, 卡, 1, 0, k, ，, 二, 卡, 也, 1, 0, k, ..."
4,4,3000吗，浦发总是这样,"[3, 0, 0, 0, 吗, ，, 浦, 发, 总, 是, 这, 样]"


In [17]:
training_data_text_list = []
testing_data_text_list = []
for i in range(len(train_dataframe)):
    training_data_text_list.append(train_dataframe.iloc[i]['training_data_text'])
for i in range(len(test_dataframe)):
    testing_data_text_list.append(test_dataframe.iloc[i]['testing_data_text'])

In [4]:
from transformers import BertTokenizer
from transformers import BertConfig
import os

# 整个模型的一些超参数
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
config = BertConfig.from_pretrained('bert-base-chinese')
config.num_classes = 11  # CRF中状态转移的类别，包括[PAD]这个类别，因为是一个batch训练
config.clip_grad = 5  # 梯度裁剪，梯度模大于该值时自动裁剪
config.epoch_num = 2
config.min_epoch_num = 1
config.patience = 0.0002  # f1增量小于该值认为没有变化
config.patience_num = 1  # f1无明显变化的轮数大于该值后停止训练
config.learning_rate = 1e-5
config.batch_size = 4
config.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
access_token = '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'
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', use_auth_token=access_token)
tokenizer

BertTokenizer(name_or_path='bert-base-chinese', vocab_size=21128, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [30]:
START_TAG, STOP_TAG = '[CLS]', '[SEP]'
tag_to_ix = {"[PAD]": 0, "O": 1, "B-BANK": 2, "I-BANK": 3, "B-PRODUCT": 4, 'I-PRODUCT': 5,
             'B-COMMENTS_N': 6, 'I-COMMENTS_N': 7, 'B-COMMENTS_ADJ': 8,
             'I-COMMENTS_ADJ': 9, STOP_TAG: 10, START_TAG: 11}
idx_to_tag = {idx: tag for idx, tag in enumerate(tag_to_ix)}


# 将每一句转成数字（大于510做截断，小于510做PADDING，加上首尾两个标识，长度总共等于512）
def convert_text_to_token(tokenizer, sentence, limit_size=510):
    tokens = tokenizer.encode(sentence[:limit_size])  # 直接截断
    # 补齐（pad的索引号就是0）
    if len(tokens) < limit_size + 2:
        tokens.extend([tag_to_ix["[PAD]"]] * (limit_size + 2 - len(tokens)))
    return tokens


# 将BIO_anno转化为token
def covert_anno_to_token(anno_list, limit_size=510):
    token_list = []
    if anno_list:
        token_list.append(tag_to_ix[START_TAG])
    else:
        return [0] * (limit_size + 2)
    anno_list = anno_list[:limit_size]
    for i in range(len(anno_list)):
        token_list.append(tag_to_ix[anno_list[i]])
    token_list.append(tag_to_ix[STOP_TAG])
    if len(token_list) < limit_size + 2:
        token_list.extend([0] * (limit_size + 2 - len(token_list)))
    return token_list


# 建立mask
def attention_masks(input_ids):
    atten_masks = []
    for seq in input_ids:
        # 如果有编码（>0）即为1, pad为0
        seq_mask = [float(x > 0) for x in seq]
        atten_masks.append(seq_mask)
    return atten_masks


# 对每个句子进行编码
train_input_ids = [convert_text_to_token(tokenizer, x, config.max_position_embeddings - 2) for x in
                   training_data_text_list]
test_input_ids = [convert_text_to_token(tokenizer, x, config.max_position_embeddings - 2) for x in
                  testing_data_text_list]
# 放到tensor中
train_input_tokens = torch.tensor(train_input_ids)
test_input_tokens = torch.tensor(test_input_ids)
# 对每个BIO_anno进行编码
train_input_targets = [covert_anno_to_token(x, config.max_position_embeddings - 2) for x in train_dataframe['BIO_anno']]
train_total_targets = torch.tensor(train_input_targets)
# 生成attention_masks
train_atten_masks = attention_masks(train_input_ids)
test_atten_masks = attention_masks(test_input_ids)
# 将atten_masks放到tensor中
train_attention_tokens = torch.tensor(train_atten_masks, dtype=torch.bool)
test_attention_tokens = torch.tensor(test_atten_masks, dtype=torch.bool)
train_total_targets[0]

tensor([11,  2,  3,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  6,  7,  1,  1,  1,
         1,  1,  8,  9,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  6,  7,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  6,  7,  1,  1,  6,  7,  1,  1,  1,  1,  4,  5,  1,  1,  1,  1,  8,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 10,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 

In [6]:
from sklearn.model_selection import train_test_split

# 使用random_state固定切分方式，切分 train_inputs, train_labels, train_masks,
train_inputs, eval_inputs, train_labels, eval_labels = train_test_split(train_input_tokens, train_total_targets,
                                                                        random_state=2021, test_size=0.2)
train_masks, eval_masks, _, _ = train_test_split(train_attention_tokens, train_input_tokens, random_state=2021,
                                                 test_size=0.2)

In [32]:
from torch.utils.data import TensorDataset, RandomSampler, DataLoader, SequentialSampler

# 使用TensorDataset对tensor进行打包
BATCH_SIZE = config.batch_size
train_data = TensorDataset(train_inputs, train_masks, train_labels)
# 无放回地随机采样样本元素
train_sampler = RandomSampler(train_data)
# 对于训练集，random sampler, shuffle
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

# 对于验证集和测试集，Sequential sampler
eval_data = TensorDataset(eval_inputs, eval_masks, eval_labels)
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=BATCH_SIZE)

test_data = TensorDataset(test_input_tokens, test_attention_tokens)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

In [33]:
# 由于bert输出的seq_len固定为512，而BiLSTM输入的seq_len由该batch最大长度决定，因此为了将bert、LSTM、CRF融合在一起，需要在每个batch中加入max_len这一信息，下面的函数就是实现此功能
def generate_label_starts(data):
    """

    :param data: 一个batch的数据
    :return: 句子的最大长度
    """
    batch_label_start = []
    for batch in data:
        sentences = [x[0] for x in batch]
        end = []
        max_len = 0
        batch_size = batch[0].shape[0]
        for sentence in sentences:
            for i in range(len(sentence)):
                if sentence[i] == 102:
                    max_len = max(max_len, i - 1)
                    break
        end.extend([max_len] * batch_size)
        batch_label_start.append(end)
    return batch_label_start


def update(dataloader):
    """

    :param dataloader: dataloader
    :return: 加入max_len后的dataloader
    """
    label_starts = generate_label_starts(dataloader)
    batch_dataloader = []
    idx = 0
    for batch in dataloader:
        batch.append(torch.tensor(label_starts[idx]))
        idx += 1
        batch_dataloader.append(batch)
    dataloader = DataLoader(batch_dataloader)
    return dataloader


train_dataloader = update(train_dataloader)
eval_dataloader = update(eval_dataloader)
test_dataloader = update(test_dataloader)

In [9]:
from torchcrf import CRF
from torch.nn.utils.rnn import pad_sequence
from transformers import BertModel
from transformers.models.bert.modeling_bert import *
from torch import nn


class Bert_BiLSTM_CRF(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.bert = BertModel.from_pretrained('bert-base-chinese', config=config)
        self.lstm = nn.LSTM(input_size=config.hidden_size, hidden_size=config.hidden_size // 2, num_layers=2,
                            bidirectional=True, batch_first=True)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.num_classes = config.num_classes
        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
        self.crf = CRF(num_tags=self.num_classes, batch_first=True)
        self.init_weights()

    def forward(self, input_ids, input_tokens_start, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs[0]
        origin_sequence_output = sequence_output[:, 1:input_tokens_start[0] + 1]
        padded_sequence_output = pad_sequence(origin_sequence_output, batch_first=True)
        padded_sequence_output = self.dropout(padded_sequence_output)
        lstm_outputs, _ = self.lstm(padded_sequence_output)
        logits = self.classifier(lstm_outputs)
        outputs = (logits,)
        if labels is not None:
            loss_mask = labels.gt(0)
            loss = -self.crf(emissions=logits, tags=labels, mask=loss_mask)
            outputs = (loss,) + outputs
        return outputs

In [10]:
from tqdm import tqdm


def train_epoch(train_loader, model, optimizer, scheduler, epoch):
    # set model to training mode
    model.train()
    # step number in one epoch: 336
    train_losses = 0
    for _, batch_samples in enumerate(tqdm(train_loader)):
        batch_input, batch_masks, batch_labels, batch_label_start = batch_samples
        batch_input, batch_masks, batch_labels, batch_label_start = batch_input.squeeze().to(
            config.device), batch_masks.squeeze().to(config.device), batch_labels.squeeze().to(
            config.device), batch_label_start.squeeze().to(config.device)
        max_len = batch_label_start[0]
        batch_labels = batch_labels[:, 1:max_len + 1].to(config.device)
        # compute model output and loss
        loss = \
            model(input_ids=batch_input, attention_mask=batch_masks, labels=batch_labels,
                  input_tokens_start=batch_label_start)[0]
        train_losses += loss.item()
        # clear previous gradients, compute gradients of all variables wrt loss
        model.zero_grad()
        loss.backward()
        # gradient clipping
        nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=config.clip_grad)
        # performs updates using calculated gradients
        optimizer.step()
        scheduler.step()
    train_loss = float(train_losses) / len(train_loader)
    print("Epoch: {}, train loss: {}".format(epoch, train_loss))


def train(train_loader, model, optimizer, scheduler):
    """train the model and test model performance"""
    # reload weights from restore_dir if specified
    num_classes = config.num_classes
    model.to(config.device)
    best_val_f1 = 0.0
    patience_counter = 0
    # start training
    for epoch in range(1, config.epoch_num + 1):
        train_epoch(train_loader, model, optimizer, scheduler, epoch)
        val_metrics = evaluate(model, eval_dataloader)
        val_f1 = val_metrics['f1']
        print("Epoch: {}, f1 score: {}".format(epoch, val_f1))
        improve_f1 = val_f1 - best_val_f1
        if improve_f1 > 1e-5:
            best_val_f1 = val_f1
            if improve_f1 < config.patience:
                patience_counter += 1
            else:
                patience_counter = 0
        else:
            patience_counter += 1
        # Early stopping and logging best f1
        if (patience_counter >= config.patience_num and epoch > config.min_epoch_num) or epoch == config.epoch_num:
            print("Best val f1: {}".format(best_val_f1))
            break
    print("Training Finished!")


In [11]:
# 计算模型的f1 score。只有entity完全一样才认为是一致
def getentity(tags):
    S = set()
    for sentence in tags:
        for i in range(len(sentence)):
            if sentence[i] == 'O':
                entity = (i, i, sentence[i])
                S.add(entity)
            elif sentence[i] == 'B':
                entity = [i, sentence[i]]
                for j in range(i + 1, len(sentence)):
                    if sentence[j][0] != 'I':
                        break
                    else:
                        entity.append(entity[j])
                S.add(tuple(entity))
    return S


def predict(model, inputs, masks, label_start):
    model.eval()
    inputs, masks, label_start = inputs.squeeze().to(config.device), masks.squeeze().to(
        config.device), label_start.squeeze().to(config.device)
    max_len = label_start[0]
    batch_output = model(input_ids=inputs, attention_mask=masks, labels=None, input_tokens_start=label_start)[0]
    batch_output = model.crf.decode(batch_output, mask=masks[:, 1:max_len + 1].to(config.device))
    pred_tags = [[idx_to_tag.get(idx) for idx in indices] for indices in batch_output]
    return pred_tags, max_len


def evaluate(model, dataloader):
    SinterG, S, G = 0, 0, 0
    for data in dataloader:
        inputs, masks, labels, label_start = data
        pred_tags, max_len = predict(model, inputs, masks, label_start)
        labels = labels.squeeze()[:, 1:max_len + 1]
        labels = labels.numpy()
        true_tags = [[idx_to_tag.get(idx) for idx in indices] for indices in labels]
        assert len(pred_tags) == len(true_tags)
        pred_entity, true_entity = getentity(pred_tags), getentity(true_tags)
        S += len(pred_entity)
        G += len(true_entity)
        SinterG += len(set(pred_entity).intersection(set(true_entity)))
    P, R = SinterG / S, SinterG / G
    f1 = 2 * P * R / (P + R)
    return {'P': P, "R": R, 'f1': f1}

In [12]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW

model = Bert_BiLSTM_CRF(config=config)
model.to(config.device)
optimizer = AdamW(model.parameters(), lr=config.learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0,
                                            num_training_steps=config.epoch_num * len(train_dataloader))

train(train_dataloader, model, optimizer, scheduler)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1506/1506 [10:27<00:00,  2.40it/s]


Epoch: 1, train loss: 51.313948667381865
Epoch: 1, f1 score: 0.9840974916662113


100%|██████████| 1506/1506 [10:26<00:00,  2.40it/s]


Epoch: 2, train loss: 30.419310840794132
Epoch: 2, f1 score: 0.9875048164253867
Best val f1: 0.9875048164253867
Training Finished!


In [51]:
result = pd.DataFrame(columns=['id', 'BIO_anno', 'class'])
result['id'] = test_dataframe['id']
for index, data in enumerate(tqdm(test_dataloader)):
    input_tokens, mask, label_start = data
    batch_output, _ = predict(model, input_tokens, mask, label_start)
    for b in range(len(batch_output)):
        result.loc[index * BATCH_SIZE + b, 'BIO_anno'] = ' '.join(batch_output[b][:-1])
result

100%|██████████| 721/721 [01:42<00:00,  7.01it/s]


Unnamed: 0,id,BIO_anno,class
0,0,O O O O B-COMMENTS_N I-COMMENTS_N O O O O O O ...,
1,1,O O O O O O O O O B-BANK O O O O B-COMMENTS_AD...,
2,2,O O O O O O O O O B-PRODUCT I-PRODUCT B-COMMEN...,
3,3,O O O B-PRODUCT I-PRODUCT O O O O O O O O O O ...,
4,4,O O O O O O B-BANK I-BANK O O O,
...,...,...,...
2878,2878,O O O O O O O O O O O O B-PRODUCT I-PRODUCT,
2879,2879,O O O O O O O O O O O O O O,
2880,2880,O O O O O O O B-PRODUCT I-PRODUCT O O O O O O ...,
2881,2881,O B-PRODUCT I-PRODUCT I-PRODUCT I-COMMENTS_N I...,


In [56]:
result.to_csv('bert_ner_baseline.csv', index=False)