# 一课Transformer
本文介绍hugging face提供的transformer库的使用

## 环境

In [1]:
import torch
import transformers

print(f'torch: {torch.__version__}')
print(f'transformer: {transformers.__version__}')

torch: 1.6.0
transformer: 4.18.0


In [2]:
# 所有大写字母路径需要改为自己的实际路径
BASE_URL = '/root/transformers_data_and_model/'
BERT_MODEL_NAME_OR_PATH = BASE_URL + 'bert-base-uncased'
# 展示文件夹中的内容
#!tree {BERT_MODEL_NAME_OR_PATH}

In [3]:
# 初始化model和tokenizer
# 所有model和tokenizer的初始化都是用from_pretrained方法，保存都使用save_pretrained的方法
# 对from_pretrained:第一个参数是文件夹的路径/文件的路径/模型的short name等几种方法，这里推荐文件夹的方法
# model初始化默认是eval模式，这里加载的是BERT的tokenizer和分类模型model
model = transformers.AutoModelForSequenceClassification.from_pretrained(BERT_MODEL_NAME_OR_PATH)
tokenizer = transformers.AutoTokenizer.from_pretrained(BERT_MODEL_NAME_OR_PATH)

Some weights of the model checkpoint at /root/transformers_data_and_model/bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

## tokenizer

In [4]:
# tokenizer的作用：对于给定的文本，经过tokenizer处理成model可以接收的格式
# tokenizer最重要的方法是__call__，这个方法可以将文本输出为模型要的格式
# tokenzier还有其他方法，encode/decode,顾名思义，就是将文本转换成input_ids及将input_ids转换成文本
# encode/decode与__call__其实无本质区别，只是__call__为了提供统一的处理接口
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
# input_ids是文本每个词的index；token_type_id是表示文本是第一句/第二句；attention_mask是处理mask用的
# 这里处理的是一个样本且只有一句话的例子，如果多句话，输入为一个文本list即可。
#sentences = [["Gonna be ok.", "Ready perfectly."]]
print(inputs)

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [5]:
# 一个样本且有两句话的例子，分别作为第一个和第二个参数输入
# 如果是多个样本，每个样本都是两句话，则第一个参数是第一句话的文本list，第二个参数是第二句话的文本list
#inputs2 = tokenizer(["Gonna be ok.", "Ready perfectly.", "hello"], ["Gonna be ok.", "Ready perfectly.", 'good morning'])
inputs2 = tokenizer('hello', 'good morning')
print(inputs2)

{'input_ids': [101, 7592, 102, 2204, 2851, 102], 'token_type_ids': [0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1]}


In [6]:
# 更多时候，我们需要的model输入是成batch格式的
# 第一个输入是文本list，padding设置成True，truncation设置成True 可以进行padding和truncation
# return_tensors写明了返回格式，是一个pytorch的tensor
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    return_tensors="pt"
)
for key, value in pt_batch.items():
    print(f'{key}: {value.numpy().tolist()}')

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


In [7]:
# 对于tokenzier处理后的文本，目标是送入model，用于分类、预测等任务
# 上面的pt_batch就是一个batch，增加**直接输入模型
# 根据transformers库的规则，所有model的输出都是元组
# 如果只有每个batch的输入，元组输出的第一个是logits
# 如果同时传入了labels的参数，则元组输出的第一个是loss，第二个是logits
# 一般回归任务loss用的Mean-Square loss，分类任务则是Cross-Entry
pt_outputs = model(**pt_batch)
print(f'logits:{pt_outputs[0]}\n')
# 传入labels参数的情况
pt_outputs = model(**pt_batch, labels=torch.LongTensor([1, 0]))
print(f'loss: {pt_outputs[0]}\nlogits: {pt_outputs[1]}')

logits:tensor([[0.0497, 0.4243],
        [0.0902, 0.4860]], grad_fn=<AddmmBackward>)

loss: 0.7168716788291931
logits: tensor([[0.0497, 0.4243],
        [0.0902, 0.4860]], grad_fn=<AddmmBackward>)


上面就是pytorch版transformers的基本输入逻辑：
transformers中的所有model都是pytorch标准模型类torch.nn.Module.文本可以通过tokenizer调用转换为模型的输入，模型输入这些信息，得到logits，计算loss，进行误差回传backward，进行迭代，就完成了训练/fine-tune.模型输入的时候，如果传入labels的参数，也可以直接得到相应的loss，一帮backward，多次迭代，完成训练

In [8]:
# 完成训练/微调后，可以将tokenizer和model保存到相同文件夹
# 在transformers框架里，一个很好的习惯，将model、tokenizer参数、训练参数等所有保存到一个文件夹下。
# 这里的model/tokenzier/config的初始化使用from_pretrained,保存使用save_pretrained
# save_pretrained传入具体文件夹名即可
SAVE_DIRECTORY = BASE_URL + 'bert_save_example'
tokenizer.save_pretrained(SAVE_DIRECTORY)
model.save_pretrained(SAVE_DIRECTORY)

### 总结
1.在transformers框架里，提供了model/tokenizer/config通用化的加载和保存，也就是from_pretraiend/save_pretrained；
2.tokenizer的作用在于，通过__call__将文本进行转换成模型接受的格式，model的输出都是元组，依据这些元组的内容进行计算loss；
3.tokenizer包装了Byte-Pair Encoding、WordPiece、SentencePiece等不同的方式；
4.model是包装了BERT、GPT2、ALBERT等不同的模型，并且提供标准化的类。等下细说。
5.对于我们来讲，去复现、实验、研究更容易。

## 概念与说明
### 主要的类
主要的类有Model、Configuration、Tokenizer这三个类，下面分别介绍。
Model类，比如BertModel，均从pytorch models(torch.nn.Module)或者keras models(tf.keras.Model)继承而来，用于处理预训练权重。

Configuration类，比如BertConfig，里面保存着建立模型所需要的所有参数。并不是总是我们手动去初始化这个类，尤其是当你使用没有做任何更改的预训练时，model会自动处理好这个类。也就是说，如果自己重新预训练的模型且架构不一致时，是允许我们去初始化这个类的。

Tokenizer类，比如BERTTokenizer，为每个模型保存词典，并将文本进行编码/解码成模型需要的格式——token嵌入的索引。

上面的类，都有下面两个方法去实例化类和保存至本地：

from_pretrained() 允许我们加载预训练模型，可以使用short_name，也可以使用本地的模型，作为第一个参数model_name_or_path传入，可以是文件夹、文件、short_name等。其中文件夹的话，会默认寻找文件夹中的pytorh_model.bin。

save_pretrained() 允许我们将模型保存至本地，保存的参数是可以是文件夹。

### AutoModels
以BERT为例，每个模型都有一个Config（BertConfig）；有1-2个tokenizer，分别是基于rust的快速tokenizer（BertTokenizerFast），一个是基于python原版的tokenizer（BertTokenizer），部分没有提供rust的快速tokenizer；有多个皆有不同head的Model，比如最原始的模型，不含head的BertModel、预训练MLM和NSP的BertForPreTraining、MLM head的BertForMaskedLM，NSP的BertForNextSentencePrediction，用于句子分类的BertForSequenceClassification，用于多选的BertForMultipleChoice，用单词分类的BertForTokenClassification，用于问答的BertForQuestionAnswering。

不同的模型，会稍有不同。但是config类都继承自PretrainedConfig；tokenizer都继承自PreTrainedTokenizer或PreTrainedTokenizerFast；model都继承自PreTrainedModel。

为了使用方便，AutoConfig、AutoTokenizer、AutoModel、AutoModelForPreTraining、AutoModelWithLMHead、AutoModelForSequenceClassification、AutoModelForQuestionAnswering、AutoModelForTokenClassification等可以用于自动查找模型。

### Trainer类
Trainer类提供了一个完整的标准训练的API，目前支持语言模型、文本分类、单词分类（NER）等任务。对于前面的config、tokenizer、model，我们可以认为，帮助我们简化的是写模型的这一步，正常生成dataset、dataloader，然后再每个epoch、batch进行训练，得到最终的结果。正常写的话，Trainer类可以不用，Trainer其实是简化的是我们训练的这一步。

对于通常的训练过程，写法大致是这样的（下面是伪代码）：

In [66]:
## 伪代码 不要执行！

# 通常训练过程代码
# 加载数据
train_data, test_dat = get_data()
# 转换成features，获得dataset
train_dataset = MyDataset(train_data, args)
test_dataset = MyDataset(test_data, args)

# 转换成dataloader，用于生成batch
# sampler 定义取batch的方法，是一个迭代器， 每次生成一个key 用于读取dataset中的值
train_sampler, test_sampler = ...
# collate_fn函数会将batch_size个样本整理成一个batch样本，便于批量训练。
train_dataloader = Dataloader(train_dataset, sampler=train_sampler, batch_size=batch_size, collate_fn=collate_fn)
test_dataloader = Dataloader(test_dataset, sampler=test_sampler, batch_size=batch_size, collate_fn=collate_fn)
# 初始化tensorboard， tensorboard是一个展示训练过程各张量变化的工具
tb_writer = SummaryWriter(log_dir=None)
# 加载optimizer
optimizer = ...
model.to(GPU)
# 开始训练
for epoch in range(epochs):
    for batch in train_dataloader:
        # 转换为train
        model.train()
        # 传入GPU
        batch.to(GPU)
        # 计算每一步的loss，然后回传
        tr_loss = train_step(model, inputs, optimizer)
        tr_loss.backward()
        model.zero_grad()
for batch in test_dataloader:
    ...
model.save_pretrained(OUTPUT_PATH)

NameError: name 'get_data' is not defined

In [None]:
# 伪代码 不要执行！！
# 下面是使用了Trainer的训练过程

# 加载数据
train_data, test_data = get_data()
# 转换成features， 获得dataset
train_dataset = MyDataset(train_data, args)
test_dataset = MyDataset(test_data, args)
# 读入train_args
train_args = **

# 初始化本Trainer
trainer = transformers.Trainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=build_compute_metrics_fn(data_args.task_name),
)
# 训练
if training_args.do_train:
    trainer.train(
        model_path=model_args.model_name_or_path if os.pash.isdir(model_args.model_name_or_path) else None
    )
    # trainer保存模型，里面调用的还是save_pretrained方法
    trainer.save_model()
    # 保存tokenizer到同一个文件夹，方便使用
    if trainer.is_world_master():
        tokenizer.save_pretrained(training_args.output_dir)

## GLUE/MRPC数据集进行文本分类的示例
### 4.1本节说明
以简单的例子，说明文本分类的fine-tune详细过程，线上部署代码
4.2节详细实现了文本分类任务的详细代码，4.3节实现了基于4.2节训练好的文本分类模型的线上预测代码，4.4节总结了文本分类的基本经验和transformers使用的基本经验。
本节涉及到的模型是bert-base-uncased，涉及到的数据集是glue数据集下的MRPC数据集。

glue数据集共有9个任务，其中STS-B是一个回归任务，MNLI是三分类任务，剩余7类均是二分类任务。九个任务之一的MRPC（The Microsoft Research Paraphrase Corpus，微软研究院释义语料库），相似性和释义任务，是从在线新闻源中自动抽取句子对语料库，并人工注释句子对中的句子是否在语义上等效。类别并不平衡，其中68%的正样本，所以遵循常规的做法，报告准确率（accuracy）和F1值。样本个数：训练集3, 668个，开发集408个，测试集1, 725个。任务：是否释义二分类，是释义，不是释义两类。评价准则：准确率（accuracy）和F1值。标签为1（正样本，互为释义）的样例（每个样例是两句话，中间用tab隔开）

### 4.2模型训练的详细过程
transformers提供了config、tokenizer、model等类简化了分词、模型等步骤，同时又有Trainer类简化了训练过程。那么更详细的训练过程是什么呢？本节主要的内容就是实现和讲解模型分类的详细过程。

简单的讲，主要分为几个步骤：
1.加载参数，文本处理成Dataset；
2.写collate_fn，用于处理padding；
3.加载config、tokenizer、model等；
4.写metrics，用于评估效果；
5.将以上参数送入Trainer初始化类，然后调用Trainer的train方法训练。

加载参数的参数分为三类：一类是model相关参数，用于记录模型的位置等信息，用于初始化模型；一类是数据参数，数据的位置，任务名称等用于提供模型输入前的参数；一类是训练参数，这些是模型在训练过程中的参数，比如learning_rate、epochs等。

文本最开始需要载入，可以通过写明一个Processor类，这个类用于提供几个方法：获得训练样本、获得开发样本、获得标签list等。这里的获得样本是一个list，每个元素都是一个example，example里包含文本和对应的标签（有的不含，比如测试集）。这个Processor提供的方法主要是为了Dataset类使用，Dataset实现单个输入的样本。

collate_fn是Dataloader类的输入，对于处理好的恒定长度的feature，可以不输入collate_fn，使用默认的collate_fn，对于长度不定涉及到padding的文本，需要自己写此参数。

config、tokenizer、model的加载我们已基本熟悉。

metrics写法可以参考下面方法。

然后将这些参数送入Trainer，就可以训练和评估了。

请看下面代码。

In [2]:
import logging
import os
import sys
import enum
import time
from dataclasses import dataclass, field
from typing import Callable, Dict, Optional, List, Union, NamedTuple

import filelock
import torch
import numpy as np
import transformers
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score

In [3]:
# 下载好的预训练模型位置
BASE_URL = '/root/transformers_data_and_model/'
BERT_MODEL_NAME_OR_PATH = BASE_URL + 'bert-base-uncased'

# 下载好的glue数据集中的MRPC数据集的位置
MRPC_DATA_DIR = BASE_URL + 'MRPC'
# finetune好的model、tokenizer等各种参数存放的位置
FINETUNED_MRPC = BASE_URL + 'finetuned-mrpc'

In [4]:
# 日志文件
logger = logging.getLogger(__name__)

In [5]:
# 定义模型参数，包含model、config、tokenzier、cache_dir等
# 有三类参数：
# 一个是模型参数，这个就是下面的定义；
# 一个是模型训练参数，可以参考transformers/src/train_args.py 文件，主要是epoch、batch_size等常见的训练参数，也包含device这种设备参数
# 一个数据参数，决定数据处理任务的参数，使用什么数据，数据名称，是否覆盖数据的cache，最长长度，这个长度是生成features使用的
# 下面这个就是模型参数：
@dataclass
class ModelArguments:
    '''
    Arguments pertaining to which model/config/tokenzier we are going to fine-tune from
    '''
    
    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
        
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
        
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )

        
# 数据（训练使用的）参数
# 这些参数用于送入Dataset、processor等，用于方便加载数据，完成从开始到模型输入前的参数
@dataclass
class GlueDataTrainingArguments:
    '''
    Arguments pertaining to what data we are going to input our model for training and eval.
    
    Using 'HfArgumentParser' we can turn this classs
    into argparse arguments to be able to specify them on
    the command line.
    '''
    
    task_name: str = field(metadata={"help": "The name of the task to train on: MRPC"})
    data_dir: str = field(
        metadata={"help": "The input data dir. Should contain the .tsv files (or other data files) for the tast."}
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequence longer "
            "than this will be truncated, sequence shorter will be padded."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
        
    def __post_init__(self):
        self.task_name = self.task_name.lower()

In [6]:
input_args = ['--model_name_or_path', BERT_MODEL_NAME_OR_PATH,
             '--task_name', 'MRPC',
              '--do_train',
              '--do_eval',
              '--data_dir', MRPC_DATA_DIR,
              '--max_seq_length', '128',
              '--per_device_train_batch_size', '32',
              '--learning_rate', '3e-5',
              '--num_train_epochs', '3.0',
              '--output_dir', FINETUNED_MRPC,
              '--overwrite_cache',
              '--overwrite_output_dir']


# 以后新的实验，需要实现上面的模型参数和数据参数，基本上按照上面格式去写，使用下面的方法转换成参数空间即可
# transformers里，有一个HfArgumentParser用于解析上面格式的参数，为标准的python参数
parser = transformers.HfArgumentParser((ModelArguments, transformers.GlueDataTrainingArguments, transformers.TrainingArguments))
# 将三类参数分别解析为对应空间
# 模型本身的参数，数据的参数，训练的参数
model_args, data_args, training_args = parser.parse_args_into_dataclasses(input_args)

In [7]:
# 展示一下所有参数
print(f'{model_args}\n\n{data_args}\n\n{training_args}')

ModelArguments(model_name_or_path='/root/transformers_data_and_model/bert-base-uncased', config_name=None, tokenizer_name=None, cache_dir=None)

GlueDataTrainingArguments(task_name='mrpc', data_dir='/root/transformers_data_and_model/MRPC', max_seq_length=128, overwrite_cache=True)

TrainingArguments(
_n_gpu=4,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_s

In [8]:
# 确保output_dir可用
if (
    os.path.exists(training_args.output_dir)
    and os.listdir(training_args.output_dir)
    and training_args.do_train
    and not training_args.overwrite_output_dir
):
    raise ValueError(
        f'Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome.'
    )

In [9]:
# 设计日志格式，记录一些关键参数，并且把训练参数打印出来
# 一个很重要的感受：使用logger打印中间变量很重要
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m%d%Y %H:%M:%S",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)

logger.warning(
    "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training:%s",
    training_args.local_rank,
    training_args.device,
    training_args.n_gpu,
    bool(training_args.local_rank != -1),
    training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

# 设定种子
transformers.set_seed(training_args.seed)

04252022 03:52:29 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=4,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=3e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=

In [10]:
# 获得标签的个数
# 输出的模式，这里是classification与regression两种
# 如果适配新任务，我们要的不是去按照这种格式，而是要得到这两个参数
num_labels = 2
output_mode = "classification"

In [23]:
# 加载model、tokenizer、model这三个
# config是包含层数、dropout参数、head个数、finetune任务等模型相关内容的参数，这个参数加载后只是为了model使用。
# config内写入标签的个数num_labels，决定model后面分类使用的全连接的输出个数
config = transformers.AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    num_labels=num_labels,
    finetuning_task=data_args.task_name,
    cache_dir=model_args.cache_dir,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
)
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_args.model_name_or_path,
    from_tf=bool(".ckpt" in model_args.model_name_or_path),
    config=config,
    cache_dir=model_args.cache_dir,
)

Some weights of the model checkpoint at /root/transformers_data_and_model/bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

In [24]:
# transformers.DataProcessor是一个基类，需要实现get_train_examples, get_dev_examples, get_test_examples, get_labels等几个函数
# 分别用于提供的InputExample的集和(list)和标签的集和
class MrpcProcessor(transformers.DataProcessor):
    "Processor for the MRPC data set (GLUE version)."
    
    def get_example_from_tensor_dict(self, tensor_dict):
        """See base class."""
        return InputExample(
            tensor_dict['idx'].numpy(),
            tensor_dict['sentence1'].numpy().decode('utf-8'),
            tensor_dict['sentence2'].numpy().decode('utf-8'),
            str(tensor_dict['label'].numpy()),
        )
    
    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
    
    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
    
    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
    
    def get_labels(self):
        """See base class."""
        return ["0", "1"]
    
    def _create_examples(self, lines, set_type):
        """Create examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
            label = None if set_type == "test" else line[0]
            examples.append(transformers.InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


In [25]:
# 将examples转换成features
def glue_convert_examples_to_features(
    examples: List[transformers.InputExample],
    tokenizer: transformers.PreTrainedTokenizer,
    max_length: Optional[int] = None,
    task=None,
    label_list=None,
    output_mode=None,
):
    """
    Loads a data file into a list of ``InputFeatures``

    Args:
        examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.
        tokenizer: Instance of a tokenizer that will tokenize the examples
        max_length: Maximum example length. Defaults to the tokenizer's max_len
        task: GLUE task
        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method
        output_mode: String indicating the output mode. Either ``regression`` or ``classification``

    Returns:
        If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``
        containing the task-specific features. If the input is a list of ``InputExamples``, will return
        a list of task-specific ``InputFeatures`` which can be fed to the model.

    """
    if max_length is None:
        max_length = tokenizer.max_len
        
    if task is not None:
        processor = MrpcProcessor()
        if label_list is None:
            label_list = processor.get_labels()
            logger.info("Using label list %s for task %s" % (label_list, task))
        if output_mode is None:
            output_mode = glue_output_modes[task]
            logger.info("Using output mode %s for task %s" % (output_mode, task))
    
    label_map = {label: i for i, label in enumerate(label_list)}
    
    def label_from_example(example: transformers.InputExample) -> Union[int, float, None]:
        if example.label is None:
            return None
        if output_mode == "classification":
            return label_map[example.label]
        elif output_mode == "regression":
            return float(example.label)
        raise KeyError(output_mode)
    
    labels = [label_from_example(example) for example in examples]
    
    batch_encoding = tokenizer(
        [(example.text_a, example.text_b) for example in examples],
        max_length=max_length,
        padding="max_length",
        truncation=True,
    )
    
    features = []
    for i in range(len(examples)):
        inputs = {k: batch_encoding[k][i] for k in batch_encoding}
        
        feature = transformers.InputFeatures(**inputs, label=labels[i])
        features.append(feature)
    
    for i, example in enumerate(examples[:5]):
        logger.info("*** Example ***")
        logger.info("guid: %s" % (example.guid))
        logger.info("features: %s" % features[i])
    
    return features


In [26]:
class Split(enum.Enum):
    train = 'train'
    dev = 'dev'
    test = 'test'
    
class GlueDataset(torch.utils.data.dataset.Dataset):
    """
    This will be superseded by a framework-agnostic approach soon.
    """
    args: GlueDataTrainingArguments
    output_mode: str
    features: List[transformers.InputFeatures]
    
    def __init__(
        self,
        args: transformers.GlueDataTrainingArguments,
        tokenizer: transformers.PreTrainedTokenizer,
        limit_length: Optional[int] = None,
        mode: Union[str, Split] = Split.train,
        cache_dir: Optional[str] = None,
        output_mode = "classification",
    ):
        self.args = args
        self.processor = MrpcProcessor()
        self.output_mode = output_mode
        if isinstance(mode, str):
            try:
                mode =Split[mode]
            except KeyError:
                raise KeyError("mode is not a valid split name")
        # load data features from cache or dataset file
        cached_features_file = os.path.join(
            cache_dir if cache_dir is not None else args.data_dir,
            "cached_{}_{}_{}_{}".format(
                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
            ),
        )
        label_list = self.processor.get_labels()
        self.label_list = label_list
        
        # Make sure only the first process in distributed training processes the dataset,
        # and the others will use the cache.
        lock_path = cached_features_file + ".lock"
        with filelock.FileLock(lock_path):
            if os.path.exists(cached_features_file) and not args.overwrite_cache:
                start = time.time()
                self.features = torch.load(cached_features_file)
                logger.info(
                    f"Loading features from cached file {cached_features_file} [took %3.f s]", time.time() - start
                )
            else:
                logger.info(f"Creating features from dataset file at {args.data_dir}")
                
                if mode == Split.dev:
                    examples = self.processor.get_dev_examples(args.data_dir)
                elif mode == Split.test:
                    examples = self.processor.get_test_examples(args.data_dir)
                else:
                    examples = self.processor.get_train_examples(args.data_dir)
                if limit_length is not None:
                    examples = examples[:limit_length]
                self.features = glue_convert_examples_to_features(
                    examples,
                    tokenizer,
                    max_length=args.max_seq_length,
                    label_list=label_list,
                    output_mode=self.output_mode,
                )
                start = time.time()
                torch.save(self.features, cached_features_file)
                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.
                logger.info(
                    "Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
                )
    
    def __len__(self):
        return len(self.features)

    def __getitem__(self, i) -> transformers.InputFeatures:
        return self.features[i]
    
    def get_labels(self):
        return self.label_list

In [27]:
# 获得dataset
train_dataset, eval_dataset, test_dataset = None, None, None
if training_args.do_train:
    train_dataset = GlueDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir, output_mode=output_mode)
if training_args.do_eval:
    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev", cache_dir=model_args.cache_dir)
if training_args.do_predict:
    test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test", cache_dir=model_args.cache_dir)

04252022 03:54:49 - INFO - filelock - Lock 140540602682128 acquired on /root/transformers_data_and_model/MRPC/cached_train_BertTokenizerFast_128_mrpc.lock
04252022 03:54:49 - INFO - __main__ - Creating features from dataset file at /root/transformers_data_and_model/MRPC
04252022 03:54:49 - INFO - __main__ - LOOKING AT /root/transformers_data_and_model/MRPC/train.tsv
04252022 03:54:50 - INFO - __main__ - *** Example ***
04252022 03:54:50 - INFO - __main__ - guid: train-1
04252022 03:54:50 - INFO - __main__ - features: InputFeatures(input_ids=[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

04252022 03:54:51 - INFO - filelock - Lock 140540602682128 released on /root/transformers_data_and_model/MRPC/cached_train_BertTokenizerFast_128_mrpc.lock
04252022 03:54:51 - INFO - filelock - Lock 140540607032720 acquired on /root/transformers_data_and_model/MRPC/cached_dev_BertTokenizerFast_128_mrpc.lock
04252022 03:54:51 - INFO - __main__ - Creating features from dataset file at /root/transformers_data_and_model/MRPC
04252022 03:54:51 - INFO - __main__ - *** Example ***
04252022 03:54:51 - INFO - __main__ - guid: dev-1
04252022 03:54:51 - INFO - __main__ - features: InputFeatures(input_ids=[101, 2002, 2056, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2987, 1005, 1056, 4906, 1996, 2194, 1005, 1055, 2146, 1011, 2744, 3930, 5656, 1012, 102, 1000, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2515, 2025, 4906, 2256, 2146, 1011, 2744, 3930, 5656, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

04252022 03:54:51 - INFO - filelock - Lock 140540607032720 released on /root/transformers_data_and_model/MRPC/cached_dev_BertTokenizerFast_128_mrpc.lock


In [28]:
# 得到计算结果的函数
# 这里计算的是acc和f1
# EvalPrediction是预测的结果的格式，prediction是预测的，labels_ids是正确的
class EvalPrediction(NamedTuple):
    """
    Evaluation output (always contains labels), to be used to compute metrics.
    
    Parameters:
        predictions (:obj:`np.ndarray`): Predictions of the model.
        label_ids (:obj:`np.ndarray`): Targets to be matched.
    """
    
    predictions:np.ndarray
    label_ids:np.ndarray
        
    
# 得到计算函数
def compute_metrics_fn(p: EvalPrediction):
    if output_mode == "classification":
        # 预测的结果
        preds = np.argmax(p.predictions, axis=1)
    # 正确的结果
    labels = p.label_ids

    # acc和f1
    acc = (preds == labels).mean()
    f1 = f1_score(y_true=labels, y_pred=preds)
    return {"acc": acc, "f1":f1, "acc_and_f1": (acc+f1) / 2}

In [29]:
# 初始化本Trainer
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics_fn,
)

In [30]:
# 训练
if training_args.do_train:
    # 此时传入model_path是为了加载optimiztor，进行继续训练，对于通常的fine-tune来说，model_path可以不传入
    # 对于初始化后的Trainer，调用train方法就可以训练了，简化了训练的过程
    trainer.train(
        model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
    )
    # Trainer保存模型调用此方法
    trainer.save_model()
    
    # 为了方便使用起见，将tokenizer的模型参数也存入model同目录(原方法已经弃用)
    if trainer.is_world_process_zero():
        tokenizer.save_pretrained(training_args.output_dir)

Loading model from /root/transformers_data_and_model/bert-base-uncased).
There were missing keys in the checkpoint model loaded: ['bert.embeddings.position_ids', 'bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.1.attention.output.LayerNorm.weight', 'bert.encoder.layer.1.attention.output.LayerNorm.bias', 'bert.encoder.layer.1.output.LayerNorm.weight', 'bert.encoder.layer.1.output.LayerNorm.bias', 'bert.encoder.layer.2.attention.output.LayerNorm.weight', 'bert.encoder.layer.2.attention.output.LayerNorm.bias', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.attention.output.LayerNorm.weight', 'bert.encoder.layer.3.attention.output.LayerNorm.bias', 'bert.encoder.layer.3.output.L

Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to /root/transformers_data_and_model/finetuned-mrpc
Configuration saved in /root/transformers_data_and_model/finetuned-mrpc/config.json
Model weights saved in /root/transformers_data_and_model/finetuned-mrpc/pytorch_model.bin
tokenizer config file saved in /root/transformers_data_and_model/finetuned-mrpc/tokenizer_config.json
Special tokens file saved in /root/transformers_data_and_model/finetuned-mrpc/special_tokens_map.json


In [36]:
# 评估结果
eval_results = {}
if training_args.do_eval:
    logger.info("*** Evaluate ***")
    
    # 传入metrics，对于前面初始化已经传入的，此时如果没变化，可以省略此步骤
    trainer.compute_metrics = compute_metrics_fn
    # 传入评估的dataset
    eval_results = trainer.evaluate(eval_dataset=eval_dataset)
    output_eval_file = os.path.join(
        training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
    )
    
    if trainer.is_world_process_zero():
        # 写入本地文件
        with open(output_eval_file, "w") as writer:
            logger.info("**** Eval results {} *****".format(eval_dataset.args.task_name))
            for key, value in eval_results.items():
                print("key:", key, " value:", value)
                logger.info("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))
                
        eval_results.update(eval_result)

04252022 03:57:48 - INFO - __main__ - *** Evaluate ***
    There is an imbalance between your GPUs. You may want to exclude GPU 3 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
04252022 03:57:50 - INFO - __main__ - **** Eval results mrpc *****
04252022 03:57:50 - INFO - __main__ -   eval_loss = 0.5169751048088074
04252022 03:57:50 - INFO - __main__ -   eval_acc = 0.7843137254901961
04252022 03:57:50 - INFO - __main__ -   eval_f1 = 0.8594249201277956
04252022 03:57:50 - INFO - __main__ -   eval_acc_and_f1 = 0.8218693228089958
04252022 03:57:50 - INFO - __main__ -   eval_runtime = 1.5105
04252022 03:57:50 - INFO - __main__ -   eval_samples_per_second = 270.117
04252022 03:57:50 - INFO - __main__ -   eval_steps_per_second = 8.607
04252022 03:57:50 - INFO - __main__ -   epoch

8
key: eval_loss  value: 0.5169751048088074
key: eval_acc  value: 0.7843137254901961
key: eval_f1  value: 0.8594249201277956
key: eval_acc_and_f1  value: 0.8218693228089958
key: eval_runtime  value: 1.5105
key: eval_samples_per_second  value: 270.117
key: eval_steps_per_second  value: 8.607
key: epoch  value: 3.0


In [39]:
# 测试集的评估
print(training_args.do_predict)
if training_args.do_predict:
    logging.info("*** Test ***")

    predictions = trainer.predict(test_dataset=test_dataset).predictions
    if output_mode == "classification":
        predictions = np.argmax(predictions, axis=1)

    output_test_file = os.path.join(
        training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
    )
    if trainer.is_world_master():
        with open(output_test_file, "w") as writer:
            logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
            writer.write("index\tprediction\n")
            for index, item in enumerate(predictions):
                item = test_dataset.get_labels()[item]
                writer.write("%d\t%s\n" % (index, item))

False


In [38]:
print(eval_results)

{'eval_loss': 0.5169751048088074, 'eval_acc': 0.7843137254901961, 'eval_f1': 0.8594249201277956, 'eval_acc_and_f1': 0.8218693228089958, 'eval_runtime': 1.6126, 'eval_samples_per_second': 253.008, 'eval_steps_per_second': 8.062, 'epoch': 3.0}


In [40]:
FINETUNED_MRPC

'/root/transformers_data_and_model/finetuned-mrpc'

In [42]:
# 查看训练好的模型文件
!tree {FINETUNED_MRPC}

[01;34m/root/transformers_data_and_model/finetuned-mrpc[00m
├── config.json
├── eval_results_mrpc.txt
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.txt

0 directories, 8 files
