# 使用Trainer API来微调模型

## 1. 数据集准备和预处理：

In [1]:
import numpy as np
from transformers import AutoTokenizer, DataCollatorWithPadding
import datasets
checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_datasets = datasets.load_dataset('glue', 'mrpc')

def tokenize_function(sample):
    return tokenizer(sample['sentence1'], sample['sentence2'], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Reusing dataset glue (C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-d7c1a56b0a079691.arrow
Loading cached processed dataset at C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-4551ce60e93aa1ca.arrow
Loading cached processed dataset at C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-8e3dd97f55b2d13b.arrow


## 2. 加载我们要fine-tune的模型：

In [2]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

不得不说，这个Huggingface很贴心，这里的warning写的很清楚。这里我们使用的是带`ForSequenceClassification`这个Head的模型，但是我们的`bert-baed-cased`虽然它本身也有自身的Head，但跟我们这里的二分类任务不匹配，所以可以看到，它的Head被移除了，使用了一个随机初始化的`ForSequenceClassification`Head。

所以这里提示还说："You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

## 3. 使用Trainer来训练

In [3]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(output_dir='test_trainer')

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,  # 在定义了tokenizer之后，其实这里的data_collator就不用再写了，会自动根据tokenizer创建
    tokenizer=tokenizer,
)

```python
TrainingArguments(
    output_dir: Union[str, NoneType] = None,
    overwrite_output_dir: bool = False,
    do_train: bool = False,
    do_eval: bool = None,
    do_predict: bool = False,
    evaluation_strategy: transformers.trainer_utils.EvaluationStrategy = 'no',
    prediction_loss_only: bool = False,
    per_device_train_batch_size: int = 8,
    per_device_eval_batch_size: int = 8,
    per_gpu_train_batch_size: Union[int, NoneType] = None,
    per_gpu_eval_batch_size: Union[int, NoneType] = None,
    gradient_accumulation_steps: int = 1,
    eval_accumulation_steps: Union[int, NoneType] = None,
    learning_rate: float = 5e-05,
    weight_decay: float = 0.0,
    adam_beta1: float = 0.9,
    adam_beta2: float = 0.999,
    adam_epsilon: float = 1e-08,
    max_grad_norm: float = 1.0,
    num_train_epochs: float = 3.0,   # 默认跑3轮
    ...
```
文档： 
https://huggingface.co/transformers/master/main_classes/trainer.html#trainingarguments

```python
Trainer(
    model: Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None,
    args: transformers.training_args.TrainingArguments = None,
    data_collator: Union[DataCollator, NoneType] = None,
    train_dataset: Union[torch.utils.data.dataset.Dataset, NoneType] = None,
    eval_dataset: Union[torch.utils.data.dataset.Dataset, NoneType] = None,
    tokenizer: Union[ForwardRef('PreTrainedTokenizerBase'), NoneType] = None,
    model_init: Callable[[], transformers.modeling_utils.PreTrainedModel] = None,
    compute_metrics: Union[Callable[[transformers.trainer_utils.EvalPrediction], Dict], NoneType] = None,
    callbacks: Union[List[transformers.trainer_callback.TrainerCallback], NoneType] = None,
    optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
)
Docstring:     
Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.
```

In [11]:
trainer.train()

Step,Training Loss
500,0.5903
1000,0.3884


TrainOutput(global_step=1377, training_loss=0.41799395969144315, metrics={'train_runtime': 393.2848, 'train_samples_per_second': 3.501, 'total_flos': 530185443455520, 'epoch': 3.0})

用Trainer来预测：

`trainer.predict()`函数处理的结果，包含三个属性：predictions, label_ids, metrics

其中`metrics`中还可以包含我们自定义的字段，我们需要在定义T`rainer`的时候给定`compute_metrics`参数。

In [24]:
predictions = trainer.predict(tokenized_datasets['validation'])
print(predictions.predictions.shape)  # logits
# array([[-2.7887206,  3.1986978],
#       [ 2.5258656, -1.832253 ], ...], dtype=float32)
print(predictions.label_ids.shape) # array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...], dtype=int64)
print(predictions.metrics)

(408, 2)
(408,)
{'eval_loss': 0.6575437188148499, 'eval_runtime': 3.4957, 'eval_samples_per_second': 116.716}


In [None]:
preds = np.argmax(predictions.predictions, axis=-1)
preds[:10]

然后就可以用preds和labels来计算一些相关的metrics了。

Huggingface `datasets`里面可以直接导入跟数据集相关的metrics：

In [29]:
from datasets import load_metric
metric = load_metric('glue', 'mrpc')
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

{'accuracy': 0.8480392156862745, 'f1': 0.8949152542372881}

metric，glue type的文档：
```python
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
```

## 构建`Trainer`中的`compute_metrics`函数

Let’s see how we can build a useful compute_metrics function and use it the next time we train. The function must take an EvalPrediction object (which is a named tuple with a predictions field and a label_ids field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values). 

In [4]:
from datasets import load_metric
def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## 看看带上了 compute_metrics 之后的训练：

In [5]:
training_args = TrainingArguments(output_dir='test_trainer', evaluation_strategy='epoch')
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)  # new model
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,  # 在定义了tokenizer之后，其实这里的data_collator就不用再写了，会自动根据tokenizer创建
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Epoch,Training Loss,Validation Loss,Accuracy,F1,Runtime,Samples Per Second
1,No log,0.329815,0.867647,0.903571,5.7556,70.887
2,0.497900,0.600649,0.845588,0.897227,6.1665,66.164
3,0.283200,0.605053,0.872549,0.910345,6.0518,67.418


TrainOutput(global_step=1377, training_loss=0.32063739751678666, metrics={'train_runtime': 400.684, 'train_samples_per_second': 3.437, 'total_flos': 530351810395680, 'epoch': 3.0})