# 微调模型


In [1]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True,
                  truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)/main/tokenizer.json: 466kB [00:00, 1.35MB/s]
Downloading model.safetensors: 100%|██████████| 440M/440M [00:21<00:00, 20.5MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceC

In [2]:
batch

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([1, 1])}

当然，仅仅用两句话训练模型不会产生很好的效果。为了获得更好的结果，您需要准备一个更大的数据集。


## 从模型中心（Hub）加载数据集


In [None]:
# ! pip install datasets

让我们使用 MRPC 数据集中的[GLUE 基准测试数据集](https://gluebenchmark.com/)，它是构成 MRPC 数据集的 10 个数据集之一，这是一个学术基准，用于衡量机器学习模型在 10 个不同文本分类任务中的性能。


In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Found cached dataset glue (/Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 777.59it/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

默认情况下，此命令在下载数据集并缓存到 `~/.cache/huggingface/datasets` .


In [6]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [7]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

在上面的例子之中, `Label` （标签） 是一种 ClassLabel（分类标签），使用整数建立起到类别标签的映射关系。0 对应于 `not_equivalent` ，1 对应于 `equivalent` 。


## 预处理数据

`tokenize_function` 函数将数据集中的每个示例通过 `tokenizer`

请注意，我们现在在 `tokenize_function` 标记函数中省略了 `padding` 参数。这是因为在标记的时候将所有样本填充到最大长度的效率不高。一个更好的做法：在构建批处理时填充样本更好，因为这样我们只需要填充到该批处理中的最大长度，而不是整个数据集的最大长度。当输入长度变化很大时，这可以节省大量时间和处理能力!

使用 [Dataset.map()](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) 方法, 并使用 `batched=True` 参数，这样函数就可以同时应用到数据集的多个元素上，而不是分别应用到每个元素上，这将显著加快标记与标记的速度。这个标记器来自 [🤗 Tokenizers 库](https://github.com/huggingface/tokenizers) 由 Rust 编写而成。


In [4]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bb21e6423b980722.arrow
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d1e8c90b5d349f7a.arrow
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-33304e37c309912f.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

可以看到比之前的 `raw_datasets` 数据集， `tokenized_datasets` 数据集已经被标记了，新增了 `input_ids` ， `token_type_ids` 和 `attention_mask` 。


In [14]:
tokenized_datasets["train"][0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0,
 'input_ids': [101,
  2572,
  3217,
  5831,
  5496,
  2010,
  2567,
  1010,
  3183,
  2002,
  2170,
  1000,
  1996,
  7409,
  1000,
  1010,
  1997,
  9969,
  4487,
  23809,
  3436,
  2010,
  3350,
  1012,
  102,
  7727,
  2000,
  2032,
  2004,
  2069,
  1000,
  1996,
  7409,
  1000,
  1010,
  2572,
  3217,
  5831,
  5496,
  2010,
  2567,
  1997,
  9969,
  4487,
  23809,
  3436,
  2010,
  3350,
  1012,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
 

### 动态填充

为了解决输入的句子长度不一致的问题，我们需要对输入进行填充。但是没有用 `padding` 参数，因为这样做效率不高。一个更好的做法是在构建批处理时填充样本，因为这样我们只需要填充到该批处理中的最大长度，而不是整个数据集的最大长度。当输入长度变化很大时，这可以节省大量时间和处理能力!

🤗transformer 库通过 `DataCollatorWithPadding` 为我们提供了这样一个函数。当你实例化它时，需要一个标记器(用来知道使用哪个词来填充，以及模型期望填充在左边还是右边)，并将做你需要的一切:


In [5]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

我们可以抽取几个样例来看一下我们的 `input_ids` 的长度是不是一致的。


In [6]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in [
    "idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

毫无疑问，我们得到了不同长度的样本，从 32 到 67。动态填充意味着该批中的所有样本都应该填充到长度为 67，这是该批中的最大长度。


In [7]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

可以看到，通过 `DataCollatorWithPadding` 函数，已经动态的将样本填充到了最大长度。


## 使用 Trainer API 微调模型

🤗 Transformers 提供了一个 Trainer 类来帮助您在自己的数据集上微调任何预训练模型。


上一步我们提前预处理数据的代码


In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)  # 动态填充

Found cached dataset glue (/Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 919.67it/s]
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bb21e6423b980722.arrow
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d1e8c90b5d349f7a.arrow
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-33304e37c309912f.arrow


### 训练

#### 1. 定义 TrainingArguments 类

它将包含 Trainer 用于训练和评估的所有超参数。您唯一必须提供的参数是保存训练模型的目录，以及训练过程中的检查点。对于其余的参数，您可以保留默认值，这对于基本微调应该非常有效。


In [None]:
# %pip install accelerate -U

In [5]:
from transformers import TrainingArguments

# training_args = TrainingArguments("test-trainer") for Linux or windows os with GPU
training_args = TrainingArguments(
    output_dir="test_trainer", use_mps_device=True)  # for M1 mac

#### 2. 定义模型


In [6]:
from transformers import AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

在实例化此预训练模型后会收到警告。这是因为 BERT 没有在句子对分类方面进行过预训练, 这正是我们现在要做的。


#### 3. 定义 Trainer


In [7]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

#### 4. 开始训练


In [10]:
trainer.train()

                                        
  0%|          | 0/1377 [04:31<?, ?it/s]          

{'loss': 0.5004, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}


                                        
  0%|          | 0/1377 [06:12<?, ?it/s]           

{'loss': 0.2709, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}


                                        
100%|██████████| 1377/1377 [04:43<00:00,  4.85it/s]

{'train_runtime': 283.6369, 'train_samples_per_second': 38.796, 'train_steps_per_second': 4.855, 'train_loss': 0.3179263140554435, 'epoch': 3.0}





TrainOutput(global_step=1377, training_loss=0.3179263140554435, metrics={'train_runtime': 283.6369, 'train_samples_per_second': 38.796, 'train_steps_per_second': 4.855, 'train_loss': 0.3179263140554435, 'epoch': 3.0})

这将开始微调（在 GPU 上应该需要几分钟），并每 500 步报告一次训练损失。但是，它不会告诉您模型的性能如何（或质量如何）。这是因为:

1. 我们没有通过将`evaluation_strategy`设置为`steps`(在每次更新参数的时候评估)或“epoch”(在每个 epoch 结束时评估)来告诉 Trainer 在训练期间进行评估。
2. 我们没有为 Trainer 提供一个 `compute_metrics()`函数来直接计算模型的好坏(否则评估将只输出 loss，这不是一个非常直观的数字)。


### 评估


In [8]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 51/51 [00:02<00:00, 17.81it/s]

(408, 2) (408,)





predictions 是一个形状为 408 x 2 的二维数组（408 是我们使用的数据集中元素的数量）。这些是我们传递给 predict()的数据集的每个元素的结果(logits)（正如你在之前的章节看到的情况）。要将我们的预测的可以与真正的标签进行比较，我们需要在第二个轴上取最大值的索引:


In [9]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

现在建立我们的 `compute_metric()` 函数来较为直观地评估模型的好坏，我们将使用 [🤗 Evaluate](https://github.com/huggingface/evaluate/) 库中的指标。我们可以像加载数据集一样轻松加载与 MRPC 数据集关联的指标，这次使用 `evaluate.load()` 函数。返回的对象有一个 `compute()` 方法我们可以用来进行度量计算的方法：


In [None]:
# %pip install evaluate
# %pip install scipy sklearn
# %pip install scikit-learn

In [10]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}

在这里，我们可以看到我们的模型在验证集上的准确率为 85.78%，F1 分数为 89.97。这是用于评估 GLUE 基准的 MRPC 数据集结果的两个指标。而在BERT 论文中展示的基础模型的 F1 分数为 88.9。那是 uncased 模型，而我们目前正在使用 cased 模型，通过改进得到了更好的结果。


最后将所有东西打包在一起，我们得到了我们的 `compute_metrics()` 函数：


In [11]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

为了查看模型在每个训练周期结束的好坏，下面是我们如何使用compute_metrics()函数定义一个新的 Trainer ：


In [14]:
# training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
training_args = TrainingArguments(
    output_dir="test_trainer", use_mps_device=True, evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [15]:
trainer.train()


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                        

[A[A                                         
  0%|          | 0/1377 [01:54<?, ?it/s]          
[A
[A

{'eval_loss': 0.38603249192237854, 'eval_accuracy': 0.8431372549019608, 'eval_f1': 0.8907849829351535, 'eval_runtime': 3.8506, 'eval_samples_per_second': 105.957, 'eval_steps_per_second': 13.245, 'epoch': 1.0}


                                        
  0%|          | 0/1377 [02:03<?, ?it/s]          

{'loss': 0.5017, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                        

[A[A                                         
  0%|          | 0/1377 [03:30<?, ?it/s]          
[A
[A

{'eval_loss': 0.5124179124832153, 'eval_accuracy': 0.8553921568627451, 'eval_f1': 0.902155887230514, 'eval_runtime': 3.8952, 'eval_samples_per_second': 104.744, 'eval_steps_per_second': 13.093, 'epoch': 2.0}


                                        
  0%|          | 0/1377 [03:46<?, ?it/s]           

{'loss': 0.292, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                        

[A[A                                         
  0%|          | 0/1377 [05:07<?, ?it/s]           
[A
                                        
100%|██████████| 1377/1377 [04:52<00:00,  4.71it/s]

{'eval_loss': 0.6257346868515015, 'eval_accuracy': 0.8676470588235294, 'eval_f1': 0.9078498293515359, 'eval_runtime': 3.9902, 'eval_samples_per_second': 102.25, 'eval_steps_per_second': 12.781, 'epoch': 3.0}
{'train_runtime': 292.3134, 'train_samples_per_second': 37.645, 'train_steps_per_second': 4.711, 'train_loss': 0.33513128852705726, 'epoch': 3.0}





TrainOutput(global_step=1377, training_loss=0.33513128852705726, metrics={'train_runtime': 292.3134, 'train_samples_per_second': 37.645, 'train_steps_per_second': 4.711, 'train_loss': 0.33513128852705726, 'epoch': 3.0})

## 完整训练过程

不使用Trainer类的情况下获得与上一节相同的结果。


前期的代码


In [16]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)  # 标签化
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)  # 动态填充

Found cached dataset glue (/Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 1239.09it/s]
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bb21e6423b980722.arrow
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d1e8c90b5d349f7a.arrow
Loading cached processed dataset at /Users/elton/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-33304e37c309912f.arrow


### 训练前的准备

我们需要:

1. 删除与模型不期望的值相对应的列（如sentence1和sentence2列）。
2. 将列名`label`重命名为`labels`（因为模型期望参数是labels）。
3. 设置数据集的格式，使其返回 `PyTorch` 张量而不是列表。


In [17]:
tokenized_datasets = tokenized_datasets.remove_columns(
    ["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [18]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [19]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 68]),
 'token_type_ids': torch.Size([8, 68]),
 'attention_mask': torch.Size([8, 68])}

In [21]:
batch.items()

dict_items([('labels', tensor([1, 1, 1, 1, 0, 0, 0, 1])), ('input_ids', tensor([[  101,  5153,  5207,  2097,  2022,  2067,  1999,  2254,  2044,  2732,
         10454, 21863, 25441,  3957,  4182,  1012,   102,  6788,  2036,  3488,
          2000,  2016,  2140,  3726,  5153,  5207,  2127,  2254,  2004,  2732,
         10454, 21863, 25441,  2038,  2014,  2034,  2775,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1996, 18269,  5565,  2006,  2049,  2217,  1998, 10070,  2055,
          2260,  3620,  1010,  2763,  7944,  2011,  1996,  2364, 13561,  1012,
           102,  1996, 18269,  3092,  2039,  2006,  2049,  2217,  1998,  2596,
          2000,  2031,  2042,  7944,  2055,  2871,  6199,  2011,  1996,  2364,
         13561,  2044,  4899,  1012,   102,     0,     0,     0,     0, 

### 加载模型


In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [22]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.6796, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


当我们提供 labels 时， 🤗 Transformers 模型都将返回这个batch的loss，我们还得到了 logits(batch中的每个输入有两个，所以张量大小为 8 x 2)。


### 优化器

由于我们试图自行实现 Trainer的功能，我们将使用相同的优化器和学习率调度器。Trainer 使用的优化器是 AdamW


In [23]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



### 学习率调度器

默认使用的学习率调度器只是从最大值 (5e-5) 到 0 的线性衰减。 为了定义它，我们需要知道我们训练的次数，即所有数据训练的次数(epochs)乘以的数据量（这是我们所有训练数据的数量）。Trainer默认情况下使用三个epochs，因此我们定义训练过程如下:


In [24]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


### 训练循环


如果我们可以访问 GPU, 我们将希望使用 GPU(在 CPU 上，训练可能需要几个小时而不是几分钟)。


In [25]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device(
    "mps") if torch.backends.mps.is_available() else torch.device("cpu")
model.to(device)
device

device(type='mps')

In [26]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

