In [1]:
# 微调预训练模型

# 使用预训练模型有显着的好处。它可以降低计算成本和碳足迹，并允许您使用最先进的模型，而无需从头开始训练。 
# Transformers 提供了针对各种任务的数千个预训练模型的访问权限。当您使用预训练模型时，您可以在特定于您的任务的数据集上对其进行训练。
# 这被称为微调，是一种非常强大的训练技术。在本教程中，您将使用您选择的深度学习框架微调预训练模型：


In [2]:
# 1. 准备数据集

# 在微调预训练模型之前，请下载数据集并准备进行训练。之前的教程向您展示了如何处理训练数据，现在您有机会测试这些技能！
# 首先加载Yelp 评论数据集

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [3]:
# 需要一个分词器来处理文本，并包含填充和截断策略来处理任何可变序列长度。要一步处理数据集，请使用 🤗 Datasetsmap方法对整个数据集应用预处理函数
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [4]:
# 建完整数据集的较小子集进行微调以减少所需的时间
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [5]:
# 2. 训练
## 使用 PyTorch Trainer 进行训练
## Transformers 提供了针对训练 🤗 Transformers 模型进行优化的Trainer类，让您可以更轻松地开始训练，而无需手动编写自己的训练循环。 
## Trainer API 支持多种训练选项和功能，例如日志记录、梯度累积和混合精度。

## 首先加载模型并指定预期标签的数量。从 Yelp Review数据集卡中，知道有五个标签：

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# 训练超参数

# 创建一个TrainingArguments类，其中包含可以调整的所有超参数以及用于激活不同训练选项的标志。
# 指定保存训练检查点的位置：

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [7]:
# 评价

## Trainer在训练期间不会自动评估模型性能。需要向Trainer传递一个函数来计算和报告指标。 
## Evaluate库提供了一个简单的accuracy函数，可以使用evaluate.load函数加载：

import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [8]:
# metric.compute()方法计算预测精度，在传递预测值到compute()之前，需要将logits转换为预测值(所有Transformers模型否返回logits)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [9]:
# 如果想在微调期间监控评估指标，请在训练参数evaluation_strategy中指定参数以在每个周期结束时报告评估指标：
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
)

In [10]:
# 使用模型、训练参数、训练和测试数据集以及评估函数创建一个Trainer对象：
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


In [11]:
# 然后通过调用train()微调模型
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.148064,0.478
2,No log,1.050453,0.56
3,No log,1.080397,0.566


TrainOutput(global_step=375, training_loss=1.0791414388020832, metrics={'train_runtime': 2266.7285, 'train_samples_per_second': 1.323, 'train_steps_per_second': 0.165, 'total_flos': 789354427392000.0, 'train_loss': 1.0791414388020832, 'epoch': 3.0})

In [12]:
trainer.push_to_hub()

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.66k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/b43646/test_trainer/commit/53ff4f8e504f7ed72cd2c05db56eecadacf74434', commit_message='End of training', commit_description='', oid='53ff4f8e504f7ed72cd2c05db56eecadacf74434', pr_url=None, pr_revision=None, pr_num=None)