# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [1]:
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

In [2]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

Downloading readme: 0.00B [00:00, ?B/s]

HF google storage unreachable. Downloading and preparing it from source


Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [4]:
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [5]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [6]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,4 stars,"Peckhams bills itself as a Vintners, Victuallers and Restaurateur - I'll save you the trouble of looking up the first two words, they mean 'wine merchant' and 'supplier of food' and this pretty much sums up what they do; however, the words don't perhaps portray the fact that they sell bloody good food and bloody good wine. The deli counter at this branch in Bruntsfield is full of tasty tidbits like good quality salami and damn fine cheesecake. They also sell prepacked sandwiches (far superior to anything you could get from the Subway a few doors down) and homemade soup that's delicious and filling. Until the recent changes in Scottish licensing law, they somehow managed to sell booze long after every other offy had closed (perhaps something to do with the restaurant downstairs?). Unfortunately the Scottish Government have put a stop to this, they still sell great wine and interesting spirits and beers, but only until 10pm these days. Meh."
1,2 star,"Went to St Francis last week for the first time with three of my friends. The menu is limited. The cheapest thing is a burger or pizza for twelve dollars. The burger was terrible, over cooked and lacked flavor. The pizza was better. One of my dinner companians had the pork tenderloin. It was very good , a bit spicy but very tender. \nThe wines are pricey , nothing under eight dollars. \n\nParking is pretty bad unless you go late at night.\n\nProbably won't be going back anythime soon."
2,3 stars,"I feel mixed about writing this review. With all I will document below, I still will stop in again for some hot-and-readys in the future. So, initially I was super excited to realize that there was a Little Caesar's on Center close to home- though I hate that shopping area because the traffic is always terrible and it's a nightmare getting in and out of. I planned a pizza night with a friend and timed it perfectly to stop in here for hot-and-ready pizzas after work to bring home. I even called in earlier to make sure that I could stop straight in, buy pizzas, and walk out. I realize that it was right after work when they are probably at their busiest, but I walked in and there were no hot-and-ready pizzas there! Plus, there were only two people working! It upsets me when businesses advertise specials or deals and can't guarantee or follow through with them. I didn't have to wait long (maybe 10-15 min), but I implore Little Caesar's to ONLY promise hot-and-ready pizzas when they are both hot AND ready. On the other hand, the woman behind the counter was intent on getting the pizzas out as soon as possible and was incredibly nice- it's possible that they were simply understaffed for the day. And in their defense, the pizzas were hot and still tasted great- and you can't beat the price! The other items on their menu are a great deal two (i.e., only $3 for crazy bread sticks). With that said, I will definitely be stopping in again for those deals."
3,1 star,"Do not recommend: We love this place before but recently my group of friends are disappointed. Gratuity is a reward for a satisfied job. It is not dictated by a policy. If u want to get the staff $10 min tip, post a sign or change ur price to. $30. I m willing to pay that. But don't dictate how much I should tip.\nAndy, the owner and his girlfriend never smiles. Other places to go in the town. Don't bother with this one."
4,4 stars,"This is our favorite spot for sushi. It's rarely busy and the food is always fabulous. We went again Friday night, and the service was a little off. We had to wait a long time for our order, and they forgot my rib-eye skewer. The sushi was top notch, though. My husband ordered a trio of sashimi and the presentation was exquisite.\n\nWe will continue to go back because the quality of the food, the presentation and the price is worth it. We do wish they'd bring back the watermelon sorbet..."
5,5 stars,"I found Jayeness (Jim Moloney) when searching for someone to remove the 1980's acoustic popcorn from the 18 foot vaulted ceilings in my newly purchased house. He was fantastic to work with, well priced, and works to please. Jim has been in the business for over 20 years, and really takes pride in what he does. Jim likes to play with color in homes and gave us some color choosing advice, which I very much appreciated. He always showed up early, worked very efficiently, is very detail oriented, and cleaned up well after he finished the job. If I were to have more rooms in the house (or the outside) painted, I would definitely hire him again!"
6,1 star,"We purchased 4 Dozen Pork Tamale's at the Red Eagle El Porvenir Tortilla Factory as we were hosting a Party at a Big Major Corporate Office Christmas Party on December 22, 2011. \nAs we always have for the last 4 year's.\nAs they have been delicious in the past. \n\nBut this time, we were Totally Disappointed, Embarrassed and Mad as these Pork Tamales were Terrible.\n\nWe served these Tamales to Higher Up's, President, Vice President's and Other Higher Up's at this Major Corporation and they did not look too impressed. \nOut of 30 People - Not \""One Person\"" told us, \""Those Tamale's were Good\"". \n\nThey were \""Too Salty\"" and the Chili Flavor of the Masa had No Flavor, No Taste. \nToo much Meat... and Not Enough Masa.\n\nWe threw the leftover's in the Gabage. They were Terrible. \n\nSomebody has either changed \""Owner's of this Restaurant or somebody has changed \""Cook's\"" at this Establishment. \n\nOh and by the way, they have raised their Price's on their Tamales and their Tamale's are much more Smaller than in the Past. \n\nAnyway, we are Pre-Warning ALL Future Customer's \""not\"" to purchase these Tamales. \n\nYou'll be sorry."
7,2 star,"Went to this food truck on a Wednesday night where some trucks make a stop in downtown. Being a SoCal boy, Asian, and loves Asian food, I really want to love this place. However, it just gave me reasons not to.\n\nThe service was great, as the lady was welcoming with a smile. On the other hand, I came here for the food, and it did not live up to expectations. For example, we had the vegetable stir fry. I am not sure if they were going the healthy route, but to be frank, it was bland. Of course it was fresh from the pan so it was steaming in my to go box, although every bite was exactly the same. It had crunch, I tasted the veggies, but that was about it. No salty flavor from the soy sauce or any spices to add to the mix. My wife and I were very disappointed. Although it could have gone up when we tried the Banh Mi...\n\nAnd...another disappointment. The reason I love Banh Mis are the French baguette crunch, as you bite through the sandwhich it may be a little rough, but the spice and Asian flavors should mix in so well. I, on the other hand, had a sandwhich which bread was soggy from the juices of the meat/veggies. From my experience, this should be a somewhat dry sandwhich that relies on the spices and flavor from the meat/veggies to compliment the crispy baguette. No crispy baguette, and definitely a wet Banh Mi, which I was not looking forward to. I understand some Banh Mis may be different, but this is definitely not what I am used to from San Diego and LA.\n\nI really want to like this place, as we need more great Asian places in Phoenix, but it definitely did not live up to what I expected. I hope to try the Pho out one day, but I have definitely experienced better."
8,2 star,"This is in a good simple location. After working full time, a baby, and all of the other life things...sometimes I just don't have the energy to wash my own car. The last time I was here I they removed my custom chevy bowtie antenna (which was fine) however at the finishing area they had put it on completly sideways. I had asked one of the employees if he had the wrench to tighten it and straighten it (since they did take it off) he did not speak one word of english so I gestured with my had I wanted it straight. He grabbed it with his hands and started twisting it. I did tell him no several times progressively louder, but either the lack of english or him being smarter than me in that matter he continued. Just as I was reaching up to stop him, he snapped my custom bowtie antenna in half. No, they didn't offer to fix it, or even give me a free wash or two.."
9,2 star,"We shopped at Furniture Row a few months ago, looking for a new couch and king sized bed and frame. We found everything we really liked there, and were happy to purchase at the prices given.\n\nOur couch arrived on schedule, but our bed only arrived with the frame, which was odd because we were told they would all arrive together. We were then told our mattress had to be 'made' because they 'ran out'. It took another 3 weeks before the king sized mattress actually arrived. I do love the mattress and frame, however, so I am giving this review higher than a one.\n\nOnto the couch. The couch came in and was great, until we learned it had stuffing instead of pillows inside, meaning we couldn't really ever remove the stuffing to wash the removable cases. Seems pointless to have a zipper if you won't ever use it. Then about 4 months into use, the cushions started bunching up and being wrinkly. We went back to the Furniture Row to ask if there was something that could be done, three times...\n\nOn every occasion, a sales person lied to our face by saying they would contact their 'regional manager' and he would get back to us, but there was probably nothing they could do. He has not contacted us yet in over a month.\n\nShort version: My bed is really nice, my couch is less than stellar, and their logistics and customer service are amazingly bad. Would not suggest using this company if you expect to ever need customer service help."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [9]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,4 stars,"the food itself is worth 4 stars, tasty toasted subs-parking is not the greatest but its worth the effort","[101, 1103, 2094, 2111, 1110, 3869, 125, 2940, 117, 27629, 13913, 17458, 1174, 4841, 1116, 118, 5030, 1110, 1136, 1103, 4459, 1133, 1157, 3869, 1103, 3098, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [10]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [11]:
from transformers import TrainingArguments

model_dir = "/root/autodl-tmp/peft-models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [12]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [13]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [14]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [20]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=20,
                                  #num_train_epochs=5,
                                  num_train_epochs=1,
                                  #evaluation_strategy="epoch", 
                                  evaluation_strategy="steps",
                                  logging_steps=200)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [21]:
#small_train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(20000))
small_eval_dataset=tokenized_datasets["test"].shuffle(seed=42).select(range(500))
trainer = Trainer(
    model=model,
    args=training_args,
    # 部分抽样
    #train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    # 全量数据
    train_dataset=tokenized_datasets["train"],
    #eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [None]:
trainer.train()

Step,Training Loss,Validation Loss


In [None]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [None]:
trainer.evaluate(small_test_dataset)

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [None]:
trainer.save_model(model_dir)

In [None]:
trainer.save_state()

In [None]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少