# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][333]

{'label': 3,
 'text': "All in favor of a deep dish pizza say I!.......IIIIIII,  ok now that i have that out of my system. This place is such a great hangout/eat-in spot. I hadn't been here and years and some friends invited us out for the evening. I was so glad they were paying cause  I was low on funds at the time.\\n\\nWe arrived on a friday night and of course it was busy there. We waited about 10 minutes to get a table which wasn't bad considering the crowd. We looked over the menu and they have so many great choices. Pizza, pasta, appetizers, seafood, burgers, salads and sandwiches. \\n\\nAfter ordering two mango lemonades that were wayyyyy over sweetened we ordered our food. We both are going gluten free which is tough but UNO's gave us a nice selection of dishes to choose from. Plus! They make a thin crust gluten free pizza which taste great. My hubby ordered the mediterrean thin crust because he loves kalamata olives and I ordered the Guac-alicious burger with a Caesar side sal

In [4]:
dataset["train"][666]

{'label': 2,
 'text': 'Just ate there, right next to GameStop & Google, has 3 small booths, & ordered the pepper steak w/ onion ($10.95). Food is fast fresh & hot, but mine had too much onion & not enough steak. At the end of the meal I was just eating onions with rice, though I hear this is healthy for you. Counter lady was cordial, but didn\'t reply when customers told her, \\"Have a nice day\\" #awkward. I know that English isn\'t her first language but she needs to catch on that people are wishing her well. Wasn\'t stuffed full either despite having eaten a large plate (I usually get this feeling eating Asian). This is basically a nice place to go for lunch that won\'t ruin your appetite for dinner. (Side note: Food is very clean. Brushed my teeth an hour before w/ Tom\'s of Maine fluoride-free peppermint & still had minty fresh breath an hour after eating)'}

In [5]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [8]:
def show_random_elements(dataset, num_examples=15):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,1 star,"We've been coming here for 15+ yrs and we USED to love this place. The owners were great and all the servers were nice. Well we went back today, a Saturday afternoon and we were the only people in there, it was dead. We thought that was pretty rare and odd, well we ended up figuring out it's because of their new servers. From the moment we got there the lady was a total witch. Every single thing she did was done with attitude, not once did I see her even try and put on a fake smile at least. We received our food and it was eh, nothing great at all. Not what it used to be or compared to their other locations. We were so fed up with her that when we were leaving we decided to ask her if she knew what guest/customer service is and she rolled her eyes and just sipped on her drink the whole time. Her name is V or B, so if you get her, I guarantee you will not be satisfied by her service. We're never coming back here."
1,5 stars,"I have eaten at several TdB's and the service and food is always amazing and the the one in LV definitely surpassed my expectations. Although we did have reservations, they over shot our reservations by 20-25 minutes. Order the pitcher of Sangria nad set back and enjoy!!"
2,4 stars,"**UPDATE**\n\nOkay, after staying at the Alladin/ Planet Hollywood I am going to beg Ballys to take us back!!! I went back there my last day and everything was sunshine compared to the OTHER hotel. I will just have to be super sweet and bat my eyelashes and wink at the receptionist (whether guy or girl) and beg for the newer cleaner rooms next time we are there. My guy and I figured that the people at the OTHER HOTEL are so rude and arent ready to jump to keep you happy because they arent UNIONIZED. I am giving this hotel an upgrade to 4 stars. LUV YA BALLYS!!!! PS... They do have comfortable beds & room service is also good. \n\n\nThis is one of the worst hotels. If you are the new side the rooms are a bit better. But still not up to par. We go there at least 3 times a year because of business and it sucks that the organization can only use this place. Most of the time the check in people are rude."
3,5 stars,joe the server was amazing we had a blast he was soo funny and the food was so tasty its a great place for a birthday we r visiting from california and best service ever he was hilarious and I would go everyday if I could lol
4,1 star,Worst experience ever. They messed up my order twice. Tacos were soggy and hardly had anyway meat. Customer service was horrible to. They were rude and had no customer service skills.
5,5 stars,"I love Pin Kaow it's hands down the best Thai place for locals off the strip. The staff is always friendly and seems to be a very clean place. I originally went to the location on Eastern for years. Both places are the same, both great. I'm a huge curry fan so I order basically the same thing every visit."
6,1 star,Whole Paycheck prices! What more is there to say! A small cart will be over $400!\nWayne Gorsek
7,2 star,"My friend and I decided to get a Pedicure and came upon this salon. When we arrived, there weren't any other customers there. They do have a very large selection of different color nail polish to choose from. We found it a little odd that one of the techs was wearing gloves, and the other was not. They do spend a lot of time, and do have good attention to detail. My tech accidently cut me a little bit, which he was apologetic for. Before he started the nail polish, he squirted something over my toes (rubbing alcohol perhaps?) which just about made me scream when it hit my cut. When it came to the foot massage/leg rub, it was not pleasant at all. The tech put an exfoliating scrub on my legs, and rubbed my legs for what seemed like forever. It wasn't really a massage, it felt like they were trying to rub multiple layers of skin off my legs! \nWhen it was all done, I was pleased with the way that my feet looked, but it was not a relaxing experience. I will likely not go there again."
8,2 star,"Looks very impressive from the outside, but I was disappointed on the interior of the main part. Very blah. Go down to look at the heart of the guy that started it. Creepy!\n\nAlso, we were approached by people outside asking for money. One guy had a whole story about his wife having cancer and how he needed help. I wanted to tell him to go to the church and ask for help. I think he was just looking to scam tourists.\n\nThere are much better churches to visit in the area including Christ Church Cathedral."
9,5 stars,Our favorite neighborhood breakfast place. Chicken on the Coup - is amazing. Love the stuffing and gravy - and then mix in the eggs - fantastic.


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

1. **为什么要处理文本长度？**  
   就像衣服有尺码，神经网络模型也有固定的"输入尺寸"。比如BERT模型最多"吃"512个单词片段。太长的文本会被截断（切掉尾巴），太短的会补零（相当于给衣服加填充物）。

2. **分词器在做什么？**  
   把文字转换成数字密码（如"你好"→[101, 2345])，同时：
   • 自动加特殊符号：比如[CLS]开头、[SEP]分隔
   • 记录哪些是真实内容（attention_mask里1表示真实，0是填充的）

3. **map方法的神奇之处**  
   这个操作就像流水线作业，把整个数据集批量送进处理函数。假设数据集有1万条文本，用`batched=True`参数，可能分100批次处理（每批100条），效率比逐条处理高得多。

4. **处理后的数据结构**  
   每个样本会变成包含多个数组的字典：
   ```python
   {
     'input_ids': [101, 2345, 1032, 0, 0],  # 数字化的文本
     'token_type_ids': [0,0,..],            # 区分句子（用于问答任务）
     'attention_mask': [1,1,..0,0]          # 标记有效内容位置
   }
   ```

举个生活化的例子：
原始句子："我爱吃披萨" → 处理后会变成类似：
```
[CLS] 我 爱 吃 披萨 [PAD] [PAD] [PAD]...
对应的数字：[101, 2769, 3342, 1563, 5643, 0, 0, 0...]
注意力的遮罩：[1,1,1,1,1,0,0,0...]
```
其中：
• [CLS]是BERT要求的起始符号
• [PAD]是填充的占位符（实际用0表示）
• 注意力遮罩告诉模型哪些位置需要关注

In [10]:
# 从transformers库导入自动分词器
from transformers import AutoTokenizer

# 加载预训练的分词器（这里用的是BERT的区分大小写版本）
# [2,4](@ref)：Hugging Face的Tokenizer支持填充和截断策略
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# 定义一个处理数据集的函数
def tokenize_function(examples):
    # 对文本进行分词，并应用两个重要策略：
    # 1. padding="max_length"：将所有文本填充到模型允许的最大长度（如512）
    # 2. truncation=True：超过最大长度的部分会被截断
    # [2,4](@ref)：这是Hugging Face推荐的标准化处理方式
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# 将处理函数应用到整个数据集（支持批量处理加速）
# batched=True表示一次性处理多个样本，比逐条处理快10倍以上
tokenized_datasets = dataset.map(tokenize_function, batched=True)



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [13]:
print(tokenized_datasets.cache_files)

{'train': [{'filename': '/root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/0.0.0/c1f9ee939b7d05667af864ee1cb066393154bf85/cache-42c6b839c042ef53.arrow'}], 'test': [{'filename': '/root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/0.0.0/c1f9ee939b7d05667af864ee1cb066393154bf85/cache-b616e165245db566.arrow'}]}


In [12]:
# 随机展示处理后的样本（假设show_random_elements是自定义的检查函数）
# 通过这个可以查看处理后的数据结构，例如：
# {
#   'input_ids': [101, 2345, 1032, ..., 0, 0], 
#   'attention_mask': [1,1,..1,0,0]
# }
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,1 star,"Went there to celebrate with some friends and I was excited to try some Mexican food - something I'd been missing since I moved from San Diego to Pittsburgh. Since there was a long wait we got margaritas and waited for the tables. I got a strawberry margarita, but the flavor was unnecessary as all I could taste was cheap, nasty tequila. Gross. To mitigate the taste and try not to be sloshed before dinner we asked for some chips and salsa while we waited. When we finally got some (15 minutes later) I was buzzed and disappointed. The chips and salsa were worse than even the store brand from the supermarket. My opinion, if a \""Mexican food\"" restaurant can't do the good chips and salsa and margaritas, I don't even want to try anything else. We paid our tab and left before our table was even ready. No thank you.","[101, 23158, 1204, 1175, 1106, 8294, 1114, 1199, 2053, 1105, 146, 1108, 7215, 1106, 2222, 1199, 4112, 2094, 118, 1380, 146, 112, 173, 1151, 3764, 1290, 146, 1427, 1121, 1727, 4494, 1106, 5610, 119, 1967, 1175, 1108, 170, 1263, 3074, 1195, 1400, 12477, 18971, 15662, 1116, 1105, 3932, 1111, 1103, 7072, 119, 146, 1400, 170, 15235, 6614, 12477, 18971, 15662, 117, 1133, 1103, 16852, 1108, 14924, 1112, 1155, 146, 1180, 5080, 1108, 10928, 117, 13392, 21359, 21005, 119, 15161, 119, 1706, 26410, 25342, 1103, 5080, 1105, 2222, 1136, 1106, 1129, 188, 8867, 8961, 1196, 4014, 1195, 1455, 1111, 1199, 13228, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


以下是该表格中各个字段的详细解释，按NLP处理流程分阶段说明：

---

### ▋ 字段结构解析 (针对BERT类模型)

| 字段名称          | 示例值片段                     | 作用层级      | 技术细节                                                                 |
|-------------------|------------------------------|--------------|--------------------------------------------------------------------------|
| **label**         | "1 star"                     | 业务标签层    | 原始业务标签（此处展示为可读形式，实际训练需转换为数值如0-4对应1-5星）    |
| **text**          | 用户评论原文                 | 原始数据层    | 未处理的原始文本输入                                                     |
| **input_ids**     | [101, 23158, 1204,...]       | Token编码层   | 将文本转换为模型可识别的数字ID序列                                        |
| **token_type_ids**| [0, 0, 0,...]                | 句子分段层     | 标识token属于哪个句子（单句子任务全为0）                                  |
| **attention_mask**| [1, 1, 1,...]                | 注意力机制层   | 控制模型关注有效内容（1=有效token，0=填充位）                             |

---

### ▋ 关键技术点详解

#### 1. **label字段的特殊处理**
```python
# 实际训练时应转换为数值标签
label_mapping = {"1 star": 0, "2 stars": 1, ..., "5 stars": 4}
dataset = dataset.map(lambda x: {"label": label_mapping[x["label"]]})
```

#### 2. **input_ids的构造过程**
- **特殊标记说明**：
  - `101`: [CLS] 分类标记（BERT等模型的起始符）
  - `102`: [SEP] 分隔标记（此例未出现，因单句输入）
  - `0`: [PAD] 填充标记（此例未出现，因已用max_length填充）

#### 3. **token_type_ids的扩展应用**
```python
# 双句子任务时的典型结构（如QA）
tokenizer("How are you?", "I'm fine", return_token_type_ids=True)
# 输出：
# token_type_ids = [0,0,0,0,0, 1,1,1,1]
```

#### 4. **attention_mask的动态性**
```python
# 实际处理变长文本时的mask示例：
原始文本: "Hello world"
填充后: "Hello world [PAD] [PAD]"
attention_mask: [1,1,0,0]
```

---

### ▋ 数据处理流程可视化
```
原始文本
   ↓ (分词器处理)
[CLS] Went there to... [SEP] → 分词结果
   ↓ (词汇表映射)
101 23158 1204 ... 102 → input_ids
   ↓ (句子标识)
0   0     0    ... 0   → token_type_ids
   ↓ (有效标识)
1   1     1    ... 1   → attention_mask
```

---

### ▋ 最佳实践建议
1. **动态填充策略**：
```python
# 替代固定长度填充，提升效率
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)
```

2. **验证字段一致性**：
```python
# 检查各字段长度是否匹配
assert len(input_ids) == len(token_type_ids) == len(attention_mask)
```

3. **解码验证**：
```python
# 反向验证编码正确性
decoded_text = tokenizer.decode(input_ids, skip_special_tokens=True)
assert decoded_text == original_text
```

---

是否需要进一步了解如何将这些预处理后的数据输入模型进行训练？

Wed Mar  5 15:34:53 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:07.0 Off |                    0 |
| N/A   38C    P0              26W /  70W |    985MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    103559      C   /root/miniconda3/envs/peft/bin/python       982MiB |
+---------------------------------------------------------------------------------------+

根据提供的日志和硬件监控信息，当前系统状态可从以下角度分析：

---

### **一、文件下载与模型加载**
1. **Tokenizer 相关文件处理**  
   • `tokenizer_config.json`（已完成100%下载）：该文件定义了分词器的配置参数（如是否区分大小写、特殊标记映射路径等）。例如，`do_lower_case=True` 表示输入文本会被统一转为小写。
   • `vocab.txt`（下载中）：词汇表文件，包含所有标记及其唯一索引，用于将文本转化为模型可识别的数字序列。例如，`[CLS]`可能对应索引0，`[SEP]`对应索引1。
   • `tokenizer.json`（下载中）：包含分词器的完整配置和模型类型（如BPE、WordPiece），是分词器的核心文件。

2. **模型配置文件加载**  
   • `config.json`（下载中）：定义模型架构参数，如隐藏层维度（`hidden_size`）、注意力头数（`num_attention_heads`）、层数（`num_hidden_layers`）等。例如，`hidden_size=768`表示每层有768个神经元。

3. **进度解读**  
   • `Map: 35%` 可能表示模型权重正在从文件映射到内存，或分词器初始化完成35%。

---

### **二、GPU资源占用**
• **Tesla T4使用情况**  
  • **显存占用**：985MiB/15360MiB，占比约6.4%，显示当前任务对GPU压力较低。
  • **进程信息**：Python进程（PID 103559）正在运行，可能与模型推理或训练相关。例如，加载模型权重（如`model.safetensors`）或执行前向计算。
  • **计算模式**：`Compute M.`显示为`Default`，表明未启用特定计算模式（如MIG多实例GPU）。

---

### **三、综合行为推断**
当前系统可能正在执行以下操作之一：
1. **模型初始化**  
   • 通过Hugging Face的`from_pretrained()`方法加载预训练模型，自动下载配置文件（如`config.json`）和分词器文件。
   • 示例代码类似：
     ```python
     from transformers import AutoTokenizer, AutoModel
     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
     model = AutoModel.from_pretrained("bert-base-uncased").cuda()
     ```

2. **文本预处理**  
   • 使用分词器将输入文本（如用户提问）转换为Token ID序列，需依赖`vocab.txt`和`tokenizer.json`。

3. **轻量级推理任务**  
   • 显存占用较低可能表明任务规模较小（如短文本分类或问答），未涉及全量训练。

---

### **四、潜在风险与优化建议**
• **显存利用率低**：Tesla T4的显存使用率不足10%，可考虑批量处理任务或启用混合精度训练（`fp16`/`bf16`）以提升吞吐量。
• **下载速度限制**：`5.64kB/s`的下载速率可能受网络带宽影响，建议检查代理设置或切换至本地缓存模型。

---

**总结**：系统正在加载一个基于Transformer架构的预训练模型（如BERT或GPT），完成分词器和模型配置的初始化，并利用GPU执行轻量级计算任务。

### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [16]:
# 从完整训练集中创建小型训练子集（1000条样本）
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
"""
执行步骤：
1. shuffle(seed=42): 先对训练集进行随机打乱（设置随机种子保证可复现性）
2. select(range(1000)): 选取前1000条打乱后的样本
作用：
- 创建小规模训练集，加速实验迭代
- 保持数据分布的随机性
- 固定随机种子保证每次运行结果一致
"""

# 从完整测试集中创建小型验证子集（1000条样本） 
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
"""
典型应用场景：
1. 快速验证模型是否能过拟合（用少量数据测试学习能力）
2. 资源有限时进行超参数调试
3. 原型开发阶段的快速实验
4. 教学演示场景（缩短训练时间）

注意事项（使用时需知）：
- 小样本可能无法代表完整数据分布
- 评估指标会有较大方差
- 正式训练时建议使用完整数据集
- 生产环境需要更严谨的验证集划分
"""

# 扩展：查看数据集结构示例
print(small_train_dataset)
# 输出示例：Dataset(features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'], num_rows: 1000)

print(small_eval_dataset)

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})
Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})


## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [17]:
from transformers import AutoModelForSequenceClassification

# 关键参数解析：
# "bert-base-cased" - 使用区分大小写的BERT基础版
# num_labels=5       - 五分类任务（对应Yelp的1-5星评分）
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased", 
    num_labels=5
)



model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [20]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [21]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_le

以下是对 `TrainingArguments` 配置的通俗解释，按功能分类并添加注释：

---

### 🌟 **核心训练参数**
```python
output_dir="models/bert-base-cased-finetune-yelp"  # 模型保存路径（最重要！训练结果全存在这）
per_device_train_batch_size=16   # 每个GPU的批次大小（显存不足时调小此值）
num_train_epochs=5               # 训练总轮次（通常3-5轮足够微调）
learning_rate=5e-05              # 学习率（BERT常用5e-5, 太大容易震荡，太小收敛慢）
```

---

### 💻 **资源相关参数**
```python
fp16=False                       # 是否启用混合精度训练（True可省显存，需GPU支持）
gradient_accumulation_steps=1    # 梯度累积步数（模拟更大批次，显存不足时使用）
optim="adamw_torch"              # 优化器类型（Adam的改进版，适合深度学习）
```

---

### ⏱️ **训练过程控制**
```python
logging_steps=100                # 每100步打印一次日志（默认500，调小可更频繁监控）
save_steps=500                   # 每500步保存一次模型（频繁保存会占用磁盘）
evaluation_strategy="no"         # 评估策略（"no"不评估，"steps"按步评估，"epoch"每轮评估）
```

---

### 🛠️ **优化相关参数**
```python
weight_decay=0.0                 # 权重衰减系数（防过拟合，常用0.01）
warmup_steps=0                   # 预热步数（初始阶段用小学习率）
max_grad_norm=1.0                # 梯度裁剪阈值（防梯度爆炸）
```

---

### 📊 **日志与保存**
```python
logging_dir="models/.../runs/..." # TensorBoard日志路径
report_to=[]                     # 上报平台（例如["wandb"]接入可视化）
save_total_limit=None            # 最大保存检查点数（设为3只保留最新3个模型）
```

---

### 🔧 **其他实用参数**
```python
seed=42                          # 随机种子（固定后结果可复现）
disable_tqdm=False               # 是否禁用进度条（True时更简洁）
remove_unused_columns=True       # 自动删除模型不需要的列（节省内存）
```

---

### 🚀 **参数选择建议**
1. **学习率**：从 `5e-5` 开始尝试，观察损失变化
2. **批次大小**：在显存允许范围内尽量调大（如16→32）
3. **训练轮次**：用早停法（`EarlyStoppingCallback`）防过拟合
4. **混合精度**：设置 `fp16=True` 可减少30%显存占用
5. **多GPU支持**：无需修改代码，启动时加 `--nproc_per_node=GPU数量`

---

### ⚠️ **特别注意项**
```python
do_train=False  # 当前配置未启用训练！（需设为True才会开始训练）
do_eval=False   # 当前未启用验证！（需配合eval_dataset使用）
```

---

### 🔄 **完整训练启动示例**
```python
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,       # 需要评估时添加
    compute_metrics=compute_metrics  # 自定义评估函数
)

# 开始训练（do_train=True时才生效）
trainer.train() 
```

---

通过合理配置这些参数，可以平衡训练速度、资源消耗和模型性能。建议先用小数据子集调试参数，再全量训练。

## 什么是学习率，什么是震荡？

### 📚 学习率（Learning Rate）通俗解释

**学习率就像「下山的步长」**  
想象你蒙着眼从山顶往下走，要找到最低点（模型的最优参数）。学习率就是你每步迈的幅度：
- **太大步（高学习率）**：容易一步跨过山谷，在对面的山坡来回跳（震荡）
- **太小步（低学习率）**：要走很久才能到底，甚至卡在半山腰（收敛慢）

![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*kA38KJq9aeZkBW-0rH2yCg.gif)

---

### 💥 什么是震荡（Oscillation）？

**震荡就像「在山谷两边反复横跳」**  
当学习率太大时，参数更新会像这样：
1. 当前点：A（损失较高）
2. 计算梯度：指向谷底方向
3. 大步更新：直接跳到对面的B点
4. 再计算梯度：又指向另一个方向
5. 结果：在谷底两侧反复跳动，无法稳定到最低点

![](https://developers.google.com/static/machine-learning/crash-course/images/LearningRateTooLarge.svg)

---

### 🌰 具体案例对比
| 学习率 | 训练表现 | 损失曲线 | 适用场景 |
|-------|---------|---------|---------|
| 0.1   | 剧烈震荡 | 锯齿状波动 | ❌ 几乎不用 |
| 1e-3  | 偶尔震荡 | 波动下降 | 简单任务 |
| 5e-5  | 平稳下降 | 平滑收敛 | ✅ BERT微调 |
| 1e-6  | 缓慢下降 | 近乎水平 | 精细调优 |

---

### 🔧 如何避免震荡？
1. **学习率预热**：前1000步从小学习率逐步增大
2. **梯度裁剪**：限制单步更新幅度（`max_grad_norm=1.0`）
3. **自适应优化器**：使用AdamW而不是SGD
4. **监控损失曲线**：出现震荡时立即降低学习率

---

### 🛠️ BERT的实践经验
```python
# 安全的学习率范围建议
learning_rate = 5e-5  # 默认安全值（适合大部分情况）
learning_rate = 3e-5  # 更保守的选择（数据量小时）
learning_rate = 1e-4  # 高风险！需配合梯度裁剪使用
```

通过合理控制学习率，可以让模型既快速收敛，又不会「跑过头」。就像开车时找到合适的油门力度，既不会急刹急停，也不会龟速前进。

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [31]:
import numpy as np
import evaluate
import os

# 确认环境变量设置
print("当前HF端点:", os.getenv('HF_ENDPOINT', '默认（未设置）'))  # 应该显示 https://hf-mirror.com

metric = evaluate.load("./accuracy.py")

当前HF端点: https://hf-mirror.com



接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [32]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [33]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.2421,1.090886,0.526
2,0.9014,0.960115,0.591
3,0.6382,0.978361,0.592


TrainOutput(global_step=189, training_loss=0.9693943861300353, metrics={'train_runtime': 341.7098, 'train_samples_per_second': 8.779, 'train_steps_per_second': 0.553, 'total_flos': 789354427392000.0, 'train_loss': 0.9693943861300353, 'epoch': 3.0})

In [18]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [19]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 1.0753791332244873,
 'eval_accuracy': 0.52,
 'eval_runtime': 2.9889,
 'eval_samples_per_second': 33.457,
 'eval_steps_per_second': 4.349,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [20]:
trainer.save_model(model_dir)

In [21]:
trainer.save_state()

In [23]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少