<a href="https://colab.research.google.com/github/VivianOuou/NLP-Course/blob/main/course/en/chapter3/section4_A_full_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A full training

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [2]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
!pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [4]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [5]:
["attention_mask", "input_ids", "labels", "token_type_ids"]

['attention_mask', 'input_ids', 'labels', 'token_type_ids']

DataLoader 是 PyTorch 中用于高效加载和批量处理数据的核心工具，尤其在训练深度学习模型时起到关键作用。以下是它的详细解析：

1. 核心功能

DataLoader 将预处理后的数据集（如 tokenized_datasets）转换为可迭代的批量数据流，主要实现：

批量生成：将单个样本组合成批次（如 batch_size=8）
内存管理：动态加载数据（避免一次性加载全部样本）
数据混洗：通过 shuffle=True 打乱训练数据顺序
并行加载：通过多进程加速数据准备（num_workers 参数）

In [6]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [7]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67])}

In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7014, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


In [18]:
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

In [19]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


In [20]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [21]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

In [22]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8504901960784313, 'f1': 0.8942807625649913}

在训练深度学习模型时，**外层循环（Epoch循环）**和**内层循环（Batch循环）**是组织训练过程的两个关键层次。它们的区别可以通过以下直观对比来理解：

---

### **1. 直观类比：读书学习**
| 循环类型       | 类比场景                                                                 | 深度学习对应行为                                                                 |
|----------------|--------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| **外层循环（Epoch）** | 把一本教材从头到尾完整读一遍                                             | 模型完整遍历一次训练集的所有数据                                                 |
| **内层循环（Batch）** | 每次阅读时，按章节或段落分批学习（比如每次读10页）                       | 将训练数据分成小批量（如batch_size=32），逐步输入模型                            |

---

### **2. 具体区别**
#### **外层循环（Epoch循环）**
```python
for epoch in range(num_epochs):  # 外层循环
    ...
```
- **作用**：控制模型遍历整个训练集的次数
- **关键特点**：
  - 每个epoch会看到**全部训练数据**（但顺序可能不同）
  - 通常设置3-10个epoch（依数据集大小而定）
- **示例**：
  - MRPC数据集有3,668个样本，3个epoch = 模型共看到每个样本3次

#### **内层循环（Batch循环）**
```python
for batch in train_dataloader:  # 内层循环
    ...
```
- **作用**：将单个epoch的数据分成小批量处理
- **关键特点**：
  - 每个batch包含`batch_size`个样本（如32个）
  - 内存效率高（避免全量数据加载）
  - 支持梯度累积等高级技巧
- **计算示例**：
  - 如果`batch_size=8`，MRPC数据集需要`3668/8≈459`步完成1个epoch

---

### **3. 工作流程图示**
```mermaid
graph TB
    A[开始训练] --> B[Epoch 1]
    B --> C[Batch 1: 更新参数]
    B --> D[Batch 2: 更新参数]
    B --> E[...Batch N]
    A --> F[Epoch 2]
    F --> G[Batch 1...]
    A --> H[Epoch...]
```

---

### **4. 为什么需要双层循环？**
| 需求                | Epoch循环的作用                         | Batch循环的作用                          |
|---------------------|----------------------------------------|------------------------------------------|
| **完整学习数据**     | 确保模型多次学习全部样本                | 将大数据集拆分为可处理的小块              |
| **参数更新频率**     | 控制整体学习轮次                        | 决定每次参数更新的样本量（batch_size）    |
| **资源管理**         | 宏观控制训练时长                        | 避免GPU内存溢出                          |
| **数据顺序随机化**   | 每epoch开始时打乱数据顺序               | 在batch层面实现并行加载                   |

---

### **5. 具体代码示例**
#### **完整训练结构**
```python
num_epochs = 3
batch_size = 8

for epoch in range(num_epochs):  # 外层循环：控制轮次
    print(f"Epoch {epoch + 1}/{num_epochs}")
    
    for batch_idx, batch in enumerate(train_dataloader):  # 内层循环：处理批次
        # 1. 数据准备
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # 2. 前向传播
        outputs = model(**batch)
        loss = outputs.loss
        
        # 3. 反向传播
        loss.backward()
        
        # 4. 参数更新
        optimizer.step()
        optimizer.zero_grad()
        
        # 打印batch级信息
        if batch_idx % 50 == 0:
            print(f"  Batch {batch_idx}, Loss: {loss.item():.4f}")
```

#### **输出示例**
```
Epoch 1/3
  Batch 0, Loss: 0.7012
  Batch 50, Loss: 0.4231
  ...
Epoch 2/3
  Batch 0, Loss: 0.3128
  Batch 50, Loss: 0.2856
```

---

### **6. 常见问题解答**
**Q1: 为什么不直接用一个大batch训练？**  
A1: 受限于GPU内存，且小batch能提供更多参数更新机会（提升收敛性）

**Q2: epoch设置多少合适？**  
A2: 观察验证集损失，当不再下降时停止（早停机制）

**Q3: batch_size如何选择？**  
A3: 在GPU内存允许下，越大训练越稳定（典型值32-256）

---

### **总结**
- **Epoch循环**：控制"学几遍全书"
- **Batch循环**：控制"每次学多少页"  
两者协同工作，既保证模型充分学习，又适应计算资源限制。理解这一结构是掌握深度学习训练的基础。

In [25]:
from torch.optim import AdamW  # 改为从PyTorch导入
from transformers import AutoModelForSequenceClassification, get_scheduler
import torch
from tqdm.auto import tqdm

# 初始化模型
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)  # 使用PyTorch的AdamW

# 设备设置
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# 训练设置
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

# 训练循环

progress_bar = tqdm(range(num_training_steps))#显示进度条
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        #向前传播
        outputs = model(**batch)
        loss = outputs.loss
        #反向传播
        loss.backward()

        #参数更新
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)#更新进度条

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

In [27]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8455882352941176, 'f1': 0.8934010152284264}

In [28]:
from accelerate import Accelerator
from torch.optim import AdamW  # 改为从PyTorch导入
from transformers import AutoModelForSequenceClassification, get_scheduler
from tqdm.auto import tqdm

# 初始化加速器
accelerator = Accelerator()

# 初始化模型和优化器
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)  # 使用PyTorch的AdamW

# 准备分布式训练组件
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

# 设置训练参数
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

# 训练循环
progress_bar = tqdm(range(num_training_steps), disable=not accelerator.is_local_main_process)
model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        with accelerator.accumulate(model):  # 支持梯度累积
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

以下是使用 🤗 Accelerate 进行分布式训练的详细解析，通过对比普通训练和分布式训练的代码差异，帮助您理解其核心机制和优势：

---

### **1. 关键修改点对比**
#### **普通训练 vs 分布式训练**
| 修改部分                | 普通训练代码                          | 分布式训练代码（Accelerate）              | 作用说明                                                                 |
|-------------------------|---------------------------------------|------------------------------------------|--------------------------------------------------------------------------|
| **初始化**              | 手动设置`device`                      | 创建`Accelerator`实例                     | 自动检测并初始化分布式环境（多GPU/TPU）                                  |
| **设备分配**            | 显式调用`model.to(device)`            | 通过`accelerator.prepare()`自动分配       | 统一管理设备放置，避免手动处理不同硬件                                   |
| **数据加载器**          | 直接使用`DataLoader`                  | 经`accelerator.prepare()`包装            | 自动分片数据（保证各进程处理不同批次）                                   |
| **反向传播**            | `loss.backward()`                     | `accelerator.backward(loss)`             | 处理梯度同步（多GPU场景）和混合精度训练                                  |
| **批次数据迁移**        | 手动`batch.to(device)`                | 自动处理（无需显式迁移）                  | 简化代码，防止设备不一致错误                                             |

---

### **2. 🤗 Accelerate 核心组件解析**
#### **`Accelerator` 类的作用**
```python
accelerator = Accelerator()
```
- **自动检测环境**：根据当前硬件选择最优后端（如NCCL、XLA等）
- **统一接口**：代码无需修改即可运行在CPU/单GPU/多GPU/TPU上
- **功能集成**：内置混合精度、梯度累积等优化策略

#### **`prepare()` 方法**
```python
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)
```
- **对组件的改造**：
  - **DataLoader**：插入分布式采样器（`DistributedSampler`），确保各进程处理不同数据子集
  - **Model**：包装为`DistributedDataParallel`（多GPU）或原生模型（单设备）
  - **Optimizer**：适配混合精度训练

---

### **3. 分布式训练流程详解**
```mermaid
sequenceDiagram
    participant MainProcess
    participant Accelerator
    participant Worker1
    participant Worker2

    MainProcess->>Accelerator: 初始化环境
    Accelerator->>Worker1: 分配数据分片和模型副本
    Accelerator->>Worker2: 分配数据分片和模型副本
    loop 训练步骤
        Worker1->>Worker1: 计算本地梯度
        Worker2->>Worker2: 计算本地梯度
        Worker1->>Accelerator: 发送梯度
        Worker2->>Accelerator: 发送梯度
        Accelerator->>MainProcess: 梯度聚合
        MainProcess->>Worker1: 广播更新后的参数
        MainProcess->>Worker2: 广播更新后的参数
    end
```

---

### **4. 代码逐行解释（分布式版本）**
#### **环境初始化**
```python
from accelerate import Accelerator
accelerator = Accelerator()  # 关键对象：自动检测GPU/TPU数量
```

#### **模型与优化器准备**
```python
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)
```

#### **数据加载器改造**
```python
# 自动处理数据分片和设备分配
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)
```

#### **训练循环调整**
```python
for batch in train_dl:  # 已自动分片数据
    outputs = model(**batch)  # 设备已由accelerator管理
    loss = outputs.loss
    accelerator.backward(loss)  # 替代loss.backward()
    
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
```

---

### **5. 分布式训练的优势**
| 特性                | 单设备训练              | 🤗 Accelerate分布式训练       |
|---------------------|-------------------------|-----------------------------|
| **硬件兼容性**      | 仅支持单一设备          | 透明支持多GPU/TPU/CPU混合环境 |
| **代码复杂度**      | 需手动管理设备迁移      | 自动处理设备与数据并行        |
| **训练速度**        | 受限于单卡性能          | 线性加速（接近N倍GPU数量）    |
| **内存效率**        | 单卡内存限制batch_size  | 多卡联合内存允许更大batch     |
| **部署便利性**      | 需针对不同环境修改代码  | 同一套代码适应所有部署场景    |

---

### **6. 实际部署示例**
#### **命令行启动（多节点）**
```bash
# 第一步：生成配置文件（交互式问答）
accelerate config

# 第二步：启动训练
accelerate launch train.py  # 自动应用配置
```

#### **Notebook启动（Colab TPU）**
```python
def training_function():
    # 包含完整训练代码
    ...

from accelerate import notebook_launcher
notebook_launcher(training_function)  # 在Colab中启动TPU训练
```

#### **配置示例（accelerate config）**
```yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 4  # 使用4块GPU
mixed_precision: fp16
```

---

### **7. 性能优化建议**
1. **TPU专用设置**：
   ```python
   # 在tokenizer中启用固定长度填充
   tokenizer(..., padding="max_length", max_length=512)
   ```
2. **梯度累积**（小batch场景）：
   ```python
   accelerator = Accelerator(gradient_accumulation_steps=4)
   ```
3. **混合精度训练**：
   ```python
   accelerator = Accelerator(mixed_precision="fp16")
   ```

---

### **8. 常见问题解决**
**Q1: 如何保证各进程数据不重复？**  
A1: `prepare()`会自动为`DataLoader`添加`DistributedSampler`，确保数据分片唯一。

**Q2: 为什么不需要手动`batch.to(device)`？**  
A2: `prepare()`后的`DataLoader`会直接返回对应设备的数据。

**Q3: 如何保存/加载模型？**  
A3: 使用`accelerator.save()`和`accelerator.load()`：
```python
accelerator.save_model(model, "model.pth")  # 自动处理多设备保存
```

---

通过 🤗 Accelerate，您可以用**几乎相同的代码**实现从单机到分布式训练的平滑过渡，极大降低了分布式训练的复杂度。该库的设计哲学是："Write once, run anywhere"——只需一套代码，即可适应各种硬件环境。

In [36]:
from accelerate import Accelerator

# 检查是否已初始化
if 'accelerator' not in globals():
    accelerator = Accelerator(
        mixed_precision="fp16",
        gradient_accumulation_steps=2
    )
else:
    print("Using existing Accelerator instance.")

Using existing Accelerator instance.


In [38]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler, AutoTokenizer, DataCollatorWithPadding
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
import torch
import os  # 添加 os 模块

# 1. 检查并复用已有的 accelerator 对象（或初始化一个新的）
if 'accelerator' not in globals():
    accelerator = Accelerator(
        mixed_precision="fp16",  # 启用混合精度训练（可选"fp16"/"bf16"）
        gradient_accumulation_steps=2  # 梯度累积步数（可选）
    )
    print("Initialized a new Accelerator instance.")
else:
    print("Reusing existing Accelerator instance.")

# 2. 加载数据和模型
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

# 加载GLUE MRPC数据集
raw_datasets = load_dataset("glue", "mrpc")
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# 移除无用列并设置PyTorch格式
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

# 3. 创建数据加载器
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=DataCollatorWithPadding(tokenizer=tokenizer)
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"],
    batch_size=8,
    collate_fn=DataCollatorWithPadding(tokenizer=tokenizer)
)

# 4. 初始化模型和优化器
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# 5. 使用accelerator准备所有组件
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

# 6. 设置学习率调度器
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

# 7. 训练循环
progress_bar = tqdm(range(num_training_steps), disable=not accelerator.is_local_main_process)

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        # 前向传播（设备已由accelerator自动管理）
        outputs = model(**batch)
        loss = outputs.loss

        # 反向传播（自动处理梯度同步）
        accelerator.backward(loss)

        # 参数更新
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        progress_bar.set_postfix(loss=loss.item())

# 8. 保存模型（自动处理多设备保存）
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)  # 解包模型以保存
output_dir = "finetuned_bert_mrpc"
os.makedirs(output_dir, exist_ok=True)  # 创建目录
accelerator.save(unwrapped_model.state_dict(), f"{output_dir}/pytorch_model.bin")  # 保存模型权重



Reusing existing Accelerator instance.


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]