# 更加透明的方式

这里我们不使用Trainer这个高级API，而是用pytorch来实现。


## 1. 数据集预处理
在Huggingface官方教程里提到，在使用pytorch的dataloader之前，我们需要做一些事情：
- 把dataset中一些不需要的列给去掉了，比如‘sentence1’，‘sentence2’等
- 把数据转换成pytorch tensors
- 修改列名 label 为 labels

其他的都好说，但**为啥要修改列名 label 为 labels，好奇怪哦！**
这里探究一下：


首先，Huggingface的这些transformer Model直接call的时候，接受的标签这个参数是叫"labels"。
所以不管你使用Trainer，还是原生pytorch去写，最终模型处理的时候，肯定是使用的名为"labels"的标签参数。


但在Huggingface的datasets中，数据集的标签一般命名为"label"或者"label_ids"，那为什么在前两集中，我们没有对标签名进行处理呢？

这一点在transformer的源码`trainer.py`里找到了端倪：
```python
# 位置在def _remove_unused_columns函数里
# Labels may be named label or label_ids, the default data collator handles that.
signature_columns += ["label", "label_ids"]
```
这里提示了， data collator 会负责处理标签问题。然后我又去查看了`data_collator.py`中发现了一下内容：
```python
class DataCollatorWithPadding:
    ...
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        ...
        if "label" in batch:
            batch["labels"] = batch["label"]
            del batch["label"]
        if "label_ids" in batch:
            batch["labels"] = batch["label_ids"]
            del batch["label_ids"]
        return batch
```
这就真相大白了：不管数据集中提供的标签名叫"label"，还是"label_ids"，
data collator 都会帮你转换成"labels"，装进batch里，再返回。




这就是为啥我们在这一集里需要手动把 label 修改为 labels的原因，因为前面使用Trainer的时候，data collator已经帮我们自动转换了。

？？？

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Reusing dataset glue (C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-f34d74a51064f292.arrow
Loading cached processed dataset at C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-8114cae97162778f.arrow
Loading cached processed dataset at C:\Users\Administrator\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-4b384cc92726f5c6.arrow


In [2]:
print(tokenized_datasets['train'].column_names)

['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids']


huggingface datasets贴心地准备了三个方法：`remove_columns`, `rename_column`, `set_format`

来方便我们为pytorch的dataloader做准备：

In [3]:
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2','idx'])
# tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets.set_format('torch')

print(tokenized_datasets['train'].column_names)

['attention_mask', 'input_ids', 'label', 'token_type_ids']


In [4]:
tokenized_datasets['train']  # 经过上面的处理，它就可以直接丢进pytorch的Dataloader中了，跟pytorch中的Dataset格式已经一样了

Dataset({
    features: ['attention_mask', 'input_ids', 'label', 'token_type_ids'],
    num_rows: 3668
})

定义我们的pytorch dataloaders：

In [9]:
from torch.utils.data import DataLoader, Dataset
# TODO: 看看Dataloader内部是怎么构建的，如何节省内存？
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8, collate_fn=data_collator)  # 通过这里的dataloader，每个batch的seq_len可能不同
eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=8, collate_fn=data_collator)

In [10]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 74]),
 'input_ids': torch.Size([8, 74]),
 'token_type_ids': torch.Size([8, 74]),
 'labels': torch.Size([8])}

## 2. 模型

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [8]:
model(**batch)

SequenceClassifierOutput(loss=tensor(0.6766, grad_fn=<NllLossBackward>), logits=tensor([[-0.6753, -0.5134],
        [-0.7061, -0.5102],
        [-0.6784, -0.5089],
        [-0.6904, -0.5041],
        [-0.6940, -0.5140],
        [-0.6889, -0.4749],
        [-0.6981, -0.4978],
        [-0.6895, -0.5102]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

## optimizer 和 learning rate scheduler

In [9]:
from transformers import AdamW, get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)  # num of batches * num of epochs
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,  # scheduler是针对optimizer的lr的
    num_warmup_steps=0,
    num_training_steps=num_training_steps)
print(num_training_steps)

1377


## 3. Training

In [10]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

device

device(type='cuda')

## training loops:

In [11]:
from tqdm import tqdm

for epoch in range(num_epochs):
    for batch in tqdm(train_dataloader):
        # 要在GPU上训练，需要把数据集都移动到GPU上：
        batch = {k:v.to(device) for k,v in batch.items()}
        loss = model(**batch).loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

 26%|██▌       | 118/459 [00:29<01:24,  4.06it/s]


KeyboardInterrupt: 

## 4. Evaluation

In [12]:
from datasets import load_metric

metric= load_metric("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():  # evaluation的时候不需要算梯度
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}

## 5. 使用 Accelerate 库进一步加速
The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs.

日后再说吧~