# 更加透明的方式

这里我们不使用Trainer这个高级API，而是用pytorch来实现。


## 1. 数据集预处理
在使用pytorch的dataloader之前，我们需要做一个事情：
- 把dataset中一些不需要的列给去掉了，比如‘sentence1’，‘sentence2’等
- 把数据转换成pytorch tensors
- 修改列名

In [22]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [20]:
print(tokenized_datasets['train'].column_names)

['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids']


huggingface datasets贴心地准备了三个方法：`remove_columns`, `rename_column`, `set_format`

来方便我们为pytorch的dataloader做准备：

In [21]:

tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2','idx'])
tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets.set_format('torch')

print(tokenized_datasets['train'].column_names)

['attention_mask', 'input_ids', 'labels', 'token_type_ids']


In [23]:
tokenized_datasets['train']  # 经过上面的处理，它就可以直接丢进pytorch的Dataloader中了，跟pytorch中的Dataset格式已经一样了

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
    num_rows: 3668
})

定义我们的pytorch dataloaders：

In [25]:
from torch.utils.data import DataLoader
# TODO: 看看Dataloader内部是怎么构建的，如何节省内存？
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8, collate_fn=data_collator)  # 通过这里的dataloader，每个batch的seq_len可能不同
eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=8, collate_fn=data_collator)

In [35]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 65]),
 'input_ids': torch.Size([8, 65]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 65])}

## 2. 模型

In [27]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [28]:
model(**batch)

SequenceClassifierOutput(loss=tensor(0.6954, grad_fn=<NllLossBackward>), logits=tensor([[-0.2911, -0.1413],
        [-0.2991, -0.1434],
        [-0.2911, -0.1372],
        [-0.3022, -0.1459],
        [-0.2962, -0.1503],
        [-0.2993, -0.1488],
        [-0.3063, -0.1543],
        [-0.3036, -0.1450]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

## optimizer 和 learning rate scheduler

In [37]:
from transformers import AdamW, get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)  # num of batches * num of epochs
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,  # scheduler是针对optimizer的lr的
    num_warmup_steps=0,
    num_training_steps=num_training_steps)
print(num_training_steps)

1377


## 3. Training

In [38]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

device

device(type='cuda')

## training loops:

In [41]:
from tqdm import tqdm

for epoch in range(num_epochs):
    for batch in tqdm(train_dataloader):
        # 要在GPU上训练，需要把数据集都移动到GPU上：
        batch = {k:v.to(device) for k,v in batch.items()}
        loss = model(**batch).loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

100%|██████████| 459/459 [01:56<00:00,  3.94it/s]
100%|██████████| 459/459 [01:57<00:00,  3.91it/s]
100%|██████████| 459/459 [01:57<00:00,  3.90it/s]


## 4. Evaluation

In [42]:
from datasets import load_metric

metric= load_metric("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():  # evaluation的时候不需要算梯度
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8480392156862745, 'f1': 0.8956228956228957}

## 5. 使用 Accelerate 库进一步加速
The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs.

日后再说吧~