
下面用代码展示BERT的基本用法。

下面展示给定输入为“The capital of China is \[MASK\]”的情况下，模型会如何预测被掩码的词。这里输出概率最高的5个词。

In [1]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch

# 选用bert-base-uncased模型进行预测，使用相应的分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)

# 准备输入句子“The capital of China is [MASK].”
text = 'The capital of China is ' + tokenizer.mask_token + '.'
# 将输入句子编码为PyTorch张量
inputs = tokenizer.encode_plus(text, return_tensors='pt')
# 定位[MASK]所在的位置
mask_index = torch.where(inputs['input_ids'][0] == tokenizer.mask_token_id)
output = model(**inputs)
logits = output.logits
# 从[MASK]所在位置的输出分布中，选择概率最高的5个并打印
distribution = F.softmax(logits, dim=-1)
mask_word = distribution[0, mask_index, :]
top_5 = torch.topk(mask_word, 5, dim=1)[1][0]
for token in top_5:
    word = tokenizer.decode([token])
    new_sentence = text.replace(tokenizer.mask_token, word)
    print(new_sentence)

The capital of China is beijing.
The capital of China is nanjing.
The capital of China is shanghai.
The capital of China is guangzhou.
The capital of China is shenzhen.



下面展示如何微调BERT用于文本分类。这里使用第4章的Books数据集。

In [1]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
import sys
from tqdm import tqdm

# 导入前面实现的Books数据集
sys.path.append('./code')
from utils import BooksDataset

dataset = BooksDataset()
# 打印出类和标签ID
print(dataset.id2label)
print(len(dataset.train_data), len(dataset.test_data))

# 接下来使用分词器进行分词，并采样100条数据用于训练和测试
# 为防止运行时间过长，此处为了在CPU上顺利运行，只选用100条数据。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

def tokenize_function(text):
    return tokenizer(text, padding='max_length', truncation=True)

def tokenize(raw_data):
    dataset = []
    for data in tqdm(raw_data):
        tokens = tokenize_function(data['en_book'])
        tokens['label'] = data['label']
        dataset.append(tokens)
    return dataset
        
small_train_dataset = tokenize(dataset.train_data[:100])
small_eval_dataset = tokenize(dataset.test_data[:100])

train size = 8627 , test size = 2157
{0: '计算机类', 1: '艺术传媒类', 2: '经管类'}
8627 2157


100%|██████████| 100/100 [00:00<00:00, 8225.09it/s]
100%|██████████| 100/100 [00:00<00:00, 4294.85it/s]


In [None]:
# 加载bert-base-cased这个预训练模型，并指定序列分类作为模型输出头，
# 分类标签数为3类
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(\
    'bert-base-cased', num_labels=len(dataset.id2label))

# 为了在训练过程中及时地监控模型性能，定义评估函数，计算分类准确率
import numpy as np
# 可以使用如下指令安装evaluate
# conda install evaluate
import evaluate

metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 通过TrainingArguments这个类来构造训练所需的参数
# evaluation_strategy='epoch'指定每个epoch结束的时候计算评价指标
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir='test_trainer',\
    evaluation_strategy='epoch')

# transformers这个库自带的Trainer类封装了大量模型训练的细节，
# 例如数据转换、性能评测、保存模型等
# 可以调用Trainer类来非常方便地调用标准的微调流程，默认训练3个epoch
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [3]:
# 默认的微调流程使用wandb记录训练log，访问wandb官网了解如何使用
# 此处通过WANDB_DISABLED环境变量禁用wandb，减少不必要的网络访问
import os
os.environ["WANDB_DISABLED"] = "true"
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.962486,0.52
2,No log,0.852982,0.67
3,No log,0.816384,0.68


以上代码通过调用Trainer类来实现简单的微调流程，接下来展示如何自定义微调流程。

In [11]:
import torch

del model
del trainer
# 如果你使用了GPU，清空GPU缓存
torch.cuda.empty_cache()

# 使用DataLoader类为模型提供数据
from torch.utils.data import DataLoader

# 将Python列表转为PyTorch张量
def collate(batch):
    input_ids, token_type_ids, attention_mask, labels = [], [], [], []
    for d in batch:
        input_ids.append(d['input_ids'])
        token_type_ids.append(d['token_type_ids'])
        attention_mask.append(d['attention_mask'])
        labels.append(d['label'])
    input_ids = torch.tensor(input_ids)
    token_type_ids = torch.tensor(token_type_ids)
    attention_mask = torch.tensor(attention_mask)
    labels = torch.tensor(labels)
    return {'input_ids': input_ids, 'token_type_ids': token_type_ids,\
        'attention_mask': attention_mask, 'labels': labels}

train_dataloader = DataLoader(small_train_dataset, shuffle=True,\
    batch_size=8, collate_fn=collate)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8,\
    collate_fn=collate)

# 载入模型，准备优化器（用于优化参数），以及scheduler
# （在训练时调整学习率，以达到更好的微调效果）
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(\
    "bert-base-cased", num_labels=len(dataset.id2label))

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0,\
    num_training_steps=num_training_steps
)

import torch
# 自动判断是否有GPU可以使用，如果可用，将model移动到GPU显存中
device = torch.device("cuda") if torch.cuda.is_available()\
    else torch.device("cpu")
model.to(device)

# 训练流程
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    # 在每个epoch开始时将model的is_training设为True，
    # 该变量将会影响到dropout等层的行为（训练时开启dropout）
    model.train()
    for batch in train_dataloader:
        # 如果GPU可用，这一步将把数据转移到GPU显存中
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        # 更新参数之后清除上一步的梯度
        optimizer.zero_grad()
        progress_bar.update(1)
progress_bar.close()
import evaluate

# 训练结束时对测试集进行评估，得到模型分数
model.eval()
metric = evaluate.load("accuracy")
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
acc = metric.compute()




下面的代码演示了如何使用GPT-2进行训练。

In [1]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
import sys

# 导入第3章使用的《小王子》数据集
sys.path.append('../code')
from utils import TheLittlePrinceDataset

full_text = TheLittlePrinceDataset(tokenize=False).text
# 接下来载入GPT2模型的分词器并完成分词。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

full_tokens = tokenizer.tokenize(full_text.lower())
train_size = int(len(full_tokens) * 0.8)
train_tokens = full_tokens[:train_size]
test_tokens = full_tokens[train_size:]
print(len(train_tokens), len(test_tokens))
print(train_tokens[:10])

19206 4802
['the', 'Ġlittle', 'Ġprince', 'Ġ', 'ĊĊ', 'Ċ', 'Ċ', 'anto', 'ine', 'Ġde']


In [6]:
import torch
from torch.utils.data import TensorDataset

# 将文本根据block_size分成小块
block_size = 128

def split_blocks(tokens):
    token_ids = []
    for i in range(len(tokens) // block_size):
        _tokens = tokens[i*block_size:(i+1)*block_size]
        if len(_tokens) < block_size:
            _tokens += [tokenizer.pad_token] * (block_size - len(_tokens))
        _token_ids = tokenizer.convert_tokens_to_ids(_tokens)
        token_ids.append(_token_ids)
    return token_ids

train_dataset = split_blocks(train_tokens)
test_dataset = split_blocks(test_tokens)

In [7]:
# 创建一个DataCollator，用于在训练时把分词的结果转化为模型可以训练的张量
# 注意此时微调的任务是语言模型，而不是掩码语言模型
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=\
    tokenizer, mlm=False)

# 导入模型，准备训练参数，调用Trainer类完成训练
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("gpt2")

training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
)

trainer.train()

# 在测试集上测试得到困惑度
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Epoch,Training Loss,Validation Loss
1,No log,3.24026
2,No log,3.152012
3,No log,3.125814


Perplexity: 22.78


这里基于HuggingFace来展示如何使用GPT-2模型生成文本。

In [5]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2',\
    pad_token_id=tokenizer.eos_token_id)
# 输入文本
input_ids = tokenizer.encode('I enjoy learning with this book',\
    return_tensors='pt')

# 输出文本
greedy_output = model.generate(input_ids, max_length=50)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

# 通过束搜索来生成句子，一旦生成足够多的句子即停止搜索
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# 输出多个句子
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output,\
        skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
I enjoy learning with this book. I have been reading it for a while now and I am very happy with it. I have been reading it for a while now and I am very happy with it.

I have been reading it for a
Output:
----------------------------------------------------------------------------------------------------
I enjoy learning with this book, and I hope you enjoy reading it as much as I do.

I hope you enjoy reading this book, and I hope you enjoy reading it as much as I do.

I hope you enjoy reading
Output:
----------------------------------------------------------------------------------------------------
0: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.

If you have any questions or comments, feel free to leave them in the comments below.
1: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.

If you have any questi

HuggingFace中集成了许多预训练语言模型。你可以直接通过具体的接口调用某一个预训练语言模型，但这种方式相对复杂，需要对具体模型和接口有所了解。或者，你也可以通过pipeline模块黑箱地使用这些模型，pipeline模块会根据指定的任务自动分配一个合适的预训练语言模型，你也可以通过参数指定一个预训练语言模型。下面演示pipeline模块处理不同任务的代码，你也可以在HuggingFace官网上了解HuggingFace支持哪些模型。



下面以情感分类为例演示文本分类任务上预训练语言模型的使用。

In [6]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
from transformers import pipeline

clf = pipeline('sentiment-analysis')
print(clf('Haha, today is a nice day!'))

print(clf(['The food is amazing', 'The assignment is weigh too hard',\
           'NLP is so much fun']))

clf = pipeline('zero-shot-classification')
print(clf(sequences=['A helicopter is flying in the sky',\
                     'A bird is flying in the sky'],
   candidate_labels=['animal', 'machine']))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english...
No model was supplied, defaulted to facebook/bart-large-mnli...

[{'label': 'POSITIVE', 'score': 0.9998708963394165}]
[{'label': 'POSITIVE', 'score': 0.9998835325241089}, {'label': 'NEGATIVE', 'score': 0.9994825124740601}, {'label': 'POSITIVE', 'score': 0.9998630285263062}]
[{'sequence': 'A helicopter is flying in the sky', 'labels': ['machine', 'animal'], 'scores': [0.9938627481460571, 0.006137245334684849]}, {'sequence': 'A bird is flying in the sky', 'labels': ['animal', 'machine'], 'scores': [0.9987970590591431, 0.001202935236506164]}]



下面演示两种文本生成任务上预训练语言模型的使用。

In [7]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
generator = pipeline('text-generation')
print(generator('In this course, we will teach you how to'))

unmasker = pipeline('fill-mask')
print(unmasker('This course will teach you all about <mask> models.'))

No model was supplied, defaulted to gpt2...
No model was supplied, defaulted to distilroberta-base...


[{'generated_text': "In this course, we will teach you how to get started with the code of a given app. This way, you will build new apps that fit your needs but are still well behaved and understandable (unless you're already using Swift to understand the language"}]
[{'score': 0.1961982101202011, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.040527306497097015, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}, {'score': 0.033017922192811966, 'token': 27930, 'token_str': ' predictive', 'sequence': 'This course will teach you all about predictive models.'}, {'score': 0.0319414846599102, 'token': 745, 'token_str': ' building', 'sequence': 'This course will teach you all about building models.'}, {'score': 0.024523010477423668, 'token': 3034, 'token_str': ' computer', 'sequence': 'This course will teach you all about computer models



输入任务“question-answering”，pipeline会自动返回默认的问答预训练语言模型“distilbert-base-cased-distilled-squad”，输入问题和上下文，就能得到答案。

In [8]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
question_answerer = pipeline('question-answering')
print(question_answerer(question='Where do I graduate from?', 
    context="I received my bachlor\'s degree at Shanghai"+\
        "Jiao Tong University (SJTU)."))

No model was supplied, defaulted to distilbert-base-cased-distilled-squad...


{'score': 0.7787413597106934, 'start': 34, 'end': 63, 'answer': 'Shanghai Jiao Tong University'}




输入任务“summarization”，pipeline会自动返回默认的预训练语言模型“sshleifer/distilbart-cnn-12-6”，输入一段文本，就能得到摘要。

In [9]:
"""
代码来源于GitHub项目huggingface/transformers
（Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License（见附录））
"""
summarizer = pipeline('summarization')
print(summarizer(
    """
    The 2022 Winter Olympics (2022年冬季奥林匹克运动会), officially 
    called the XXIV Olympic Winter Games (Chinese: 第二十四届冬季奥
    林匹克运动会; pinyin: Dì Èrshísì Jiè Dōngjì Àolínpǐkè Yùndònghuì) 
    and commonly known as Beijing 2022 (北京2022), was an international 
    winter multi-sport event held from 4 to 20 February 2022 in Beijing, 
    China, and surrounding areas with competition in selected events 
    beginning 2 February 2022.[1] It was the 24th edition of the Winter 
    Olympic Games. Beijing was selected as host city in 2015 at the 
    128th IOC Session in Kuala Lumpur, Malaysia, marking its second 
    time hosting the Olympics, and the last of three consecutive 
    Olympics hosted in East Asia following the 2018 Winter Olympics 
    in Pyeongchang County, South Korea, and the 2020 Summer Olympics 
    in Tokyo, Japan. Having previously hosted the 2008 Summer Olympics, 
    Beijing became the first city to have hosted both the Summer and 
    Winter Olympics. The venues for the Games were concentrated around 
    Beijing, its suburb Yanqing District, and Zhangjiakou, with some 
    events (including the ceremonies and curling) repurposing venues 
    originally built for Beijing 2008 (such as Beijing National 
    Stadium and the Beijing National Aquatics Centre). The Games 
    featured a record 109 events across 15 disciplines, with big air 
    freestyle skiing and women's monobob making their Olympic debuts 
    as medal events, as well as several new mixed competitions. 
    A total of 2,871 athletes representing 91 teams competed in the 
    Games, with Haiti and Saudi Arabia making their Winter Olympic 
    debut. Norway finished at the top of the medal table 
    for the second successive Winter Olympics, winning a total of 37 
    medals, of which 16 were gold, setting a new record for the 
    largest number of gold medals won at a single Winter Olympics. 
    The host nation China finished third with nine gold medals and 
    also eleventh place by total medals won, marking its most 
    successful performance in Winter Olympics history.[4]
    """
))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6...


[{'summary_text': " The 2022 Winter Olympics was held in Beijing, China, and surrounding areas . It was the 24th edition of the Winter Olympic Games . The Games featured a record 109 events across 15 disciplines, with big air freestyle skiing and women's monobob making their Olympic debuts as medal events . Norway won 37 medals, of which 16 were gold, setting a new record for the largest number of gold medals won at a single Winter Olympics ."}]
