# 1. 快速分词器
前面我们已经介绍过如何使用分词器将文本编码为 token IDs，以及反过来将 token IDs 解码回文本。

实际上，Hugging Face 共提供了两种分分词器：

慢速分词器：Transformers 库自带，使用 Python 编写；
快速分词器：Tokenizers 库提供，使用 Rust 编写。
特别地，快速分词器除了能进行编码和解码之外，还能够追踪原文到 token 之间的映射，这对于处理序列标注、自动问答等任务非常重要。

快速分词器只有在并行处理大量文本时才能发挥出速度优势，在处理单个句子时甚至可能慢于慢速分词器。

我们一直推荐使用的 AutoTokenizer 类除了能根据 checkpoint 自动加载对应分词器以外，默认就会选择快速分词器，因此在大部分情况下都应该使用 AutoTokenizer 类来加载分词器。

## 再看分词效果

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "Hello world!"
encoding = tokenizer(example)
print(encoding)
print(type(encoding))
print('tokenizer.is_fast:', tokenizer.is_fast)
print('encoding.is_fast:', encoding.is_fast)
print('encoding.input_ids:', encoding.tokens())

{'input_ids': [101, 8667, 1362, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
<class 'transformers.tokenization_utils_base.BatchEncoding'>
tokenizer.is_fast: True
encoding.is_fast: True
encoding.input_ids: ['[CLS]', 'Hello', 'world', '!', '[SEP]']


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(encoding.tokens())

['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']


## 追踪映射
在上面的例子中，索引为 5 的 token 是“##yl”，它是词语“Sylvain”的一个部分，因此在映射回原文时不应该被单独看待。我们可以通过 word_ids() 函数来获取每一个 token 对应的词语索引：

In [8]:
print(encoding.word_ids())
print(encoding.input_ids)

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]
[101, 1422, 1271, 1110, 156, 7777, 2497, 1394, 1105, 146, 1250, 1120, 20164, 10932, 10289, 1107, 6010, 119, 102]


In [13]:
# 词语/token 
#  文本：通过 word_to_chars()、token_to_chars() 函数来实现，返回词语/token 在原文中的起始和结束偏移量。
token_index = 5
print('the 5th token is:', encoding.tokens()[token_index])
start, end = encoding.token_to_chars(token_index)
print('start and end of the token is:', start, end, '(inclusive)')
print('example is:', example)
print('corresponding text span is:', example[start:end])
word_index = encoding.word_ids()[token_index] # 3
print('the word index is:', word_index)
start, end = encoding.word_to_chars(word_index)
print('start and end of the token is:', start, end, '(inclusive)')
print('corresponding word span is:', example[start:end])

the 5th token is: ##yl
start and end of the token is: 12 14 (inclusive)
example is: My name is Sylvain and I work at Hugging Face in Brooklyn.
corresponding text span is: yl
the word index is: 3
start and end of the token is: 11 18 (inclusive)
corresponding word span is: Sylvain


In [14]:
# 词语 
#  token：前面的例子中我们使用 word_ids() 获取了整个 token 序列对应的词语索引。
#  实际上，词语和 token 之间可以直接通过索引直接映射，分别通过 token_to_word() 和 word_to_tokens() 来实现
token_index = 5
print('the 5th token is:', encoding.tokens()[token_index])
corresp_word_index = encoding.token_to_word(token_index)
print('corresponding word index is:', corresp_word_index)
start, end = encoding.word_to_chars(corresp_word_index)
print('the word is:', example[start:end])
start, end = encoding.word_to_tokens(corresp_word_index)
print('corresponding tokens are:', encoding.tokens()[start:end])

the 5th token is: ##yl
corresponding word index is: 3
the word is: Sylvain
corresponding tokens are: ['S', '##yl', '##va', '##in']


In [25]:
# 文本 
#  词语/token：通过 char_to_word() 和 char_to_token() 方法来实现：
chars = 'My name is Sylvain'
print('characters of "{}" ars: {}'.format(chars, list(chars)))
print('corresponding word index: ')
for i, c in enumerate(chars):
    print('"{}": {} '.format(c, encoding.char_to_word(i)), end="")
print("\nword_ids:",encoding.word_ids())    
print('\ncorresponding token index: ')
for i, c in enumerate(chars):
    print('"{}": {} '.format(c, encoding.char_to_token(i)), end="")

characters of "My name is Sylvain" ars: ['M', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'S', 'y', 'l', 'v', 'a', 'i', 'n']
corresponding word index: 
"M": 0 "y": 0 " ": None "n": 1 "a": 1 "m": 1 "e": 1 " ": None "i": 2 "s": 2 " ": None "S": 3 "y": 3 "l": 3 "v": 3 "a": 3 "i": 3 "n": 3 
word_ids: [None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

corresponding token index: 
"M": 1 "y": 1 " ": None "n": 2 "a": 2 "m": 2 "e": 2 " ": None "i": 3 "s": 3 " ": None "S": 4 "y": 5 "l": 5 "v": 6 "a": 6 "i": 7 "n": 7 

# 2. 序列标注任务

## pipeline 的输出
前面我们讲过，NER pipeline 模型实际上封装了三个过程：

对文本进行编码；
将输入送入模型；
对模型输出进行后处理。
前两个步骤在所有 pipeline 模型中都是一样的，只有第三个步骤——对模型输出进行后处理，则是根据任务类型而不同。token 分类 pipeline 模型在默认情况下会加载 dbmdz/bert-large-cased-finetuned-conll03-english NER 模型，我们直接打印出它的输出：

In [1]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
results = token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")
print(results)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'I-PER', 'score': 0.99938285, 'index': 4, 'word': 'S', 'start': 11, 'end': 12}, {'entity': 'I-PER', 'score': 0.99815494, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14}, {'entity': 'I-PER', 'score': 0.99590707, 'index': 6, 'word': '##va', 'start': 14, 'end': 16}, {'entity': 'I-PER', 'score': 0.99923277, 'index': 7, 'word': '##in', 'start': 16, 'end': 18}, {'entity': 'I-ORG', 'score': 0.9738931, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35}, {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40}, {'entity': 'I-ORG', 'score': 0.9887976, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45}, {'entity': 'I-LOC', 'score': 0.9932106, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


In [6]:
from transformers import pipeline
# 实际上，NER pipeline 模型提供了多种组合 token 形成实体的策略，可以通过 aggregation_strategy 参数进行设置：

# simple：默认策略，以实体对应所有 token 的平均分数作为得分，例如“Sylvain”的分数就是“S”、“##yl”、“##va”和“##in”四个 token 分数的平均；
# first：将第一个 token 的分数作为实体的分数，例如“Sylvain”的分数就是 token “S”的分数；
# max：将 token 中最大的分数作为整个实体的分数；
# average：对应词语（注意不是 token）的平均分数作为整个实体的分数，例如“Hugging Face”就是“Hugging”（0.975）和 “Face”（0.98879）的平均值 0.9819，而 simple 策略得分为 0.9796。
token_classifier = pipeline("token-classification", aggregation_strategy="simple")
results = token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")
print(results)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


## 构造模型输出
通过 AutoModelForTokenClassification 类来构造一个 token 分类模型，并且手工地对模型的输出进行后处理，获得与 pipeline 模型相同的结果

In [7]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print("inputs:",inputs)
print(inputs["input_ids"].shape)
print("outputs:",outputs)
#模型的输入是一个长度为 
#  19的 token 序列，输出尺寸为 1*19*9
# ，即模型对每个 token 都会输出一个包含 9 个 logits 值的向量（9 分类）
print(outputs.logits.shape)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


inputs: {'input_ids': tensor([[  101,  1422,  1271,  1110,   156,  7777,  2497,  1394,  1105,   146,
          1250,  1120, 20164, 10932, 10289,  1107,  6010,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([1, 19])
outputs: TokenClassifierOutput(loss=None, logits=tensor([[[ 8.7508, -2.2626, -1.5300, -2.2889, -0.6513, -2.0016, -0.0112,
          -2.0860,  0.3335],
         [ 8.4973, -2.3986, -1.3582, -2.7887,  0.7575, -1.8873,  0.4344,
          -1.9900, -0.3397],
         [ 9.4719, -2.2261, -0.9849, -2.6116,  0.1219, -2.0627, -0.1259,
          -1.8758, -0.0609],
         [ 9.8670, -2.2175, -1.3125, -2.4866, -0.2550, -1.8536,  0.0856,
          -1.7520, -0.6437],
         [-0.2011, -2.1873, -1.5316, -2.7110,  8.4025, -2.4168, -0.6980,
          -3.0337, -0.0997],
         [ 0.1065, -2.0520, -1.4787, -2.8139,  7.4525, -2.8399, -0.0626

In [8]:
print(model.config.id2label)

{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}


In [13]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits)
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
# print(probabilities)
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
#{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}
print(predictions)

results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)
# 0.99938285

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[[ 8.7508, -2.2626, -1.5300, -2.2889, -0.6513, -2.0016, -0.0112,
          -2.0860,  0.3335],
         [ 8.4973, -2.3986, -1.3582, -2.7887,  0.7575, -1.8873,  0.4344,
          -1.9900, -0.3397],
         [ 9.4719, -2.2261, -0.9849, -2.6116,  0.1219, -2.0627, -0.1259,
          -1.8758, -0.0609],
         [ 9.8670, -2.2175, -1.3125, -2.4866, -0.2550, -1.8536,  0.0856,
          -1.7520, -0.6437],
         [-0.2011, -2.1873, -1.5316, -2.7110,  8.4025, -2.4168, -0.6980,
          -3.0337, -0.0997],
         [ 0.1065, -2.0520, -1.4787, -2.8139,  7.4525, -2.8399, -0.0626,
          -3.3666, -0.4683],
         [ 0.5985, -2.2538, -1.1926, -3.0111,  7.0070, -2.8675,  0.3492,
          -3.3129, -0.2878],
         [-0.0584, -2.2660, -1.4335, -3.1940,  8.3225, -2.6212, -0.0348,
          -2.9780, -0.2957],
         [ 9.6889, -2.4281, -1.5653, -2.5225, -0.9693, -1.5668,  0.4285,
          -1.9413, -0.6774],
         [ 9.0116, -2.1216, -1.4140, -2.6964,  0.2728, -1.7851,  0.3635,
         

In [18]:
import torch

# 假设张量形状为 (sequence_length, num_classes)，这里sequence_length=4，num_classes=4
logits = torch.tensor([
    [0.1, 0.9, 0.05, 0.01],  # Token 1, 类别B最有可能
    [0.3, 0.5, 0.2, 0.01],  # Token 2, 类别A或B都有可能
    [0.05, 0.05, 0.85, 0.05],  # Token 3, 类别C最有可能
    [0.01, 0.01, 0.01, 0.97]  # Token 4, 默认类别最有可能
])
# dim=-1 表示在最后一个维度（类别维度）上操作
predicted_classes_indices = logits.argmax(dim=-1)
print("预测类别索引:", predicted_classes_indices)
# 转换为类别列表
# 假设我们知道类别ID与类别标签的映射如下：

# 索引0 -> "A"
# 索引1 -> "B"
# 索引2 -> "C"
# 索引3 -> "O"（默认类别，非实体）

# 假设的类别ID到标签的映射
class_labels = {0: "A", 1: "B", 2: "C", 3: "O"}

# 将索引转换为类别标签
predicted_class_labels = [class_labels[i.item()] for i in predicted_classes_indices]
print("预测类别标签:", predicted_class_labels)

预测类别索引: tensor([1, 1, 2, 3])
预测类别标签: ['B', 'B', 'C', 'O']


In [17]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
# print(inputs_with_offsets)
offset_mapping = inputs_with_offsets["offset_mapping"]
print(offset_mapping)

[(0, 0), (0, 2), (3, 7), (8, 10), (11, 12), (12, 14), (14, 16), (16, 18), (19, 22), (23, 24), (25, 29), (30, 32), (33, 35), (35, 40), (41, 45), (46, 48), (49, 57), (57, 58), (0, 0)]


In [20]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
print(probabilities)
predictions = outputs.logits.argmax(dim=-1)[0].tolist()

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

print(results)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[[0.9994322657585144, 1.6470299669890665e-05, 3.426703187869862e-05, 1.6042311472119763e-05, 8.250691462308168e-05, 2.1382315026130527e-05, 0.00015649104898329824, 1.9652115952339955e-05, 0.0002208924270235002], [0.9989632368087769, 1.851575871114619e-05, 5.240453174337745e-05, 1.2534721463453025e-05, 0.0004347366630099714, 3.087430013692938e-05, 0.0003146875824313611, 2.7860676709678955e-05, 0.00014510867185890675], [0.9997084736824036, 8.308127689815592e-06, 2.8745616873493418e-05, 5.650359071296407e-06, 8.694857388036326e-05, 9.783477253222372e-06, 6.786145968362689e-05, 1.1793993508035783e-05, 7.241901766974479e-05], [0.9998351335525513, 5.6455264711985365e-06, 1.3955141184851527e-05, 4.3133736653544474e-06, 4.0176873881137e-05, 8.123063707898837e-06, 5.648486330755986e-05, 8.991626600618474e-06, 2.7239089831709862e-05], [0.0001833340502344072, 2.515659434720874e-05, 4.8462032282259315e-05, 1.4900553651386872e-05, 0.9993828535079956, 1.999772030103486e-05, 0.00011153621017001569, 1

## 组合实体
我们以前面介绍的 simple 合并策略为例，将连续的标签为“I-XXX”的多个 token 进行合并（或者以“B-XXX”开头，后面接多个“I-XXX”的 token 序列），直到遇到

“O”：表示该 token 为非实体；
“B-XXX”或“I-YYY”或“B-YYY”：表示出现了新的实体。

In [21]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()

import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        label = label[2:] # Remove the B- or I-
        start, end = offsets[idx]
        all_scores = [probabilities[idx][pred]]
        # Grab all the tokens labeled with I-label
        while (
            idx + 1 < len(predictions)
            and model.config.id2label[predictions[idx + 1]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx + 1][predictions[idx + 1]])
            _, end = offsets[idx + 1]
            idx += 1

        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.9981694370508194, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796018997828165, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


# 3. 抽取式问答任务
除了序列标注以外，抽取式问答是另一个需要使用到分词器高级功能的任务。与 NER 任务类似，自动问答需要根据问题从原文中标记（抽取）出答案片段。

In [22]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back Transformers?"
results = question_answerer(question=question, context=context)
print(results)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9741135835647583, 'start': 76, 'end': 104, 'answer': 'Jax, PyTorch, and TensorFlow'}


In [24]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

context = """
Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back Transformers?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

print(inputs["input_ids"].shape)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits)
print(end_logits)
print(start_logits.shape, end_logits.shape)

torch.Size([1, 65])
tensor([[-3.9582, -5.9036, -3.9443, -6.2181, -6.4083, -7.1622, -6.0465, -5.1919,
         -4.0218,  1.1040, -3.9652, -1.5413, -2.2242,  3.1515,  6.2945, -0.4716,
         -1.4831, -0.5067, -4.3221, -1.6551,  2.5779, 10.9044, -0.6544,  2.6956,
         -1.1208, -1.6860, -3.7357, -1.6676, -1.5421, -1.8649,  2.0896, -1.2079,
         -0.7890,  0.0215, -1.3682, -3.5892, -4.3107, -3.8289, -7.1438, -5.9742,
         -3.7412, -5.6779, -4.2294, -4.4258, -2.2509, -6.1912, -7.2860, -3.6947,
         -6.6102, -3.8975, -3.4443, -2.6780, -7.3615, -4.1177, -6.7804, -4.3929,
         -6.6827, -7.4341, -5.9426, -6.6557, -8.2156, -6.9574, -6.2020, -6.1046,
         -4.0218]], grad_fn=<CloneBackward0>)
tensor([[-1.7854, -6.2361, -6.2518, -5.3445, -5.2671, -8.1038, -5.0321, -5.9211,
         -3.2730, -0.7021, -6.1406, -4.3293, -5.8735, -4.2517,  4.9747, -3.4800,
          0.0339, -3.4037, -1.4726,  2.4518, -0.8068,  2.2278,  0.7126, -0.5041,
          0.2587, -0.3865, -0.6514,  4.5269

在Hugging Face的Transformers库中，当你使用像BERT、DistilBERT这样的模型时，inputs.sequence_ids()方法主要用于识别输入序列中不同部分的标记。
这个方法对于区分原始文本、问题文本、以及可能的填充或特殊标记（如[CLS]、[SEP]）非常重要，尤其是在处理诸如问答、文本分类等任务的输入数据时。

具体来说，sequence_ids()返回一个列表，列表中的每个元素对应输入序列中的一个标记（token），元素值用来指示该标记属于输入中的哪个部分。常见的值包括：

0：通常表示序列的开始或分类任务中的[CLS]标记。
1：一般代表第一个序列的实际文本内容，比如在问答任务中就是上下文部分。
2：在某些情况下，如果存在第二个序列（如问题部分），那么它代表第二个序列的实际文本内容。
例如，在一个典型的问答任务中，输入序列会以[CLS]标记开始，接着是上下文文本，然后是一个[SEP]标记，
之后是问题文本，最后以另一个[SEP]标记结束。sequence_ids()可以帮助我们区分这些不同的部分，这对于后续处理，比如应用不同的mask策略或分析模型行为，非常有用。

在上述代码片段中，通过sequence_ids = inputs.sequence_ids()获得的sequence_ids列表，后续被用来创建一个掩码（mask），该掩码能够确保在计算答案起始和结束位置的概率时，只考虑上下文和问题的实际文本部分，而忽略特殊标记和填充内容，从而提高了预测答案的准确性。

In [39]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

context = """
Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back Transformers?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
# 处理logits：从模型输出中提取出start_logits和end_logits，它们分别代表每个词位作为答案起始和结束位置的可能性。
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# print(start_logits)
# print(end_logits)
import torch
# 创建并应用掩码：为了确保模型预测的答案仅限于实际的文本内容而非其他特殊标记或填充内容，代码通过
# inputs.sequence_ids()获取序列ID，并据此创建了一个布尔掩码mask。接着调整掩码，确保不考虑序列中的特殊标记，
# 特别是将[CLS]标记对应的掩码设为False，
# 因为它是句子的开始，通常不包含答案信息。最后，将此掩码应用于logits，将非答案区域的分数设为一个很低的值（-10000），以排除其被选为答案的可能性。
sequence_ids = inputs.sequence_ids()
# print("sequence_ids:",sequence_ids)
mask = [i != 1 for i in sequence_ids]
# print("开始或分类任务中的[CLS]标记的掩码:",mask)
mask[0] = False # Unmask the [CLS] token
# print(mask)
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000
# print(start_logits)
# print(end_logits)
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

# print(start_probabilities)
# print(end_probabilities)

In [40]:
scores = start_probabilities[:, None] * end_probabilities[None, :]
scores = torch.triu(scores)