# Pipelines

NLP任务：
- 文本分类：例如情感分析、句子对等关系判断
- 对文本中的词语进行分类：例如词性标注(POS)、命名实体识别(NER)等
- 文本生成：例如填充预设的模版(prompt)、预测文本中被遮掩掉(masked)的词语
- 从文本中抽取答案：例如根据给定的问题从一段文本中抽取对应的答案
- 根据输入文本生成新的句子：例如文本翻译、自动摘要

Transformer库最基础的对象是`pipeline()`函数，封装了预训练模型和对应的前处理和后处理环节。只需要输入文本，就能得到预期的答案。
常见的pipelines有：
- feature-extraction (获得文本的向量化表示)
- fill-mask (填充被遮盖的词、片段)
- ner (命名实体识别)
- question-answering (自动问答)
- sentiment-analysis (情感分析)
- summarization (自动摘要)
- text-generation (文本生成)
- translation (机器翻译)
- zero-shot-classification (零训练样本分类)


## 情感分析
借助情感分析pipeline，我们只需要输入文本，就可以得到其情感标签(积极/消极)以及对应的概率：

In [3]:
from lib2to3.fixes.fix_input import context

from torch.nn.functional import max_pool1d
from transformers import pipeline, AutoTokenizer

classifier = pipeline('sentiment-analysis')
result = classifier('I like you')
print(result)
results = classifier(
    ['I like you', 'I hate you']
)
print(results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998695850372314}]
[{'label': 'POSITIVE', 'score': 0.9998695850372314}, {'label': 'NEGATIVE', 'score': 0.9791242480278015}]


pipeline模型会自动完成以下三个步骤：
1. 将文本预处理为模型可以理解的格式
2. 将与处理好的文本送入模型
3. 对模型的预测值进行后处理，输入人类可以理解的格式

pipeline会自动选择合适的预训练模型来完成任务。例如对于情感分析，默认会选择微调好的英文情感模型`distilbert-base-uncased-finetuned-sst-2-english`

## 零训练样本分类
零训练样本分类pipeline允许我们在不提供任何标注数据的情况下自定义分类标签
pipeline自动选择了预训练好的facebook/bart-large-mnli模型

In [5]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification')
result = classifier(
    'This is a course about the Transformers library.',
    candidate_labels = ['education', 'politics', 'business']
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'sequence': 'This is a course about the Transformers library.', 'labels': ['education', 'business', 'politics'], 'scores': [0.8719860911369324, 0.09406667947769165, 0.033947158604860306]}


## 文本生成
我们首先根据任务需要构建一个模版(prompt)，然后将其送入到模型中来生成后续文本

In [6]:
from transformers import pipeline

generator = pipeline('text-generation')
results = generator('In this course, we will teach you how to')
print(results)
results = generator('In this course, we will teach you how to',
num_return_sequences=2,
max_length=50)
print(results)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to implement multi-instance polymorphism, i.e., adding non-recursive data expressions to any type, using a polymorphism of a class-level data-type. This means that you are'}]
[{'generated_text': 'In this course, we will teach you how to make your own bread (i.e. raw and dry, using only the freshest ingredients). However, it is also important to ensure you are using only materials in your recipe, because most homemade'}, {'generated_text': 'In this course, we will teach you how to build fast virtual machines that run using an emulator emulator at 3.6×3.9 MB/sec or better. Our virtual machines will be run with Python 3.6.2, which is'}]


pipeline选择了预训练好的gpt2模型来完成任务，我们也可以指定要使用的模型。下面我们将指定使用DeepSeek-R1模型

In [1]:
from transformers import pipeline

generator = pipeline('text-generation', model='distilgpt2')
results = generator('In this course, we will teach you how to',
max_length=50,
num_return_sequences=2)
print(results)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create the skills for self-awareness, self-management and self-healing into the lessons of a successful teacher. I hope you enjoy this course – read my introductory article on the psychology of self'}, {'generated_text': 'In this course, we will teach you how to apply to the class, as well as learn how to use that as part of your online course design.\n\nThis course will take you through the first three principles of online project management. How you'}]


In [8]:
# 或者你可以使用专门用于生成中文古诗的`gpt2-chinese-poem'
from transformers import pipeline

generator = pipeline('text-generation', model='uer/gpt2-chinese-poem')
results = generator('[CLS] 日 照 香 炉 生 紫 烟 ，', max_length=50, num_return_sequences=2)
print(results)

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': '[CLS] 日 照 香 炉 生 紫 烟 ， 以 一 炉 峰 。 天 风 吹 灵 药 金 炉 ， 烟 清 佛 见 青 莲 蕊 。 岩 顶 雪 寒 石 乳 流 ， 山 中 云 暖 长 松 寿 。 不 因'}, {'generated_text': '[CLS] 日 照 香 炉 生 紫 烟 ， 曰 心 独 止 。 我 入 冥 行 莫 朝 ， 无 生 不 灭 将 依 理 。 不 至 此 何 空 焉 。 云 霄 难 越 与 么 共 ， 人 间'}]


## 遮盖词填充
给定一段部分词语被遮盖掉的文本，使用预训练模型来预测能填充这些位置的词语

In [10]:
from transformers import pipeline

unmask = pipeline('fill-mask')
results = unmask('This course will teach you all about <mask> models.', top_k=2)
print(results)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0


[{'score': 0.19198693335056305, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.04209252446889877, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}]


## 命名实体识别
命名实体识别(NER)pipeline负责从文本中抽取出指定类型的实体

In [11]:
from transformers import pipeline

ner = pipeline('ner', grouped_entities=True)
results = ner('My name is Chris and I work at Hangzhou JiyiTech in Hangzhou')
print(results)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use mps:0


[{'entity_group': 'PER', 'score': np.float32(0.99943787), 'word': 'Chris', 'start': 11, 'end': 16}, {'entity_group': 'ORG', 'score': np.float32(0.9967356), 'word': 'Hangzhou JiyiTech', 'start': 31, 'end': 48}, {'entity_group': 'LOC', 'score': np.float32(0.9977745), 'word': 'Hangzhou', 'start': 52, 'end': 60}]


## 自动问答
自动问答pipeline可以根据给定的上下文回答问题

In [12]:
from transformers import pipeline

question_answer = pipeline('question-answering')
answer = question_answer(
    question='Where do I work',
    context='My name is Chris and I work at HangzhouJiyiTech in Hangzhou'
)
print(answer)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use mps:0


{'score': 0.860968828201294, 'start': 31, 'end': 47, 'answer': 'HangzhouJiyiTech'}


pipeline自动选择了在SQuAD数据上训练好的`distilbert-base`模型来完成任务。自动问答pipeline实际上是一个抽取式问答模型，即从给定的上下文中抽取答案，而不是生成答案。

QA系统可以分为：
- 抽取式QA(extractive QA)：假设答案就包含在文档中，因此直接从文档中抽取答案
- 多选QA(multiple-choice QA)：从多个给定的选项中选择答案，相当于做阅读理解
- 无约束QA(free-form QA)：直接生成答案文本，并对答案文本格式没有任何限制

## 自动摘要
自动摘要pipeline可以将长文本压缩成短文本，并且还要尽可能保留原文的主要信息

In [14]:
from transformers import pipeline

summarizer = pipeline('summarization')
results = summarizer("""
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""")

print(results)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use mps:0


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]


## pipeline的原理

In [15]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
result = classifier('I like you')
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998695850372314}]


pipeline进行了三个步骤

1. 预处理，将原始文本转换为模型可以接受的输入格式
2. 将处理好的输入送入模型
3. 对模型的输出进行后处理，将其转换成人能看懂的方式

## 使用分词器进行预处理

神经网络无法直接处理文本，首先需要通过预处理将文本转换为模型可以理解的数字。

1. 将输入切分为词语、子词或者符号，统称tokens
2. 根据模型的词表将每个token映射到对应的token编号
3. 根据模型的需要，添加一些额外的输入

每个模型都有特定的预处理操作，可以使用`AutoTokenizer`类和它的`from_pretrained()`函数，移动根据模型的`checkpoint`名称来获取对应分词器。

情感分析pipeline的默认checkpoint是`distilbert-base-uncased-finetuned-sst-2-english`, 下面我们手工下载并调用其分词器

In [17]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a Huggingface course whole my life",
    "I hate you so much"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2878,  2026,  2166,   102],
        [  101,  1045,  5223,  2017,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## 将预处理好的输入送入模型
预训练模型的下载方式和分词器(tokenizer)类似，Transformers提供了`AutoModel`类和对应的`from_pretrained()`函数。

In [18]:
from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

预训练模型只包含基础的Transformer模块，对于给定的输入，它会输出一些神经元的值，称为hidden states或者特征(features)。对于NLP模型来说，可以理解为是文本的高维语义表示。这些hidden states通常会被输入到其他的模型部分，以完成特定的任务，例如送入到分类头中完成文本分类任务。

Transformer模块的输出是一个Batch size的三维张量，其中Batch Size表示每次输入的样本数量，即每次输入多少个句子；Sequence Length表示文本序列的长度，即每个句子被分为多少个token；Hidden size表示每一个token经过模型编码后的输出向量的维度

In [19]:
from transformers import AutoTokenizer, AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a Huggingface course whole my life",
    "I hate you so much"
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 15, 768])


## 对模型输出进行后处理
由于模型的输出只是一些数值，因此并不适合人类阅读

In [20]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a Huggingface course whole my life",
    "I hate you so much"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)

print(outputs.logits)

tensor([[-1.2530,  1.2837],
        [ 3.8375, -3.1348]], grad_fn=<AddmmBackward0>)


模型对一个句子的输出是`[-1.2530,  1.2837]`，对第二个句子输出`[ 3.8375, -3.1348]`，这些并不是概率值，而是模型最后一层输出的logits值，还需要过一层Softmax

In [21]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[7.3325e-02, 9.2668e-01],
        [9.9906e-01, 9.3657e-04]], grad_fn=<SoftmaxBackward0>)


In [23]:
print(model.config.id2label)

{0: 'NEGATIVE', 1: 'POSITIVE'}
