开箱即用的 pipelines
Transformers 库将目前的 NLP 任务归纳为几下几类：

文本分类：例如情感分析、句子对关系判断等；
对文本中的词语进行分类：例如词性标注 (POS)、命名实体识别 (NER) 等；
文本生成：例如填充预设的模板 (prompt)、预测文本中被遮掩掉 (masked) 的词语；
从文本中抽取答案：例如根据给定的问题从一段文本中抽取出对应的答案；
根据输入文本生成新的句子：例如文本翻译、自动摘要等。
Transformers 库最基础的对象就是 pipeline() 函数，它封装了预训练模型和对应的前处理和后处理环节。我们只需输入文本，就能得到预期的答案。目前常用的 pipelines 有：

feature-extraction （获得文本的向量化表示）
fill-mask （填充被遮盖的词、片段）
ner（命名实体识别）
question-answering （自动问答）
sentiment-analysis （情感分析）
summarization （自动摘要）
text-generation （文本生成）
translation （机器翻译）
zero-shot-classification （零训练样本分类）
下面我们以常见的几个 NLP 任务为例，展示如何调用这些 pipeline 模型。

In [None]:
# 情感分析
# 借助情感分析 pipeline，我们只需要输入文本，就可以得到其情感标签（积极/消极）以及对应的概率：

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I've been waiting for a HuggingFace course my whole life.")
print(result)
results = classifier(
  ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
print(results)

pipeline 模型会自动完成以下三个步骤：

将文本预处理为模型可以理解的格式；
将预处理好的文本送入模型；
对模型的预测值进行后处理，输出人类可以理解的格式。
pipeline 会自动选择合适的预训练模型来完成任务。例如对于情感分析，默认就会选择微调好的英文情感模型 distilbert-base-uncased-finetuned-sst-2-english。m

In [2]:
# 零训练样本分类
# 零训练样本分类 pipeline 允许我们在不提供任何标注数据的情况下自定义分类标签。
# 可以看到，pipeline 自动选择了预训练好的 facebook/bart-large-mnli 模型来完成任务。
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
result = classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445960879325867, 0.11197640746831894, 0.0434274859726429]}


In [3]:
from transformers import pipeline

generator = pipeline("text-generation")
results = generator("In this course, we will teach you how to")
print(results)
results = generator(
    "In this course, we will teach you how to",
    num_return_sequences=2,
    max_length=50
) 
print(results)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to create powerful and powerful virtual machines using the same software and hardware. You will work on multiple scenarios to create powerful and powerful virtual machines.\n\nIn this course, you will find the tools to create powerful virtual machines using the same software and hardware. You will work on multiple scenarios to create powerful and powerful virtual machines.\n\nIn this course, you will discover the tools to create powerful and powerful virtual machines using the same software and hardware. You will work on multiple scenarios to create powerful and powerful virtual machines.\n\nIn this course, you will discover the tools to create powerful and powerful virtual machines using the same software and hardware. You will work on multiple scenarios to create powerful and powerful virtual machines.\n\nIn this course, you will learn how to create powerful and powerful virtual machines using the same software and hardware. 

In [4]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
results = generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)
print(results)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "In this course, we will teach you how to make simple and effective, and not make a bunch of fancy fancy things that you can't do. The only way to do it is to think about ways to make something very simple.\n\n\nThis course will give you a basic idea of how to make everything simple.\nAfter you've completed this course, you'll need to make your own.\nBefore you start making, you'll need to know how to make it easy.\nFirst, you will need to find a few examples.\nExample 1:\nImagine a simple video tutorial.\nFirst, imagine how to make a simple video tutorial.\nOnce you've finished it, you can use a few examples.\nExample 2:\nImagine another simple video tutorial.\nNow, you might want to use a few examples.\nSo you can make this app simple, but it's not easy.\nYou could start by using a few examples.\nExample 3:\nImagine a simple video tutorial.\nNow, you might need to use a few examples.\nNow, you might want to use a few examples.\nA quick example.\nNow, you might wan

In [1]:
# 还可以通过左边的语言 tag 选择其他语言的模型。例如加载专门用于生成中文古诗的 gpt2-chinese-poem 模型：
#pip install --upgrade torch torchvision torchaudio

# 

from transformers import pipeline

generator = pipeline("text-generation", model="uer/gpt2-chinese-poem")
results = generator(
    "[CLS] 万 叠 春 山 积 雨 晴 ，",
    max_length=40,
    num_return_sequences=2,
)
print(results)

tokenizer_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': '[CLS] 万 叠 春 山 积 雨 晴 ， 望 山 光 紫 。 山 花 发 杜 鹃 啼 血 ， 山 头 日 日 悲 风 起 。 萦 翠 带 花 绸 缪 。 彼 远 游 人 ， 一 别 三 千 秋 。 何 年 此 结 彼 ， 一 旦 埋 荒 洲 。 我 魂 亦 归 ， 当 与 尔 同 游 。 相 思 不 相 见 ， 独 上 古 城 头 。 人 不 可 见 ， 行 云 去 悠 悠 。 哉 彼 流 水 ， 东 去 何 时 休 。 兹 老 兄 弟 ， 生 死 同 衾 裯 。 骨 肉 为 兄 弟 土 壤 壤 壤 壤 壤 壤 ， 魂 气 化 作 灰 槁 ， 魂 气 结 散 作 邻 。 魂 散 作 室 中 夜 夜 台 ， 寒 风 吹 作 曙 ， 月 。 我 屋 角 ， 日 月 轮 转 月 月 兮 星 河 流 光 。 我 床 ， 月 以 月 兮 月 兮 月 兮 月 兮 月 兮 月 兮 月 兮 月 兮 月 兮 月 兮 月 兮 露 。 月 兮 月 兮 月 兮 河 汉 河 汉 月 兮 河 汉 水 流 天 河 汉 之 星 河 汉 之 月 兮 河 汉 兮 河 山 河 汉 之 ， 河 汉 之 云 。 河 汉 兮 河 汉 之 露 兮 河 汉 之 。 天 河'}, {'generated_text': '[CLS] 万 叠 春 山 积 雨 晴 ， 山 看 不 了 。 不 知 天 气 暄 ， 但 觉 日 光 好 。 山 人 睡 足 时 ， 心 闲 境 自 少 。 哉 山 中 人 ， 何 年 丹 九 转 。 毋 五 岳 游 ， 不 必 身 插 羽 。 客 方 兴 云 ， 云 归 自 何 许 。 乎 勿 叹 留 ， 不 归 待 渠 补 。 夜 大 雷 雨 ， 一 山 多 雨 声 。 檐 溜 响 不 歇 ， 端 似 秋 不 平 。 山 前 溪 水 流 ， 江 水 日 夜 响 。 江 湖 浮 画 舷 为 簸 ， 一 叶 荡 。 人 家 门 前 。 船 上 。 船 船 行 船 ， 水 去 ， 到 底 住 ， 一 掉 一 掷 一 掷 一 掷 一 掷 一 掷 一 掷 双 桨 ， 一 掷 一 樯 ， 不 摇 。 一 掷 一 掷 。 一 掷 一 掷 千 。 百 万 千 千 呼 。 一 掷 一 掷 一 掷 一 掷 一 掷 一 掷 一 掷 一 掷 一 掷 千 寻 一 换 百 舟 