# Transformers, what can they do?

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [25]:
!pip install datasets evaluate transformers[sentencepiece]



In [26]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [27]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

score表示置信度，是模型对情感分类结果的可信度量化，数值越高，表示模型越确信自己的判断

In [28]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994257926941, 0.11197380721569061, 0.04342673346400261]}

In [29]:
classifier(
    "I want to play at the afternoon.",
    candidate_labels=["education", "entertaimnet", "business"],
)

{'sequence': 'I want to play at the afternoon.',
 'labels': ['entertaimnet', 'business', 'education'],
 'scores': [0.920432984828949, 0.0464714840054512, 0.03309548646211624]}

In [30]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to develop a strong, solid, positive mindset of supporting your family.\n\n2. Become a Teacher\n\nAt this point, I am looking for what skills are needed as a school teacher. We'}]

In [31]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to master the art of drawing.\n\n- Learn how to understand, adapt, and read from the'},
 {'generated_text': 'In this course, we will teach you how to generate a beautiful and simple, intuitive model of the human experience using a natural way of using our visual'}]

In [32]:
from transformers import pipeline

# 加载中文文本生成模型
generator = pipeline("text-generation", model="uer/gpt2-chinese-cluecorpussmall")

# 输入中文提示生成文本
output = generator(
    "在这个课程中，我们将教你如何",
    max_length=50,  # 设置生成文本的最大长度
    num_return_sequences=2,  # 生成两条结果
)

# 打印生成的文本
for i, result in enumerate(output):
    print(f"生成结果 {i+1}: {result['generated_text']}")

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


生成结果 1: 在这个课程中，我们将教你如何 解 决 一 个 人 与 人 之 间 「 信 任 危 机 」 教 授 是 一 位 英 国 皇 家 教 育 学 院 的 教 授 ， 对 学
生成结果 2: 在这个课程中，我们将教你如何 招 聘 会 主 席 兼 董 事 长 ， 上 下 九 分 讲 师 ！ 由 本 人 于 1999 年 第 一 期 （ 现 位 于 广 州 市 番


生成文本当中貌似有一些问题，就是句子并不完整

In [33]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.19198468327522278,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042092032730579376,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [34]:
from transformers import pipeline

# 加载 bert-base-cased 模型
unmasker = pipeline("fill-mask", model="bert-base-cased")

# 输入调整后的句子
results = unmasker("This course will teach you all about [MASK] models.", top_k=2)

# 打印结果
for result in results:
    print(result)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


{'score': 0.2596322000026703, 'token': 1648, 'token_str': 'role', 'sequence': 'This course will teach you all about role models.'}
{'score': 0.09427239000797272, 'token': 1103, 'token_str': 'the', 'sequence': 'This course will teach you all about the models.'}


结果分析：
bert-base-cased 预测的前两个候选词可能是 different 和 new，具体得分和词语可能因模型版本或环境略有不同。
与默认模型（预测 mathematical 和 computational）相比，bert-base-cased 的预测更倾向于通用形容词，因为它是一个通用预训练模型，未针对特定领域优化。

NER 任务：
目标是识别文本中的命名实体，并标注其类别（如人名 PER、组织 ORG、地点 LOC）。
输出包括实体的词、类别、置信度（score）、以及在句子中的起始和结束位置。

grouped_entities=True：
默认情况下，NER 模型可能会将多词实体（如“Hugging Face”）拆分成单独的 token（“Hugging”和“Face”）。
设置 grouped_entities=True 后，管道会将这些部分重新组合为一个完整的实体。

In [35]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [36]:
#POS 标注每个词的语法功能（如名词、动词）。
from transformers import pipeline

# 加载 POS 模型
pos = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos")

# 输入句子
results = pos("My name is Sylvain and I work at Hugging Face in Brooklyn.")

# 打印结果
for result in results:
    print(result)

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


{'entity': 'PRON', 'score': np.float32(0.9994592), 'index': 1, 'word': 'my', 'start': 0, 'end': 2}
{'entity': 'NOUN', 'score': np.float32(0.99601364), 'index': 2, 'word': 'name', 'start': 3, 'end': 7}
{'entity': 'AUX', 'score': np.float32(0.9953696), 'index': 3, 'word': 'is', 'start': 8, 'end': 10}
{'entity': 'PROPN', 'score': np.float32(0.99848914), 'index': 4, 'word': 'sy', 'start': 11, 'end': 13}
{'entity': 'PROPN', 'score': np.float32(0.9978808), 'index': 5, 'word': '##lva', 'start': 13, 'end': 16}
{'entity': 'PROPN', 'score': np.float32(0.99808747), 'index': 6, 'word': '##in', 'start': 16, 'end': 18}
{'entity': 'CCONJ', 'score': np.float32(0.99918765), 'index': 7, 'word': 'and', 'start': 19, 'end': 22}
{'entity': 'PRON', 'score': np.float32(0.9994679), 'index': 8, 'word': 'i', 'start': 23, 'end': 24}
{'entity': 'VERB', 'score': np.float32(0.99923587), 'index': 9, 'word': 'work', 'start': 25, 'end': 29}
{'entity': 'ADP', 'score': np.float32(0.9063111), 'index': 10, 'word': 'at', 's

In [37]:
from transformers import pipeline
#这个是提取式问答，并不是生成答案（与gpt的不同之处）
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [38]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [39]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Device set to use cpu


[{'translation_text': 'This course is produced by Hugging Face.'}]

In [40]:
from transformers import pipeline

# 法语到英语
translator_fr_en = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
text_en = translator_fr_en("Ce cours est produit par Hugging Face.")[0]["translation_text"]

# 英语到中文
translator_en_zh = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")
text_zh = translator_en_zh(text_en)[0]["translation_text"]

print(text_zh)

#如果直接法语到中文，是会报错的，需要法语-英文-中文

Device set to use cpu
Device set to use cpu


这门课程由Huggging Face制作。


对于这一节课的总结：
学习了pipeline里面设置的很多方法
对于一个句子的语义分析打分，对于一个句子贴标签的打分（直接用预训练模型），文本生成（感觉这个生成的质量一般般，特别是对于中文，同时有时候生成不了完整的句子），填充句子中的空白（其实是短语的提炼和训练），识别句子中的人名地名 or 动名词之类的，根据提供的文本回答问题，总结问题，翻译句子（但是好像需要用英文来作为媒介）