<a href="https://colab.research.google.com/github/chenboju/AI/blob/main/Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pipeline介紹
Transformers 庫中最基本的對象是`pipeline()`函數。它將模型與其必要的預處理和後處理步驟連接起來，使我們能夠通過直接輸入任何文字並獲得最終的答案：

目前可用的一些pipeline是：

* 特徵提取（獲取文字的向量表示）
* 填充空缺
* ner（命名實體識別）
* 問答
* 情感分析
* 文字摘要
* 文字生成
* 翻譯
* 零樣本分類

In [None]:
!pip install transformers -U
!pip install sentencepiece
!pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sacremoses
Successfully installed sacremoses-0.1.1


### 情感分析

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") #sentiment-analysis 是固定寫法
#classifier("I've been waiting for a HuggingFace course my whole life.")
#classifier("so ?")
#classifier("我今天要去上課，so happy")
classifier("4090賣6萬台幣")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9220388531684875}]

In [None]:
# 使用批次

classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!",
     "我今天要去上課，so happy","我今天要去上課","要去看電影","新書出了","考到碩士了","年薪100萬"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'POSITIVE', 'score': 0.9995966553688049},
 {'label': 'NEGATIVE', 'score': 0.8862934708595276},
 {'label': 'NEGATIVE', 'score': 0.8832128047943115},
 {'label': 'POSITIVE', 'score': 0.5436826348304749},
 {'label': 'NEGATIVE', 'score': 0.5958869457244873},
 {'label': 'POSITIVE', 'score': 0.6323531270027161}]

預設情況下，此pipeline選擇一個特定的預訓練模型，該模型已針對英語情感分析進行了微調。建立分類器物件時，將下載並快取模型。如果您重新執行該命令，則將使用快取的模型，無需再次下載模型。

將一些文字傳遞到pipeline時涉及三個主要步驟：

* 文字被預處理為模型可以理解的格式。
* 預處理的輸入被傳遞給模型。
* 模型處理後輸出最終人類可以理解的結果

### 零樣本分類

對尚未標記的文字進行分類。這是實際專案中的常見場景，因為注釋文字通常很耗時並且需要領域專業知識。對於這項任務`zero-shot-classificationpipeline`非常強大：它允許直接指定用於分類的標籤，因此您不必依賴預訓練模型的標籤。下面的模型展示瞭如何使用這兩個標籤將句子分類為正面或負面——但也可以使用任何其他標籤集對文字進行分類。


In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification") #zero-shot-classification固定寫法
classifier(
    #"This is a course about the Transformers library", #input
    ["碩士順利畢業","This is a course about the Transformers library"],
    #candidate_labels=["education", "politics", "business"],
    candidate_labels=["教育", "科技", "學術"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'sequence': '碩士順利畢業',
  'labels': ['學術', '教育', '科技'],
  'scores': [0.4213746190071106, 0.30179092288017273, 0.2768344283103943]},
 {'sequence': 'This is a course about the Transformers library',
  'labels': ['教育', '學術', '科技'],
  'scores': [0.37463894486427307, 0.3610444962978363, 0.26431652903556824]}]

### 文字生成

提供一個提示，模型將通過生成剩餘的文字來自動完成整段話。

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
#generator("In this course, we will teach you how to")
generator("今天天氣很好，要去哪裡玩?")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '今天天氣很好，要去哪裡玩?\n\n對说段中文往好?\n\n(不她)'}]

使用參數`num_return_sequences`控制生成多少個不同的序列，並使用參數`max_length`控制輸出文字的總長度

In [None]:
#generator("In this course, we will teach you how to", num_return_sequences = 2, max_length = 30) num_return_sequences:幾句，max_length:最大長度
generator("今天天氣很好，要去哪裡玩?", num_return_sequences = 6, max_length = 50)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '今天天氣很好，要去哪裡玩?或是風手尽尼常拳便先?�'},
 {'generated_text': "今天天氣很好，要去哪裡玩?\n\n\nBut I like my life and I'm not thinking about any of that, I see a lot of great things"},
 {'generated_text': '今天天氣很好，要去哪裡玩?我们一格了。我内高手有人指'},
 {'generated_text': '今天天氣很好，要去哪裡玩?\n\nYANG. 鞅営失以头何的仝是�'},
 {'generated_text': '今天天氣很好，要去哪裡玩?門同不离阘? 曪是怂不是的同�'},
 {'generated_text': '今天天氣很好，要去哪裡玩?痨国便器足?聖照察?\n'}]

### 在pipeline中使用 Hub 中的其他模型
可以從 Hub 中選擇特定模型以在特定任務的pipeline中使用

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct") #指定任務、模型
generator(
    "4090上市了要去組電腦嗎?，",
    max_length=30, #最大長度
    num_return_sequences=5, #句數
    device_map="cuda",
)

The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Mask filling
填充給定文字中的空白，`top_k`參數控制要顯示的結果有多少種。請注意，這裡模型填充了特殊的`<mask>`詞，它通常被稱為掩碼標記。<br>
破壞式:輸入一句話拿掉部分字，輸入完整語句。<br>
nsp:給兩句話，輸入第一句話，要輸出第二句話。

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask") #填空任務
#unmasker("This course will teach <mask> all about <mask> models.", top_k=2) #<mask>標記 top_k=2選兩個
unmasker("This course will teach <mask> all about <mask> models.", top_k=1)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[[{'score': 0.5968282222747803,
   'token': 201,
   'token_str': ' us',
   'sequence': '<s>This course will teach us all about<mask> models.</s>'}],
 [{'score': 0.2055044323205948,
   'token': 30412,
   'token_str': ' mathematical',
   'sequence': '<s>This course will teach<mask> all about mathematical models.</s>'}]]

### 命名實體識別
命名實體識別 (NER) 是一項任務，其中模型必須找到輸入文字的哪些部分對應於諸如人員、位置或組織之類的實體。

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [None]:
from transformers import pipeline
#gyr66/bert-base-chinese-finetuned-ner
ner = pipeline("ner", model = "gyr66/bert-base-chinese-finetuned-ner", grouped_entities=True)
#ner = pipeline("ner", model = "microsoft/phi-1_5", grouped_entities=True)


#ner = pipeline("ner", model = "ckiplab/bert-base-chinese-ner", grouped_entities=True)
ner("我的名字叫陳小明，我在新竹明新科大上課。")

config.json:   0%|          | 0.00/2.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/407M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/439k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'entity_group': 'name',
  'score': 0.9999055,
  'word': '陳 小 明',
  'start': 5,
  'end': 8},
 {'entity_group': 'organization',
  'score': 0.9951236,
  'word': '新 竹 明 新 科 大',
  'start': 11,
  'end': 17}]

In [None]:
gyr66

NameError: name 'gyr66' is not defined

### 問答系統
問答pipeline使用來自給定上下文回答問題：

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?", #問題
    context="My name is Sylvain and I work at Hugging Face in Brooklyn", #給他一句
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6949753165245056, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

### 文字摘要
文字摘要是將文字縮減為較短文字的任務，同時保留文字中的主要（重要）信息。下面是一個例子：

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

### 翻譯
提供語言對（例如「translation_en_to_fr」），則可以使用預設模型，但最簡單的方法是在模型中心（hub）選擇要使用的模型

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]