<a href="https://colab.research.google.com/github/garylu999999/autojs/blob/master/course/en/chapter2/section2_pt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

特性	Encoder-only (BERT)	Decoder-only (GPT)
工作模式	表示与映射 (Representation & Mapping)	提示与生成 (Prompt & Generation)
核心机制	提取[CLS]向量，通过一个额外的分类层进行映射。	将任务转化为文本续写，直接生成类别名称的单词。
注意力	双向，充分理解上下文。	单向，从左到右预测未来。
需要组件	需要在模型之上添加一个任务特定的“头”。	不需要任何额外组件，模型本身即可完成。
训练范式	通常需要微调 (Fine-tuning)。	可以直接零样本/少样本 (Zero/Few-shot) 工作。
好比	一个专业的化验员：接收样本，通过仪器（分类头）分析出成分（类别）。	一个博学的智者：你用语言向他提问，他直接用语言回答你答案。

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [4]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [6]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

print(outputs)

torch.Size([2, 16, 768])
BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])

In [None]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# 下面是优化后的脚本

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 1. 加载 Tokenizer (这部分和您原来的一样)
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# 2. 准备输入数据 (也和原来一样)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print("Tokenized Inputs:")
print(inputs)
print("-" * 30)

# 3. 加载带有分类头的模型 (这是关键的改动)
# 注意：我们使用的是 AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# 4. 进行预测
# 使用 with torch.no_grad() 是一个好习惯，因为我们只是在做推理，不需要计算梯度
with torch.no_grad():
    # 使用 **inputs 将字典解包成关键字参数，如 input_ids=..., attention_mask=...
    outputs = model(**inputs)

# 5. 解析输出
print("Model Outputs (Logits):")
logits = outputs.logits
print(logits)
print("-" * 30)

# 6. 将 logits 转换为可读的预测结果
# Logits 是原始分数，我们需要找到分数最高的那个类别的索引
predictions = torch.argmax(logits, dim=-1)
print("Predicted Class IDs:", predictions) # 输出 [1, 0]

# 7. 将预测的ID映射回标签名称
# 模型的配置中存储了 ID 和标签的对应关系
labels = [model.config.id2label[p.item()] for p in predictions]
print("Predicted Labels:", labels) # 输出 ['POSITIVE', 'NEGATIVE']


Tokenized Inputs:
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
------------------------------
Model Outputs (Logits):
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]])
------------------------------
Predicted Class IDs: tensor([1, 0])
Predicted Labels: ['POSITIVE', 'NEGATIVE']
