# 二、自然语言处理之模型应用----文本分类

HuggingFace有一个巨大的模型库，其中一些是已经非常成熟的经典模型，这些模型即使不进行任何训练也能直接得出比较好的预测结果，也就是常说的Zero Shot Learning。

In [1]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 1) 下载模型

In [2]:
# 下载模型
#!HF_ENDPOINT=https://hf-mirror.com hf download distilbert/distilbert-base-uncased-finetuned-sst-2-english --local-dir ../models/distilbert/distilbert-base-uncased-finetuned-sst-2-english

### 2) 使用pipeline加载模型

使用管道工具时，调用者需要做的只是告诉管道工具要进行的任务类型，管道工具会自动分配合适的模型，直接给出预测结果，如果这个预测结果对于调用者已经可以满足需求，则不再需要再训练。

管道工具的API非常简洁，隐藏了大量复杂的底层代码，即使是非专业人员也能轻松使用。

In [3]:
# 加载文本分类任务的bert模型
from transformers import pipeline
classifier = pipeline(task="sentiment-analysis", 
                      model="../models/distilbert/distilbert-base-uncased-finetuned-sst-2-english", 
                      device=device)

Device set to use cuda


### 3) 查看模型的配置信息

In [4]:
# 查看模型的配置信息
print(classifier.model.config)

DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.1",
  "vocab_size": 30522
}



发现模型名为 distilbert-base-uncased-finetuned-sst-2-english

然后到huggingface.co搜索模型名，找到页面 https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

然后到hf-mirror.com搜索模型名，找到页面 https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english

### 4) 将加载的模型保存到本地其它文件夹

In [5]:
# 将在线下载的模型保存到本地
#classifier.save_pretrained("../models/other_model_name")

### 5) 使用模型预测

In [6]:
sentence1="I hate you"
sentence2="I love you"
sentence3="There is an old tree"
results = classifier([sentence1, sentence2, sentence3])
for r in results:
    print(r)

{'label': 'NEGATIVE', 'score': 0.9991129040718079}
{'label': 'POSITIVE', 'score': 0.9998656511306763}
{'label': 'NEGATIVE', 'score': 0.6850801706314087}


### 6) 使用from_pretrained加载模型

参考这个页面 https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

或这个页面 https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english

In [7]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("../models/distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("../models/distilbert/distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
print(logits)
predicted_class_id = logits.argmax().item()
print(model.config.id2label)
print(model.config.id2label[predicted_class_id])

tensor([[-4.0687,  4.3669]])
{0: 'NEGATIVE', 1: 'POSITIVE'}
POSITIVE


### 7）使用model和tokenizer组装一个pipeline，然后用组装后的pipeline分类

In [8]:
pipe = pipeline(task='sentiment-analysis',
                model=model,
                tokenizer=tokenizer,
                device=device)

sentence1="I hate you"
sentence2="I love you"
sentence3="There is an old tree"
results = pipe([sentence1, sentence2, sentence3])
for r in results:
    print(r)

Device set to use cuda


{'label': 'NEGATIVE', 'score': 0.9991129040718079}
{'label': 'POSITIVE', 'score': 0.9998656511306763}
{'label': 'NEGATIVE', 'score': 0.6850801706314087}
