# 二、自然语言处理之模型应用----完形填空

HuggingFace有一个巨大的模型库，其中一些是已经非常成熟的经典模型，这些模型即使不进行任何训练也能直接得出比较好的预测结果，也就是常说的Zero Shot Learning。

In [1]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 1) 下载模型

In [2]:
# 下载模型
#!HF_ENDPOINT=https://hf-mirror.com hf download distilbert/distilroberta-base --local-dir ../models/distilbert/distilroberta-base

### 2) 使用pipeline加载模型

使用管道工具时，调用者需要做的只是告诉管道工具要进行的任务类型，管道工具会自动分配合适的模型，直接给出预测结果，如果这个预测结果对于调用者已经可以满足需求，则不再需要再训练。

管道工具的API非常简洁，隐藏了大量复杂的底层代码，即使是非专业人员也能轻松使用。

In [3]:
# 完形填空
from transformers import pipeline
unmasker = pipeline(task="fill-mask",
                    model="../models/distilbert/distilroberta-base", 
                    device=device)

Some weights of the model checkpoint at ../models/distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda


### 3) 查看模型的配置信息

In [4]:
# 查看模型的配置信息
print(unmasker.model.config)

RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}



### 4) 使用模型预测

In [5]:
# 加载本地模型，并使用pipeline函数
### 1) 定义一个带有mask的句子
sentence='HuggingFace is creating a <mask> that the community uses to solve NLP tasks.'
results = unmasker(sentence)
for r in results:
    print(r)

{'score': 0.17927496135234833, 'token': 3944, 'token_str': ' tool', 'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.'}
{'score': 0.11349349468946457, 'token': 7208, 'token_str': ' framework', 'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.'}
{'score': 0.052434857934713364, 'token': 5560, 'token_str': ' library', 'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.'}
{'score': 0.034935176372528076, 'token': 8503, 'token_str': ' database', 'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.'}
{'score': 0.028602290898561478, 'token': 17715, 'token_str': ' prototype', 'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.'}


### 5) 使用from_pretrained加载本地模型

参考这个页面 https://huggingface.co/docs/transformers/model_doc/roberta

或这个页面 https://hf-mirror.com/docs/transformers/model_doc/roberta

In [6]:
# 加载本地模型，并使用pipeline函数
from transformers import RobertaTokenizer, RobertaForMaskedLM
tokenizer = RobertaTokenizer.from_pretrained('../models/distilbert/distilroberta-base')
model = RobertaForMaskedLM.from_pretrained('../models/distilbert/distilroberta-base')

unmasker = pipeline(task='fill-mask', 
                    model=model,
                    tokenizer=tokenizer)

results = unmasker(sentence)
for r in results:
    print(r['score'], ' ', r['token_str'])

Some weights of the model checkpoint at ../models/distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


0.17927496135234833    tool
0.11349349468946457    framework
0.052434857934713364    library
0.034935176372528076    database
0.028602290898561478    prototype


In [7]:
from transformers import AutoTokenizer, RobertaForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("../models/distilbert/distilroberta-base")
model = RobertaForMaskedLM.from_pretrained("../models/distilbert/distilroberta-base")

inputs = tokenizer("The capital of France is <mask>.", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

# retrieve index of <mask>
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
print('mask_token_index=', mask_token_index.item())

predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))


Some weights of the model checkpoint at ../models/distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


mask_token_index= 6
 Paris
