# 二、自然语言处理之模型应用----命名实体识别

HuggingFace有一个巨大的模型库，其中一些是已经非常成熟的经典模型，这些模型即使不进行任何训练也能直接得出比较好的预测结果，也就是常说的Zero Shot Learning。

In [1]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 1) 下载模型

In [2]:
# 下载模型
#!HF_ENDPOINT=https://hf-mirror.com hf download dslim/bert-base-NER --local-dir ../models/dslim/bert-base-NER

### 2) 使用pipeline加载模型

使用管道工具时，调用者需要做的只是告诉管道工具要进行的任务类型，管道工具会自动分配合适的模型，直接给出预测结果，如果这个预测结果对于调用者已经可以满足需求，则不再需要再训练。

管道工具的API非常简洁，隐藏了大量复杂的底层代码，即使是非专业人员也能轻松使用。

In [3]:
# 加载命名实体识别模型
from transformers import pipeline
ner_pipe = pipeline(task="ner", 
                    model="../models/dslim/bert-base-NER",
                   device=device)

Some weights of the model checkpoint at ../models/dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda


### 3) 查看模型的配置信息

In [4]:
# 查看模型的配置信息
print(ner_pipe.model.config)

BertConfig {
  "_num_labels": 9,
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-MISC",
    "2": "I-MISC",
    "3": "B-PER",
    "4": "I-PER",
    "5": "B-ORG",
    "6": "I-ORG",
    "7": "B-LOC",
    "8": "I-LOC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 7,
    "B-MISC": 1,
    "B-ORG": 5,
    "B-PER": 3,
    "I-LOC": 8,
    "I-MISC": 2,
    "I-ORG": 6,
    "I-PER": 4,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



### 4) 使用模型预测

In [5]:
# 命名实体识别
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

for entity in ner_pipe(sequence):
    print(entity)

{'entity': 'B-ORG', 'score': np.float32(0.99735945), 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': np.float32(0.98321056), 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': np.float32(0.9978242), 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': np.float32(0.9986395), 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'B-LOC', 'score': np.float32(0.9994894), 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': np.float32(0.99935216), 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': np.float32(0.99955374), 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'B-LOC', 'score': np.float32(0.7335543), 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-ORG', 'score': np.float32(0.57430446), 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-ORG', 'score': np.float32(0.5355703), 'index': 21

### 5) 使用from_pretrained加载模型
参考这个页面 https://huggingface.co/dslim/bert-base-NER

或这个页面 https://hf-mirror.com/dslim/bert-base-NER

In [6]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("../models/dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("../models/dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin Germany"

ner_results = nlp(example)
for r in ner_results:
    print(r)

Some weights of the model checkpoint at ../models/dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


{'entity': 'B-PER', 'score': np.float32(0.99889874), 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}
{'entity': 'B-LOC', 'score': np.float32(0.999503), 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}
{'entity': 'I-LOC', 'score': np.float32(0.7820654), 'index': 10, 'word': 'Germany', 'start': 41, 'end': 48}


### 6) 下面我想让输出更可控一些（即不使用pipeline这么傻瓜的函数）

In [7]:
print(tokenizer)
print(model)

BertTokenizerFast(name_or_path='../models/dslim/bert-base-NER', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embeddi

In [8]:
import torch
from transformers import BertTokenizerFast, BertForTokenClassification

tokenizer = BertTokenizerFast.from_pretrained("../models/dslim/bert-base-NER")
model = BertForTokenClassification.from_pretrained("../models/dslim/bert-base-NER")

example = "My name is Wolfgang and I live in Berlin Germany"
inputs = tokenizer(example, return_tensors="pt")
print(inputs.input_ids)
with torch.no_grad():
    logits = model(**inputs).logits
print(logits)
print(model.config.id2label)

Some weights of the model checkpoint at ../models/dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[  101,  1422,  1271,  1110, 14326,  1105,   146,  1686,  1107,  3206,
          1860,   102]])
tensor([[[ 6.3067, -0.3467, -1.0567, -0.7085, -1.0865, -0.8405, -1.8050,
          -0.8185, -1.2252],
         [ 9.2293, -1.1977, -2.5957,  0.0104, -1.9989, -1.5482, -2.3986,
          -0.6703, -1.2614],
         [ 9.3833, -1.0201, -2.1171, -0.2670, -1.1930, -2.5010, -1.7489,
          -1.1891, -1.1439],
         [ 9.3078, -0.9415, -1.9995, -0.2671, -1.2567, -2.6290, -1.8227,
          -1.0745, -1.0053],
         [-0.4231, -1.4356, -2.8459,  8.0095, -0.3620, -0.6428, -2.8158,
          -0.0446, -2.0519],
         [ 9.5702, -1.4722, -1.7606, -0.9493, -0.8142, -2.2278, -1.2958,
          -1.1866, -1.4799],
         [ 9.1578, -1.3360, -2.6826, -0.0208, -1.6509, -1.8847, -2.2323,
          -0.6267, -1.2597],
         [ 9.7792, -1.1190, -2.5631, -1.0737, -1.6784, -1.8619, -1.6596,
          -0.7448, -1.4209],
         [ 9.7693, -0.6543, -2.6917, -1.0276, -2.2436, -1.7990, -1.9699,
       

In [9]:
for i, logit in enumerate(logits[0]):
    predicted_class_id = logit.argmax().item()
    print(tokenizer.decode(inputs.input_ids[0][i]), end='\t')
    print(predicted_class_id, end='\t')
    print(model.config.id2label[predicted_class_id])

[CLS]	0	O
My	0	O
name	0	O
is	0	O
Wolfgang	3	B-PER
and	0	O
I	0	O
live	0	O
in	0	O
Berlin	7	B-LOC
Germany	8	I-LOC
[SEP]	0	O
