# 中文细粒度命名实体识别

中文细粒度命名实体识别，简称 NER。主要用于识别文本、句子中的人名、地名、机构名。除此之外，还包括一些实体、例如日前、数字、货币等等。

参考：

  ● [中文命名实体识别总结](https://www.jianshu.com/p/34a5c6b9bb3e)
  
  ● [中文细粒度命名实体识别](https://zhuanlan.zhihu.com/p/103034432?utm_source=wechat_session)
  
  ● [NLP-中文命名实体识别](https://blog.csdn.net/MaggicalQ/article/details/88980534)
  
  ● [命名实体总结](https://www.cnblogs.com/nxf-rabbit75/archive/2019/04/18/10727769.html)


## 数据集


项目数据集来自 [中文细粒度命名实体识别数据集](https://www.cluebenchmarks.com/introduce.html)。该数据集主要包括 train.json、test.json、dev.json。 

  ● train.json 训练数据集。包含text、label。可以进行训练
  
  ● test.json 测试数据集。该数据集没有提供 label、无法进行评分。详细参考[官网](https://www.cluebenchmarks.com/introduce.html)
  
  ● dev.json 验证数据集。包含 text 和 label。可以进行测试、验证。
  
 项目中采用 train.json 做训练和验证数据集。dev.json 做测试数据集。数据集中包括多个实体，每个实体的语料数量各不相同。

### 预处理

对数据进行处理，查看下载的数据集格式。对数据集进行加载预处理。数据集采用json格式, 每一行为一个json文本。text为文本数据，label为文本标签，主要包括实体类型、实体字符、以及在文本中的索引

In [1]:
!head -10 ../data/cluener_public/train.json

{"text": "浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为，对目前国内商业银行而言，", "label": {"name": {"叶老桂": [[9, 11]]}, "company": {"浙商银行": [[0, 3]]}}}
{"text": "生生不息CSOL生化狂潮让你填弹狂扫", "label": {"game": {"CSOL": [[4, 7]]}}}
{"text": "那不勒斯vs锡耶纳以及桑普vs热那亚之上呢？", "label": {"organization": {"那不勒斯": [[0, 3]], "锡耶纳": [[6, 8]], "桑普": [[11, 12]], "热那亚": [[15, 17]]}}}
{"text": "加勒比海盗3：世界尽头》的去年同期成绩死死甩在身后，后者则即将赶超《变形金刚》，", "label": {"movie": {"加勒比海盗3：世界尽头》": [[0, 11]], "《变形金刚》": [[33, 38]]}}}
{"text": "布鲁京斯研究所桑顿中国中心研究部主任李成说，东亚的和平与安全，是美国的“核心利益”之一。", "label": {"address": {"美国": [[32, 33]]}, "organization": {"布鲁京斯研究所桑顿中国中心": [[0, 12]]}, "name": {"李成": [[18, 19]]}, "position": {"研究部主任": [[13, 17]]}}}
{"text": "目前主赞助商暂时空缺，他们的球衣上印的是“unicef”（联合国儿童基金会），是公益性质的广告；", "label": {"organization": {"unicef": [[21, 26]], "联合国儿童基金会": [[29, 36]]}}}
{"text": "此数据换算成亚洲盘罗马客场可让平半低水。", "label": {"organization": {"罗马": [[9, 10]]}}}
{"text": "你们是最棒的!#英雄联盟d学sanchez创作的原声王", "label": {"game": {"英雄联盟": [[8, 11]]}}}
{"text": "除了吴湖帆时现精彩，吴待秋、吴子

In [2]:
train_data_file = "../data/cluener_public/train.json"
test_data_file = "../data/cluener_public/dev.json"

### 语料标注方法

语料标注方法是将语料中每一个字符使用特殊的字符进行标记。用于区分该字符的涵义。比较 “罗马” 这一实体名，则用 B_organization 和 E_organization 表示。

语料采用 BIOES 标注方法：

  ● BIOES是在IOB方法上，扩展出的一个更复杂，但更完备的标注方法。其中 B表示这个词处于一个实体的开始(Begin), I 表示实体内部(inside), O 表示实体外部(outside)的其他词, E 表示这个词处于一个实体的结束为止， S 表示这个词是自己就可以组成一个实体(Single)

  ● BIOES 是目前最通用的命名实体标注方法。

In [3]:
import sys
sys.path.append("../")

import json
from module.core.data_tools import DataTools

# 定义语料标识符
identifier_b, identifier_i, identifier_o, identifier_e, identifier_s = "B", "I", "O", "E", "S"
    
# 定义语料标识符的格式
identifier_format = lambda i, s: "{}_{}".format(i, s)

def handle(line):
    json_data = json.loads(line)
    
    # 获取文本数据和标签数据
    text = json_data['text']
    label = json_data['label']

    identifier = [identifier_o] * len(text)

    for ner_name, ner_value in label.items():
        for ner_str, ner_index in ner_value.items():
            for n_index in ner_index:
                if text[n_index[0]:n_index[1] + 1] != ner_str:
                    print("Data Error: no specific character found . text: {}, label: {}".format(text, label))
                    exit()
                # 单个字符的实体。在中文语料中可能不存在。
                if len(ner_str) == 1:
                    identifier[n_index[0]] = identifier_format(identifier_s,ner_name)
                    
                # 两个字符的实体
                elif len(ner_str) == 2:
                    identifier[n_index[0]] = identifier_format(identifier_b, ner_name)
                    identifier[n_index[1]] = identifier_format(identifier_e, ner_name)
                
                # 两个字符以上的实体
                elif len(ner_str) > 2:
                    identifier[n_index[0]] = identifier_format(identifier_b, ner_name)
                    for i in range(1, len(ner_str) - 2 + 1):
                        identifier[n_index[0] + i] = identifier_format(identifier_i, ner_name)
                    identifier[n_index[1]] = identifier_format(identifier_e, ner_name)

    return [text, identifier, label]

# 使用DataTools读取数据，同时传入handle函数，对数据进行处理。
train_dataset = DataTools.Preprocess.read_file_data(train_data_file, handle_func=handle)
test_dataset = DataTools.Preprocess.read_file_data(test_data_file, handle_func=handle)

10748it [00:00, 32153.35it/s]
1343it [00:00, 28559.88it/s]


In [4]:
print("Train Dataset : ")
for i in range(20):
    print("text: ", train_dataset[i][0])
    print("identifier: ", train_dataset[i][1])
    print("label: ", train_dataset[i][2])
    print()
print()

print("Test Dataset : ")
for i in range(20):
    print("text: ", test_dataset[i][0])
    print("identifier: ", test_dataset[i][1])
    print("label: ", test_dataset[i][2])
    print()

Train Dataset : 
text:  浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为，对目前国内商业银行而言，
identifier:  ['B_company', 'I_company', 'I_company', 'E_company', 'O', 'O', 'O', 'O', 'O', 'B_name', 'I_name', 'E_name', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
label:  {'name': {'叶老桂': [[9, 11]]}, 'company': {'浙商银行': [[0, 3]]}}

text:  生生不息CSOL生化狂潮让你填弹狂扫
identifier:  ['O', 'O', 'O', 'O', 'B_game', 'I_game', 'I_game', 'E_game', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
label:  {'game': {'CSOL': [[4, 7]]}}

text:  那不勒斯vs锡耶纳以及桑普vs热那亚之上呢？
identifier:  ['B_organization', 'I_organization', 'I_organization', 'E_organization', 'O', 'O', 'B_organization', 'I_organization', 'E_organization', 'O', 'O', 'B_organization', 'E_organization', 'O', 'O', 'B_organization', 'I_organization', 'E_organization', 'O', 'O', 'O', 'O']
label:  {'organization': {'那不勒斯': [[0, 3]], '锡耶

In [5]:
# 将数据集拆分成data、identifier、label 三个子集。
train_data, train_identifier, train_label = list(), list(), list()
for (text, identifier, label) in train_dataset:
    train_data.append(text)
    train_identifier.append(identifier)
    train_label.append(label)
    
test_data, test_identifier, test_label = list(), list(), list()
for (text, identifier, label) in test_dataset:
    test_data.append(text)
    test_identifier.append(identifier)
    test_label.append(label)

In [6]:
# 统计训练的实体数量。
from collections import Counter

def entity_count(labels):
    entity_number = Counter()
    
    for label in labels:
        for entity_name, _ in label.items():
            entity_number.update([entity_name])
            
    for entity_name, entity_num in entity_number.items():
        print("{} : {}".format(entity_name, entity_num))

print("train entity: ")
entity_count(train_label)

print()

print("test entity: ")
entity_count(test_label)

train entity: 
name : 2847
company : 2215
game : 1897
organization : 1894
movie : 779
address : 2090
position : 2464
government : 1461
scene : 946
book : 908

test entity: 
address : 273
name : 352
organization : 206
game : 226
scene : 124
book : 121
company : 279
position : 347
government : 190
movie : 101


由上可知语料中实体的数量大小不同，最大的是 name 实体语料。

## 模型实现

使用 pytorch 实现 BiLSTM + CRF 模型。 
参考：[使用 pytorch 实现 BiLSTM + CRF 模型](https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html)

代码实现参考：BLTP/module/ner/bilstm_crf/bilstm_crf_ner.py

### 创建词典

字符不能直接输入到网络模型。需要转换成数字。创建词典，将每个字和标示符使用数字的形式表示。

In [7]:
from module.ner.bilstm_crf.dictionary import NERDictionary

dictionary = NERDictionary()
dictionary.fit(train_data, train_identifier)

save_dict = "../model/BiLSTM_CRF_NER/ner_dictionary.pickle"
dictionary.save(save_dict)

10748it [00:00, 60601.93it/s]

word num: 3671, identifier num: 37
save dictionary success! File: ../model/BiLSTM_CRF_NER/ner_dictionary.pickle





### 模型

In [8]:
from module.ner.bilstm_crf.bilstm_crf_ner import BiLSTM_CRF_NER

embedding_dim = 200
hidden_size = 256

ner_model = BiLSTM_CRF_NER(dictionary, embedding_dim=embedding_dim, hidden_size=hidden_size)

#### 训练

In [None]:
epochs = 5
lr = 0.001
weight_decay = 0.0001
ratio = 0.1
save_model = "../model/BiLSTM_CRF_NER/ner_model.pth"

ner_model.fit(train_data, train_identifier, train_label, 
              epochs=epochs, 
              lr=lr, 
              weight_decay=weight_decay, 
              ratio=ratio, 
              save_model=save_model)

train epochs: 1/5, train step: 9678/9678, train loss: 28.9645
entity: address        entity number: 2733.0  recall: 0.0000 
entity: government     entity number: 1679.0  recall: 0.0000 
entity: organization   entity number: 2979.0  recall: 0.0007 
entity: name           entity number: 3385.0  recall: 0.0012 
entity: scene          entity number: 1309.0  recall: 0.0000 
entity: position       entity number: 2717.0  recall: 0.0037 
entity: game           entity number: 2246.0  recall: 0.0116 
entity: company        entity number: 2675.0  recall: 0.0000 
entity: movie          entity number: 956.0  recall: 0.0042 
entity: book           entity number: 980.0  recall: 0.0010 
entity number:21659.0, recall:0.0022
valid step: 1070/1070, valid loss: 15.6291
entity: book           entity number: 159.0  recall: 0.0000 
entity: game           entity number: 130.0  recall: 0.0154 
entity: company        entity number: 305.0  recall: 0.0033 
entity: position       entity number: 424.0  recall: 0.01

  "type " + obj.__name__ + ". It won't be checked "


train epochs: 4/5, train step: 9678/9678, train loss: 6.3193
entity: address        entity number: 2735.0  recall: 0.0150 
entity: movie          entity number: 971.0  recall: 0.0000 
entity: organization   entity number: 2993.0  recall: 0.0391 
entity: name           entity number: 3363.0  recall: 0.0535 
entity: position       entity number: 2746.0  recall: 0.0368 
entity: government     entity number: 1685.0  recall: 0.0576 
entity: company        entity number: 2709.0  recall: 0.0096 
entity: scene          entity number: 1290.0  recall: 0.0000 
entity: game           entity number: 2243.0  recall: 0.0709 
entity: book           entity number: 986.0  recall: 0.0740 
entity number:21721.0, recall:0.0366
valid step: 1042/1070, valid loss: 5.8494

#### 测试

In [15]:
ner_model.test(test_data, test_identifier, test_label)

100%|██████████| 1343/1343 [02:39<00:00,  8.41it/s]

entity: address       entity number: 373 precision: 0.0621,  recall: 0.0295,  F1.score: 0.0400
entity: name          entity number: 465 precision: 0.0904,  recall: 0.0344,  F1.score: 0.0498
entity: organization  entity number: 367 precision: 0.0508,  recall: 0.0245,  F1.score: 0.0331
entity: game          entity number: 295 precision: 0.1808,  recall: 0.1085,  F1.score: 0.1356
entity: scene         entity number: 209 precision: 0.0000,  recall: 0.0000,  F1.score: nan
entity: book          entity number: 154 precision: 0.0508,  recall: 0.0584,  F1.score: 0.0544
entity: company       entity number: 378 precision: 0.1525,  recall: 0.0714,  F1.score: 0.0973
entity: position      entity number: 433 precision: 0.2994,  recall: 0.1224,  F1.score: 0.1738
entity: government    entity number: 247 precision: 0.0452,  recall: 0.0324,  F1.score: 0.0377
entity: movie         entity number: 151 precision: 0.0678,  recall: 0.0795,  F1.score: 0.0732





#### 预测

In [46]:
sents = "会去玩玩星际2"

ner_model.predict(sents)

['I_game', 'O', 'O', 'O', 'B_game', 'I_game', 'E_game']