<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#实战：BiLSTM+CRF进行词性标注" data-toc-modified-id="实战：BiLSTM+CRF进行词性标注-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>实战：<code>BiLSTM+CRF</code>进行词性标注</a></span><ul class="toc-item"><li><span><a href="#模型" data-toc-modified-id="模型-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>模型</a></span><ul class="toc-item"><li><span><a href="#BiLSTM模型" data-toc-modified-id="BiLSTM模型-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span><code>BiLSTM</code>模型</a></span></li><li><span><a href="#BiLSTM_CRF模型" data-toc-modified-id="BiLSTM_CRF模型-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span><code>BiLSTM_CRF</code>模型</a></span><ul class="toc-item"><li><span><a href="#CRF层" data-toc-modified-id="CRF层-1.1.2.1"><span class="toc-item-num">1.1.2.1&nbsp;&nbsp;</span><code>CRF</code>层</a></span></li><li><span><a href="#BiLSTM+CRF" data-toc-modified-id="BiLSTM+CRF-1.1.2.2"><span class="toc-item-num">1.1.2.2&nbsp;&nbsp;</span><code>BiLSTM+CRF</code></a></span></li></ul></li></ul></li><li><span><a href="#数据集" data-toc-modified-id="数据集-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>数据集</a></span><ul class="toc-item"><li><span><a href="#数据预处理" data-toc-modified-id="数据预处理-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>数据预处理</a></span></li><li><span><a href="#创建数据管道" data-toc-modified-id="创建数据管道-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>创建数据管道</a></span></li></ul></li><li><span><a href="#训练模型" data-toc-modified-id="训练模型-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>训练模型</a></span><ul class="toc-item"><li><span><a href="#BiLSTM" data-toc-modified-id="BiLSTM-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>BiLSTM</a></span></li></ul></li><li><span><a href="#转移矩阵" data-toc-modified-id="转移矩阵-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>转移矩阵</a></span></li></ul></li><li><span><a href="#TODO" data-toc-modified-id="TODO-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>TODO</a></span></li></ul></div>

# 实战：`BiLSTM+CRF`进行词性标注

## 模型

In [1]:
import torch
import torch.nn as nn

### `BiLSTM`模型
```
  input_ids:   (batch, seq)
--> Embedding: (batch, seq, embedding_dim)
--> LSTM:      (batch, seq, hidden_dim)
--> Dropout
--> Linear:    (batch, seq, num_tags)

+ targets:     (batch, seq)
--> loss
```
训练时：模型 **前向计算**`(forward)` 输出序列的预测标签的概率分布，然后与真实标签计算损失，并进行训练    
预测时：模型输出序列的预测标签的概率分布，求最大值即为结果

In [2]:
class NERLSTM(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, dropout, word2id, tag2id):
        super(NERLSTM, self).__init__()

        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = len(word2id) + 1
        self.tag_to_ix = tag2id
        self.num_tags = len(tag2id)

        self.word_embeds = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(
            self.embedding_dim,
            self.hidden_dim // 2,
            num_layers=1,
            bidirectional=True,
            batch_first=True, # 该属性设置后，需要特别注意数据的形状
        )
        self.hidden2tag = nn.Linear(self.hidden_dim, self.num_tags)

    def forward(self, x):
        batch_size = x.size(0)
        sent_len = x.size(1)
        embedding = self.word_embeds(x)
        outputs, hidden = self.lstm(embedding)
        outputs = self.dropout(outputs)
        outputs = self.hidden2tag(outputs)
        return outputs

### `BiLSTM_CRF`模型

#### `CRF`层
- `forward`方法用来计算损失
- `vertibi_decode`方法用来获的最佳序列

In [3]:
# https://github.com/s14t284/TorchCRF/blob/master/TorchCRF/__init__.py
from typing import List, Optional

import torch
import torch.nn as nn
from torch import BoolTensor, FloatTensor, LongTensor


class CRF(nn.Module):
    CUDA = torch.cuda.is_available()

    def __init__(self,
                 num_labels: int,
                 pad_idx: Optional[int] = None,
                 use_gpu: bool = True) -> None:
        """
        :param num_labels: number of labels
        :param pad_idxL padding index. default None
        :return None
        """

        CRF.CUDA = CRF.CUDA and use_gpu
        if num_labels < 1:
            raise ValueError(
                "invalid number of labels: {0}".format(num_labels))

        super().__init__()
        self.num_labels = num_labels
        device = "cuda" if CRF.CUDA else "cpu"

        # 遷移行列の設定
        # 遷移行列のフォーマット (遷移元, 遷移先)
        # transition matrix setting
        # transition matrix format (source, destination)
        self.trans_matrix = FloatTensor(num_labels, num_labels).to(device)
        # 先頭と末尾への遷移行列の設定
        # transition matrix of start and end settings
        self.start_trans = FloatTensor(num_labels).to(device)
        self.end_trans = FloatTensor(num_labels).to(device)

        self._initialize_parameters(pad_idx)

        self.trans_matrix = nn.Parameter(self.trans_matrix)
        self.start_matrix = nn.Parameter(self.start_trans)
        self.end_matrix = nn.Parameter(self.end_trans)

    def forward(self, h: FloatTensor, labels: LongTensor,
                mask: BoolTensor) -> FloatTensor:
        """
        :param h: hidden matrix (seq_len, batch_size, num_labels)
        :param labels: answer labels of each sequence
                       in mini batch (seq_len, batch_size)
        :param mask: mask tensor of each sequence
                     in mini batch (seq_len, batch_size)
        :return: The log-likelihood (batch_size)
        """

        log_numerator = self._compute_numerator_log_likelihood(h, labels, mask)
        log_denominator = self._compute_denominator_log_likelihood(h, mask)

        return log_numerator - log_denominator

    def viterbi_decode(self, h: FloatTensor,
                       mask: BoolTensor) -> List[List[int]]:
        """
        decode labels using viterbi algorithm
        :param h: hidden matrix (batch_size, seq_len, num_labels)
        :param mask: mask tensor of each sequence
                     in mini batch (seq_len, batch_size)
        :return: labels of each sequence in mini batch
        """

        batch_size, seq_len, _ = h.size()
        # 各系列の系列長を用意
        # prepare the sequence lengths in each sequence
        seq_lens = mask.long().sum(dim=1)
        # バッチ内において，スタート地点から先頭のラベルに対してのスコアを用意
        # In mini batch, prepare the score
        # from the start sequence to the first label
        score = [self.start_trans.data + h[:, 0]]
        path = []

        for t in range(1, seq_len):
            # 1つ前の系列のスコアを抽出
            # extract the score of previous sequence
            # (batch_size, num_labels, 1)
            previous_score = score[t - 1].view(batch_size, -1, 1)

            # 系列の隠れ層のスコアを抽出
            # extract the score of hidden matrix of sequence
            # (batch_size, 1, num_labels)
            h_t = h[:, t].view(batch_size, 1, -1)

            # t-1の系列のラベルからtの系列のラベルまでの遷移におけるスコアを抽出
            # self.trans_matrixは系列Aから系列Bまでの遷移のスコアを持っている
            # extract the score in transition
            # from label of t-1 sequence to label of sequence of t
            # self.trans_matrix has the score of the transition
            # from sequence A to sequence B
            # (batch_size, num_labels, num_labels)
            score_t = previous_score + self.trans_matrix + h_t

            # 導出したスコアのうち，各系列の最大値と最大値をとり得る位置を保持
            # keep the maximum value
            # and point where maximum value of each sequence
            # (batch_size, num_labels)
            best_score, best_path = score_t.max(1)
            score.append(best_score)
            path.append(best_path)

        # バッチ内のラベルを推定
        # predict labels of mini batch
        best_paths = [
            self._viterbi_compute_best_path(i, seq_lens, score, path)
            for i in range(batch_size)
        ]

        return best_paths

    def _viterbi_compute_best_path(
            self,
            batch_idx: int,
            seq_lens: torch.LongTensor,
            score: List[FloatTensor],
            path: List[torch.LongTensor],
    ) -> List[int]:
        """
        return labels using viterbi algorithm
        :param batch_idx: index of batch
        :param seq_lens: sequence lengths in mini batch (batch_size)
        :param score: transition scores of length max sequence size
                      in mini batch [(batch_size, num_labels)]
        :param path: transition paths of length max sequence size
                     in mini batch [(batch_size, num_labels)]
        :return: labels of batch_idx-th sequence
        """

        seq_end_idx = seq_lens[batch_idx] - 1
        # 系列の一番後ろのラベルを抽出
        # extract label of end sequence
        _, best_last_label = (score[seq_end_idx][batch_idx] +
                              self.end_trans).max(0)
        best_labels = [int(best_last_label)]

        # viterbiアルゴリズムにより，ラベルを後ろから推定
        # predict labels from back using viterbi algorithm
        for p in reversed(path[:seq_end_idx]):
            best_last_label = p[batch_idx][best_labels[0]]
            best_labels.insert(0, int(best_last_label))

        return best_labels

    def _compute_denominator_log_likelihood(self, h: FloatTensor,
                                            mask: BoolTensor):
        """
        compute the denominator term for the log-likelihood
        :param h: hidden matrix (batch_size, seq_len, num_labels)
        :param mask: mask tensor of each sequence
                     in mini batch (batch_size, seq_len)
        :return: The score of denominator term for the log-likelihood
        """

        batch_size, seq_len, _ = h.size()
        # 計算できるよう，遷移行列のサイズを変更
        # (num_labels, num_labels) -> (1, num_labels, num_labels)
        trans = self.trans_matrix.view(1, self.num_labels, self.num_labels)
        # 先頭から各ラベルへのスコアと各ラベルの1番目のスコアを足し合わせる
        # add the score from beginning to each label
        # and the first score of each label
        score = self.start_trans.view(1, -1) + h[:, 0]
        # ミニバッチ中の単語数だけ処理を行う
        # iterate through processing for the number of words in the mini batch
        for t in range(1, seq_len):
            # (batch_size, self.num_labels, 1)
            before_score = score.view(batch_size, self.num_labels, 1)
            # 各系列の系列のt番目のマスクを用意
            # prepare t-th mask of sequences in each sequence
            # (batch_size, 1)
            mask_t = mask[:, t].view(batch_size, 1).type(torch.BoolTensor)
            mask_t = mask_t.cuda() if CRF.CUDA else mask_t

            # 各系列におけるt番目の系列ラベルの遷移確率
            # prepare the transition probability of the t-th sequence label
            # in each sequence
            # (batch_size, 1, num_labels)
            h_t = h[:, t].view(batch_size, 1, self.num_labels)
            # 各系列でのt番目のスコアを導出
            # calculate t-th scores in each sequence
            # (batch_size, num_labels)
            score_t = self.logsumexp(before_score + h_t + trans, 1)
            # スコアの更新
            # update scores
            # (batch_size, num_labels)
            score = score_t * mask_t + score * (~mask_t)

        # 末尾のスコアを足し合わせる
        # add the end score of each label
        score += self.end_trans.view(1, -1)
        # ミニバッチ中のデータ全体の対数尤度を返す
        # return the log likely food of all data in mini batch
        return self.logsumexp(score, 1)

    def _compute_numerator_log_likelihood(self, h: FloatTensor, y: LongTensor,
                                          mask: BoolTensor) -> FloatTensor:
        """
        compute the numerator term for the log-likelihood
        :param h: hidden matrix (batch_size, seq_len, num_labels)
        :param y: answer labels of each sequence
                  in mini batch (batch_size, seq_len)
        :param mask: mask tensor of each sequence
                     in mini batch (batch_size, seq_len)
        :return: The score of numerator term for the log-likelihood
        """

        batch_size, seq_len, _ = h.size()
        # 系列のスタート位置のベクトルを抽出
        # extract first vector of sequences in mini batch
        score = self.start_trans[y[:, 0]]

        h = h.unsqueeze(-1)
        trans = self.trans_matrix.unsqueeze(-1)

        for t in range(seq_len - 1):
            mask_t = mask[:, t].cuda() if CRF.CUDA else mask[:, t]
            mask_t1 = mask[:, t + 1] if CRF.CUDA else mask[:, t + 1]
            # t+1番目のラベルのスコアを抽出
            # extract the score of t+1 label
            # (batch_size)
            h_t = torch.cat([h[b, t, y[b, t]] for b in range(batch_size)])
            # t番目のラベルからt+1番目のラベルへの遷移スコアを抽出
            # extract the transition score from t-th label to t+1 label
            # (batch_size)
            trans_t = torch.cat([trans[s[t], s[t + 1]] for s in y])
            # 足し合わせる
            # add the score of t+1 and the transition score
            # (batch_size)
            score += h_t * mask_t + trans_t * mask_t1

        # バッチ内の各系列の最後尾のラベル番号を抽出する
        # extract end label number of each sequence in mini batch
        # (batch_size)
        last_mask_index = mask.long().sum(1) - 1
        last_labels = y.gather(1, last_mask_index.unsqueeze(-1))
        # hの形を元に戻す
        # restore the shape of h
        h = h.unsqueeze(-1).view(batch_size, seq_len, self.num_labels)

        # バッチ内の最大長の系列のスコアを足し合わせる
        # Add the score of the sequences of the maximum length in mini batch
        score += h[:, -1].gather(1, last_labels).squeeze(1) * mask[:, -1]
        # 各系列の最後尾のタグからEOSまでのスコアを足し合わせる
        # Add the scores from the last tag of each sequence to EOS
        score += self.end_trans[last_labels].view(batch_size)

        return score

    def _initialize_parameters(self, pad_idx: Optional[int]) -> None:
        """
        initialize transition parameters
        :param: pad_idx: if not None, additional initialize
        :return: None
        """

        nn.init.uniform_(self.trans_matrix, -0.1, 0.1)
        nn.init.uniform_(self.start_trans, -0.1, 0.1)
        nn.init.uniform_(self.end_trans, -0.1, 0.1)
        if pad_idx is not None:
            self.start_trans[pad_idx] = -10000.0
            self.trans_matrix[pad_idx, :] = -10000.0
            self.trans_matrix[:, pad_idx] = -10000.0
            self.trans_matrix[pad_idx, pad_idx] = 0.0

    @staticmethod
    def logsumexp(x: FloatTensor, dim: int) -> FloatTensor:
        """
        return log(sum(exp(x))) while minimizing
                                the possibility of overflow/underflow.
        :param x: the matrix format FloatTensor
        :param dim: dimensiton
        :return: log(sum(exp(x)))
        """

        vmax, _ = x.max(dim)
        return vmax + torch.log(
            torch.sum(torch.exp(x - vmax.unsqueeze(dim)), dim))

In [4]:
device = torch.device('cuda' if torch.cuda.is_available else 'cpu')

batch_size = 2
seq_len = 3
num_labels = 5

# 掩码，表示序列中哪些元素是填充的
mask = torch.ByteTensor([[1, 1, 1], [1, 1, 0]]).to(device)

# 真实的序列标签
labels = torch.LongTensor([[0, 2, 3], [1, 4, 1]]).to(device)

# LSTM层的输出，表示预测的 序列标签概率分布
hidden = torch.randn((batch_size, seq_len, num_labels),
                     requires_grad=True).to(device)  

In [5]:
# 创建模型，只需要提供 标签 数量
crf = CRF(num_labels)

# 提供LSTM层输出的序列标签分布，和真实标签，前向计算获得损失
crf.forward(hidden, labels, mask)

tensor([-5.7249, -5.4133], device='cuda:0', grad_fn=<SubBackward0>)

In [6]:
# 根据 预测的 序列标签概率分布 进行解码，获得最佳序列
crf.viterbi_decode(hidden, mask)

[[0, 1, 1], [0, 2]]

#### `BiLSTM+CRF`
```
input_ids:                       (batch, seq)
--> Embedding:                   (batch, seq, embedding_dim)
--> LSTM:                        (batch, seq, hidden_dim)
--> Dropout
--> Linear:                      (batch, seq, num_tags)
--> CRF
```
训练时：模型直接调用`log_likelihood`方法计算损失，然后进行训练
   - 此时模型内部，线性层的 输出和目标 一起经过 CRF 的**前向计算`forward`**得到损失       
      
预测时：模型 **前向计算`forward`** 获得输出序列   
   - 此时模型内部，线性层的 输出 经过 CRF 的解码方法计算得到预测标签序列

In [7]:
class NERLSTM_CRF(nn.Module):
    def __init__(self, config):
        super(NERLSTM_CRF, self).__init__()

        self.embedding_dim = config.embedding_dim
        self.hidden_dim = config.hidden_dim
        self.vocab_size = config.vocab_size
        self.num_tags = config.num_tags

        self.embeds = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.dropout = nn.Dropout(config.dropout)

        self.lstm = nn.LSTM(
            self.embedding_dim,
            self.hidden_dim // 2,
            num_layers=1,
            bidirectional=True,
            batch_first=True,  # 该属性设置后，需要特别注意数据的形状
        )

        self.linear = nn.Linear(self.hidden_dim, self.num_tags)

        # CRF 层
        self.crf = CRF(self.num_tags)

    def forward(self, x, mask):
        embeddings = self.embeds(x)
        feats, hidden = self.lstm(embeddings)
        emissions = self.linear(self.dropout(feats))
        outputs = self.crf.viterbi_decode(emissions, mask)
        return outputs

    def log_likelihood(self, x, labels, mask):
        embeddings = self.embeds(x)
        feats, hidden = self.lstm(embeddings)
        emissions = self.linear(self.dropout(feats))
        loss = -self.crf.forward(emissions, labels, mask)
        return torch.sum(loss)

In [8]:
# 测试模型

class Config():
    embedding_dim = 6
    hidden_dim = 20
    vocab_size = 50
    num_tags = 5
    dropout = 0.1


config = Config()
model = NERLSTM_CRF(config).to(device)

In [10]:
x = torch.LongTensor([[1, 12, 31, 4, 15], [2, 18, 39, 14, 0]]).to(device)
mask = torch.BoolTensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 0]]).to(device)
labels = torch.LongTensor([[1, 2, 1, 3, 0], [2, 4, 2, 1, 0]]).to(device)

# 解码
model.forward(x, mask)

[[2, 2, 3, 3, 3], [1, 2, 2, 2]]

> 数据类型为：`torch.int64`

In [11]:
# 计算损失
model.log_likelihood(x, labels, mask)

tensor(14.6996, device='cuda:0', grad_fn=<SumBackward0>)

## 数据集
原始数据来自人民日报，处理成如下形式：
```
迈向/v  充满/v  希望/n  的/u  新/a  世纪/n
```
主要分为三类：nr,ns,nt
>nr | 人名  
ns | 地名   
nt | 机构名


并用以下标识：

> B | 词首   
M | 词中   
E | 词尾   
O | 单字

### 数据预处理

In [12]:
# https://github.com/buppt/ChineseNER/blob/master/data/renMinRiBao/data_renmin_word.py
import os
import codecs
import re
import pdb
import pandas as pd
import numpy as np
import collections
import pickle


def originHandle():
    # 将 多空格 隔开的连续的标注数据: "中共中央/nt  总书记/n  、/w  国家/n  主席/n"，转化成单空格

    with open('../datasets/ner/renmin/renmin.txt',
              'r') as inp, open('../datasets/ner/renmin/renmin2.txt',
                                'w') as outp:
        for line in inp.readlines():
            line = line.split('  ')
            i = 1  # 删除每一行第一个标注数据，为时间 "19980101-01-001-016/m"
            while i < len(line) - 1:

                # 1. [北京/ns  石景山/ns  发电/vn  总厂/n]nt --> 北京石景山发电总厂/nt
                ################################################################
                if line[i][0] == '[':
                    outp.write(line[i].split('/')[0][1:])
                    i += 1
                    while i < len(line) - 1 and line[i].find(']') == -1:
                        if line[i] != '':
                            outp.write(line[i].split('/')[0])
                        i += 1
                    outp.write(line[i].split('/')[0].strip() + '/' +
                               line[i].split('/')[1][-2:] + ' ')

                # 2. 将连续的两个 nr 标签单词连接起来，如 姓+名
                ################################################################
                elif line[i].split('/')[1] == 'nr':
                    word = line[i].split('/')[0]
                    i += 1
                    if i < len(line) - 1 and line[i].split('/')[1] == 'nr':
                        outp.write(word + line[i].split('/')[0] + '/nr ')
                    else:
                        outp.write(word + '/nr ')
                        continue
                else:
                    outp.write(line[i] + ' ')
                i += 1
            outp.write('\n')


def originHandle2():
    # 将 单词+标注，转换成 字+标注；字的标注表示为 字在单词中的位置+单词的标注，如 "中国"
    with codecs.open('../datasets/ner/renmin/renmin2.txt', 'r',
                     'utf-8') as inp, codecs.open(
                         '../datasets/ner/renmin/renmin3.txt', 'w',
                         'utf-8') as outp:
        for line in inp.readlines():
            line = line.split(' ')
            i = 0
            while i < len(line) - 1:
                if line[i] == '':
                    i += 1
                    continue
                word = line[i].split('/')[0]
                tag = line[i].split('/')[1]

                # 1. 只保留这三种标注：nr , ns , nt
                # 中共中央/nt 总书记/n 、/w 国家/n 主席/n 江泽民/nr -->
                # 中/B_nt 共/M_nt 中/M_nt 央/E_nt 总/O 书/O 记/O 、/O 国/O 家/O 主/O 席/O 江/B_nr 泽/M_nr 民/E_nr
                ###########################################################################################
                if tag == 'nr' or tag == 'ns' or tag == 'nt':
                    outp.write(word[0] + "/B_" + tag + " ")
                    for j in word[1:len(word) - 1]:
                        if j != ' ':
                            outp.write(j + "/M_" + tag + " ")
                    outp.write(word[-1] + "/E_" + tag + " ")

                # 2. 其它的标注都转变成 '/O'
                else:
                    for wor in word:
                        outp.write(wor + '/O ')
                i += 1
            outp.write('\n')


def sentence2split():
    with open('../datasets/ner/renmin/renmin3.txt',
              'r') as inp, codecs.open('../datasets/ner/renmin/renmin4.txt',
                                       'w', 'utf-8') as outp:
        texts = inp.read()  # .decode('utf-8')

        # 按 标点符号 分成拆分成子句，每个子句为一行
        sentences = re.split('[，。！？、‘’“”:]/[O]', texts)
        for sentence in sentences:
            if sentence != " ":
                outp.write(sentence.strip() + '\n')

# originHandle()
# originHandle2()
# sentence2split()

In [71]:
from tensorflow.keras.preprocessing.sequence import pad_sequences


def data2pkl():
    datas = list()
    labels = list()

    all_words = []
    tags = set()

    input_data = codecs.open('../datasets/ner/renmin/renmin4.txt', 'r',
                             'utf-8')

    # 1. 将标注子句 拆分成 字列表 和 对应的标注列表 #############
    #####################################################
    for line in input_data.readlines():
        linedata = list()
        linelabel = list()

        line = line.split()

        numNotO = 0
        for word in line:
            word = word.split('/')
            linedata.append(word[0])
            linelabel.append(word[1])

            all_words.append(word[0])
            tags.add(word[1])

            if word[1] != 'O':  # 标注全为 O 的子句
                numNotO += 1

        if numNotO != 0:  # 只保存 标注不全为 O 的子句
            #             print(linedata)
            datas.append(linedata)
            labels.append(linelabel)

    input_data.close()
    print("文本序列的数量：", len(datas))  # 字列表 组成的列表
    assert (len(labels) == len(datas))  # 对应的 标注列表 组成的列表
    print("文本所有单词数：", len(all_words))

    # 2. 创建词汇表和标签表 ################################
    #####################################################

    words_count = collections.Counter(all_words).most_common()
    word2id = {word: i for i, (word, _) in enumerate(words_count, 1)}
    word2id["[PAD]"] = 0
    word2id["[unknown]"] = len(word2id)

    id2word = {i: word for word, i in word2id.items()}
    print("词汇表的大小：", len(id2word))

    print("所有标签：", tags)
    tag2id = {tag: i for i, tag in enumerate(tags)}
    print(tag2id)

    id2tag = {i: tag for tag, i in tag2id.items()}
    print(id2tag)

    print("-" * 100)
    print("Buiding vocab Done!!!")

    # 3. 数据向量化，并处理成相同长度 ########################
    #####################################################
    max_len = 60

    data_ids = [[word2id[w] for w in line] for line in datas]
    label_ids = [[tag2id[t] for t in line] for line in labels]

    x = pad_sequences(data_ids, maxlen=max_len, padding='post').astype(np.int64)
    y = pad_sequences(label_ids, maxlen=max_len, padding='post').astype(np.int64)

    print("-" * 100)
    print("Vectorizing data Done!!!")

    # 4. 向量化后数据拆分成训练集、验证集、测试集 ##############
    #####################################################
    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(
        x,
        y,
        test_size=0.2,
        random_state=43,
    )
    x_train, x_valid, y_train, y_valid = train_test_split(
        x_train,
        y_train,
        test_size=0.2,
        random_state=43,
    )

    print("-" * 100)
    print("Splitting data Done!!!")

    # 5. 保存数据 ########################
    #####################################################

    import pickle
    import os
    with open('../datasets/ner/renmin/renmindata.pkl', 'wb') as outp:
        pickle.dump(word2id, outp)
        pickle.dump(id2word, outp)
        pickle.dump(tag2id, outp)
        pickle.dump(id2tag, outp)

        pickle.dump(x_train, outp)
        pickle.dump(y_train, outp)

        pickle.dump(x_test, outp)
        pickle.dump(y_test, outp)

        pickle.dump(x_valid, outp)
        pickle.dump(y_valid, outp)

    print("-" * 100)
    print('** Finished saving the data.')


# data2pkl()

文本序列的数量： 37924
文本所有单词数： 1690464
词汇表的大小： 4683
所有标签： {'E_nt', 'M_nt', 'E_nr', 'O', 'B_nr', 'B_ns', 'M_ns', 'E_ns', 'M_nr', 'B_nt'}
{'E_nt': 0, 'M_nt': 1, 'E_nr': 2, 'O': 3, 'B_nr': 4, 'B_ns': 5, 'M_ns': 6, 'E_ns': 7, 'M_nr': 8, 'B_nt': 9}
{0: 'E_nt', 1: 'M_nt', 2: 'E_nr', 3: 'O', 4: 'B_nr', 5: 'B_ns', 6: 'M_ns', 7: 'E_ns', 8: 'M_nr', 9: 'B_nt'}
----------------------------------------------------------------------------------------------------
Buiding vocab Done!!!
----------------------------------------------------------------------------------------------------
Vectorizing data Done!!!
----------------------------------------------------------------------------------------------------
Splitting data Done!!!
----------------------------------------------------------------------------------------------------
** Finished saving the data.


In [72]:
# 保存预料中所有实体
def save_entities():
    input_data = codecs.open('../datasets/ner/renmin/renmin4.txt', 'r',
                             'utf-8')

    # 保存所有实体 #########################################
    #####################################################

    from collections import defaultdict
    entities = defaultdict(set)

    for line in input_data.readlines():

        line = line.strip().split()
        tokens = [line[i].split('/')[0] for i in range(len(line))]
        tags = [line[i].split('/')[1] for i in range(len(line))]


        start = 0
        end = 0
        for i in range(len(line)):
            token, tag = tokens[i], tags[i]
            if tag.startswith("B"):
                begin = i
            elif tag.startswith("E"):
                end = i
                word = ''.join(tokens[begin:end + 1])
                label = tag[2:]
                entities[label].add(word)
            
    with open('../datasets/ner/renmin/entities.pkl', 'wb') as outp:
        pickle.dump(entities, outp)

# save_entities()

In [73]:
pickle_path = '../datasets/ner/renmin/entities.pkl'
with open(pickle_path, 'rb') as inp:
    entities = pickle.load(inp)
    
entities['nr']    

{'樊萌',
 '慕凌飞',
 '母秋华',
 '曾建徽',
 '桑燕',
 '陈焕友',
 '冼笃信',
 '路易十四',
 '托马谢克',
 '祁秉文',
 '黄双武',
 '彭小民',
 '斯卡拉',
 '章治文',
 '张志华',
 '周志刚',
 '祝耀祖',
 '丁传贤',
 '马本斋',
 '穆道俊',
 '陈育明',
 '李小和',
 '游本昌',
 '欧阳修',
 '李卫天',
 '于尔根·吕特格尔斯',
 '卡尔',
 '姜伯驹',
 '梁国安',
 '刘志秋',
 '石平',
 '毛宁',
 '秦蕴珊',
 '老吴',
 '龙志毅',
 '宗良',
 '许昱华',
 '程步云',
 '张衡',
 '塔尔博特',
 '高成',
 '周江波',
 '李释勘',
 '王璐瑶',
 '陈立军',
 '王继英',
 '让·米奥',
 '张琨锐',
 '谭利华',
 '阎晓明',
 '安学发',
 '郭强',
 '珠康·土登克珠',
 '爱丽丝',
 '陈立夫',
 '张克俭',
 '沃勒贝克',
 '高山岳',
 '王庆成',
 '林嘉（马来）',
 '谢宗惠',
 '杨伟光',
 '刘秉义',
 '王素琴',
 '赵富林',
 '石景宜',
 '齐如山',
 '李小双',
 '陈晓红',
 '亚伯',
 '姚本棠',
 '蒋文良',
 '罗帅',
 '卓福香',
 '柳公权',
 '马瑾',
 '郭大维',
 '张淑英',
 '王夫棠',
 '王荣志',
 '黄学锋',
 '王志新',
 '吕农华',
 '张裕民',
 '侯文学',
 '约翰·戴维特',
 '汪啸风',
 '法比尤斯',
 '刘宝珍',
 '闵惠芬',
 '马思忠',
 '李默庵',
 '王文学',
 '方成',
 '张雅心',
 '龚伟瑄',
 '刘我成',
 '徐业刚',
 '曹馨仪',
 '邓成城',
 '阿佩尔',
 '韩磊磊',
 '吴爱恩',
 '何添发',
 '欧阳海',
 '侏罗纪',
 '何少存',
 '朱时茂',
 '姚信民',
 '谢峰',
 '杨阳',
 '钱云强',
 '方嘉民',
 '威廉·莱易斯',
 '李力',
 '张学仁',
 '刘泽仁',
 '小贾',
 '毛增滇',
 '洪成南',
 '王仲平',
 '吴敦夫',
 '张弛',
 '张同吾',
 '

In [74]:
entities['ns']    

{'阿尔及利亚',
 '瑞丽市',
 '素可泰',
 '珠江三角洲',
 '排牙山',
 '临川',
 '比萨斜塔',
 '伯尔萨',
 '刘庄村',
 '鄢家河',
 '宋坑',
 '临江',
 '栾城县',
 '孟加拉国',
 '于洪区',
 '苏丹',
 '明尼苏达州',
 '凹底镇',
 '海淀剧场',
 '汉城',
 '牧奎村',
 '杏花岭区',
 '辛辛那提城',
 '塞纳河畔',
 '奥希金斯公园',
 '白水镇',
 '兰州',
 '电器道',
 '交道口',
 '银川市',
 '浦东新区',
 '北大荒',
 '醴陵',
 '绥德',
 '中央大街',
 '潍坊市',
 '北甸子乡',
 '牡丹江',
 '四川省凉山彝族自治州',
 '夫子庙',
 '苏北',
 '钓鱼台国宾馆',
 '得克萨斯州',
 '安徽省',
 '上海商城',
 '宁连路',
 '蒲圻市',
 '衡南',
 '万家寨',
 '江西省宜春地区',
 '北京海淀剧院',
 '波兰',
 '武邑黄口',
 '呼和浩特',
 '天津市',
 '江津',
 '沈泉庄',
 '黑非洲',
 '合肥',
 '南美洲',
 '河东',
 '胶东',
 '阿里高原',
 '胶南市',
 '大别山',
 '北京光彩体育馆手球馆',
 '莫桑比克',
 '北京经济技术开发区',
 '格拉茨',
 '广东',
 '晋豫',
 '坦桑尼亚',
 '阿比托',
 '河北张家口地区',
 '阜南县',
 '宁波',
 '达特茅思',
 '北爱尔兰',
 '池洞镇',
 '九湖镇',
 '砀山县',
 '汕头',
 '湄公河',
 '凤凰岭',
 '二里头',
 '田阳县',
 '巴西利亚',
 '龙庆峡',
 '天河体育场',
 '蒙得维的亚',
 '西花厅',
 '南京路',
 '南召',
 '江西省',
 '济南良友富临大酒店',
 '芝加哥',
 '内塔尼亚胡',
 '长江三峡',
 '黄粱梦镇',
 '菲律宾',
 '华尔街',
 '昆仑山',
 '长春沟村',
 '揭阳',
 '文化路',
 '大宁河',
 '马达加斯加',
 '大观园',
 '王家坝村',
 '亚洲',
 '延吉路',
 '恩古巴尼',
 '崂山区',
 '筷子巷',
 '虹口区',
 '南斯拉夫联盟黑山共和国',
 '台

In [75]:
entities['nt']

{'共青团云南省委',
 '上海市房屋土地管理局',
 '中国认证人员国家注册委员会环境管理专业委员会',
 '格威特体育用品公司',
 '荆州轻桥股份有限公司',
 '老挝人民革命党',
 '项桥电管站',
 '远东研究所',
 '鹤壁市电视台',
 '塔斯社',
 '方城县公安局法医院',
 '福建省福州市新华书店',
 '黎政府',
 '上海科学普及出版社',
 '新加坡国际金融交易所',
 '河南财经学院',
 '小浪底建管局',
 '台湾大学',
 '国商集团',
 '中国奥林匹克委员会',
 '南昌郊区区委',
 '那曲军分区',
 '新疆电力工业局',
 '中共四川省委',
 '成都市经委',
 '上海东方队',
 '北京市首都规划设计委员会',
 '南通仙羽制衣有限公司',
 '东滩煤矿',
 '中央美院',
 '世界银行',
 '五虎岭小学',
 '上海滑稽剧团',
 '联邦邮电部',
 '美国国际集团亚洲投资有限公司',
 '中国青少年研究中心少年儿童研究所',
 '松下电工·万宝电器（广州）有限公司',
 '重庆有线电视台',
 '龙舟股份',
 '茅盾文学奖评奖委员会',
 '对外友协',
 '中国诚信证券评估公司',
 '土宪法法院',
 '北海道拓殖银行',
 '美联社',
 '四川蓝剑队',
 '国务院军队转业干部安置工作小组',
 '石油部',
 '国家体委体操运动管理中心',
 '伦敦帝国理工学院',
 '东四街道办事处',
 '瑞士圣加仑修道院',
 '哈萨克阿里法拉比国立大学',
 '美国志愿者协会',
 '抗敌演剧队',
 '鲁北化工',
 '北京人民日报教科文部文化组',
 '南方局',
 '北京艺术研究所',
 '江苏悦达公司盐城汽车厂',
 '山东省鄄城县公安局巡警大队',
 '西南财经大学',
 '国家旅游局',
 '瑞典沃尔沃公司',
 '北爱民主党',
 '河西医院',
 '衡阳市支队后勤处',
 '西藏自治区科委农牧处',
 '北京市外来劳动力职业介绍中心',
 '尼加拉瓜桑地诺民族解放阵线',
 '农业银行四川省西昌市支行',
 '香港福建社团联会',
 '中共中央西北局',
 '解放军艺术学院影视中心',
 '中国书法家协会',
 '广东天贸南方大厦百货有限公司',
 '全国人大华侨委员会',
 '济南铁路分

In [82]:
"武汉" in entities['ns']

True

### 创建数据管道

In [85]:
# 加载数据

import pickle

pickle_path = '../datasets/ner/renmin/renmindata.pkl'
with open(pickle_path, 'rb') as inp:
    word2id = pickle.load(inp)
    id2word = pickle.load(inp)
    tag2id = pickle.load(inp)
    id2tag = pickle.load(inp)
    x_train = pickle.load(inp)
    y_train = pickle.load(inp)
    x_test = pickle.load(inp)
    y_test = pickle.load(inp)
    x_valid = pickle.load(inp)
    y_valid = pickle.load(inp)
print("train len:", len(x_train))
print("test len:", len(x_test))
print("valid len:", len(x_valid))

train len: 24271
test len: 7585
valid len: 6068


In [86]:
# # 保存词汇表和标签

# with open('../datasets/ner/renmin/vocab.pkl', 'wb') as outp:
#     pickle.dump(word2id, outp)
#     pickle.dump(id2word, outp)    

# with open('../datasets/ner/renmin/tags.pkl', 'wb') as outp:
#     pickle.dump(tag2id, outp)
#     pickle.dump(id2tag, outp)    

In [77]:
id2tag

{0: 'E_nt',
 1: 'M_nt',
 2: 'E_nr',
 3: 'O',
 4: 'B_nr',
 5: 'B_ns',
 6: 'M_ns',
 7: 'E_ns',
 8: 'M_nr',
 9: 'B_nt'}

In [78]:
id2word

{1: '的',
 2: '国',
 3: '一',
 4: '在',
 5: '中',
 6: '人',
 7: '１',
 8: '了',
 9: '和',
 10: '是',
 11: '有',
 12: '年',
 13: '大',
 14: '不',
 15: '０',
 16: '为',
 17: '会',
 18: '业',
 19: '上',
 20: '地',
 21: '发',
 22: '出',
 23: '９',
 24: '作',
 25: '要',
 26: '工',
 27: '行',
 28: '民',
 29: '这',
 30: '经',
 31: '家',
 32: '新',
 33: '个',
 34: '日',
 35: '部',
 36: '以',
 37: '来',
 38: '到',
 39: '２',
 40: '市',
 41: '成',
 42: '生',
 43: '对',
 44: '进',
 45: '全',
 46: '我',
 47: '们',
 48: '政',
 49: '多',
 50: '主',
 51: '时',
 52: '他',
 53: '产',
 54: '本',
 55: '展',
 56: '长',
 57: '实',
 58: '者',
 59: '学',
 60: '方',
 61: '建',
 62: '（',
 63: '）',
 64: '开',
 65: '理',
 66: '同',
 67: '动',
 68: '月',
 69: '高',
 70: '关',
 71: '重',
 72: '力',
 73: '电',
 74: '现',
 75: '于',
 76: '公',
 77: '５',
 78: '社',
 79: '下',
 80: '３',
 81: '报',
 82: '区',
 83: '加',
 84: '分',
 85: '济',
 86: '制',
 87: '自',
 88: '化',
 89: '定',
 90: '文',
 91: '体',
 92: '过',
 93: '前',
 94: '合',
 95: '等',
 96: '场',
 97: '就',
 98: '天',
 99: '与',
 100: '说',
 101: '面

In [18]:
tag2id

{'E_nt': 0,
 'M_nt': 1,
 'E_nr': 2,
 'O': 3,
 'B_nr': 4,
 'B_ns': 5,
 'M_ns': 6,
 'E_ns': 7,
 'M_nr': 8,
 'B_nt': 9}

In [19]:
word2id['[PAD]']

0

In [20]:
strs = "深圳欢迎您"
for char in strs:
    print(word2id[char])

313
1456
635
644
1458


In [21]:
# 创建数据管道

import torch
from torch.utils.data import Dataset,DataLoader
import torch.nn as nn
import torch.optim as optim


batch_size = 32  # batch size
num_workers = 4  # how many workers for loading data



class NERDataset(Dataset):
    def __init__(self, X, Y, *args, **kwargs):
        self.data = [{'x': X[i], 'y': Y[i]} for i in range(X.shape[0])]

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)


train_dataset = NERDataset(x_train, y_train)
valid_dataset = NERDataset(x_valid, y_valid)
test_dataset = NERDataset(x_test, y_test)

train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              shuffle=True,
                              num_workers=num_workers)
valid_dataloader = DataLoader(valid_dataset,
                              batch_size=batch_size,
                              shuffle=True,
                              num_workers=num_workers)
test_dataloader = DataLoader(test_dataset,
                             batch_size=batch_size,
                             shuffle=True,
                             num_workers=num_workers)

## 训练模型

### BiLSTM

In [22]:
device

device(type='cuda')

In [23]:
class Config:
    embedding_dim = 100
    hidden_dim = 200

    vocab_size = len(word2id)
    num_tags = len(tag2id)

    dropout = 0.2
    lr = 0.001
    weight_decay = 1e-5

# 创建模型，优化器，评价标准
config = Config()
model = NERLSTM_CRF(config).to(device)
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters(),
                       lr=config.lr,
                       weight_decay=config.weight_decay)

In [24]:
# 用于将实体类别解码，单字组合成单词

def parse_tags(text, path):
    tags = [id2tag[idx] for idx in path]

    begin = 0
    end = 0

    res = []
    for idx, tag in enumerate(tags):
        # 将连续的 同类型 的字连接起来
        if tag.startswith("B"):
            begin = idx
        elif tag.startswith("E"):
            end = idx
            word = text[begin:end + 1]
            label = tag[2:]
            res.append((word, label))
        elif tag=='O':
            res.append((text[idx], tag))
    return res

In [94]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

max_epoch = 10


class ChineseNER(object):
    def train(self):
        for epoch in range(max_epoch):

            # 训练模式
            model.train()

            for index, batch in enumerate(train_dataloader):
                # 梯度归零
                optimizer.zero_grad()

                # 训练数据-->gpu
                x = batch['x'].to(device)
                mask = (x > 0).to(device)
                y = batch['y'].to(device)
                

                # 前向计算计算损失
                loss = model.log_likelihood(x, y, mask)

                # 反向传播
                loss.backward()

                # 梯度裁剪
                torch.nn.utils.clip_grad_norm_(parameters=model.parameters(),
                                               max_norm=10)

                # 更新参数
                optimizer.step()
                if index % 200 == 0:
                    print('epoch:%5d,------------loss:%f' %
                          (epoch, loss.item()))

            # 验证损失和精度
            aver_loss = 0
            preds, labels = [], []
            for index, batch in enumerate(valid_dataloader):

                # 验证模式
                model.eval()

                # 验证数据-->gpu
                val_x, val_y = batch['x'].to(device), batch['y'].to(device)
                val_mask = (val_x > 0).to(device)
                predict = model(val_x, val_mask)

                # 前向计算损失
                loss = model.log_likelihood(val_x, val_y, val_mask)
                aver_loss += loss.item()

                # 统计非0的，也就是真实标签的长度
                leng = []
                for i in val_y.cpu():
                    tmp = []
                    for j in i:
                        if j.item() > 0:
                            tmp.append(j.item())
                    leng.append(tmp)

                for index, i in enumerate(predict):
                    preds += i[:len(leng[index])]

                for index, i in enumerate(val_y.tolist()):
                    labels += i[:len(leng[index])]

            # 损失值与评测指标
            aver_loss /= (len(valid_dataloader) * 64)
            precision = precision_score(labels, preds, average='macro')
            recall = recall_score(labels, preds, average='macro')
            f1 = f1_score(labels, preds, average='macro')
            report = classification_report(labels, preds)
            print(report)
            torch.save(model.state_dict(), '../models/ner/params.pkl')

    # 预测，输入为单句，输出为对应的单词和标签
    def predict(self, input_str=""):
        model.load_state_dict(torch.load("../models/ner/params.pkl"))
        model.eval()
        if not input_str:
            input_str = input("请输入文本: ")
        
        input_vec = []
        for char in input_str:
            if char not in word2id:
                input_vec.append(word2id['[unknown]'])
            else:
                input_vec.append(word2id[char])
                
        # convert to tensor
        sentences = torch.tensor(input_vec).view(1, -1).to(device)
        mask = sentences > 0
        paths = model(sentences, mask)

        res = parse_tags(input_str, paths[0])
        return res
    
    # 在测试集上评判性能
    def test(self, test_dataloader):
        model.load_state_dict(torch.load("../models/ner/params.pkl"))
        
        aver_loss = 0
        preds, labels = [], []
        for index, batch in enumerate(test_dataloader):

            # 验证模式
            model.eval()

            # 验证数据-->gpu
            val_x, val_y = batch['x'].to(device), batch['y'].to(device)
            val_mask = (val_x > 0).to(device)
            predict = model(val_x, val_mask)

            # 前向计算损失
            loss = model.log_likelihood(val_x, val_y, val_mask)
            aver_loss += loss.item()

            # 统计非0的，也就是真实标签的长度
            leng = []
            for i in val_y.cpu():
                tmp = []
                for j in i:
                    if j.item() > 0:
                        tmp.append(j.item())
                leng.append(tmp)

            for index, i in enumerate(predict):
                preds += i[:len(leng[index])]

            for index, i in enumerate(val_y.tolist()):
                labels += i[:len(leng[index])]

        # 损失值与评测指标
        aver_loss /= (len(test_dataloader) * 64)
        precision = precision_score(labels, preds, average='macro')
        recall = recall_score(labels, preds, average='macro')
        f1 = f1_score(labels, preds, average='macro')
        report = classification_report(labels, preds)
        print(report)        
        

In [26]:
cn = ChineseNER()
# cn.train()

epoch:    0,------------loss:979.705811
epoch:    0,------------loss:263.596802
epoch:    0,------------loss:171.161011
epoch:    0,------------loss:164.264771
              precision    recall  f1-score   support

           0       0.80      0.81      0.81      1542
           1       0.79      0.86      0.82      6459
           2       0.88      0.79      0.83      3017
           3       0.95      0.96      0.96     61629
           4       0.93      0.84      0.88      3117
           5       0.84      0.82      0.83      3586
           6       0.73      0.67      0.70      2163
           7       0.81      0.79      0.80      3567
           8       0.89      0.83      0.86      2805
           9       0.82      0.80      0.81      1763

    accuracy                           0.92     89648
   macro avg       0.84      0.82      0.83     89648
weighted avg       0.92      0.92      0.91     89648

epoch:    1,------------loss:113.367020
epoch:    1,------------loss:97.349915
ep

epoch:    9,------------loss:1.087444
epoch:    9,------------loss:1.591179
epoch:    9,------------loss:6.660941
epoch:    9,------------loss:1.844975
              precision    recall  f1-score   support

           0       0.88      0.91      0.90      1542
           1       0.89      0.91      0.90      6459
           2       0.96      0.93      0.95      3017
           3       0.98      0.98      0.98     61629
           4       0.97      0.94      0.95      3117
           5       0.92      0.93      0.93      3586
           6       0.85      0.88      0.86      2163
           7       0.91      0.91      0.91      3567
           8       0.97      0.93      0.95      2805
           9       0.90      0.92      0.91      1763

    accuracy                           0.96     89648
   macro avg       0.92      0.92      0.92     89648
weighted avg       0.96      0.96      0.96     89648



In [58]:
# 模型性能
cn.test(test_dataloader)

              precision    recall  f1-score   support

           0       0.90      0.92      0.91      1887
           1       0.91      0.91      0.91      8043
           2       0.95      0.91      0.93      3833
           3       0.98      0.98      0.98     75869
           4       0.97      0.93      0.95      3962
           5       0.93      0.93      0.93      4520
           6       0.85      0.86      0.85      2688
           7       0.92      0.91      0.91      4495
           8       0.96      0.93      0.95      3645
           9       0.91      0.92      0.92      2146

    accuracy                           0.96    111088
   macro avg       0.93      0.92      0.92    111088
weighted avg       0.96      0.96      0.96    111088



In [109]:
# 模型进行预测
cn = ChineseNER()
text = "冯永祥突发奇想，跑到阿尔及利亚旅行，意外结识了印度人民党的领导"
cn.predict(text)

[('冯永祥', 'nr'),
 ('突', 'O'),
 ('发', 'O'),
 ('奇', 'O'),
 ('想', 'O'),
 ('，', 'O'),
 ('跑', 'O'),
 ('到', 'O'),
 ('阿尔及利亚', 'ns'),
 ('旅', 'O'),
 ('行', 'O'),
 ('，', 'O'),
 ('意', 'O'),
 ('外', 'O'),
 ('结', 'O'),
 ('识', 'O'),
 ('了', 'O'),
 ('印度人民党', 'nt'),
 ('的', 'O'),
 ('领', 'O'),
 ('导', 'O')]

## 转移矩阵

In [36]:
id2tag

{0: 'E_nt',
 1: 'M_nt',
 2: 'E_nr',
 3: 'O',
 4: 'B_nr',
 5: 'B_ns',
 6: 'M_ns',
 7: 'E_ns',
 8: 'M_nr',
 9: 'B_nt'}

In [37]:
trans_matrix = torch.Tensor.cpu(model.crf.trans_matrix).detach().numpy()

In [39]:
df = pd.DataFrame(trans_matrix, columns=id2tag.values(), index=id2tag.values())
df

Unnamed: 0,E_nt,M_nt,E_nr,O,B_nr,B_ns,M_ns,E_ns,M_nr,B_nt
E_nt,-2.573899,-3.030039,-0.728452,0.909396,0.266779,0.373823,-1.361368,-1.049253,-0.733836,0.107502
M_nt,2.064595,1.689258,-1.940975,-3.371228,-1.675405,-1.948209,-2.652172,-2.889705,-1.326632,-3.38041
E_nr,-1.452444,-1.468744,-3.259135,1.27931,0.709749,-0.101151,-1.440134,-1.680012,-2.549911,0.288885
O,-3.398406,-3.010524,-2.628851,1.655515,0.666735,0.623919,-3.150675,-3.03733,-2.530541,0.677525
B_nr,-1.319444,-2.039114,1.578004,-2.857347,-3.091537,-2.269576,-3.158556,-3.43597,1.800888,-1.399733
B_ns,-1.902348,-3.107933,-3.106146,-2.916749,-2.224071,-3.267369,1.963997,1.683751,-3.329149,-1.943171
M_ns,-2.643914,-3.1919,-3.060015,-3.492515,-1.683115,-3.250608,1.325443,1.899873,-2.29082,-1.446936
E_ns,-1.974666,-3.305349,-2.005721,1.301592,0.344891,1.07241,-3.45796,-3.144714,-1.864227,-0.017089
M_nr,-1.491318,-1.743178,1.77302,-2.79535,-2.651609,-1.647576,-2.42808,-3.107333,1.031566,-1.024025
B_nt,-0.838918,2.111549,-1.85187,-2.921312,-1.19427,-1.949645,-3.129531,-3.415173,-2.139561,-3.43643


In [132]:
index = [
    'B_nt', 'M_nt', 'E_nt', 'B_nr', 'M_nr', 'E_nr', 'B_ns', 'M_ns', 'E_ns', 'O'
]
df.reindex(index, axis=0).reindex(index, axis=1).round(2).style.applymap(
    lambda v: 'background-color: %s' % '#B0C4DE'
    if v > 0 else 'background-color: %s' % '#FFFFFF')

Unnamed: 0,B_nt,M_nt,E_nt,B_nr,M_nr,E_nr,B_ns,M_ns,E_ns,O
B_nt,-3.44,2.11,-0.84,-1.19,-2.14,-1.85,-1.95,-3.13,-3.42,-2.92
M_nt,-3.38,1.69,2.06,-1.68,-1.33,-1.94,-1.95,-2.65,-2.89,-3.37
E_nt,0.11,-3.03,-2.57,0.27,-0.73,-0.73,0.37,-1.36,-1.05,0.91
B_nr,-1.4,-2.04,-1.32,-3.09,1.8,1.58,-2.27,-3.16,-3.44,-2.86
M_nr,-1.02,-1.74,-1.49,-2.65,1.03,1.77,-1.65,-2.43,-3.11,-2.8
E_nr,0.29,-1.47,-1.45,0.71,-2.55,-3.26,-0.1,-1.44,-1.68,1.28
B_ns,-1.94,-3.11,-1.9,-2.22,-3.33,-3.11,-3.27,1.96,1.68,-2.92
M_ns,-1.45,-3.19,-2.64,-1.68,-2.29,-3.06,-3.25,1.33,1.9,-3.49
E_ns,-0.02,-3.31,-1.97,0.34,-1.86,-2.01,1.07,-3.46,-3.14,1.3
O,0.68,-3.01,-3.4,0.67,-2.53,-2.63,0.62,-3.15,-3.04,1.66


In [138]:
start_matrix = torch.Tensor.cpu(model.crf.start_matrix.view(
    1, -1)).detach().numpy()
df = pd.DataFrame(start_matrix, columns=id2tag.values())
df.reindex(index, axis=1).round(3).style.applymap(
    lambda v: 'background-color: %s' % '#B0C4DE'
    if v > 0 else 'background-color: %s' % '#FFFFFF')

Unnamed: 0,B_nt,M_nt,E_nt,B_nr,M_nr,E_nr,B_ns,M_ns,E_ns,O
0,-0.088,-0.033,0.027,-0.045,-0.087,-0.024,-0.073,0.037,-0.011,0.007


In [139]:
end_matrix = torch.Tensor.cpu(model.crf.end_matrix.view(
    1, -1)).detach().numpy()
df = pd.DataFrame(end_matrix, columns=id2tag.values())
df.reindex(index, axis=1).round(3).style.applymap(
    lambda v: 'background-color: %s' % '#B0C4DE'
    if v > 0 else 'background-color: %s' % '#FFFFFF')

Unnamed: 0,B_nt,M_nt,E_nt,B_nr,M_nr,E_nr,B_ns,M_ns,E_ns,O
0,0.093,-0.043,-0.071,0.059,-0.1,0.062,0.011,-0.075,-0.001,-0.03


# TODO
> 以字为单位进行的实体识别，以词为单位效果会不会更好？？