### BiLSTM层输入与输出

BiLSTM层的输入:每个词的向量表示

BiLSTM层的输出:每个词属于不同实体类别标签的概率

<img src="../../../Other/img/lstm-crf0.png">


### 如果模型不包含CRF层

<img src="../../../Other/img/lstm-crf1.png">

因为BiLSTM的输出为每个词属于不同实体类别标签的概率,因此对于每个词,可以选择最高得分的标签作为预测结果

如左图所示,对于w0,"B-Person"的得分最高(1.5),因此可以选择"B-Person"作为其预测标签;同理,可以选择w1的标签为"I-Person",w2​的标签为"O",w3​的标签为"B-Organization",w4​的标签为"O".序列实体类别标签预测正确.

如右图所示,显然,序列实体类别标签预测结果为["I-Organization", "I-Person", "O", "B-Organization", "I-Person"]是不合理的.

CRF层可以加入一些约束来保证最终预测结果是有效的.这些约束可以在训练数据时被CRF层自动学习得到(如:句子的开头应该是"B-"或"O",而不能是"I-")

### 转移得分

为了使转移得分矩阵的鲁棒性更好,额外再加两个标签:"START"和"END",START表示一句话的开始(并不是指该句话的第一个单词,"START"后才是第一个单词,同样的,"END"代表着这句话的结束)

下表为一个转移得分矩阵的示例:

|                    | START | B-Person | I-Person | **B-Organization** | **I-Organization** | **O** | END   |
| ------------------ | ----- | -------- | -------- | ------------------ | ------------------ | ----- | ----- |
| **START**          | 0     | 0.8      | 0.07     | 0.7                | 0.9998             | 0.9   | 0.08  |
| **B-Person**       | 0     | 0.6      | 0.9      | 0.2                | 0.0006             | 0.6   | 0.009 |
| **I-Person**       | -1    | 0.5      | 0.53     | 0.55               | 0.0003             | 0.85  | 0.008 |
| **B-Organization** | 0.9   | 0.5      | 0.0003   | 0.25               | 0.8                | 0.77  | 0.006 |
| **I-Organization** | -0.9  | 0.45     | 0.007    | 0.7                | 0.65               | 0.76  | 0.2   |
| **O**              | 0     | 0.65     | 0.0007   | 0.7                | 0.0008             | 0.9   | 0.08  |
| **END**            | 0     | 0        | 0        | 0                  | 0                  | 0     | 0     |


由上表可知:
* 从"START"到"I-Person"和"I-Organization"的得分都很低,可得句子中第一个单词的标签应该是以"B-"或者"O"开头,而不能以"I-"开头
* "O", "I-label"是无效的(从"O"到"I-Person"和"I-Oranization"的得分都很低),可得命名实体的第一个标签应该是以"B-"开头的而不该是以"I-"开头


### CRF的损失函数

假设标签数量为$k$,序列长度为$n$,则总的路径数为:$N=k^n$条,若用$S_i$代表第$i$条路径的分数,则该路径标签序列出现的概率为:

$$ P\left(S_{i}\right)=\frac{e^{S_{i}}}{\sum_{j}^{N} e^{S_{j}}} $$

真实路径real的得分为$S_{real}$,易得:

$$ P\left(S_{real}\right)=\frac{e^{S_{real}}}{\sum_{j}^{N} e^{S_{j}}} $$

学习的目的是为了不断的提高$P(S_{reaL})$,即目标函数.故损失函数可以定义为:

$$ loss = - P\left(S_{real}\right)=-\frac{e^{S_{real}}}{\sum_{j}^{N} e^{S_{j}}}  $$

等式两边取对数,则有:

$$
\begin{aligned}
\operatorname{loss} &=-\log \frac{e^{S_{\text {real }}}}{\sum_{j}^{N} e^{S_{j}}} \\
&=-\left(\log \left(e^{S_{\text {real }}}\right)-\log \left(\sum_{j}^{N} e^{S_{j}}\right)\right) \\
&=\log \left(\sum_{j}^{N} e^{S_{j}}\right)-\log \left(e^{S_{\text {real }}}\right) \\
&=\log \left(e^{S_{1}}+e^{S_{2}}+\ldots+e^{S_{N}}\right)-S_{\text {real }}
\end{aligned}
$$

可以看出,上式包含两部分:单条真实路径的分数$S_{real}$,归一化项$\log (e^{S_{1}}+e^{S_{2}}+\ldots+e^{S_{N}})$

### 单条路径的分数计算

假设$E$代表发射分数矩阵,$T$代表转移分数矩阵,$n$代表文本序列长度,$tag_size$代表标签的数量.为了方便书写,为每个标签编号,如下图所示:

| Tag        | B-Person | I-Person | B-Organization | I-Organization | O    |
| ---------- | -------- | -------- | -------------- | -------------- | ---- |
| **Tag id** | 0        | 1        | 2              | 3              | 4    |

其中,$E$的shape为$[n, tag\_size]$,每行对应着一个文本字词的发射分数,每列代表一个标签.例如:$E_{01}$代表$w_{0}$取id为1的标签分数,$E_{23}$代表$w_{2}$取id为3的标签分数.$T$的shape为$[tag\_size, tag\_size]$,它代表标签之间相互转移的分数.例如:$T_{03}$代表id为3的标签向id为0的标签转移分数.

每条路径的分数就是由对应的发射分数和转移分数组合而成的,对于路径["B-Person", "I-Person", "O", "B-Organization", "O"]来说,$w_0$的标签是"B-Person",对应的发射分数是$E_{00}$,$w_1$的标签是"I-Person",对应的发射分数是$E_{11}$,由"B-Person"向"I-Person"转移的分数是$T{10}$,因此到这一步的分数为:$T_{00} + T_{10} + E_{11}$

$w_2$的标签是"O",对应的发射分数是$E_{24}$,由$w_1$的标签"I-Person"向$w_2$的标签"O"转移的分数是$T_{41}$,因此到这一步的分数为:$T_{00} + T_{10} + E_{11} + T{41} + E_{24}$,以此类推.

### 全部路径的分数计算

$\mathbf{x} = [w_0, w_1, w_2]$

$LABEL = [l_1, l_2]$


|       | $l_1$    | $l_2$    |
| ----- | -------- | -------- |
| $w_0$ | $x_{01}$ | $x_{02}$ |
| $w_1$ | $x_{11}$ | $x_{12}$ |
| $w_2$ | $x_{12}$ | $x_{22}$ |

</br>

|       | $l_1$    | $l_2$    |
| ----- | -------- | -------- |
| $l_1$ | $t_{11}$ | $t_{12}$ |
| $l_2$ | $t_{21}$ | $t_{22}$ |


* previous:上一步的得分
* obs:当前步骤的词的信息


1. 假设句子只有一个词$w_0$

obs = $[x_{01}, x_{02}]$

previous = None

$TotalScore(w_0) = log(e^{x_{01}} + e^{x_{02}})$


2. 假设句子有两个词$[w_0, w_1]$

obs = $[x_{11}, x_{12}]$

previous = $[x_{01}, x_{02}]$ 

* 扩展precious

$$
\text { previous }=\left(\begin{array}{ll}
x_{01} & x_{01} \\
x_{02} & x_{02}
\end{array}\right)
$$

* 扩展obs

$$
\text { obs }=\left(\begin{array}{ll}
x_{11} & x_{12} \\
x_{11} & x_{12}
\end{array}\right)
$$

* 对previous,obs以及transition得分求和

$$
\text { scores }=\left(\begin{array}{ll}
x_{01} & x_{01} \\
x_{02} & x_{02}
\end{array}\right)+\left(\begin{array}{ll}
x_{11} & x_{12} \\
x_{11} & x_{12}
\end{array}\right)+\left(\begin{array}{ll}
t_{11} & t_{12} \\
t_{21} & t_{22}
\end{array}\right)
 = 
 \left(\begin{array}{ll}
x_{01} + x_{11} + t_{11} & x_{02} + x_{12} + t_{12} \\
x_{02} + x_{11} + t_{21} & x_{01} + x_{12} + t_{22}
\end{array}\right) 
$$

* 修改previous的值

$$
\begin{aligned}
&\text { previous }=\left[\log \left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right)\right.\text {,}&\left.\log \left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right)\right]
\end{aligned}
$$

* 总路径得分

$$
\begin{align}
	& TotalScore \left(w_{0} \rightarrow w_{1}\right)\\
={} & \log \left(e^{\text {previous }[0]}+e^{\text {previous }[1]}\right) \\
={} & \log \left( e^{log(e^{x_{01} + x_{11} + t_{11}} + e^{x_{02} + x_{11} + t_{21}})} + e^{log(e^{x_{01} + x_{12} + t_{12}} + e^{x_{02} + x_{12} + t_{22}})}  \right) \\
={} &  \log \left( e^{x_{01} + x_{11} + t_{11}} + e^{x_{02} + x_{11} + t_{21}} +  e^{x_{01} + x_{12} + t_{12}} + e^{x_{02} + x_{12} + t_{22}} \right)
\end{align}
$$

两个词组成的句子所有可能的标签x_x_:

$label_{1} \rightarrow label_{1}, label_{1} \rightarrow label_{2}, label_{2} \rightarrow label_{1},  label_{2} \rightarrow label_{2} $

$$
\begin{align}
S_1 &= x_{01} + x_{11} + t_{11} (label_{1} \rightarrow label_{1})\\
S_2 &= x_{02} + x_{11} + t_{21} (label_{2} \rightarrow label_{1})\\
S_3 &= x_{01} + x_{12} + t_{12} (label_{1} \rightarrow label_{2})\\
S_4 &= x_{02} + x_{12} + t_{22} (label_{2} \rightarrow label_{2})\\
\end{align}
$$

3. 假设句子有三个词$[w_0, w_1, w_2]$

obs = $[x_{21}, x_{22}]$

previous = $[\log (e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}} , \log (e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}]$

* 扩展previous

$$
\text { previous }=\left(\begin{array}{ll}
\log \left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right) & \log \left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right) \\
\log \left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right) & \log \left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right)
\end{array}\right)
$$


* 扩展obs

$$
o b s=\left(\begin{array}{ll}
x_{21} & x_{22} \\
x_{21} & x_{22}
\end{array}\right)
$$

* 对previous,obs以及transition得分求和

$$
\text { scores }=\left(\begin{array}{ll}
\log \left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+ t_{21}}\right)+x_{21}+t_{11} & \log \left(e^{x 01+x 11+t_{11}}+e^{x 02+x_{11}+t_{21}}\right)+x_{22}+t_{12} \\
\log \left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right)+x_{21}+t_{21} & \log \left(e^{x_{01}+x_{12}+t_{12}}+e^{x 02+x_{12}+t_{22}}\right)+x_{22}+t_{22}
\end{array}\right)
$$

* 修改previous的值

$$
\begin{align}
	& previous\\

={} & \left[\begin{array}{l}
\log \left( e^{ \log\left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+ t_{21}}\right) +x_{21}+t_{11}}  + e^{ \log\left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+ t_{22}}\right) +x_{21}+t_{21}}\right) , \\
\log \left( e^{ \log\left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+ t_{21}}\right) +x_{22}+t_{12}}  + e^{ \log\left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+ t_{22}}\right) +x_{22}+t_{22}}\right)
\end{array}\right] \\ 
={} & \left[\begin{array}{l}
\log \left(\left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right) e^{x_{21}+t_{11}}+\left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right) e^{x_{21}+t_{21}}\right) , \\
\log \left(\left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right) e^{x_{22}+t_{12}}+\left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right) e^{x_{22}+t_{22}}\right)
\end{array}\right]  \\
\end{align}
$$

* 全部路径得分

$$
\begin{align}
	& TotalScore \left(w_{0} \rightarrow w_{1} \rightarrow w_{2}  \right)\\
={} & \log \left(e^{\text {previous }[0]}+e^{\text {previous }[1]}\right) \\
={} & \log \left(e^{\log \left(\left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right) e^{x_{21}+t_{11}}+\left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right) e^{x_{21}+t_{21}}\right)}  + e^{ \log \left(\left(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}\right) e^{x_{22}+t_{12}}+\left(e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}}\right) e^{x_{22}+t_{22}}\right) }   \right)\\
={} & \log \left(e^{x_{01}+x_{11}+t_{11}+x_{21}+t_{11}}+e^{x_{02}+x_{11}+t_{21}+x_{21}+t_{11}}+e^{x_{01}+x_{12}+t_{12}+x_{21}+t_{21}}+e^{x_{02}+x_{12}+t_{22}+x_{21}+t_{21}}+ e^{x_{01}+x_{11}+t_{11}+x_{22}+t_{12}}+ e^{x_{02}+x_{11}+t_{21}+x_{22}+t_{12}}+
e^{x_{01}+x_{12}+t_{12}+x_{22}+t_{22}}+e^{x_{02}+x_{12}+t_{22}+x_{22}+t_{22}}\right)
\end{align}
$$
