### 7.2可视化BERT原理

### 7.2.1 BERT的整体架构
BERT的整体架构如下图所示，它采用了Transformer中Encoder部分。	
![image.png](attachment:image.png)

其中trm模块的代码如下

In [6]:
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
    def __init__(self, k, heads):
        super().__init__()

        self.attention = SelfAttention(k, heads = heads)

        self.norm1 = nn.LayerNorm(k)
        self.norm2 = nn.LayerNorm(k)

        self.mlp = nn.Sequential(
            nn.Linear(k, 4*k),
            nn.ReLU(),
            nn.Linear(4*k, k)
        )

    def forward(self, x):
        # 先做self-attention
        attended = self.attention(x)
        # 再做layer norm
        x = self.norm1(attended + x)

        # feedforward和layer norm
        feedforward = self.mlp(x)
        return self.norm2(feedforward + x)

### 7.2.2 BERT模型的输入
BERT的输入的编码向量（d_model=512）是3个嵌入特征的单位和,如下图所示
![image.png](attachment:image.png)
### 7.2.3 Masked LM
掩码语言模型（Masked Language Model，MLM）是一种真正的bidirectional方法，前面提到的ELMo模型只是将left-to-right和right-to-left分别训练拼接起来。两种模型的区别从它们的目标函数就可以明显地看出。ELMo以P(t_k|t_1,\ldots,t_{k-1})，P(t_k|t_{k+1},\ldots,t_n)作为目标函数，然后独立训练，最后把结果进行拼接。而BERT以P(t_k|t_1,\ldots,t_{k-1},t_{k+1},\ldots,t_n)为目标函数，这样学到的词向量可同时关注左右词的信息。
	在BERT的训练过程中，15%的词块标记（WordPiece Token）（或中文需设置为word级）会被随机Mask掉。因测试环境没有Mask这类标识，为尽量使训练和测试这两个环境靠近，BERT的提出者使用了一个Mask小技巧，即在确定要Mask掉的单词之后，80%的时候会直接替换为[Mask]，10%的时候将其替换为其他任意单词，10%的时候会保留原始Token。整个MLM训练过程如下图所示。
 ![image.png](attachment:image.png)


### 7.2.4 Next Sentence Prediction
	考虑到下游任务很多会涉及问答（QA）和自然语言推理（NLI）之类的任务，所以增加了两个语句的任务（Next Sentence Prediction，NSP），目的是让模型理解两个句子之间的联系。在该任务中，训练的输入是句子A和B，B有一半的几率是A的下一句，输入这两个句子，模型预测B是不是A的下一句。NSP预训练的时候可以达到97%~98%的准确度。具体训练过程如下图所示。
![image.png](attachment:image.png)

BERT训练过程包括MLM及NSP，其损失函数的具体定义可参考huggingface官网上的对应代码：  
https://github.com/huggingface/transformers/blob/master/src/transformers/models

## 7.3 用PyTorch实现BERT
用PyTorch实现BERT的核心代码，主要有2个，第1个是表示Transformer-block的模块，还一个是生成BERT输入的BERTEmbedding类，把第1、2模块组合为BERT模型的模块bert.py，这些模块之间的关系如下图所示。
![image.png](attachment:image.png)

### 7.3.1BERTEmbedding类的代码

In [9]:
import torch.nn as nn
from model.embedding.token import TokenEmbedding
from model.embedding.position import PositionalEmbedding
from model.embedding.segment import SegmentEmbedding


class BERTEmbedding(nn.Module):
    """
    BERT Embedding which is consisted with under features
        1. TokenEmbedding : normal embedding matrix
        2. PositionalEmbedding : adding positional information using sin, cos
        2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)

        sum of all these features are output of BERTEmbedding
    """

    def __init__(self, vocab_size, embed_size, dropout=0.1):
        """
        :param vocab_size: total vocab size
        :param embed_size: embedding size of token embedding
        :param dropout: dropout rate
        """
        super().__init__()
        self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size)
        self.position = PositionalEmbedding(d_model=self.token.embedding_dim)
        self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim)
        self.dropout = nn.Dropout(p=dropout)
        self.embed_size = embed_size

    def forward(self, sequence, segment_label):
        x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
        return self.dropout(x)


### 7.3.2 TransformerBlock类的代码

In [11]:
import torch.nn as nn

from model.attention import MultiHeadedAttention
from model.utils import SublayerConnection, PositionwiseFeedForward


class TransformerBlock(nn.Module):
    """
    Bidirectional Encoder = Transformer (self-attention)
    Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
    """

    def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):
        """
        :param hidden: hidden size of transformer
        :param attn_heads: head sizes of multi-head attention
        :param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size
        :param dropout: dropout rate
        """

        super().__init__()
        self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden)
        self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout)
        self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout)
        self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, mask):
        x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask))
        x = self.output_sublayer(x, self.feed_forward)
        return self.dropout(x)


### 7.3.3 构建BERT的代码

In [13]:
import torch.nn as nn

from model.transformer import TransformerBlock
from model.embedding import BERTEmbedding


class BERT(nn.Module):
    """
    BERT model : Bidirectional Encoder Representations from Transformers.
    """

    def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):
        """
        :param vocab_size: vocab_size of total words
        :param hidden: BERT model hidden size
        :param n_layers: numbers of Transformer blocks(layers)
        :param attn_heads: number of attention heads
        :param dropout: dropout rate
        """

        super().__init__()
        self.hidden = hidden
        self.n_layers = n_layers
        self.attn_heads = attn_heads

        # paper noted they used 4*hidden_size for ff_network_hidden_size
        self.feed_forward_hidden = hidden * 4

        # embedding for BERT, sum of positional, segment, token embeddings
        self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=hidden)

        # multi-layers transformer blocks, deep network
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(hidden, attn_heads, hidden * 4, dropout) for _ in range(n_layers)])

    def forward(self, x, segment_info):
        # attention masking for padded token
        # torch.ByteTensor([batch_size, 1, seq_len, seq_len)
        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)

        # embedding the indexed sequence to sequence of vectors
        x = self.embedding(x, segment_info)

        # running over multiple transformer blocks
        for transformer in self.transformer_blocks:
            x = transformer.forward(x, mask)

        return x
