#### batch normalization
&emsp;&emsp;batch normalization 对隐藏层的数据进行正态分布标准化，由于标准化后可能影响神经网络的表达能力，normalize后的数据再使用缩放系数$\gamma$和平移系数$\beta$进行缩放和平移。其中$\gamma$和$\beta$参数需要进行反向传播，使得处理后的数据达到最佳的使用效果。

$$\mu_{\beta} = \frac{1}{m}\sum_{i=1}^{m}{x_{i}} \tag{1-1}$$

$$\delta_{\beta}^{2} = \frac{1}{m}\sum_{i=1}^{m}{(x_{i} - \mu_{\beta})^{2}} \tag{1-2}$$

$$x_{i}^{-} = \frac{x_{i} - \mu_{\beta}}{\sqrt{\delta_{\beta}^{2} + e}} \tag{1-3}$$

$$y_{i} = \gamma x_{i}^{-} + \beta \tag{1-4}$$

&emsp;&emsp;batch normalization特点
1. 依赖batch_size
2. 对处理序列化数据的网络不太使用
3. 只在训练的时候用，inference的时候用不到

&emsp;&emsp;在训练最后一个epoch时，要对这一epoch所有的训练样本的均值和标准差进行统计，这样在测试数据过来的时候，使用训练样本的标准差的期望和均值的期望对测试数据进行归一化，注意标准差这里使用的期望是无偏估计。

$$E[x] = E_{\beta}[\mu_{\beta}]$$

$$Var[x] = \frac{m}{m-1}{E_{\beta}[\delta_{\beta}^{2}]}$$

#### Layer normalization

&emsp;&emsp;layer normalization 比较适合用于RNN和单条样本的训练和预测。

$$\mu = \frac{1}{H}\sum_{i=1}^{H}{x_{i}} \tag{2-1}$$

$$\delta = sqrt{\frac{1}{H}\sum{i=1}{H}{(x_{i} - \mu)^{2}}} \tag{2-2}$$

$$y = g \odot \frac{x-\mu}{\sqrt{\delta_{2} + e} + b} \tag{2-3}$$

&emsp;&emsp;其中g和b是可学习的参数,$\odot$为element-wise乘法

&emsp;&emsp;上面的公式中H 是一层神经元的个数，这里一层网络共享一个均值和方差，不同训练样本对应不同的均值和方差

the output of each sub-layer is LayerNorm(x + sublayer(x))

In [12]:
import torch
import torch.nn as nn
import copy

In [13]:
def clones(module,N):
    '''
    Produce N identical layers
    '''
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [14]:
class LayerNorm(nn.Module):
    '''
    Construct a layernorm module (see citation for details)
    '''
    def __init__(self,features,eps=1e-6):
        super(LayerNorm,self).__init__()
        self.a = nn.Parameter(torch.ones(features))
        self.b = nn.Parameter(torch.zeros(features))
        self.eps = eps
        
    def forward(self,x):
        mean = x.mean(-1,keepdim=True)
        std = x.std(-1,keepdim=True)
        return self.a * (x-mean)/(std + self.eps) + self.b

In [15]:
class Encoder(nn.Module):
    '''
    core encoder is a stack of n layers
    '''
    def __init__(self,layer,N):
        super(Encoder,self).__init__()
        self.layers = clones(layer,N)
        self.norm = LayerNorm(layer.size)
    
    def forward(self,x,mask):
        '''
        pass the input (and mask) through each layer in turn
        '''
        for layer in self.layers:
            x = layer(x,mask)
        return self.norm(x)

In [16]:
class SublayerConnection(nn.Module):
    '''
    A residual connection followed by a layer norm
    note for code simplicity the norm is first as oppsed to last
    '''
    def __init__(self,size,dropout):
        super(SublayerConnection,self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self,x,sublayer):
        '''
        apply residual connection to any sublayer with the same size
        '''
        return x + self.dropout(sublayer(self.norm(x)))

In [22]:
class EncoderLayer(nn.Module):
    '''
    encoder is made up of self-attn and feed forward
    '''
    def __init__(self,size,self_attn,feed_forward,dropout):
        super(EncoderLayer,self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size,dropout),2)
        self.size = size
        
    def forward(self,x,mask):
        x = self.sublayer[0](x,lambda x:self.self_attn(x,x,x,mask))
        return self.sublayer[1](x,self.feed_forward)

In [23]:
def subsequent_mask(size):
    '''
    mask out subsequent positions 
    target word(row) is allowed to look at (column)
    '''
    attn_shape = (1,size,size)
    # 其中target word 只考虑target word 以及之前的word
    subsequent_mask = np.triu(np.ones(attn_shape),k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

In [25]:
class Decoder(nn.Module):
    '''
    generic N layer decoder with masking
    '''
    def __init__(self,layer,N):
        super(Decoder,self).__init__()
        self.layers = clones(layer,N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self,x,memory,src_mask,tgt_mask):
        for layer in self.layers:
            x = layer(x,memory,src_mask,tgt_mask)
        return self.norm(x)

In [26]:
class DecoderLayer(nn.Module):
    '''
    decoder is made of self-attn,src-attn and feed forward
    '''
    def __init__(self,size,self_attn,src_attn,feed_forward,dropout):
        super(DecoderLayer,self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.scr_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size,dropout),3)
        
    def forward(self,x,memory,src_mask,tgt_mask):
        m = memory
        x = self.sublayer[0](x,lambda x:self.self_attn(x,x,x,tgt_mask))
        x = self.sublayer[1](x,lambda x:self.src_attn(x,m,m,src_mask))
        return self.sublayer[2](x,self.feed_forward)

$$Attention(Q,K,V) = softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$$

In [None]:
import math
def attention(query,key,value,mask=None,dropout=None):
    'compute scaled dot product attention'
    d_k = query.size(-1)
    scores = torch.matmul(query,key.transpose(-2,-1))\math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask)

In [19]:
test_ones = np.ones((1,3,3))

In [20]:
test_ones

array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])

In [21]:
np.triu(test_ones,k=1).astype('uint8')

array([[[0, 1, 1],
        [0, 0, 1],
        [0, 0, 0]]], dtype=uint8)