### 丢弃法(倒置丢弃法)
- dropout(inverted dropout)
- 操作: 隐藏层的神经元 有p的概率被丢弃, 并且有 1-p的概率做除以 1-p的拉伸:  $$ h^{'} = \frac{ξ}{1-p}h $$, 其中 随机变量 ξ 是0和1的概率分别是 p和1-p, 即符合伯努利分布, 则 随机变量ξ的期望值是 1-p, 那么, 拉伸后的新神经元的期望值是, $$ E(h^{'}) = \frac{E(ξ)}{1-p}h = h $$, 所以丢弃法不改变输入的期望值
- 丢弃概率 p, 是超参数
- 因为隐藏层的单元都有可能被清零, 输出层的计算不能过度依赖隐藏层中的任一个, 从而起到正则化的作用
- 为了确定的结果, 测试模型时不使用丢弃法
- 论文:
    - Hiton老爷子的合作论文: [Dropout: a simple way to prevent neural networks from overfitting](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf), 多伦多大学好牛!


In [1]:
import sys
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn

In [2]:
def dropout(X, drop_prob):
    assert 0 <= drop_prob <= 1
    keep_prob = 1 - drop_prob
    # 这种情况下把全部元素都丢弃
    if keep_prob == 0:
        return X.zeros_like()
    mask = nd.random.uniform(0, 1, X.shape) < keep_prob  # 为什么是均匀分布
    return mask * X / keep_prob

In [3]:
X = nd.arange(16).reshape((2, 8))
dropout(X, 0)


[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  9. 10. 11. 12. 13. 14. 15.]]
<NDArray 2x8 @cpu(0)>

In [4]:
nd.random.uniform(0, 1, X.shape)


[[0.96366274 0.2726563  0.3834415  0.47766513 0.79172504 0.8121687
  0.5288949  0.47997716]
 [0.56804454 0.3927848  0.92559665 0.83607876 0.07103606 0.33739617
  0.08712929 0.6481719 ]]
<NDArray 2x8 @cpu(0)>

In [5]:
dropout(X, 0.5)


[[ 0.  2.  0.  0.  0. 10.  0.  0.]
 [ 0. 18.  0.  0. 24.  0.  0.  0.]]
<NDArray 2x8 @cpu(0)>

In [6]:
dropout(X, 1)


[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x8 @cpu(0)>

In [7]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))
b1 = nd.zeros(num_hiddens1)
W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))
b2 = nd.zeros(num_hiddens2)
W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]
for param in params:
    param.attach_grad()

In [8]:
drop_prob1, drop_prob2 = 0.2, 0.5

def net(X):
    X = X.reshape((-1, num_inputs))
    H1 = (nd.dot(X, W1) + b1).relu()
    if autograd.is_training():  # 只在训练模型时使用丢弃法
        H1 = dropout(H1, drop_prob1)  # 在第一层全连接后添加丢弃层
    H2 = (nd.dot(H1, W2) + b2).relu()
    if autograd.is_training():
        H2 = dropout(H2, drop_prob2)  # 在第二层全连接后添加丢弃层
    return nd.dot(H2, W3) + b3

In [9]:
def get_dataloader_workers(num_workers=4):
    # 0 means no additional process is used to speed up the reading of data.
    if sys.platform.startswith('win'):
        return 0
    else:
        return num_workers
    
    
def load_data_fashion_mnist(batch_size, resize=None):
    """Download the Fashion-MNIST dataset and then load into memory."""
    dataset = gluon.data.vision
    trans = [dataset.transforms.Resize(resize)] if resize else []
    trans.append(dataset.transforms.ToTensor())
    trans = dataset.transforms.Compose(trans)
    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
                                  num_workers=get_dataloader_workers()),
            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
                                  num_workers=get_dataloader_workers()))

In [10]:
# 计算预测准确率
def accuracy(y_hat, y):
#     return (nd.argmax(y_hat) == y).sum()/y.shape[0]
    return (y_hat.argmax(axis=1) == y.astype('float32')).mean().asscalar()


# 评估模型
def evaluate_accuracy(data_iter, net):
    acc_sum, n = 0.0, 0
    for X, y in data_iter:
        y = y.astype('float32')
        acc_sum += (net(X).argmax(axis=1) == y).sum().asscalar()
        n += y.size
    return acc_sum / n

In [11]:
# 训练模型
def sgd(params, lr, batch_size):
    """
    参数还是照常更新, dropout是在 优化模型中
    """
    for param in params:
        param[:] = param - lr * param.grad / batch_size
        
        
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
              params=None, lr=None, trainer=None):
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
        for X, y in train_iter:
            with autograd.record():
                y_hat = net(X)
                l = loss(y_hat, y).sum()
            l.backward()
            if trainer is None:
                sgd(params, lr, batch_size)
            else:
                trainer.step(batch_size)
            y = y.astype('float32')
            train_l_sum += l.asscalar()
            train_acc_sum += (y_hat.argmax(axis=1) == y).sum().asscalar()
            n += y.size
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
              % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))

In [12]:
num_epochs, lr, batch_size = 50, 0.5, 256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = load_data_fashion_mnist(batch_size)
train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

epoch 1, loss 1.1172, train acc 0.568, test acc 0.754
epoch 2, loss 0.5759, train acc 0.787, test acc 0.831
epoch 3, loss 0.4917, train acc 0.820, test acc 0.842
epoch 4, loss 0.4448, train acc 0.838, test acc 0.853
epoch 5, loss 0.4198, train acc 0.848, test acc 0.860
epoch 6, loss 0.3970, train acc 0.856, test acc 0.872
epoch 7, loss 0.3796, train acc 0.862, test acc 0.870
epoch 8, loss 0.3687, train acc 0.867, test acc 0.875
epoch 9, loss 0.3537, train acc 0.871, test acc 0.871
epoch 10, loss 0.3458, train acc 0.873, test acc 0.879
epoch 11, loss 0.3365, train acc 0.877, test acc 0.880
epoch 12, loss 0.3357, train acc 0.878, test acc 0.883
epoch 13, loss 0.3184, train acc 0.884, test acc 0.881
epoch 14, loss 0.3155, train acc 0.884, test acc 0.886
epoch 15, loss 0.3077, train acc 0.886, test acc 0.887
epoch 16, loss 0.3024, train acc 0.889, test acc 0.884
epoch 17, loss 0.2975, train acc 0.891, test acc 0.888
epoch 18, loss 0.2956, train acc 0.891, test acc 0.886
epoch 19, loss 0.29

### 简洁实现

In [13]:
seq_net = nn.Sequential()
seq_net.add(nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob1),  # 在第一个全连接层后添加丢弃层
        nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob2),  # 在第二个全连接层后添加丢弃层
        nn.Dense(10))
seq_net.initialize(init.Normal(sigma=0.01))

In [14]:
trainer = gluon.Trainer(seq_net.collect_params(), 'sgd', {'learning_rate': lr})
train_ch3(seq_net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, trainer)

epoch 1, loss 1.1917, train acc 0.538, test acc 0.769
epoch 2, loss 0.5946, train acc 0.779, test acc 0.830
epoch 3, loss 0.4965, train acc 0.818, test acc 0.794
epoch 4, loss 0.4547, train acc 0.836, test acc 0.861
epoch 5, loss 0.4270, train acc 0.845, test acc 0.863
epoch 6, loss 0.4016, train acc 0.854, test acc 0.866
epoch 7, loss 0.3845, train acc 0.860, test acc 0.873
epoch 8, loss 0.3712, train acc 0.864, test acc 0.875
epoch 9, loss 0.3635, train acc 0.867, test acc 0.873
epoch 10, loss 0.3460, train acc 0.874, test acc 0.877
epoch 11, loss 0.3419, train acc 0.875, test acc 0.883
epoch 12, loss 0.3327, train acc 0.878, test acc 0.882
epoch 13, loss 0.3295, train acc 0.879, test acc 0.881
epoch 14, loss 0.3204, train acc 0.883, test acc 0.885
epoch 15, loss 0.3169, train acc 0.884, test acc 0.882
epoch 16, loss 0.3118, train acc 0.884, test acc 0.880
epoch 17, loss 0.3011, train acc 0.889, test acc 0.883
epoch 18, loss 0.2974, train acc 0.890, test acc 0.883
epoch 19, loss 0.29

### 正向传播/反向传播和 计算图
1. Frobenius范数: 每个元素的平方和, 再开方, 用于计算矩阵的L2范数
2. 多层神经网络中, 正则化的惩罚项是 各隐藏层 的 Frobenius范数的平均
3. 计算图: 
    - 左下角是输入
    - 右上角的输出
    - 变量是方框, 操作符是圆圈
    - 大部分箭头向右或向上
4. 反向传播

### 数值稳定性和 模型初始化
- 随着神经网络层数增多, 数值稳定性可能会变差, 出现输出层的衰减或者爆炸
- 初始化: 不可将隐藏单元的值初始化成相等的值, 因为在反向传播中, 参数梯度值也会相等, 导致隐藏单元的值一直相等(可以试试)
    - Solution: 随机初始化
    - 打破对称性(symmetry breaking): When some machine learning models have weights all initialized to the same value, it can be difficult or impossible for the weights to differ as the model is trained. Initializing the model to small random values breaks the symmetry and allows different weights to learn independently of each other.
    - 相同的参数, WX的乘积求和就导致隐藏层的单元的值都一样: https://stackoverflow.com/a/20029817/5141613
    - MXNet默认初始化: 权重参数每个元素随机采样于-0.07到0.07之间的均匀分布，bias全部清零
    - Xavier随机初始化: 在full connected 网络中, a是输入层个数, b是输出层个数, 那么权重参数随机采样于 均匀分布:  $$ U(-\sqrt{\frac{6}{a+b}}, \sqrt{\frac{6}{a+b}}) $$, 优点: 每层输出的方差不受输入个数的影响, 每层梯度的方差不受输出个数的影响
- 避免衰减和爆炸:
- 避免参数的初始值过大或者过小:
    - He-et-al Initialization: 令a=上一层的维度, 则 随机初始化后, 再乘以 $$ \sqrt{\frac{2}{a}} $$
    - Sigmoid的Xavier Initialization: $$ \sqrt{\frac{1}{a}} $$

### 练习
1. 调参:
    - 对调 两个丢弃参数, 有何结果
    - 比较 使用丢弃法和不使用丢弃法的结果
    - 增加隐藏层单元 或者 增加隐藏层数, 使用丢弃法处理过拟合的效果好不好更明显
    - 比较丢弃法和权重衰减的效果, 如果同时使用 权重衰减和丢弃法, 效果如何
2. Symmetry breaking中的对称如何理解?
    - 正向转播和反向传播中, 只有一个神经元在起作用
3. 能否将线性回归或者softmax回归中的 权重参数初始化成相同的值?
    - 不能

In [15]:
nd.random.uniform?
# 采样区间: 左闭右开, 默认区间: -0.07~0.07