## Task5

### Dropout  
- Dropout is a regularization technique patented by Google for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network

### L1 and L2 Regularization
- A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.
<center>$L(w) = E_D(w)+ \frac{\lambda}{n}\sum_{i=1}^{n}|w_i|$</center>
- The L2 regularization used L2 norm 
<center>$L(w) = E_D(w)+ \frac{\lambda}{2n}\sum_{i=1}^{n}|w_i^2|$</center>

- The following numpy implementation part is referred to <a href="https://www.kaggle.com/mtax687/dropout-regularization-of-neural-net-using-numpy">kaggle</a>

**Mathematically**:

For one example $x^{(i)}$:
$$z^{[1] (i)} =  W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}$$ 
$$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$
$$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}$$
$$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$$
$$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}$$

Given the predictions on all the examples, you can also compute the cost $J$ as follows: 
$$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right)  \large  \right) \small \tag{6}$$

In [1]:
# Numpy to implement Dropout L1 and L2
import numpy as np

In [2]:
# Sigmoid
def sigmoid(z):
    
    s = 1.0 / (1.0 + np.exp(-1.0 * z))
    
    return s

global parameters

In [3]:
# Parameters initialization 
def initialize_parameters(x, h, y):
    '''
        x - input size
        h - hidden size
        y - output size
        
        params: python dictionary containing parameters
        w1 : (h, x)
        b1 : (h, 1)
        w2 : (y, h)
        b2 : (y, 1)
    '''
    
    w1 = np.random.randn(h, x) * 0.01
    b1 = np.zeros((h, 1))
    w2 = np.random.randn(y, h) * 0.01
    b2 = np.zeros((y, 1))
    
    parameters = {
        "w1" : w1,
        "b1" : b1,
        "w2" : w2,
        "b2" : b2
    }
    
    
    return parameters

In [4]:
parameters = initialize_parameters(2,4,1)

In [5]:
def forward_pro_with_dropout(x, parameters, prob = 0.5):
    '''
        w1 : (h, x)
        b1 : (h, 1)
        w2 : (y, h)
        b2 : (y, 1)
        
        prob: probability of keeping neuron active 
        
        return: 
            a2: sigmoid output of the second activation 
            cache : dictionary containing z1 a1 z2 and a2
            
    '''
    
    w1 = parameters["w1"]
    b1 = parameters["b1"]
    w2 = parameters["w2"]
    b2 = parameters["b2"]
    
    z1 = np.add(np.matmul(w1, x), b1)
    a1 = np.tanh(z1)
    
    
    d1 = np.random.rand(a1.shape[0], a1.shape[1])
    d1 = d1 < prob
    
    a1 = np.multiply(a1, d1)
    #  By doing this you are assuring that the result of the cost 
    #  will still have the same expected value as without drop-out.
    a1 = a1 / prob
    
    z2 = np.add(np.matmul(w2, a1), b2)
    a2 = sigmoid(z2)
    
    cache = {
        "z1" : z1,
        "a1" : a1,
        "d1" : d1, 
        "z1" : z2,
        "a2" : a2
    }
    
    return a2, cache

In [6]:
# cost J
def compute_cost(a2, y, parameters, lamda = 0.001):
    '''
    a2 : output of the second activation 
    y : true labels 
    parameters: w1 b1 w2 b2
    lambda
    '''
    
    m = y.shape[1] # number of example
    
    logprobs = np.multiply(y, np.log(a2)) + np.multiply((1 - y), np.log(1 - a2))
    
    l2_reg = np.sum(np.square(parameters['w1'])) + np.sum(np.square(parameters['w2']))
    
    cost = (-1.0/m) * np.sum(logprobs)
    cost = np.squeeze(cost) #makes sure cost is the dimension we expect. 
    
    l2_cost = cost + lamda * l2_reg / (2 * m)
    return l2_cost


In [7]:
def backward_pro_with_dropout(parameters, cache, x, y, lamda= 0.001, prob=0.5):
    '''
    parameters 
    cache 
    x
    y
    
    
    return grads 
    '''
    
    m = x.shape[1]
    # 参数提取
    w1 = parameters['w1']
    w2 = parameters['w2']
    a1 = cache['a1']
    a2 = cache['a2']
    d1 = cache['d1']
    
    dz2 = a2 - y
    dw2 = np.dot(dz2, a1.T) / m + lamda / m * w2
    db2 = np.sum(dz2, axis=1, keepdims=True) / m
 
    # Dropout的关键操作
    da1 = np.dot(w2.T, dz2)
 
    da1 = da1 * d1
    da1 = da1 / prob
 
    dz1 = np.multiply(np.dot(w2.T, dz2), (1 - np.power(a1, 2)))
    dw1 = np.dot(dz1, x.T) / m + lamda / m * w1
    db1 = np.sum(dz1, axis=1, keepdims=True) / m
    
    
    grads = {
        "dw1" : dw1,
        "db1" : db1,
        "dw2" : dw2,
        "db2" : db2
    }
    
    return grads

In [8]:

def update_parameters(parameters, grads, learning_rate=0.01):
    w1 = parameters["w1"]
    b1 = parameters['b1']
    w2 = parameters['w2']
    b2 = parameters['b2']
 
    dw1 = learning_rate * grads["dw1"]
    db1 = learning_rate * grads["db1"]
    dw2 = learning_rate * grads["dw2"]
    db2 = learning_rate * grads["db2"]
 
    w1 = w1 - dw1
    b1 = b1 - db1
    w2 = w2 - dw2
    b2 = b2 - db2
 
    parameters = {
        "w1": w1,
        "b1": b1,
        "w2": w2,
        "b2": b2
    }
    return parameters
    

In [9]:
np.random.seed(1)
m = 200
X = np.random.randn(2, m)
Y = (1 + (2 * (X[0, :] > 0) - 1) * (2 * (X[1, :] > 0) - 1)) / 2
Y = Y.reshape(1, X.shape[1])

In [10]:
num_iterations = 100
learning_rate = 0.01
x, h, y = 2, 30, 1

parameters = initialize_parameters(x, h, y)
w1 = parameters["w1"]
b1 = parameters["b1"]
w2 = parameters["w2"]
b2 = parameters["b2"]

In [11]:
parameters

{'w1': array([[-0.01306534,  0.0007638 ],
        [ 0.00367232,  0.01232899],
        [-0.00422857,  0.00086464],
        [-0.02142467, -0.00830169],
        [ 0.00451616,  0.01104174],
        [-0.00281736,  0.02056356],
        [ 0.01760249, -0.00060652],
        [-0.02413503, -0.01777566],
        [-0.00777859,  0.01115841],
        [ 0.00310272, -0.02094248],
        [-0.00228766,  0.01613361],
        [-0.00374805, -0.0074997 ],
        [ 0.02054624,  0.0005341 ],
        [-0.00479157,  0.00350167],
        [ 0.00017165, -0.00429142],
        [ 0.01208456,  0.01115702],
        [ 0.00840862, -0.00102887],
        [ 0.011469  , -0.00049703],
        [ 0.00466643,  0.01033687],
        [ 0.00808844,  0.01789755],
        [ 0.00451284, -0.0168406 ],
        [-0.0116017 ,  0.01350107],
        [-0.00331283,  0.00386539],
        [-0.00851456,  0.01000881],
        [-0.00384832,  0.01458108],
        [-0.00532234,  0.01118133],
        [ 0.00674396, -0.00722392],
        [ 0.01098996, 

In [12]:
for i in range(num_iterations):
    a2, cache = forward_pro_with_dropout(X, parameters)
    
    loss = compute_cost(a2, Y, parameters)
    
    grads = backward_pro_with_dropout(parameters, cache, X, Y)
    
    parameters = update_parameters(parameters, grads, learning_rate)
    
    if i % 10 == 0:
        print("Loss after iteration %i:%f" % (i, loss))
        

Loss after iteration 0:0.693116
Loss after iteration 10:0.693152
Loss after iteration 20:0.693083
Loss after iteration 30:0.693015
Loss after iteration 40:0.693013
Loss after iteration 50:0.692935
Loss after iteration 60:0.692903
Loss after iteration 70:0.692963
Loss after iteration 80:0.692873
Loss after iteration 90:0.692796


In [13]:
### Pytorch: implementation of dropout

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
# cifar-10官方提供的数据集是用numpy array存储的
# 下面这个transform会把numpy array变成torch tensor，然后把rgb值归一到[0, 1]这个区间
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# 在构建数据集的时候指定transform，就会应用我们定义好的transform
# root是存储数据的文件夹，download=True指定如果数据不存在先下载数据
cifar_train = torchvision.datasets.CIFAR10(root='./data', train=True,
                                           download=True, transform=transform)
cifar_test = torchvision.datasets.CIFAR10(root='./data', train=False,
                                          transform=transform)

0it [00:00, ?it/s]

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


170500096it [02:50, 2843384.86it/s]                               

In [14]:
trainloader = torch.utils.data.DataLoader(cifar_train, batch_size=32, shuffle=True)
testloader = torch.utils.data.DataLoader(cifar_test, batch_size=32, shuffle=True)

In [18]:

class LeNet(nn.Module):
    # 一般在__init__中定义网络需要的操作算子，比如卷积、全连接算子等等
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.drop1 = nn.Dropout(0.5)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(120, 84)
        self.drop2 = nn.Dropout(0.5)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(84, 10)
    # forward这个函数定义了前向传播的运算，只需要像写普通的python算数运算那样就可以了
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        # 下面这步把二维特征图变为一维，这样全连接层才能处理
        x = x.view(-1, 16*5*5)
        
        x = self.fc1(x)
        x = self.drop1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.drop2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        return x

In [19]:

# optim中定义了各种各样的优化方法，包括SGD
import torch.optim as optim

# 如果你没有GPU，那么可以忽略device相关的代码
device = torch.device("cpu")
net = LeNet().to(device)

# CrossEntropyLoss就是我们需要的损失函数
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

In [20]:
print("Start Training...")
for epoch in range(5):
    # 我们用一个变量来记录每100个batch的平均loss
    loss100 = 0.0
    # 我们的dataloader派上了用场
    for i, data in enumerate(trainloader):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device) # 注意需要复制到GPU
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        loss100 += loss.item()
        if i % 100 == 99:
            print('[Epoch %d, Batch %5d] loss: %.3f' %
                  (epoch + 1, i + 1, loss100 / 100))
            loss100 = 0.0

print("Done Training!")

Start Training...
[Epoch 1, Batch   100] loss: 2.305
[Epoch 1, Batch   200] loss: 2.305
[Epoch 1, Batch   300] loss: 2.304
[Epoch 1, Batch   400] loss: 2.301
[Epoch 1, Batch   500] loss: 2.299
[Epoch 1, Batch   600] loss: 2.297
[Epoch 1, Batch   700] loss: 2.293
[Epoch 1, Batch   800] loss: 2.292
[Epoch 1, Batch   900] loss: 2.283
[Epoch 1, Batch  1000] loss: 2.270
[Epoch 1, Batch  1100] loss: 2.265
[Epoch 1, Batch  1200] loss: 2.237
[Epoch 1, Batch  1300] loss: 2.202
[Epoch 1, Batch  1400] loss: 2.176
[Epoch 1, Batch  1500] loss: 2.155
[Epoch 2, Batch   100] loss: 2.111
[Epoch 2, Batch   200] loss: 2.065
[Epoch 2, Batch   300] loss: 2.076
[Epoch 2, Batch   400] loss: 2.077
[Epoch 2, Batch   500] loss: 2.032
[Epoch 2, Batch   600] loss: 2.028
[Epoch 2, Batch   700] loss: 2.045
[Epoch 2, Batch   800] loss: 1.993
[Epoch 2, Batch   900] loss: 1.988
[Epoch 2, Batch  1000] loss: 1.956
[Epoch 2, Batch  1100] loss: 1.928
[Epoch 2, Batch  1200] loss: 1.949
[Epoch 2, Batch  1300] loss: 1.903
[E