### 作業目標: 使用Pytorch進行微分與倒傳遞
這份作業我們會實作微分與倒傳遞以及使用Pytorch的Autograd。

### 使用Pytorch實作微分與倒傳遞

這裡我們很簡單的實作兩層的神經網路進行回歸問題，其中loss function為L2 loss

$$
L2\_loss = (y_{pred}-y)^2
$$

兩層經網路如下所示
$$
y_{pred} = ReLU(XW_1)W_2
$$

In [1]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [2]:
# 隨機生成x, y
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=500, n_features=1000, n_targets=10, noise=10, random_state=0)
X = torch.tensor(X, device=device)
y = torch.tensor(y, device=device)

In [3]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10
q = len(X)//N
r = len(X)%N
batchs = q if r == 0 else q + 1
relu = torch.nn.ReLU()

# 初始化weight W1, W2
W1 = torch.randn((D_in, H), device=device).double()
W2 = torch.randn((H, D_out), device=device).double()

# 設置learning rate
learning_rate = 1e-6

# 訓練500個epoch
for t in range(20001):
  
    loss = torch.zeros((1,), device=device)
    for batch in range(batchs):
        # 向前傳遞: 計算y_pred
        X_input = X[batch*N:] if batch == batchs-1 else X[batch*N:batch*N+N]
        A = X_input@W1
        B = relu(A)
        y_pred = B@W2
        
        diff = y_pred - y[batch*N:]if batch == batchs-1 else y_pred - y[batch*N:batch*N+N]
        
      # 計算loss
        loss += torch.sum(diff*diff)
        
      # 倒傳遞: 計算W1與W2對loss的微分(梯度)
        grad_Y = 2.0 * diff
        grad_W2 = B.t() @ grad_Y
        grad_relu = grad_Y @ W2.t()
        grad_relu[A<0] = 0
        grad_W1 = X_input.t() @ grad_relu

      # 參數更新
        W1 -= learning_rate * grad_W1
        W2 -= learning_rate * grad_W2
    if t%1000 == 0:
        print(t, round(loss.item(), ndigits=3))



0 504346464.0
1000 46179.402
2000 14213.812
3000 4695.682
4000 1601.843
5000 585.205
6000 244.785
7000 118.514
8000 65.496
9000 40.616
10000 27.304
11000 19.479
12000 14.555
13000 11.159
14000 8.584
15000 6.561
16000 5.006
17000 3.854
18000 3.021
19000 2.405
20000 1.917


### 使用Pytorch的Autograd

In [4]:
# import torch
# device = torch.device('cpu')

In [24]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成x, y
###<your code>###

# 初始化weight W1, W2
W1 = torch.randn(D_in, H, device=device, requires_grad=True).double()
W2 = torch.randn(H, D_out, device=device, requires_grad=True).double()

# 設置learning rate
#learning_rate = 1e-6

# 訓練500個epoch
for t in range(20001):
    total_loss = torch.zeros((1,), device=device)
    for batch in range(batchs):
        # 向前傳遞: 計算y_pred
        X_input = X[batch*N:] if batch == batchs - 1 else X[batch*N:batch*N+N]
        y_pred = (relu(X_input@W1))@W2
        #y_pred = X_input.mm(W1).clamp(min=0).mm(W2)
        
        diff = y_pred - y[batch*N:] if batch == batchs - 1 else y_pred - y[batch*N:batch*N+N]
        
        # 計算loss
        loss = torch.sum(diff*diff)
        #loss = diff.pow(2).sum()
        total_loss += loss
        # 呼叫 backward() 之前要先呼叫 retain_grad()，否則 grad 會是 None
        W1.retain_grad()
        W2.retain_grad()
        
        # 倒傳遞: 計算W1與W2對loss的微分(梯度)
        loss.backward()
        
        # 參數更新: 這裡再更新參數時，我們不希望更新參數的計算也被紀錄微分相關的資訊，因此使用torch.no_grad()
        with torch.no_grad():
        # 更新參數W1 W2
            W1 -= learning_rate * W1.grad
            W2 -= learning_rate * W2.grad

        # 將紀錄的gradient清空(因為已經更新參數)
            W1.grad.zero_()
            W2.grad.zero_()
  
    if t % 1000 == 0:
        print(t, round(total_loss.item(), ndigits=3))

  

0 598108224.0
1000 49969.031
2000 18124.059
3000 5504.642
4000 1795.553
5000 608.606
6000 294.78
7000 175.359
8000 117.449
9000 120.748
10000 5688.556
11000 146.528
12000 35.365
13000 25.022
14000 18.32
15000 13.921
16000 10.733
17000 8.104
18000 5.933
19000 4.305
20000 3.2
