Pytorch的两个核心特征：
- 类似numpy的多维张量，可以在GPU上运行
- 自动微分

# Tensors

## numpy

介绍pytorch之前，首先使用numpy实现一个网络。

numpy提供了多维张量和操作，但是对深度学习不友好

In [1]:
import numpy as np

# N是Batch size, H是隐藏层大小
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入和输出数据

x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

In [2]:
# 随机初始化权重
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [3]:
learning_rate = 1e-6
for t in range(500):
    
    h = x.dot(w1)
    h_relu = np.maximum(h, 0) # relu
    y_pred = h_relu.dot(w2)
    
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # 反向传播
    grad_y_pred = 2.0*(y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h<0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 31883824.45931278
1 30635015.13647095
2 37077019.523745716
3 44578896.970277786
4 44497988.159341395
5 31914450.303799387
6 16186571.800548442
7 6540387.998912735
8 2806179.8782955
9 1545144.1507705613
10 1073461.460157258
11 842369.4445914896
12 695416.463790141
13 586386.928640781
14 499905.2994029885
15 429238.22338585224
16 370695.64591995144
17 321801.60374993173
18 280741.5532091633
19 245948.10557674846
20 216282.3937686798
21 190871.14337591833
22 168998.76886730813
23 150095.780800678
24 133694.21748948944
25 119409.08971889189
26 106927.42906132335
27 95998.27727163803
28 86393.41371239997
29 77914.4885381542
30 70408.37043025033
31 63746.01286338222
32 57817.49554206335
33 52536.87415776809
34 47815.26626447074
35 43585.37194325829
36 39788.70370580571
37 36372.524751473735
38 33293.0512679896
39 30517.122633058385
40 28009.33331298832
41 25737.751315303864
42 23675.81401928372
43 21801.027403981432
44 20090.676075399144
45 18532.22053331019
46 17110.417815904526
47 15811.

# Autograd
## Pytorch:Tensors 和 autograd

上面示例中使用numpy手动实现了forward 和 backward, 当网络复杂时，手动实现反向传播将非常复杂。

pytorch提供了自动微分计算实现反向传播。通过构建计算图，节点为张量，连接边为函数。

如果`x`是个张量，只需要将`x.requires_grad=True`,然后`x.grad`就可以用来记录梯度。

用pytorch实现的网络如下：

In [10]:
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0)

N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入输出，默认的require_grad为False
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 随机创建权重
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

In [12]:
learning_rate = 1e-6
for t in range(500):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # 计算并打印loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
    
    # 使用autograd计算反向传播
    loss.backward()
    
    # 手动更新loss，因为权重的requires_grad为ture，在更新操作不需要更新，
    # 所以使用torch.no_grad()
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # 更新权重后，手动把梯度清零
        w1.grad.zero_()
        w2.grad.zero_()

99 34.13650131225586
199 1.8900409936904907
299 0.1125074028968811
399 0.007219775579869747
499 0.0007748950738459826


## Pytorch:定义新的autograd函数

每一个autograd函数都提供两个函数操作张量，forward函数根据输入计算输出张量，backward反向计算梯度。

在pytorch中通过定义`torch.autograd.Function`的子类，实现自己的autograd操作，

In [13]:
class MyRelu(torch.autograd.Function):
    
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [14]:
dtype = torch.float
device = torch.device("cpu")

N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入输出，默认的require_grad为False
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 随机创建权重
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

In [15]:
learning_rate = 1e-6
for t in range(500):
    relu = MyRelu.apply
    
    y_pred = relu(x.mm(w1)).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    loss.backward()
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # 更新权重后，手动把梯度清零
        w1.grad.zero_()
        w2.grad.zero_()

99 619.879638671875
199 4.068207263946533
299 0.03632698208093643
399 0.0006286892457865179
499 7.912427099654451e-05


# nn 模块
## Pytorch：nn

`nn`包定义了一系列的模块，它大致相当于网络层。模块接收输入张量并计算输出张量，但也可以保持内部状态，例如包含可学习参数的张量。nn包还定义了一组在训练神经网络时常用的有用损失函数。

In [18]:
import torch

# 与上面相同的部分
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 使用nn包定义模型，模型为层的序列，每个线性层包含参数
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

loss_fn = torch.nn.MSELoss(reduction="sum")

In [20]:
learning_rate = 1e-4
for t in range(500):
    y_pred = model(x)
    
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
    
    # 在反向传播前将参数的梯度归0，对应上面最后把w1, w2置0
    model.zero_grad()
    
    loss.backward()
    
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

99 2.387185573577881
199 0.03280126675963402
299 0.0011327475076541305
399 5.52507808606606e-05
499 3.063823896809481e-06


## pytorch:optim

到目前为止，都是手动更新参数。对于像随机梯度下降这样的简单优化算法来说，这不是一个巨大的负担，但在实践中，我们经常使用更复杂的优化器如AdaGrad，RMSProp，Adam等来训练神经网络。

`optim`提供了常见优化算法的实现

In [21]:
import torch

# 与上面相同的部分
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 使用nn包定义模型，模型为层的序列，每个线性层包含参数
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

loss_fn = torch.nn.MSELoss(reduction="sum")

In [23]:
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    y_pred = model(x)
    
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
    
    # 在反向传播前将参数的梯度归0，对应上面最后把w1, w2置0
    optimizer.zero_grad()
    
    loss.backward()
    
    # 使用step函数，逐步更新梯度
    optimizer.step()

99 50.85662078857422
199 0.897648811340332
299 0.005331294145435095
399 2.724405931076035e-05
499 9.765761888047564e-08


## Pytorch：自定义 nn 模块

有时需要指定比现有模块序列更复杂的模型;通过继承 `nn`并实现`forward`实现。

In [24]:
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

In [25]:
N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = TwoLayerNet(D_in, H, D_out)

In [26]:
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    y_pred = model(x)
    
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 2.0103516578674316
199 0.028675688430666924
299 0.0007557850331068039
399 2.5025467039085925e-05
499 9.620515584174427e-07


## pytorch:控制流 + 权重共享

In [27]:
import random
import torch

In [32]:
class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
    
    def forward(self, x):
        """
        # 每次前向生成一个随机数，即n个隐藏层，每次的层数不一样，
        # 但是每次对应层的参数是相同的
        """
        h_relu = self.input_linear(x).clamp(min=0)
        
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred

In [33]:
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = DynamicNet(D_in, H, D_out)

In [34]:
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

In [35]:
for t in range(500):
    y_pred = model(x)
    
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 68.56875610351562
199 25.789127349853516
299 1.280673623085022
399 1.2233548164367676
499 0.24811068177223206
