### Pytorch的核心是两个特征
1. 一个N维张量，类似于numpy，但是可以在GPU上运算
2. 搭建和训练神经网络是的自动微分、求导机制

In [3]:
# 在正式使用pytorch之前，回顾一下numpy的一些用法。
import numpy as np

batch_size, input_size, hidden_size, output_size = 64, 1000, 100, 10

# 创建随机的输入以及输出
x = np.random.randn(batch_size, input_size)
y = np.random.randn(batch_size, output_size)

# 随机初始化权重
w1 = np.random.randn(input_size, hidden_size)
w2 = np.random.randn(hidden_size, output_size)

learning_rate = 1e-6

for t in range(5):
    # 前向传播
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    # 反向传播
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 35382166.98289068
1 28152083.021180052
2 24975552.184205044
3 21719252.14714821
4 17214528.916116476


In [4]:
# Numpy是一个优秀的矩阵运算工具，但是不能利用numpy来加速其数值运算
# 本例使用tensor将在随机数据上训练一个两层的网络，与前面的示例类似，仅仅使用tensor
import torch
dtype = torch.float
device = torch.device("cuda:0" if torch.cuda.is_available() else "CPU")

batch_size, input_size, hidden_size, output_size = 64, 1000, 100, 10

x = torch.randn(batch_size, input_size, device=device, dtype=dtype)
y = torch.randn(batch_size, output_size, device=device, dtype=dtype)

w1 = torch.randn(input_size, hidden_size, device=device, dtype=dtype)
w2 = torch.randn(hidden_size, output_size, device=device, dtype=dtype)

learning_rate = 1e-6

for t in range(10):
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    grad_y_pred = 2.0 * (y_pred -y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 27933208.0
1 24945064.0
2 24965536.0
3 24551664.0
4 21835740.0
5 16774729.0
6 11234487.0
7 6828489.0
8 4039985.0
9 2469799.5


In [7]:
# 上述示例使用手动对每一层的参数进行更新，实际使用pytorch的模型的时候，肯定不能这样做，
# pytorch提供了自动求导工具

x = torch.randn(batch_size, input_size, device=device, dtype=dtype)
y = torch.randn(batch_size, output_size, device=device, dtype=dtype)

w1 = torch.randn(input_size, hidden_size, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(hidden_size, output_size, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(10):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    loss.backward()
    # 防止梯度更新的时候更新pytorch计算图
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

0 24254692.0
1 17557514.0
2 14198531.0
3 12105104.0
4 10384138.0
5 8770156.0
6 7159100.0
7 5669248.0
8 4352951.0
9 3288865.0


In [8]:
# 我们也可以定义自己的自动求导函数
class MyRule(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        grad_x = grad_output.clone()
        grad_x[x < 0] = 0
        return grad_x
    
# 我们使用MyRule.apply(argu_tensor)就可以了

In [11]:
# 使用大规模的网络中，使用autograd包太过于底层，也会极大地增加代码难度。
# torch.nn包为我们提供了高级API，方便我们如何使用Keras一样快速地构架模型

model = torch.nn.Sequential(
    torch.nn.Linear(input_size, hidden_size),
    torch.nn.ReLU(),
    torch.nn.Linear(hidden_size, output_size)
)
model.to(device)
loss_fn = torch.nn.MSELoss(reduce="mean")
learning_rate = 1e-4
for t in range(10):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    model.zero_grad()
    loss.backward()
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 1.1464388370513916
1 1.1463005542755127
2 1.1461621522903442
3 1.1460238695144653
4 1.1458854675292969
5 1.145747423171997
6 1.1456091403961182
7 1.1454709768295288
8 1.14533269405365
9 1.14519464969635


In [12]:
# 实际上自己一层一层参数进行梯度更新也是很麻烦的一件事，pytorch为我们提供了更好的选择
# 仍然使用上一个cell定义的模型
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(10):
    y_pred = model(x)
    loss = loss_fn(y_pred,  y)
    print(t, loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 1.1450566053390503
1 1.1172531843185425
2 1.0901139974594116
3 1.0637024641036987
4 1.0379905700683594
5 1.0131222009658813
6 0.9890559315681458
7 0.9658142328262329
8 0.9431743621826172
9 0.9212177395820618


### 控制流和权重共享
作为动态图和权重共享的一个列子，下面将实现一个非常奇怪的模型：一个全连接的ReLU网络，在每一次前向传播时，它的隐藏层的数目为1到4的随机数，这样可以多次重用相同的权重连接来计算.

In [3]:
import random
import torch

class DynamicNet(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(DynamicNet, self).__init__()
        self.input_layer = torch.nn.Linear(input_size, hidden_size)
        self.hidden_layer = torch.nn.Linear(hidden_size, hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        h_relu = self.input_layer(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.hidden_layer(h_relu).clamp(min=0)
        y_pred = self.output_layer(h_relu)
        return y_pred
    
batch_size, input_size, hidden_size, output_size = 64, 1000, 100, 10
x = torch.randn(batch_size, input_size)
y = torch.randn(batch_size, output_size)

model = DynamicNet(input_size, hidden_size, output_size)
criterion = torch.nn.MSELoss(reduction="mean")
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

for t in range(10):
    y_pred = model(x)
    loss = criterion(y_pred, y)
    print(t, loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 1.079433560371399
1 1.138946771621704
2 1.0769429206848145
3 1.138685941696167
4 1.0794172286987305
5 1.0873134136199951
6 1.0794012546539307
7 1.0872423648834229
8 1.0793811082839966
9 1.137485384941101


## 模型加载与保存
在保存和加载模型的时候，一定要熟悉pytorch的三个核心
- torch.save将序列化对象保存到磁盘，此函数使用的Python内置的pickle模块进行序列化，可以用来保存模型，Tensor，字典等等。
- torch.load使用pickle中的unpickling功能，将pickle对象文件反序列化到内存中。还可以用于设备加载数据。
- torch.nn.Module.load_state_dict()使用反序列化函数反序列化state_dict来加载模型参数字典

### 什么是state_dict？
在pytorch中，模型的可学习参数包含在模型的参数中。state_dict是Python字典，它将每一层映射到其参数张量上。注意只有具有可学习参数的模型才有state_dict，目前优化torch.optim也有state_dict属性。

In [8]:
# 举例说明state_dict的用法
class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = torch.nn.Conv2d(3, 6, 5)
        self.pool = torch.nn.MaxPool2d(2, 2)
        self.conv2 = torch.nn.Conv2d(6, 16, 5)
        self.fc1 = torch.nn.Linear(16 * 5 * 5, 120)
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = TheModelClass()

optimizer = torch.optim.SGD(params=model.parameters(), lr=1e-4, momentum=0.9)

print("Model's state dict")
for param_name in model.state_dict():
    print(param_name, model.state_dict()[param_name].size())
    
print("Optimizer's state dict")
for param_name in optimizer.state_dict():
    print(param_name, optimizer.state_dict()[param_name])

Model's state dict
conv1.weight torch.Size([6, 3, 5, 5])
conv1.bias torch.Size([6])
conv2.weight torch.Size([16, 6, 5, 5])
conv2.bias torch.Size([16])
fc1.weight torch.Size([120, 400])
fc1.bias torch.Size([120])
fc2.weight torch.Size([84, 120])
fc2.bias torch.Size([84])
fc3.weight torch.Size([10, 84])
fc3.bias torch.Size([10])
Optimizer's state dict
state {}
param_groups [{'lr': 0.0001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [4424580944, 4424579024, 4424577344, 4424579824, 4424258256, 4424260816, 4424659104, 4424578304, 4424259776, 4424259216]}]


在保存模型的时候，有多种方法，可以只存储模型的各种参数(Pytorch官方也是推荐这种方式)，也可以将整个模型序列化到一个文件中。
### save/load state_dict(suggested)
使用这种方法的好处是，只保存模型的参数，反序列化的时候不依赖过多的类文件。请记住，在运行推理之前，务必调用model.eval()去设置 dropout 和 batch normalization 层为评估模式。如果不这么做，可能导致 模型推断结果不一致。

In [9]:
PATH = "example.pt"
# save
torch.save(model.state_dict(), PATH)

# load
model = TheModelClass()
model.load_state_dict(torch.load(PATH))
model.eval()

TheModelClass(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

### sava/load model completely
这个方法保存和加载过程需要的代码都极少。但是这个方法受限于序列化数据依赖某特殊的类而且需要确切的字典结构。这是因为pickle无法保存模型类本身，相反，它保存包含类文件的路径，所以在项目重构之后，可能出现模型加载失败

In [11]:
# save
torch.save(model, PATH)
# load
model = torch.load(PATH)
model.eval()

TheModelClass(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

### save checkpoint
这是预训练模型常用的做法，通过设置记录点的方式，将训练模型过程中的所有参数都保存下来，包括model.state_dict(),optimizer.state_dict(),loss,epoch等等。
<br>

```python
# save
torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": model.state_dict(),
    "loss": loss
})

# load 
model = TheModelClass()
optimizer = TheOptimizerClass()
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
```