# Learning PyTorch with Examples

http://pytorch.apachecn.org/cn/tutorials/beginner/pytorch_with_examples.html

这个教程通过一些单独的示例介绍了 PyTorch 的基本概念.

PyTorch 的核心部分提供了两个主要功能:
- 一个类似于 numpy 的n维张量, 但可以在 GPU 上运行
- 为建立和训练神经网络自动微分

我们将使用全连接的 ReLU 网络作为我们的运行示例. 该网络将有一个隐藏层, 并将使用梯度下降训练去最小化随机数字的预测输出和真实输出之间的欧式距离.

本章内容目录
- Tensors
 - Warm-up: numpy
 - PyTorch: Tensors
- Autograd
 - PyTorch: Variables and autograd
 - PyTorch: Defining new autograd functions
 - TensorFlow: Static Graphs
- nn module
 - PyTorch: nn
 - PyTorch: optim
 - PyTorch: Custom nn Modules
 - PyTorch: Control Flow + Weight Sharing

## 1 Tensor

### 1.1 Warm-up: numpy

在介绍 PyTorch 之前, 我们先使用 numpy 实现网络.

Numpy 提供了一个n维的数组对象, 并提供了许多操纵这个数组对象的函数. Numpy 是科学计算的通用框架; Numpy 数组没有计算图, 也没有深度学习, 也没有梯度下降等方法实现的接口. 但是我们仍然可以很容易地使用 numpy 生成随机数据 并将产生的数据传入双层的神经网络, 并使用 numpy 来实现这个网络的正向传播和反向传播:

In [3]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0*(y_pred-y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 32734496.17157029
1 32294133.853444383
2 36125509.463052526
3 37183780.48030265
4 31022164.268079452
5 19699376.016992837
6 10052292.468155723
7 4738500.7993115615
8 2444237.3114667926
9 1497242.4778033549
10 1070008.1332374555
11 841127.0850321376
12 694976.3630113022
13 588876.6797920397
14 505986.8424715453
15 438725.3052591055
16 382877.39705253055
17 335873.2101721588
18 295982.79854439956
19 261877.20664989087
20 232538.0556403153
21 207130.36456101632
22 185044.68712183356
23 165793.64032913014
24 148921.70412625218
25 134082.87118412345
26 120979.46089646069
27 109379.62886807072
28 99077.4421020044
29 89900.01297823727
30 81707.19345777252
31 74382.46280875207
32 67814.82552643586
33 61927.01719941042
34 56624.075103926894
35 51836.70820755682
36 47512.149389581115
37 43597.022553439776
38 40052.502064549924
39 36834.00142787381
40 33905.25205181475
41 31235.48118161978
42 28800.77742783323
43 26576.20195773837
44 24541.45712813047
45 22678.88170069232
46 20971.68314481961
4

372 0.0023963813529751616
373 0.002301801300155051
374 0.0022109698738764245
375 0.0021237388686119133
376 0.002040058725428096
377 0.0019596143732930485
378 0.0018823558380315537
379 0.0018081503469432243
380 0.0017368849738359476
381 0.0016684414689145924
382 0.001602774806581876
383 0.0015396390409503867
384 0.0014790005752204757
385 0.001420762216461707
386 0.001364825418799094
387 0.001311096756954825
388 0.0012594925135383662
389 0.0012099294554024646
390 0.00116232174786812
391 0.0011166003405978463
392 0.0010726796631505722
393 0.0010305332261146534
394 0.00099002191838985
395 0.0009511333051298429
396 0.0009137589340764401
397 0.0008778471715718858
398 0.0008433506760829081
399 0.0008102154168760009
400 0.0007783845234853177
401 0.0007478104013133853
402 0.0007184415551512898
403 0.0006902280855215711
404 0.0006631474438293873
405 0.0006371251542788569
406 0.0006121176203336854
407 0.0005880927372436432
408 0.00056501398241409
409 0.0005428441687805316
410 0.000521546410907099

### 1.2 PyTorch: Tensors

Numpy 是一个伟大的框架, 但它不能利用 GPU 加速它数值计算. 对于现代的深度神经网络, GPU 往往是提供 50倍或更大的加速, 所以不幸的是, numpy 不足以满足现在深度学习的需求.

这里我们介绍一下最基本的 PyTorch 概念: Tensor . PyTorch Tensor 在概念上与 numpy 数组相同: Tensor 是一个n维数组, PyTorch 也提供了很多能在这些 Tensor 上操作的函数. 像 numpy 数组一样, PyTorch Tensor 也和numpy的数组对象一样不了解深度学习,计算图和梯度下降；它们只是科学计算的通用工具.

然而不像 numpy, PyTorch Tensor 可以利用 GPU 加速他们的数字计算. 要在 GPU 上运行 PyTorch 张量, 只需将其转换为新的数据类型.

在这里, 我们将 PyTorch Tensor 生成的随机数据传入双层的神经网络. 就像上面的 numpy 例子一样, 我们需要手动实现网络的正向传播和反向传播:

In [4]:
import torch

#dtype = torch.FloatTensor
dtype = torch.cuda.FloatTensor # run on GPU, torch.Tensor is an alias for the default tensor type (torch.FloatTensor)

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in).type(dtype) #.type: Returns the type if dtype is not provided, else casts this object to the specified type.
y = torch.randn(N, D_out).type(dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    print(t, loss)
    
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 28995146.0
1 27361294.0
2 29826936.0
3 31742716.0
4 29063876.0
5 21549452.0
6 12910030.0
7 6791278.5
8 3524125.0
9 2009537.875
10 1314762.25
11 970252.9375
12 774696.3125
13 646534.4375
14 552546.9375
15 478610.03125
16 418145.125
17 367613.8125
18 324727.28125
19 287976.25
20 256285.71875
21 228829.953125
22 204921.828125
23 184009.71875
24 165671.1875
25 149507.578125
26 135221.328125
27 122553.7734375
28 111273.984375
29 101222.578125
30 92241.5703125
31 84203.9296875
32 76985.09375
33 70487.03125
34 64627.6015625
35 59333.234375
36 54543.33203125
37 50198.86328125
38 46255.32421875
39 42664.33984375
40 39392.3984375
41 36408.70703125
42 33682.87890625
43 31191.8984375
44 28914.9140625
45 26831.470703125
46 24918.7890625
47 23161.330078125
48 21544.095703125
49 20056.931640625
50 18687.849609375
51 17424.43359375
52 16257.421875
53 15178.8212890625
54 14180.76953125
55 13256.44140625
56 12400.0810546875
57 11606.001953125
58 10869.060546875
59 10184.5908203125
60 9548.5361328125
6

dtype = torch.cuda.FloatTensor
- run on GPU
- torch.Tensor is an alias for the default tensor type (torch.FloatTensor)

x = torch.randn(N, D_in).type(dtype)
- .type: Returns the type if dtype is not provided, else casts this object to the specified type.

h = x.mm(w1)
- Performs a matrix multiplication of the matrices
- This function does not broadcast. For broadcasting matrix products, see torch.matmul().

h_relu = h.clamp(min=0)
- Clamp all elements in input into the range [ min, max ] and return a resulting tensor:

h_relu.t()
- Transpose

grad_h_relu.clone()
- Returns a copy of the self tensor. The copy has the same size and data type as self.

## 2 Autograd

### 2.1 PyTorch: Variables and autograd

在上面的例子中, 我们不得不手写实现神经网络的正反向传播的代码. 而手写实现反向传播的代码对于一个 小型的双层网络来说是没什么大问题的, 但是在面对大型复杂网络手写方向传播代码就会变得很棘手.

谢天谢地, 我们可以使用 自动微分 来自动化的计算神经网络中的后向传播. PyTorch 中的 autograd 包提供自动微分了这个功能. 使用 autograd 时, 网络的正向传播将定义一个 计算图 ; Tensor 将会成为图中的节点,从输入 Tensor 产生输出 Tensor 的函数将会用图中的( Edge )依赖边表示. 通过计算图来反向传播可以让您轻松计算梯度.

这听起来很复杂, 但是在实践中使用起来相当简单. 我们将 PyTorch 的 Tensor 包装成在 Variable 对象； 一个 Variable 代表一个计算图中的节点. 如果 x 是一个 Variable , 则 x.data 是一个 Tensor , 而 x.grad 是另外一个包含关于 x 的梯度的 Variable .

PyTorch Variable 与 PyTorch Tensor 具有相同的 API: (几乎) 任何您可以在 Tensor 上执行的 操作也适用于 Variable ；该区别在于如果你使用 Variable 定义了一个计算图, Pytorch 允许您自动计算梯度.

这里我们使用 PyTorch 的 Variable 和自动微分来实现我们的双层网络；现在我们不再需要手写任何关于 计算网络反向传播的代码:

In [8]:
import torch
from torch.autograd import Variable

dtype = torch.cuda.FloatTensor

N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensor to hold input and outputs, and wrap them into Variables
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

# Create random Tensors for weights, and wrap them into Variables
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    
    # Forward pass: compute predicted y using operations on Variables; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values (split
    # the below sentence into several ones) since we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Variables.
    # Now loss is a Variables of shape (1,), and loss.data is the Tensor.
    # (1,): loss.data[0] gets the a scalar value.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Variables with requires_grad=True.
    # After this call w1.grad and w2.grad will be Variables holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()
    
    # Manually update weights using gradient descent.
    # w1.data and w2.data are Tensors
    # W1.grad and w2.grad are Variables, and w1.grad.data and w2.grad.data are Tensors
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data
    
    # Manually zero the gradients after updating weights
    w1.grad.data.zero_()
    w2.grad.data.zero_()

0 29772548.0
1 24823848.0
2 23166764.0
3 21470888.0
4 18356564.0
5 13981907.0
6 9583265.0
7 6104660.0
8 3808175.5
9 2422623.25
10 1622230.875
11 1154277.5
12 870551.625
13 688015.5625
14 562881.1875
15 471903.5
16 402198.03125
17 346875.5
18 301638.53125
19 263934.71875
20 232100.390625
21 204990.53125
22 181707.171875
23 161585.9375
24 144125.703125
25 128884.8203125
26 115526.625
27 103785.9765625
28 93444.0390625
29 84306.2734375
30 76196.96875
31 68987.5
32 62560.79296875
33 56819.7265625
34 51684.359375
35 47084.6953125
36 42952.01171875
37 39231.65625
38 35878.65625
39 32848.27734375
40 30106.81640625
41 27622.966796875
42 25369.568359375
43 23322.375
44 21460.3671875
45 19764.939453125
46 18219.94921875
47 16809.5546875
48 15521.02734375
49 14342.2109375
50 13263.4892578125
51 12274.8681640625
52 11368.4111328125
53 10536.060546875
54 9771.4873046875
55 9067.732421875
56 8419.9384765625
57 7823.4580078125
58 7273.09619140625
59 6765.041015625
60 6295.873046875
61 5862.3999023437

463 0.00014092904166318476
464 0.00013848903472535312
465 0.00013597677752841264
466 0.00013421413314063102
467 0.00013162827235646546
468 0.000129780251882039
469 0.00012772073387168348
470 0.00012489539221860468
471 0.00012291045277379453
472 0.00012037895794492215
473 0.00011884667037520558
474 0.00011687122605508193
475 0.00011464234557934105
476 0.00011250393436057493
477 0.00011061292752856389
478 0.00010935208410955966
479 0.00010754401591839269
480 0.00010590210149530321
481 0.00010415139695396647
482 0.00010276256216457114
483 0.00010105106048285961
484 9.946001955540851e-05
485 9.821564890444279e-05
486 9.680413495516405e-05
487 9.553597919875756e-05
488 9.364654397359118e-05
489 9.221697837347165e-05
490 9.062888420885429e-05
491 8.924429857870564e-05
492 8.806244295556098e-05
493 8.659731247462332e-05
494 8.544695447199047e-05
495 8.393824828090146e-05
496 8.331567369168624e-05
497 8.203719335142523e-05
498 8.087518654065207e-05
499 7.99290428403765e-05


Variable
- requires_grad: default is False
- .grad: is also a Variable

### 2.2 PyTorch: Defining new autograd functions

在这层覆盖下, 每个原始的 autograd 操作符实际上是两个函数在Tensor上运行. forward 函数从输入的 Tensor 计算将要输出的 Tensor . backward 函数接收上一个 Tensor 关于 scalar 的梯度, 以 及计算当前输入 Tensor 对相同 scalar 值的梯度.

在 PyTorch 中, 我们可以通过定义一个 torch.autograd.Function 的子类和 实现 forward 和 backward 函数来轻松定义自己的 autograd 操作符. 然后我们可以 使用我们新的 autograd 操作符构造一个实例并将其作为一个函数调用, 传递用 Variable 包装了的输入数据的.

在这个例子中我们定义了我们自己的 autograd 函数来执行 ReLU 非线性函数, 并用它来实现我们的双层网络:

In [10]:
import torch
from torch.autograd import Variable

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input
    
dtype = torch.FloatTensor

N, D_in, H, D_out = 64, 1000, 100, 10

x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply
    
    y_pred = relu(x.mm(w1)).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])
    
    loss.backward()
    
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data
    
    w1.grad.data.zero_()
    w2.grad.data.zero_()

0 37941688.0
1 38185336.0
2 40163612.0
3 36658976.0
4 26343332.0
5 14784023.0
6 7178840.0
7 3529874.0
8 1982522.625
9 1307999.625
10 971712.9375
11 772731.8125
12 636963.1875
13 535203.1875
14 454980.5
15 389977.8125
16 336431.125
17 291814.0625
18 254346.703125
19 222615.359375
20 195604.953125
21 172479.390625
22 152586.734375
23 135398.953125
24 120478.8046875
25 107518.09375
26 96180.9296875
27 86228.6484375
28 77464.671875
29 69731.390625
30 62888.53125
31 56811.77734375
32 51403.8203125
33 46583.56640625
34 42275.19921875
35 38417.52734375
36 34959.5625
37 31853.47265625
38 29062.19140625
39 26544.12109375
40 24269.55078125
41 22212.3828125
42 20350.65625
43 18662.673828125
44 17129.578125
45 15736.0693359375
46 14468.2333984375
47 13313.5693359375
48 12261.640625
49 11300.6708984375
50 10422.7666015625
51 9619.703125
52 8884.771484375
53 8211.205078125
54 7593.44873046875
55 7026.6064453125
56 6505.8681640625
57 6027.0615234375
58 5586.8134765625
59 5181.57861328125
60 4808.2651

376 0.0002672165574040264
377 0.0002594874531496316
378 0.00025221117539331317
379 0.0002456468646414578
380 0.0002389893343206495
381 0.00023265408526640385
382 0.00022601640375796705
383 0.00022005222854204476
384 0.00021407850726973265
385 0.00020862571545876563
386 0.00020301752374507487
387 0.00019820223678834736
388 0.00019337174308020622
389 0.00018790092144627124
390 0.00018347265722695738
391 0.00017848028801381588
392 0.0001742429449222982
393 0.00016976529150269926
394 0.00016594839689787477
395 0.0001614615903235972
396 0.0001575064961798489
397 0.0001537017524242401
398 0.00014962669229134917
399 0.00014612285303883255
400 0.00014278831076808274
401 0.0001394372375216335
402 0.00013644510181620717
403 0.0001334028347628191
404 0.0001303686440223828
405 0.00012745497224386781
406 0.00012432452058419585
407 0.00012149869144195691
408 0.00011865557462442666
409 0.00011647496285149828
410 0.00011369857384124771
411 0.00011122131400043145
412 0.00010895200830418617
413 0.000106

### 2.3 TensorFlow: Static Graphs

Pytorch 的 autograd 看上去有点像 TensorFlow .两个框架的共同点是他们都是定义了自己的计算图. 和使用自动求微分的方法来计算梯度. 两者之间最大的不同在于 TensorFlow 的计算图是 static 和 PyTorch 的计算图是 dynamic .

在 TensorFlow 中, 我们只定义了一次计算图,然后重复执行同一张计算图, 只是输入计算图的数据不同而已. 而在 PyTorch 中, 每个正向传播都会定义一个新的计算图.

静态图很好, 因为您可以预先优化计算图；例如一个框架可能会为了计算效率决定融合一些计算图操作(像:Fused Graph), 或提出 一个多卡或者多机的分布式计算图的策略. 如果您正在重复使用相同的计算图, 那么这个潜在的 昂贵的前期优化可以使用静态图来得以减轻.

一方面来说, 静态图和动态图的控制流是不同的. 对于有些模型我们可能希望对每个数据点执行不同 的计算；例如循环神经网络可能会被展开为对每个数据的不同的长度的时间步数；这个展开可以用循 环来实现. 循环结构的静态图需要成为计算图的一部分；为此 TensorFlow 提供 tf.scan 操作符 用于将重复的结构嵌入到计算图中. 而动态计算图的情况比较简单: 因为我们设计的计算图可以对每个不同长度的输入随机应变. 我们可以使用正常的命令式代码对每个不同长度的输入执行计算.

为了与上面的 PyTorch autograd 例子进行对比, 我们在这里也使用 TensorFlow 创建简单的两层神经网络:

In [11]:
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

  from ._conv import register_converters as _register_converters


29014120.0
27775826.0
30869156.0
33080600.0
30457910.0
22044510.0
12846660.0
6450597.0
3262854.2
1841078.2
1208576.9
893946.5
711969.1
590122.5
499727.3
428469.62
370395.9
322163.7
281659.25
247353.4
218116.14
193036.9
171429.22
152715.53
136436.47
122244.09
109812.266
98894.03
89265.555
80749.516
73193.33
66468.63
60468.16
55104.312
50302.926
45990.246
42105.348
38601.242
35433.14
32565.893
29964.97
27601.613
25451.064
23492.016
21703.41
20068.5
18572.297
17201.953
15945.406
14791.701
13731.764
12756.334
11858.643
11030.818
10267.215
9562.42
8911.6875
8309.774
7752.717
7236.8022
6758.9707
6315.759
5904.0723
5521.841
5166.532
4836.2173
4529.001
4242.994
3976.7449
3728.4736
3497.0186
3281.268
3079.8203
2891.6772
2715.9392
2551.687
2398.1743
2254.623
2120.2754
1994.4758
1876.6539
1766.3265
1663.0264
1566.0973
1475.1877
1390.0342
1310.1538
1235.1267
1164.6766
1098.4711
1036.2354
977.74884
922.746
871.0416
822.3972
776.59314
733.4818
692.8978
654.6752
618.68304
584.753
552.78
522.64795
494

## 3 nn module

### 3.1 PyTorch: nn

计算图( Computational graphs )和 autograd 是一个非常强大的定义复杂的运算符并自动地导出的范式；然而对于 大型的神经网络, 原始的 autograd 仍然显得有点太low-level.

当我们创建神经网络时, 我们经常思考如何设计安排 ** layer ** , 以及一些在训练过程中网络会学习到的 ** learnable parameters **

在TensorFlow中, 像 Keras, TensorFlow-Slim, 和 TFLearn 提供了方便构建neural netword的比raw computational graph更higher-level abstractions.

在 PyTorch 中, nn 包起了同样的作用. nn 包定义了一组 ** Modules ** , 大致相当于神经网络层. 模块接收输入Variable并进行计算输出Variable, 但也可以保持内部状态, 如 用 Variable 包装的 learnable parameters . nn 包 也定义了一系列在训练神经网络时比较常用的损失函数.

在这个例子中, 我们使用 nn 包来实现我们的双层神经网络:

In [12]:
import torch
from torch.autograd import Variable

N, D_in, H, D_out = 64, 1000, 100, 10

x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out))

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Variables for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out))

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
criterion = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(500):
    
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Variable of input data to the Module and it produces
    # a Variable of output data.
    y_pred = model(x)
    
    # Compute and print loss. We pass Variables containing the predicted and true
    # values of y, and the loss function returns a Variable containing the loss.
    loss = criterion(y_pred, y)
    print(t, loss.data[0])
    
    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Variables with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Variable, so
    # we can access its gradients like we did before.
    for param in model.parameters():
        param.data -= learning_rate * param.grad.data

0 703.4216918945312
1 652.5372314453125
2 608.4209594726562
3 569.59228515625
4 534.6481323242188
5 503.0848388671875
6 474.34906005859375
7 447.6849060058594
8 422.99432373046875
9 400.1066589355469
10 378.70098876953125
11 358.56243896484375
12 339.5515441894531
13 321.5807189941406
14 304.50689697265625
15 288.2322082519531
16 272.8668212890625
17 258.239501953125
18 244.3154754638672
19 231.0492706298828
20 218.34323120117188
21 206.2227783203125
22 194.6803741455078
23 183.689208984375
24 173.25978088378906
25 163.34701538085938
26 153.93235778808594
27 145.00367736816406
28 136.57296752929688
29 128.58460998535156
30 121.03135681152344
31 113.92469787597656
32 107.22628021240234
33 100.91724395751953
34 94.98430633544922
35 89.39228057861328
36 84.12435913085938
37 79.17227935791016
38 74.53214263916016
39 70.17620849609375
40 66.08942413330078
41 62.25642776489258
42 58.66366195678711
43 55.29549789428711
44 52.135406494140625
45 49.1715087890625
46 46.39003372192383
47 43.77647

494 5.974632699690119e-07
495 5.777225169367739e-07
496 5.58635974812205e-07
497 5.405407250691496e-07
498 5.227507244853768e-07
499 5.055909468865138e-07


### 3.2 PyTorch: optim

到目前为止, 我们一直通过手动更新的方法更新模型的可学习参数( learnable parameters )的权重 .data 这对于简单的优化算法像随机梯度下降来还算轻松, 但是在实际中我们经常使用更巧妙的 优化器来训练神经网络, 如 AdaGrad, RMSProp, Adam 等.

PyTorch 中的 optim 包包含了一些优化器的算法, 并提供了一些常用优化器的使用.

在这个例子中, 虽然我们将像之前一样使用 nn 包来定义我们的模型, 但是我们这次将使用由 optim 包提供的Adam算法来更新模型:

In [13]:
import torch
from torch.autograd import Variable

N, D_in, H, D_out = 64, 1000, 100, 10

x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out))

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out))

criterion = torch.nn.MSELoss(size_average=False)

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Variables it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):
    
    y_pred = model(x)
    
    loss = criterion(y_pred, y)
    print(t, loss.data[0])
    
    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()
    
    loss.backward()
    
    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 734.6030883789062
1 715.9983520507812
2 697.8878173828125
3 680.2461547851562
4 663.0841674804688
5 646.3997802734375
6 630.16650390625
7 614.4307250976562
8 599.0421752929688
9 584.0729370117188
10 569.6410522460938
11 555.6756591796875
12 542.1476440429688
13 529.02197265625
14 516.2322387695312
15 503.7779846191406
16 491.7445373535156
17 480.1190490722656
18 468.8105163574219
19 457.7789001464844
20 447.0379943847656
21 436.5836486816406
22 426.4103088378906
23 416.5605773925781
24 406.9507141113281
25 397.5396728515625
26 388.3530578613281
27 379.40203857421875
28 370.73291015625
29 362.3007507324219
30 354.0657043457031
31 346.01422119140625
32 338.2065124511719
33 330.5850830078125
34 323.13983154296875
35 315.8452453613281
36 308.74462890625
37 301.8183898925781
38 295.0179443359375
39 288.3460693359375
40 281.826171875
41 275.4485168457031
42 269.1873779296875
43 263.04339599609375
44 257.03564453125
45 251.16433715820312
46 245.41358947753906
47 239.8070068359375
48 234.312

426 8.99837505130563e-06
427 8.516391062585171e-06
428 8.059073479671497e-06
429 7.626537808391731e-06
430 7.216650828922866e-06
431 6.826993740105536e-06
432 6.458931238739751e-06
433 6.108844445407158e-06
434 5.77728451389703e-06
435 5.4633846957585774e-06
436 5.16583622811595e-06
437 4.8837346184882335e-06
438 4.617530976247508e-06
439 4.364367214293452e-06
440 4.125899067730643e-06
441 3.898428076354321e-06
442 3.6840328903053887e-06
443 3.4805875657184515e-06
444 3.287942490715068e-06
445 3.106124040641589e-06
446 2.933504902102868e-06
447 2.7706826131179696e-06
448 2.616072151795379e-06
449 2.4700307221792173e-06
450 2.3318314106290927e-06
451 2.201333472839906e-06
452 2.0778632006113185e-06
453 1.9614651591837173e-06
454 1.850474177444994e-06
455 1.7461887864556047e-06
456 1.6472270090162056e-06
457 1.554344521537132e-06
458 1.465873197048495e-06
459 1.3828391729475698e-06
460 1.30405169329606e-06
461 1.229582949235919e-06
462 1.1590494750635116e-06
463 1.092501520361111e-06
464

### 3.3 PyTorch: Custom nn Modules

有时你会想要使用比现有模块组合更复杂的特殊模型；对于这些情况, 你可以 通过继承 nn.Module 来定义你自己的模块, 并定义一个 forward 来实现模块接收输入 Variable 并使用其他模块输出的 Variable 和 其他 autograd 操作.

在这个例子中, 我们使用了我们之前已经实现的双层网络来作为一个自定义的模块子类:

In [14]:
import torch
from torch.autograd import Variable

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must return
        a Variable of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Variables.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

N, D_in, H, D_out = 64, 1000, 100, 10

x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    
    y_pred = model(x)

    loss = criterion(y_pred, y)
    print(t, loss.data[0])

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 589.84326171875
1 546.5028686523438
2 509.0318603515625
3 476.5212707519531
4 448.0259704589844
5 422.5433044433594
6 399.28851318359375
7 377.91180419921875
8 358.2407531738281
9 339.8867492675781
10 322.751220703125
11 306.6019287109375
12 291.4007568359375
13 277.0802001953125
14 263.52508544921875
15 250.6353302001953
16 238.3273468017578
17 226.56622314453125
18 215.35025024414062
19 204.61721801757812
20 194.38937377929688
21 184.6250457763672
22 175.2938232421875
23 166.37644958496094
24 157.8649139404297
25 149.7423553466797
26 141.99266052246094
27 134.60411071777344
28 127.5632553100586
29 120.84666442871094
30 114.45011138916016
31 108.35645294189453
32 102.57367706298828
33 97.09485626220703
34 91.89411926269531
35 86.97666931152344
36 82.31912231445312
37 77.91621398925781
38 73.7448501586914
39 69.79910278320312
40 66.0712661743164
41 62.55091094970703
42 59.21954345703125
43 56.07221984863281
44 53.09781265258789
45 50.29652786254883
46 47.63972854614258
47 45.13365936

365 0.0013559116050601006
366 0.001323879580013454
367 0.0012926360359415412
368 0.001262156874872744
369 0.0012324207928031683
370 0.0012034098617732525
371 0.0011750920675694942
372 0.0011474419152364135
373 0.0011204875772818923
374 0.0010941573418676853
375 0.0010684967273846269
376 0.0010434481082484126
377 0.001018985640257597
378 0.0009951320243999362
379 0.000971843721345067
380 0.0009491167729720473
381 0.0009269269066862762
382 0.0009052859386429191
383 0.0008841900853440166
384 0.0008635635022073984
385 0.000843440240714699
386 0.0008238050504587591
387 0.0008046089787967503
388 0.000785890850238502
389 0.0007676207460463047
390 0.0007497862679883838
391 0.000732366053853184
392 0.0007153672631829977
393 0.000698765623383224
394 0.0006825680611655116
395 0.000666758744046092
396 0.000651323061902076
397 0.000636242562904954
398 0.0006215310422703624
399 0.0006071499665267766
400 0.0005931073683314025
401 0.0005794076132588089
402 0.0005660256720148027
403 0.00055296783102676

### 3.4 PyTorch: Control Flow + Weight Sharing

As an example of dynamic graphs and weight sharing, we implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by simply reusing the same Module multiple times when defining the forward pass.

We can easily implement this model as a Module subclass:

In [15]:
import random
import torch
from torch.autograd import Variable

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0,3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred
    
N, D_in, H, D_out = 64, 1000, 100, 10

x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

model = DynamicNet(D_in, H, D_out)

criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    
    y_pred = model(x)

    loss = criterion(y_pred, y)
    print(t, loss.data[0])

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 667.53076171875
1 693.6033325195312
2 669.2315673828125
3 668.64599609375
4 663.107421875
5 647.8817749023438
6 510.8783874511719
7 620.1029052734375
8 649.1666259765625
9 591.6267700195312
10 382.6544189453125
11 346.6311340332031
12 299.63665771484375
13 648.8026733398438
14 647.310791015625
15 645.1423950195312
16 624.7391967773438
17 499.853515625
18 609.0315551757812
19 457.6951599121094
20 108.71575164794922
21 615.4326782226562
22 372.5513916015625
23 338.0988464355469
24 505.787353515625
25 134.68310546875
26 444.5380859375
27 503.76116943359375
28 368.18707275390625
29 158.55819702148438
30 302.3117370605469
31 179.04550170898438
32 166.1812744140625
33 105.02025604248047
34 292.51434326171875
35 113.05707550048828
36 189.5210418701172
37 61.65806579589844
38 79.3973617553711
39 145.32965087890625
40 65.57173919677734
41 60.19955062866211
42 69.08759307861328
43 62.31060791015625
44 182.10792541503906
45 57.90001678466797
46 223.2615203857422
47 149.097412109375
48 259.89468

403 0.8975740075111389
404 0.22024816274642944
405 2.15879225730896
406 2.2116363048553467
407 0.365464448928833
408 1.5533955097198486
409 4.983763694763184
410 0.5027531981468201
411 0.6437060236930847
412 2.043757438659668
413 2.0610036849975586
414 0.49179866909980774
415 0.3606640696525574
416 0.35279780626296997
417 0.4952828288078308
418 2.033763885498047
419 0.9162747859954834
420 1.2769445180892944
421 0.24047304689884186
422 0.5683113932609558
423 1.025093674659729
424 1.0027989149093628
425 1.8858778476715088
426 1.8615309000015259
427 1.5285418033599854
428 2.9590766429901123
429 2.593839645385742
430 0.5085890293121338
431 0.3837805688381195
432 0.10782626271247864
433 0.16320732235908508
434 10.748854637145996
435 0.403991162776947
436 0.2677985727787018
437 0.6376399397850037
438 6.311458587646484
439 27.946340560913086
440 0.5371912121772766
441 3.0092709064483643
442 5.5654826164245605
443 5.929560661315918
444 16.61552619934082
445 10.514671325683594
446 0.87324124574