## numpy 

一个全连接ReLU神经网络，一个隐藏层，没有bias。用来从x预测y，使用L2 Loss。

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。

numpy ndarray是一个普通的n维array。它不知道任何关于深度学习或者梯度(gradient)的知识，也不知道计算图(computation graph)，只是一种用来计算数学运算的数据结构

In [4]:
import numpy as np

# N is batch lsize; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    
    # loss = (y_pred - y) ** 2
    grad_y_pred = 2.0 * (y_pred - y)
    # 
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 41824678.180057615
1 50080555.165254444
2 65916370.65069209
3 69330630.68758233
4 45381567.8242583
5 16920231.329380296
6 4995331.875842152
7 2213096.804905979
8 1515680.9321868145
9 1209080.6339608391
10 1003956.1205179498
11 846383.8630323744
12 720304.3745333939
13 617716.1342175906
14 533254.773459375
15 463093.58348851086
16 404281.8736900108
17 354635.0556294072
18 312445.60870459466
19 276401.6591656436
20 245439.46476171853
21 218681.05903497135
22 195471.03999872156
23 175237.37575172898
24 157510.42454636272
25 141945.84265121858
26 128218.24943510201
27 116092.241695102
28 105348.943766234
29 95791.87354169408
30 87262.5699072098
31 79629.08779848268
32 72780.79711496545
33 66625.26573931388
34 61083.17311177148
35 56081.11385465953
36 51556.13965650405
37 47456.900896106315
38 43734.988041808334
39 40351.77243741012
40 37274.12221230324
41 34468.10109771392
42 31905.485548703822
43 29561.471209696916
44 27414.77271818041
45 25445.92975076601
46 23638.1108955191
47 21977.9

424 0.0006655832277527704
425 0.0006392034364943969
426 0.0006138803408616171
427 0.0005895706761733528
428 0.0005662512136522886
429 0.0005438475699711647
430 0.0005223316169627529
431 0.0005016688422855922
432 0.0004818306650896207
433 0.0004627882031036567
434 0.0004445127597467507
435 0.00042695224976643873
436 0.00041009212333281334
437 0.0003939037924562766
438 0.00037836293064817054
439 0.000363434512048646
440 0.0003491087817568328
441 0.00033534557738847066
442 0.0003221322631312497
443 0.0003094428747482882
444 0.0002972550367129028
445 0.0002855537251813408
446 0.00027433528308443177
447 0.00026354277625648733
448 0.0002531795569441203
449 0.0002432300209278269
450 0.00023367491560404645
451 0.00022449369038224346
452 0.00021567873962274287
453 0.00020721334502302337
454 0.0001990826865648606
455 0.00019127110127572144
456 0.00018376889633159413
457 0.00017656388895430228
458 0.0001696489277930274
459 0.0001630025644605853
460 0.00015661791997065224
461 0.0001504854275234768

PyTorch: Tensors
----------------

这次我们使用PyTorch tensors来创建前向神经网络，计算损失，以及反向传播。

一个PyTorch Tensor很像一个numpy的ndarray。但是它和numpy ndarray最大的区别是，PyTorch Tensor可以在CPU或者GPU上运算。如果想要在GPU上运算，就需要把Tensor换成cuda类型。


In [6]:
import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 26446648.0
1 19861944.0
2 18734896.0
3 19931198.0
4 21489918.0
5 21455394.0
6 18874904.0
7 14228652.0
8 9388967.0
9 5642670.0
10 3293204.25
11 1961592.25
12 1241706.375
13 847210.25
14 621993.9375
15 484221.71875
16 393320.59375
17 328768.15625
18 280064.4375
19 241669.953125
20 210438.03125
21 184461.625
22 162468.859375
23 143655.265625
24 127441.5390625
25 113374.96875
26 101123.390625
27 90403.890625
28 80992.0859375
29 72718.1328125
30 65425.92578125
31 58968.4921875
32 53237.0
33 48141.81640625
34 43609.2734375
35 39561.58984375
36 35944.3203125
37 32699.51953125
38 29783.29296875
39 27159.458984375
40 24794.3828125
41 22658.40234375
42 20727.3125
43 18979.55078125
44 17394.78515625
45 15957.5390625
46 14651.142578125
47 13462.0166015625
48 12378.98046875
49 11392.177734375
50 10491.0419921875
51 9667.728515625
52 8915.1220703125
53 8226.29296875
54 7595.3837890625
55 7016.4375
56 6485.29052734375
57 5997.21630859375
58 5548.943359375
59 5136.94677734375
60 4757.81884765625
61 

In [7]:
# Create tensors.
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2 
print(w.grad)    # w.grad = 1 
print(b.grad)    # b.grad = 1 

tensor(2.)
tensor(1.)
tensor(1.)



PyTorch: Tensor和autograd
-------------------------------
PyTorch的一个重要功能就是autograd，也就是说只要定义了forward pass(前向神经网络)，计算了loss之后，PyTorch可以自动求导计算模型所有参数的梯度。

一个PyTorch的Tensor表示计算图中的一个节点。如果x是一个Tensor并且x.requires_grad=True那么x.grad是另一个储存着x当前梯度(相对于一个scalar，常常是loss)的向量。

In [8]:
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N 是 batch size; D_in 是 input dimension;
# H 是 hidden dimension; D_out 是 output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建随机的Tensor来保存输入和输出
# 设定requires_grad=False表示在反向传播的时候我们不需要计算gradient
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 创建随机的Tensor和权重。
# 设置requires_grad=True表示我们希望反向传播的时候计算Tensor的gradient
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 前向传播:通过Tensor预测y；这个和普通的神经网络的前向传播没有任何不同，
    # 但是我们不需要保存网络的中间运算结果，因为我们不需要手动计算反向传播。
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # 通过前向传播计算loss
    # loss是一个形状为(1，)的Tensor
    # loss.item()可以给我们返回一个loss的scalar
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # PyTorch给我们提供了autograd的方法做反向传播。如果一个Tensor的requires_grad=True，
    # backward会自动计算loss相对于每个Tensor的gradient。在backward之后，
    # w1.grad和w2.grad会包含两个loss相对于两个Tensor的gradient信息。
    loss.backward()

    # 我们可以手动做gradient descent(后面我们会介绍自动的方法)。
    # 用torch.no_grad()包含以下statements，因为w1和w2都是requires_grad=True，
    # 但是在更新weights之后我们并不需要再做autograd。
    # 另一种方法是在weight.data和weight.grad.data上做操作，这样就不会对grad产生影响。
    # tensor.data会我们一个tensor，这个tensor和原来的tensor指向相同的内存空间，
    # 但是不会记录计算图的历史。
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 20808304.0
1 14313663.0
2 11042082.0
3 9183454.0
4 7955617.0
5 7011514.5
6 6159816.5
7 5350482.5
8 4555018.5
9 3810271.5
10 3126546.5
11 2533301.0
12 2028739.5
13 1617361.5
14 1285091.75
15 1023324.875
16 817444.25
17 657712.625
18 533054.875
19 435979.3125
20 359864.75
21 300099.875
22 252680.1875
23 214766.1875
24 184146.25
25 159216.375
26 138676.890625
27 121605.953125
28 107281.046875
29 95153.5625
30 84800.3984375
31 75911.484375
32 68223.234375
33 61518.625
34 55635.171875
35 50450.359375
36 45865.3828125
37 41788.1875
38 38147.984375
39 34885.953125
40 31954.9375
41 29316.7421875
42 26935.484375
43 24778.625
44 22821.13671875
45 21040.853515625
46 19419.6484375
47 17940.693359375
48 16590.478515625
49 15354.8759765625
50 14222.794921875
51 13185.5185546875
52 12233.6416015625
53 11360.46484375
54 10556.712890625
55 9816.529296875
56 9134.4423828125
57 8504.970703125
58 7923.921875
59 7386.6611328125
60 6889.6455078125
61 6429.5029296875
62 6003.1962890625
63 5608.068359375
64

417 0.0005163642344996333
418 0.0005017853109166026
419 0.00048741925274953246
420 0.00047241844004020095
421 0.00045919790863990784
422 0.00044668017653748393
423 0.00043374812230467796
424 0.00042258016765117645
425 0.00041112996404990554
426 0.0003996024024672806
427 0.0003886795602738857
428 0.0003787778550758958
429 0.00036835274659097195
430 0.00035946554271504283
431 0.0003497603756841272
432 0.00034068754757754505
433 0.0003330507024656981
434 0.00032387676765210927
435 0.0003150595584884286
436 0.0003073743428103626
437 0.0003001578734256327
438 0.00029290010570548475
439 0.0002858215884771198
440 0.00027906629838980734
441 0.0002718333271332085
442 0.00026490658638067544
443 0.0002575732651166618
444 0.00025167863350361586
445 0.00024610848049633205
446 0.0002405318373348564
447 0.00023424744722433388
448 0.000228756049182266
449 0.00022344484750647098
450 0.0002183444012189284
451 0.0002133996458724141
452 0.00020882359240204096
453 0.00020419270731508732
454 0.0001999646192

PyTorch: nn
-----------

这次我们使用PyTorch中nn这个库来构建网络。
用PyTorch autograd来构建计算图和计算gradients，
然后PyTorch会帮我们自动计算gradient。

In [9]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 689.4557495117188
1 637.8826293945312
2 593.7218017578125
3 555.1991577148438
4 521.2799682617188
5 490.8536376953125
6 463.44744873046875
7 438.5511474609375
8 415.5782165527344
9 394.2338562011719
10 374.1195373535156
11 355.1658020019531
12 337.226318359375
13 320.21173095703125
14 304.13885498046875
15 288.8180847167969
16 274.1483154296875
17 260.0936279296875
18 246.60142517089844
19 233.71185302734375
20 221.40745544433594
21 209.65528869628906
22 198.45458984375
23 187.7453155517578
24 177.51092529296875
25 167.751220703125
26 158.42572021484375
27 149.5524139404297
28 141.10110473632812
29 133.0416717529297
30 125.38948822021484
31 118.11051177978516
32 111.19634246826172
33 104.6466293334961
34 98.44660186767578
35 92.57392120361328
36 87.02213287353516
37 81.79095458984375
38 76.86685180664062
39 72.2244644165039
40 67.8368148803711
41 63.71094512939453
42 59.83037567138672
43 56.19355392456055
44 52.7797966003418
45 49.585174560546875
46 46.584041595458984
47 43.772220611

487 2.1310577267286135e-06
488 2.064929503831081e-06
489 2.0002826204290614e-06
490 1.9375383999431506e-06
491 1.877355884971621e-06
492 1.818264763642219e-06
493 1.761982389325567e-06
494 1.7071806723834015e-06
495 1.6541096101718722e-06
496 1.6023641364881769e-06
497 1.5523978618148249e-06
498 1.5038763194752391e-06
499 1.4573047337762546e-06



PyTorch: optim
--------------

这一次我们不再手动更新模型的weights,而是使用optim这个包来帮助我们更新参数。
optim这个package提供了各种不同的模型优化方法，包括SGD+momentum, RMSProp, Adam等等。


In [11]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 636.9827880859375
1 620.2548217773438
2 604.031005859375
3 588.264404296875
4 572.9768676757812
5 558.1908569335938
6 543.9096069335938
7 530.05078125
8 516.5758056640625
9 503.5146484375
10 490.8492736816406
11 478.55523681640625
12 466.7230224609375
13 455.3133239746094
14 444.36090087890625
15 433.66204833984375
16 423.1812438964844
17 412.95965576171875
18 403.0374755859375
19 393.39825439453125
20 384.032958984375
21 374.89361572265625
22 366.0033874511719
23 357.31964111328125
24 348.81951904296875
25 340.5503845214844
26 332.5168762207031
27 324.6792297363281
28 316.96942138671875
29 309.3975830078125
30 301.99072265625
31 294.7376708984375
32 287.6647644042969
33 280.73028564453125
34 273.9713134765625
35 267.3499450683594
36 260.8721923828125
37 254.52682495117188
38 248.314208984375
39 242.2535858154297
40 236.33584594726562
41 230.53363037109375
42 224.84710693359375
43 219.2938690185547
44 213.84780883789062
45 208.50843811035156
46 203.26853942871094
47 198.1171875
48 19

399 6.4893747548921965e-06
400 6.036319518898381e-06
401 5.612848326563835e-06
402 5.218467777012847e-06
403 4.850426194025204e-06
404 4.509055997914402e-06
405 4.188652837910922e-06
406 3.891766937158536e-06
407 3.615047717175912e-06
408 3.3575488487258554e-06
409 3.116485459031537e-06
410 2.8940330594195984e-06
411 2.6860300295084016e-06
412 2.4919590941863135e-06
413 2.3120499008655315e-06
414 2.14484771277057e-06
415 1.9894637262041215e-06
416 1.844092025748978e-06
417 1.710112655928242e-06
418 1.585188783792546e-06
419 1.469030394218862e-06
420 1.3610157338916906e-06
421 1.260641511180438e-06
422 1.1678645250867703e-06
423 1.0810950925588259e-06
424 1.0008667459260323e-06
425 9.264373943551618e-07
426 8.577043217883329e-07
427 7.932555377010431e-07
428 7.335656277973612e-07
429 6.782200898669544e-07
430 6.276450221776031e-07
431 5.800283133794437e-07
432 5.359633519219642e-07
433 4.954490009367873e-07
434 4.575281593588443e-07
435 4.2283707557544403e-07
436 3.9029555409797467e-07



PyTorch: 自定义 nn Modules
--------------------------

我们可以定义一个模型，这个模型继承自nn.Module类。如果需要定义一个比Sequential模型更加复杂的模型，就需要定义nn.Module模型。


In [12]:
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 704.116943359375
1 652.5560302734375
2 607.7492065429688
3 568.6614379882812
4 533.970458984375
5 502.5086669921875
6 473.9654541015625
7 447.58343505859375
8 423.02984619140625
9 400.00506591796875
10 378.3228454589844
11 357.8038330078125
12 338.4827880859375
13 320.24359130859375
14 302.8949279785156
15 286.5151672363281
16 271.01416015625
17 256.29296875
18 242.2705535888672
19 228.97930908203125
20 216.34658813476562
21 204.3105010986328
22 192.8864288330078
23 182.05313110351562
24 171.74844360351562
25 161.9464569091797
26 152.65228271484375
27 143.8447265625
28 135.49073791503906
29 127.59117889404297
30 120.15321350097656
31 113.1252670288086
32 106.4934310913086
33 100.22747802734375
34 94.32118225097656
35 88.76624298095703
36 83.53791809082031
37 78.58827209472656
38 73.91858673095703
39 69.52561950683594
40 65.38511657714844
41 61.501373291015625
42 57.85792541503906
43 54.43056106567383
44 51.213783264160156
45 48.19186782836914
46 45.35361862182617
47 42.69595336914062

387 0.00016100509674288332
388 0.0001567270519444719
389 0.0001525663392385468
390 0.00014852205640636384
391 0.0001445824163965881
392 0.00014075257058721036
393 0.00013702985597774386
394 0.00013339721772354096
395 0.00012987741502001882
396 0.0001264395978068933
397 0.0001231060887221247
398 0.00011985387391177937
399 0.00011669175728457049
400 0.00011361585347913206
401 0.00011062498379033059
402 0.00010771049710456282
403 0.00010487536201253533
404 0.00010211869812337682
405 9.94319052551873e-05
406 9.682260133558884e-05
407 9.428044722881168e-05
408 9.180724009638652e-05
409 8.940151747083291e-05
410 8.705927029950544e-05
411 8.477944356855005e-05
412 8.256134606199339e-05
413 8.039979729801416e-05
414 7.829782407497987e-05
415 7.625006401212886e-05
416 7.425479998346418e-05
417 7.232096686493605e-05
418 7.043348887236789e-05
419 6.859648419776931e-05
420 6.680794467683882e-05
421 6.50659785605967e-05
422 6.337249942589551e-05
423 6.172524444991723e-05
424 6.012155063217506e-05
4