<a href="https://colab.research.google.com/github/dkanzariya/Workshop/blob/main/pytorch_tutorial_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pytorch tutorial

At its core, PyTorch provides two main features:

*   An n-dimensional Tensor, similar to numpy but can run on GPUs
*   Automatic differentiation for building and training neural networks

We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

# Warm-up: numpy

Before introducing PyTorch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations

In [None]:
import time
import numpy as np

start_time = time.time()
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.dot(w1)
  h_relu = np.maximum(h, 0)
  y_pred = h_relu.dot(w2)
  
  # Compute and print loss
  loss = np.square(y_pred - y).sum()
  print(t, loss)
  
  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.T.dot(grad_y_pred)
  grad_h_relu = grad_y_pred.dot(w2.T)
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0
  grad_w1 = x.T.dot(grad_h)
 
  # Update weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2
print("--- %s seconds ---" % (time.time() - start_time))

0 31411577.08342529
1 29275057.27267325
2 31563823.06463454
3 32863356.40462929
4 29214625.432808645
5 20729323.730420567
6 11972383.298462013
7 6154947.799985526
8 3229874.7229280886
9 1906970.815440505
10 1297971.448413244
11 984857.384221008
12 797523.3870170718
13 669126.0752350641
14 572262.224505333
15 494996.9021366163
16 431418.6381409643
17 378185.22465027426
18 333110.323539712
19 294719.0127310677
20 261827.36357730106
21 233421.19895787328
22 208757.52336702135
23 187253.86990485236
24 168406.30529060343
25 151832.36626733537
26 137231.49450904774
27 124320.96430690758
28 112863.08693699696
29 102665.68188460411
30 93553.62972931447
31 85418.18193875122
32 78126.43959634217
33 71568.18272805396
34 65659.93092963526
35 60323.62444490589
36 55494.324150674576
37 51117.88695265792
38 47140.81964087847
39 43529.01025816661
40 40240.01814160924
41 37237.915125247084
42 34494.8678760224
43 31983.53754878195
44 29680.492538030507
45 27567.605307386017
46 25627.221861268066
47 2384

# PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won't be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Any computation you might want to perform with numpy can also be accomplished with PyTorch Tensors; you should think of them as a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you use the device argument when constructing a Tensor to place the Tensor on a GPU.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we manually implement the forward and backward passes through the network, using operations on PyTorch Tensors:

In [None]:
import torch
import time

start_time = time.time()
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2
print("--- %s seconds ---" % (time.time() - start_time))

0 34959512.0
1 30176682.0
2 26593174.0
3 21312424.0
4 15043544.0
5 9511083.0
6 5723540.5
7 3487236.0
8 2252192.75
9 1566490.375
10 1165661.125
11 912793.375
12 740213.25
13 614265.5
14 517412.1875
15 440440.1875
16 377820.25
17 326054.0625
18 282809.03125
19 246385.515625
20 215501.1875
21 189147.90625
22 166492.84375
23 146978.828125
24 130100.984375
25 115460.15625
26 102725.390625
27 91613.6484375
28 81902.46875
29 73367.390625
30 65841.828125
31 59196.01953125
32 53313.9453125
33 48100.47265625
34 43470.90234375
35 39348.33984375
36 35667.09765625
37 32374.841796875
38 29424.392578125
39 26775.509765625
40 24397.05078125
41 22254.609375
42 20323.53515625
43 18579.681640625
44 17003.3125
45 15575.6826171875
46 14280.8369140625
47 13105.916015625
48 12038.4140625
49 11067.5830078125
50 10182.5966796875
51 9375.52734375
52 8639.3603515625
53 7966.89013671875
54 7351.91650390625
55 6789.3525390625
56 6273.82421875
57 5800.9970703125
58 5366.87109375
59 4968.2373046875
60 4601.865234375

#PyTorch: Autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it's pretty simple to use in practice. If we want to compute gradients with respect to some Tensor, then we set requires_grad=True when constructing that Tensor. Any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation through the graph. If x is a Tensor with requires_grad=True, then after backpropagation x.grad will be another Tensor holding the gradient of x with respect to some scalar value.

Sometimes you may wish to prevent PyTorch from building computational graphs when performing certain operations on Tensors with requires_grad=True; for example we usually don't want to backpropagate through the weight update steps when training a neural network. In such scenarios we can use the torch.no_grad() context manager to prevent the construction of a computational graph.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [None]:
import torch

# device = torch.device('cpu')
device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
  # w2 have requires_grad=True, operations involving these Tensors will cause
  # PyTorch to build a computational graph, allowing automatic computation of
  # gradients. Since we are no longer implementing the backward pass by hand we
  # don't need to keep references to intermediate values.
  y_pred = x.mm(w1).clamp(min=0).mm(w2)
  
  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
  # is a Python number giving its value.
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Use autograd to compute the backward pass. This call will compute the
  # gradient of loss with respect to all Tensors with requires_grad=True.
  # After this call w1.grad and w2.grad will be Tensors holding the gradient
  # of the loss with respect to w1 and w2 respectively.
  loss.backward()

  # Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()


0 30597946.0
1 23214368.0
2 19016300.0
3 15458803.0
4 11966616.0
5 8766768.0
6 6146963.0
7 4225688.0
8 2909847.25
9 2044027.75
10 1480396.875
11 1110354.25
12 861726.875
13 689554.9375
14 565832.125
15 473431.75
16 402086.96875
17 345432.125
18 299376.03125
19 261260.53125
20 229271.125
21 202145.484375
22 178910.890625
23 158870.6875
24 141502.9375
25 126388.578125
26 113163.015625
27 101549.1875
28 91318.859375
29 82287.578125
30 74291.6484375
31 67187.4609375
32 60859.91015625
33 55210.734375
34 50156.9375
35 45631.921875
36 41571.171875
37 37919.99609375
38 34631.05859375
39 31663.142578125
40 28981.625
41 26556.470703125
42 24360.7890625
43 22377.05859375
44 20573.96484375
45 18935.76171875
46 17444.236328125
47 16082.9658203125
48 14838.5068359375
49 13701.3662109375
50 12661.63671875
51 11709.49609375
52 10836.6904296875
53 10035.8740234375
54 9300.541015625
55 8624.859375
56 8003.2685546875
57 7430.984375
58 6903.76708984375
59 6417.62890625
60 5969.064453125
61 5554.892578125


#TensorFlow: Static Graphs

PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is that TensorFlow's computational graphs are static and PyTorch uses dynamic computational graphs.

In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.

Static graphs are nice because you can optimize the graph up front; for example a framework might decide to fuse some graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over.

One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as tf.scan for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.

To contrast with the PyTorch autograd example above, here we use TensorFlow to fit a simple two-layer net:

In [None]:
import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
  # Run the graph once to initialize the Variables w1 and w2.
  sess.run(tf.global_variables_initializer())

  # Create numpy arrays holding the actual data for the inputs x and targets y
  x_value = np.random.randn(N, D_in)
  y_value = np.random.randn(N, D_out)
  for _ in range(500):
    # Execute the graph many times. Each time it executes we want to bind
    # x_value to x and y_value to y, specified with the feed_dict argument.
    # Each time we execute the graph we want to compute the values for loss,
    # new_w1, and new_w2; the values of these Tensors are returned as numpy
    # arrays.
    loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                feed_dict={x: x_value, y: y_value})
    print(loss_value)


Instructions for updating:
non-resource variables are not supported in the long term
32780668.0
29224552.0
30175196.0
30309314.0
26448426.0
18935366.0
11332722.0
6132154.0
3335763.5
1980294.4
1319051.0
970229.94
763369.1
625581.2
525095.3
447201.66
384494.12
332890.4
289877.8
253626.7
222782.89
196420.33
173744.69
154143.6
137144.81
122334.5
109392.68
98047.62
88079.805
79282.73
71501.13
64597.523
58454.96
52977.78
48080.383
43699.938
39780.79
36257.71
33085.695
30227.97
27652.643
25322.482
23212.746
21305.338
19574.957
18002.578
16569.662
15263.379
14070.537
12980.433
11983.026
11069.517
10232.444
9464.656
8759.7705
8112.1045
7516.4214
6968.4043
6463.6763
5998.3467
5569.4067
5173.4277
4807.7285
4469.9165
4157.71
3868.7803
3602.0828
3355.2473
3126.4338
2914.2803
2717.4998
2534.8389
2365.2732
2207.747
2061.3262
1925.227
1798.6637
1680.8447
1571.2173
1469.1312
1374.0857
1285.5171
1202.9452
1125.9464
1054.1301
987.1478
924.59595
866.18933
811.6445
760.69946
713.08124
668.5808
626.9584
588

#PyTorch: nn

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.

In TensorFlow, packages like Keras, TensorFlow-Slim, and TFLearn provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the nn package serves this same purpose. The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

In this example we use the nn package to implement our two-layer network:

In [None]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
  # Forward pass: compute predicted y by passing x to the model. Module objects
  # override the __call__ operator so you can call them like functions. When
  # doing so you pass a Tensor of input data to the Module and it produces
  # a Tensor of output data.
  y_pred = model(x)

  # Compute and print loss. We pass Tensors containing the predicted and true
  # values of y, and the loss function returns a Tensor containing the loss.
  loss = loss_fn(y_pred, y)
  print(t, loss.item())
  
  # Zero the gradients before running the backward pass.
  model.zero_grad()

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.
  loss.backward()

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad


0 647.1207275390625
1 601.440185546875
2 562.4964599609375
3 528.2743530273438
4 497.6966247558594
5 470.0245056152344
6 444.6116027832031
7 421.22015380859375
8 399.4977722167969
9 379.1961975097656
10 360.05242919921875
11 342.1058349609375
12 325.16864013671875
13 309.0809631347656
14 293.7188720703125
15 279.0403137207031
16 264.9557189941406
17 251.53298950195312
18 238.7579803466797
19 226.588623046875
20 214.96234130859375
21 203.84304809570312
22 193.19268798828125
23 183.00401306152344
24 173.2965087890625
25 164.03549194335938
26 155.19766235351562
27 146.77188110351562
28 138.7554931640625
29 131.14088439941406
30 123.8921127319336
31 117.00759887695312
32 110.4794692993164
33 104.29859924316406
34 98.44470977783203
35 92.89039611816406
36 87.62410736083984
37 82.65440368652344
38 77.9630355834961
39 73.53495788574219
40 69.35985565185547
41 65.42072296142578
42 61.70951843261719
43 58.2105598449707
44 54.9124755859375
45 51.808937072753906
46 48.887386322021484
47 46.134460

#PyTorch: optim

Up to this point we have updated the weights of our models by manually mutating Tensors holding learnable parameters. This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisiticated optimizers like AdaGrad, RMSProp, Adam, etc.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In this example we will use the nn package to define our model as before, but we will optimize the model using the Adam algorithm provided by the optim package:

In [None]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
  # Forward pass: compute predicted y by passing x to the model.
  y_pred = model(x)

  # Compute and print loss.
  loss = loss_fn(y_pred, y)
  print(t, loss.item())
  
  # Before the backward pass, use the optimizer object to zero all of the
  # gradients for the Tensors it will update (which are the learnable weights
  # of the model)
  optimizer.zero_grad()

  # Backward pass: compute gradient of the loss with respect to model parameters
  loss.backward()

  # Calling the step function on an Optimizer makes an update to its parameters
  optimizer.step()


0 749.6666870117188
1 731.2828369140625
2 713.3854370117188
3 696.118896484375
4 679.3082885742188
5 662.9584350585938
6 647.1128540039062
7 631.707763671875
8 616.7071533203125
9 602.13720703125
10 587.9408569335938
11 574.0980834960938
12 560.60107421875
13 547.4644775390625
14 534.7015380859375
15 522.3305053710938
16 510.3026428222656
17 498.5463562011719
18 487.083740234375
19 476.0008239746094
20 465.2979736328125
21 454.8642272949219
22 444.7009582519531
23 434.7528991699219
24 425.0491027832031
25 415.59735107421875
26 406.34613037109375
27 397.30438232421875
28 388.4363098144531
29 379.7904052734375
30 371.3359680175781
31 363.0437927246094
32 354.90728759765625
33 346.9778137207031
34 339.21612548828125
35 331.63177490234375
36 324.2071838378906
37 316.92364501953125
38 309.7959289550781
39 302.8256530761719
40 295.9817199707031
41 289.2799072265625
42 282.6982727050781
43 276.25628662109375
44 269.95977783203125
45 263.7974548339844
46 257.7710266113281
47 251.8699493408203


#PyTorch: Custom nn Modules

Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing nn.Module and defining a forward which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

In this example we implement our two-layer network as a custom Module subclass:

In [None]:
import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    """
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    """
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()



0 709.3436279296875
1 656.5252685546875
2 611.2459716796875
3 571.8984375
4 537.1768798828125
5 505.98736572265625
6 477.7784729003906
7 451.75390625
8 427.5066833496094
9 405.0999450683594
10 384.15203857421875
11 364.51806640625
12 346.0712890625
13 328.6414489746094
14 312.14617919921875
15 296.41009521484375
16 281.40771484375
17 267.0614013671875
18 253.37362670898438
19 240.298095703125
20 227.7803192138672
21 215.78187561035156
22 204.36416625976562
23 193.47268676757812
24 183.10081481933594
25 173.19711303710938
26 163.77517700195312
27 154.76866149902344
28 146.21604919433594
29 138.09518432617188
30 130.39321899414062
31 123.0806884765625
32 116.14785766601562
33 109.58479309082031
34 103.36297607421875
35 97.4729232788086
36 91.89535522460938
37 86.60822296142578
38 81.62428283691406
39 76.92679595947266
40 72.48443603515625
41 68.28834533691406
42 64.34233093261719
43 60.62618637084961
44 57.126304626464844
45 53.8381462097168
46 50.732177734375
47 47.81432342529297
48 45.