At its core, PyTorch provides two main features:
- An n-dimensional Tensor, similar to numpy but can run on GPUs
- Automatic differentiation for building and training neural networks

## Tensors

### Warm-up: numpy
Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; however it does not provide any tools for working with compuational graphs, or deep learning but we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations.

In [1]:
import numpy as np

In [3]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

In [4]:
# creating random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

In [5]:
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [6]:
# learning rate
learning_rate = 1e-6

In [12]:
# no. of epochs
epochs = 500

In [13]:
for t in range(epochs):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 w.r.t loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 3.9386661087089905e-23
1 3.928804449302803e-23
2 3.8938421909039664e-23
3 3.8859533621780803e-23
4 3.860619439769596e-23
5 3.8336505131104554e-23
6 3.80035160970761e-23
7 3.790186782065365e-23
8 3.7644212236571733e-23
9 3.7544487725110684e-23
10 3.73805541428637e-23
11 3.7130395857168215e-23
12 3.683668464026185e-23
13 3.667447375018149e-23
14 3.654796051435343e-23
15 3.6368955458087293e-23
16 3.6188192990340655e-23
17 3.6183034639281436e-23
18 3.607154592777827e-23
19 3.5929502339064574e-23
20 3.574537308423988e-23
21 3.5702075166144183e-23
22 3.549702783776388e-23
23 3.537797894082686e-23
24 3.522693228548957e-23
25 3.52214858475369e-23
26 3.492068161008184e-23
27 3.444124997067547e-23
28 3.4252520248686835e-23
29 3.4162206299479716e-23
30 3.411532001661656e-23
31 3.4264681868601e-23
32 3.415597587821394e-23
33 3.403514495633048e-23
34 3.3903075379038483e-23
35 3.36696207840856e-23
36 3.3422382012387944e-23
37 3.316679518443569e-23
38 3.2849022922283946e-23
39 3.274183324221716e-23

459 1.2114039581290754e-23
460 1.2079374135111379e-23
461 1.2055572411015644e-23
462 1.2083982032326471e-23
463 1.2017547907754896e-23
464 1.1950403386219825e-23
465 1.1954725978872267e-23
466 1.1980115955387222e-23
467 1.1994546787058559e-23
468 1.1969872136611829e-23
469 1.1957511217780671e-23
470 1.197028326371849e-23
471 1.1945212997171537e-23
472 1.1958256498926676e-23
473 1.187289226200132e-23
474 1.1834337711394117e-23
475 1.1800798214680707e-23
476 1.1784077095758686e-23
477 1.1758017640052183e-23
478 1.1728843548818045e-23
479 1.1687440664829356e-23
480 1.1644821171517716e-23
481 1.1556252328910334e-23
482 1.1632654052763608e-23
483 1.1586002838596709e-23
484 1.1550549162552088e-23
485 1.1565666173364168e-23
486 1.1553198446890942e-23
487 1.1508969547955298e-23
488 1.1536106103286951e-23
489 1.1514614127882739e-23
490 1.1485957942710001e-23
491 1.1484900008239421e-23
492 1.1478641796722376e-23
493 1.150721518491796e-23
494 1.150934635122275e-23
495 1.152725010190169e-23
496 1.

## Autograd

### PyTorch: Tensors and autograd

We can use the ***automatic differentiation*** to automate the computation of backward passes in neural networks. The ***autograd*** package in PyTorch provides exactly this functionality.

When using autograd, the forward pass of our network will define ***computational graph***; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors.

Backpropagating through this graph then allows us to easily compute gradients.

In [14]:
import torch

In [15]:
dtype = torch.float

if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")

In [17]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

In [18]:
# creating random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

In [19]:
# creating random Tensors for weights.
# Setting requires_grad=False indicates
# we do not need to compute gradients w.r.t Tensors
# during the backward pass.

w1 = torch.randn(D_in, H, device=device,
                 dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device,
                 dtype=dtype, requires_grad=True)

In [20]:
# learning rate
learning_rate = 1e-6

In [21]:
epochs = 500

In [22]:
for t in range(epochs):
    # Forward pass: compute predicted y using operations on Tensors;
    # we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass
    # This call will compute the gradient of loss
    # w.r.t all tensors with requires_grad=True
    # After this call w1.grad and w2.grad will be tensors
    # holding the gradient of the loss w.r.t. w1 and w2 respectively
    loss.backward()
    
    # Manually update weights using g.d. Wrap in torch.no_grad()
    # because weights have requires_grad=True
    # but we don't have to track this in autgrad
    # An alternative way is to operate on
    # weight.data and weight.grad.data
    # we can also use torch.optim.SGD to achieve this
    
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating the weights
        w1.grad.zero_()
        w2.grad.zero_()

0 34457040.0
1 30511114.0
2 30195162.0
3 28095522.0
4 22501658.0
5 14964980.0
6 8703156.0
7 4765038.5
8 2700668.0
9 1670216.125
10 1146002.625
11 856056.375
12 677807.625
13 556328.375
14 466589.09375
15 396662.0
16 340495.09375
17 294240.03125
18 255716.59375
19 223360.640625
20 195886.484375
21 172416.0625
22 152243.46875
23 134843.0625
24 119761.21875
25 106668.375
26 95245.8515625
27 85247.7109375
28 76474.9453125
29 68748.265625
30 61925.66015625
31 55884.66796875
32 50521.00390625
33 45749.98046875
34 41495.0
35 37699.96484375
36 34304.17578125
37 31257.607421875
38 28524.859375
39 26063.458984375
40 23844.82421875
41 21843.642578125
42 20034.525390625
43 18396.361328125
44 16911.712890625
45 15563.083984375
46 14335.3740234375
47 13217.1572265625
48 12197.79296875
49 11268.115234375
50 10418.6572265625
51 9642.1552734375
52 8930.6171875
53 8277.716796875
54 7678.6240234375
55 7127.8876953125
56 6620.7119140625
57 6153.791015625
58 5723.08203125
59 5325.96875
60 4959.26611328125


## PyTorch: Defining new autograd functions
The ***forward*** function computes output Tensors from input Tensors. The ***backward*** function receives the gradient of the output Tensors w.r.t. some scalar value, and computes the gradient of the input Tensors w.r.t. that same scalar value.

In PyTorch, we can easily define our own autograd operator by defining a subclass of ```torch.autograd.Function``` and implementing the ```forward``` and ```backward``` functions. We can then use our autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In [23]:
import torch

In [40]:
class MyReLU(torch.autograd.Function):
    '''
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward
    passes which operate on Tensors.    
    '''
    
    @staticmethod
    def forward(ctx, input):
        '''
        In the forward pass we receive a Tensor containing the input
        and return a Tensor containing the output. ctx is a context
        object that can be used to stash information for backward 
        computation. You can cache arbitrary objects for use in the
        backward pass using the ctx.save_for_backward method.
        '''
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        '''
        In the backward pass we receive a Tensor containing the
        gradient of the loss w.r.t. the output, and we need to 
        compute the gradient of the loss w.r.t. the input.
        '''
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [41]:
dtype = torch.float

if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")

In [42]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

In [43]:
# creating random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

In [44]:
# creating random Tensors for weights.
# Setting requires_grad=False indicates
# we do not need to compute gradients w.r.t Tensors
# during the backward pass.

w1 = torch.randn(D_in, H, device=device,
                 dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device,
                 dtype=dtype, requires_grad=True)

In [45]:
# learning rate
learning_rate = 1e-6

In [46]:
# no. of epochs
epochs = 500

In [47]:
for t in range(500):
    # To apply our Function, we use Function.applt method.
    # We alias this as 'relu'
    relu = MyReLU.apply
    
    # Forward pass: compute predicted y using operations;
    # we compute ReLU using our custom autograd operation
    y_pred = relu(x.mm(w1)).mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass.
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 27830346.0
1 25396068.0
2 28058450.0
3 31686692.0
4 32241706.0
5 26850888.0
6 17945728.0
7 9850813.0
8 4958372.0
9 2548184.75
10 1465158.375
11 965791.625
12 713659.75
13 567795.8125
14 471308.8125
15 400496.46875
16 344975.125
17 299730.15625
18 262075.9375
19 230325.5625
20 203269.015625
21 180033.4375
22 159983.25
23 142577.15625
24 127428.03125
25 114173.640625
26 102531.921875
27 92274.859375
28 83216.6484375
29 75205.5
30 68090.0546875
31 61753.3984375
32 56096.74609375
33 51037.38671875
34 46508.13671875
35 42442.51953125
36 38786.27734375
37 35493.98046875
38 32526.638671875
39 29843.02734375
40 27412.517578125
41 25207.251953125
42 23204.298828125
43 21382.888671875
44 19724.88671875
45 18212.7890625
46 16832.232421875
47 15570.3837890625
48 14415.8779296875
49 13358.26171875
50 12389.216796875
51 11500.1611328125
52 10683.4482421875
53 9932.5615234375
54 9241.5361328125
55 8604.99609375
56 8018.1103515625
57 7476.38232421875
58 6976.16796875
59 6513.77001953125
60 6086.1015

396 0.005901465192437172
397 0.005707187578082085
398 0.005514222197234631
399 0.0053258934058249
400 0.005146936047822237
401 0.004976735450327396
402 0.004810024984180927
403 0.004649372771382332
404 0.004495232831686735
405 0.004348935559391975
406 0.004204457625746727
407 0.004067350644618273
408 0.003933647647500038
409 0.0038046096451580524
410 0.003680839901790023
411 0.003562268801033497
412 0.0034465498756617308
413 0.0033352994360029697
414 0.0032283575274050236
415 0.0031239374075084925
416 0.003021928947418928
417 0.002926211804151535
418 0.0028357645496726036
419 0.002744177123531699
420 0.002660091035068035
421 0.002574750455096364
422 0.002497052541002631
423 0.002418593503534794
424 0.0023440795484930277
425 0.0022700519766658545
426 0.002200571820139885
427 0.0021348961163312197
428 0.002069783164188266
429 0.0020057675428688526
430 0.0019452591659501195
431 0.0018871017964556813
432 0.0018299842486158013
433 0.001775385346263647
434 0.0017221018206328154
435 0.0016703