<a href="https://colab.research.google.com/github/alsalamahs/MLGitDemo/blob/master/Introduction_to_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will introduce PyTorch, talk about its important concepts and features, and eventually train an MNIST classifier using what we have learned. 

## NOTE: Click the top-left "Open In Playground" button! 

## What is PyTorch?

1. A Python GPU-accelerated tensor library (NumPy, but faster)
2. Differentiable Programming with dynamic computation graphs
3. Flexible and efficient **neural network** library
4. Python-first framework (easy to integrate with other Python libraries, debug, and extend)
  + Quick conversion from & to NumPy array, integration with other Python libs.
  + Your favorite Python debugger.
  + Adding custom ops with Python/c++ extension. 
  + Running in purely c++ environment with the c++ API.

Useful links:

+ PyTorch documentation: https://pytorch.org/docs/stable/index.html

  Most math operations can be found as `torch.*` or `Tensor.*`.
+ PyTorch official tutorials (60-min blitz is a good start): https://pytorch.org/tutorials/
    - Transfer learning tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
+ PyTorch examples (DCGAN, ImageNet training, Reinforcement Learning, etc.): https://github.com/pytorch/examples/

In [0]:
# install basical image libs
!pip install Pillow>=5.0.0
!pip install -U image

# install torch and torchvision (a utility library for computer vision that provides many public datasets and pre-trained models)
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'
!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.1.0-{platform}-linux_x86_64.whl torchvision

Requirement already up-to-date: image in /usr/local/lib/python3.6/dist-packages (1.5.27)


## GPU-accelerated Tensor Library

A Tensor is a multi-dimensional array.

In [0]:
import torch

In [0]:
# Create a 3x5 matrix filled with zeros

x = torch.zeros(3, 5)
print(x)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])


In [0]:
# Create a 3x5 matrix filled with random values

y = torch.randn(3, 5)
print(y)

tensor([[-0.8744,  0.9280, -0.6588, -0.3902,  0.5511],
        [ 0.0770, -1.0809, -1.3102,  1.7989,  0.6384],
        [ 0.3242, -0.0301,  1.6886, -0.5148,  0.3431]])


In [0]:
# Shape manipulations

print('\n.t()  (transpose): ')
print(y.t())

print('.reshape(5, 3): ')
print(y.reshape(5, 3))


.t()  (transpose): 
tensor([[-0.8744,  0.0770,  0.3242],
        [ 0.9280, -1.0809, -0.0301],
        [-0.6588, -1.3102,  1.6886],
        [-0.3902,  1.7989, -0.5148],
        [ 0.5511,  0.6384,  0.3431]])
.reshape(5, 3): 
tensor([[-0.8744,  0.9280, -0.6588],
        [-0.3902,  0.5511,  0.0770],
        [-1.0809, -1.3102,  1.7989],
        [ 0.6384,  0.3242, -0.0301],
        [ 1.6886, -0.5148,  0.3431]])


In [0]:
# Slicing

print(y[1:])

print(y[1:, ::2])

tensor([[ 0.0770, -1.0809, -1.3102,  1.7989,  0.6384],
        [ 0.3242, -0.0301,  1.6886, -0.5148,  0.3431]])
tensor([[ 0.0770, -1.3102,  0.6384],
        [ 0.3242,  1.6886,  0.3431]])


In [0]:
# Basic arithmetics

print(x + 2)

tensor([[2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.]])


In [0]:
print(y * (x + 2))

tensor([[-1.7487,  1.8560, -1.3176, -0.7803,  1.1022],
        [ 0.1540, -2.1619, -2.6204,  3.5978,  1.2768],
        [ 0.6484, -0.0602,  3.3772, -1.0297,  0.6862]])


In [0]:
print((y * (x + 2)).exp())

tensor([[ 0.1740,  6.3982,  0.2678,  0.4583,  3.0107],
        [ 1.1665,  0.1151,  0.0728, 36.5182,  3.5853],
        [ 1.9125,  0.9416, 29.2876,  0.3571,  1.9861]])


#### GPU Acceleration

Everything can be run on a GPU

First, let us create a [`torch.device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device) object representing a GPU device.

In [0]:
cuda0 = torch.device('cuda:0')  # pick the GPU at index 0

In [0]:
y

tensor([[-0.8744,  0.9280, -0.6588, -0.3902,  0.5511],
        [ 0.0770, -1.0809, -1.3102,  1.7989,  0.6384],
        [ 0.3242, -0.0301,  1.6886, -0.5148,  0.3431]])

In [0]:
# Move a tensor from CPU to GPU
# NOTE: the first time you access a GPU, a context is created so this may take a
# few seconds. But subsequent uses will be fast.

cuda_y = y.to(cuda0)
print(cuda_y)

tensor([[-0.8744,  0.9280, -0.6588, -0.3902,  0.5511],
        [ 0.0770, -1.0809, -1.3102,  1.7989,  0.6384],
        [ 0.3242, -0.0301,  1.6886, -0.5148,  0.3431]], device='cuda:0')


In [0]:
# Or directly creating a tensor on GPU

cuda_x = torch.zeros(3, 5, device=cuda0)
print(cuda_x)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]], device='cuda:0')


In [0]:
# All functions and methods work on GPU tensors

print((cuda_y * (cuda_x + 2)).exp())  # values match the CPU results above!

tensor([[ 0.1740,  6.3982,  0.2678,  0.4583,  3.0107],
        [ 1.1665,  0.1151,  0.0728, 36.5182,  3.5853],
        [ 1.9125,  0.9416, 29.2876,  0.3571,  1.9861]], device='cuda:0')


### NumPy Bridge

Converting a `torch.Tensor` to a `np.ndarray` and vice versa is a breeze.

The `torch.Tensor` and `np.ndarray` will share their underlying memory locations (if the `torch.Tensor` is on CPU and `dtype` is the same), and changing one will change the other.

In [0]:
import numpy as np

In [0]:

x = torch.randn(5)
print(x)

tensor([-0.8433, -0.4097, -1.6601,  0.3581, -0.5345])


In [0]:

x.numpy()

array([-0.8432796 , -0.40973672, -1.6601192 ,  0.3580714 , -0.5344998 ],
      dtype=float32)

In [0]:
# Converting a tensor to an array

x = torch.randn(5)
print(x)

# use `my_tensor.numpy()`
x_np = x.numpy()
print(x_np)

# or `np.asarray`

x_np = np.asarray(x)
print(x_np)

tensor([ 0.7391, -0.7181, -3.2596,  2.2210,  1.0768])
[ 0.73911566 -0.7181137  -3.259646    2.221019    1.0767806 ]
[ 0.73911566 -0.7181137  -3.259646    2.221019    1.0767806 ]


In [0]:
# in-place changes on one affects the other

x[0] = -1
print(x)
print(x_np)

tensor([-1.0000, -0.7181, -3.2596,  2.2210,  1.0768])
[-1.        -0.7181137 -3.259646   2.221019   1.0767806]


In [0]:
# Converting an array to a tensor

a = np.random.randn(3, 4)

a_pt = torch.as_tensor(a)
print(a_pt)

tensor([[-0.6790, -0.1510, -0.8305,  1.2007],
        [ 0.0102, -0.0697, -0.4782, -1.3210],
        [-0.7971,  1.2852,  0.4573,  1.2090]], dtype=torch.float64)


In [0]:
# the resulting CPU Tensor shares memory with the array!

a_pt[0] = -1
print(a)

[[-1.         -1.         -1.         -1.        ]
 [ 0.01015544 -0.06972739 -0.47822052 -1.32100022]
 [-0.79705154  1.28522745  0.45733386  1.20901938]]


In [0]:
# But if we change dtype and/or device at the same time, a copy is made

a_half_pt = torch.as_tensor(a, dtype=torch.float16, device=cuda0)
a_half_pt[0] = 9
print(a_half_pt)

print(a)  # original array is not affected

tensor([[ 9.0000,  9.0000,  9.0000,  9.0000],
        [ 0.0102, -0.0697, -0.4783, -1.3213],
        [-0.7969,  1.2852,  0.4573,  1.2090]], device='cuda:0',
       dtype=torch.float16)
[[-1.         -1.         -1.         -1.        ]
 [ 0.01015544 -0.06972739 -0.47822052 -1.32100022]
 [-0.79705154  1.28522745  0.45733386  1.20901938]]


## Differentiable Programming with Dynamic Computation Graphs

Gradient-based optimization is an essential part of the modern deep learning frenzy. PyTorch uses [reverse-mode automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to efficiently compute gradients through any computations done on tensors.

### Dynamic vs. Static

A neural network is essentially a sequence of mathematical operations on tensors, which build up a computation graph.

Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world. One has to build a neural network, and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch.

PyTorch uses a technique called reverse-mode auto-differentiation, which allows you to change the way your network behaves arbitrarily with zero lag or overhead. 


### Dynamic computation graphs

When you create a tensor with its `requires_grad` flag set to `True`, the [`autograd`](https://pytorch.org/docs/stable/autograd.html) engine considers it as a **leaf** node of the computation graph. As you compute with it, the graph is dynamically expanded. When you ask for gradients (e.g., via `tensor.backward()`), the `autograd` engine traces backwards through the graph, and automatically computes the gradients for you.

![alt text](https://github.com/pytorch/pytorch/raw/master/docs/source/_static/img/dynamic_graph.gif)


**Let's see this in action!**

In [0]:
# Now, we want tensors with `requires_grad=True`

a = torch.ones(3, 5, requires_grad=True)  # tensor of all ones
print(a)  # notice that the `requires_grad` flag is on!

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]], requires_grad=True)


In [0]:
# Currently `a` has no gradients

print(a.grad)

None


In [0]:
# Let's compute the gradient wrt the sum

s = a.sum()
print('sum of a is', s)

sum of a is tensor(15., grad_fn=<SumBackward0>)


In [0]:
# Notice the `grad_fn` of `s`. it represents the function used to propagate 
# gradients from `s` to previous nodes of the graph (`a` in this case).

s.backward()  # compute gradient!
print(a.grad)

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])


In [0]:
# Yay! Indeed d \sum_a / d a_ij = 1

In [0]:
# Gradients are automatically **accumulated**

a.sum().backward()
print(a.grad)  # now the new gradients are added to the old ones

# Don't worry, we have easy ways to clear the gradients too. 
# We will talk about those later!

tensor([[2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.]])


In [0]:
# Now let's do something slightly fancier, on GPU!

a = torch.ones(3, 4, device=cuda0, requires_grad=True)
b = torch.randn(4, 4, device=cuda0, requires_grad=True)

result = (torch.mm(a, b.t().exp()) * 0.5).rfft(2).sum() * b.prod() - b.mean()
print('this complicated chain of operation gives....')
print(result)

this complicated chain of operation gives....
tensor(-0.4008, device='cuda:0', grad_fn=<SubBackward0>)


In [0]:
result.backward()
print('\ngradient wrt a is')
print(a.grad)
print('\ngradient wrt b is')
print(b.grad)


gradient wrt a is
tensor([[-0.0022, -0.0021, -0.0002,  0.0006],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000]], device='cuda:0')

gradient wrt b is
tensor([[-0.0567, -0.0698, -0.0825, -0.0595],
        [-0.0319, -0.0519, -0.0639, -0.0631],
        [-0.0670, -0.0666, -0.0580, -0.0705],
        [-0.0661, -0.1020, -0.0744, -0.0738]], device='cuda:0')


In [0]:
########################
#                      #
#       Exercise       #
#                      #
########################


a = torch.linspace(-3, 3, 10, dtype=torch.float32, requires_grad=True)
b = torch.logspace(0.2, 2, 10, requires_grad=True)


z = torch.log( b.sum() / a.exp().sum()) - b.sum()
print(z)

z.backward()

print(a.grad)
print(b.grad)

tensor(-266.3888, grad_fn=<SubBackward0>)
tensor([-0.0012, -0.0024, -0.0046, -0.0089, -0.0174, -0.0339, -0.0659, -0.1284,
        -0.2501, -0.4872])
tensor([-0.9963, -0.9963, -0.9963, -0.9963, -0.9963, -0.9963, -0.9963, -0.9963,
        -0.9963, -0.9963])


Compute 

$$z = \log \left( \frac{1}{\sum_i \exp(a_i)} \sum_j b_j \right) - \sum_k b_k,$$

and then the gradients of $z$ w.r.t. $\mathbf{a}$ and $\mathbf{b}$.

They should look like:

```
# Gradient wrt a
tensor([-0.0012, -0.0024, -0.0046, -0.0089, -0.0174, -0.0339, -0.0659, -0.1284,
        -0.2501, -0.4872])

# Gradient wrt b
tensor([-0.9963, -0.9963, -0.9963, -0.9963, -0.9963, -0.9963, -0.9963, -0.9963,
        -0.9963, -0.9963])
```

#### Manipulating the `requires_grad` flag

In [0]:
# Other than directly setting it at creation time, you can change this flag 
# in-place using `my_tensor.requires_grad_()`, or, as in the above example, or
# just directly setting the attribute.

x = torch.randn(1, 4, 5)
print(x)
print('x does not track gradients')

tensor([[[ 1.3404,  1.2348, -0.4726,  1.8338,  0.9665],
         [-0.4777,  2.3249, -0.2659,  0.8649, -0.9776],
         [-1.0396, -2.4354,  0.7180, -0.9280, -0.7578],
         [-0.6382, -1.4439,  0.2138, -0.6615, -1.5771]]])
x does not track gradients


In [0]:
x.requires_grad_()
print(x)
print('x now **does** track gradients')

tensor([[[ 1.3404,  1.2348, -0.4726,  1.8338,  0.9665],
         [-0.4777,  2.3249, -0.2659,  0.8649, -0.9776],
         [-1.0396, -2.4354,  0.7180, -0.9280, -0.7578],
         [-0.6382, -1.4439,  0.2138, -0.6615, -1.5771]]], requires_grad=True)
x now **does** track gradients


## Flexible and Efficient Neural Network Library

The [`torch.nn`](https://pytorch.org/docs/stable/nn.html) and [`torch.optim`](https://pytorch.org/docs/stable/optim.html) packages provide many efficient implementations of neural network components:
  + Affine layers and [activation functions](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity)
  + Normalization methods
  + [Initialization schemes](https://pytorch.org/docs/stable/nn.html#torch-nn-init)
  + [Loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions)
  + [Embeddings](https://pytorch.org/docs/stable/nn.html#sparse-layers)
  + [Distributed and Multi-GPU training](https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed)
  + [Gradient-based optimizers](https://pytorch.org/docs/stable/optim.html)
  + [Learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)
  + etc.

In [0]:
import torch.nn as nn
import torch.nn.functional as F

#### `torch.nn` Layers

We will use the [fully connected linear layer (`nn.Linear`)](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) as an example. 

A fc layer performs an affine transform with a 2D weight parameter $\mathbf{w}$ and a 1D bias parameter $\mathbf{b}$:

$$ f(\mathbf{x}) = \mathbf{w}^\mathrm{T} \mathbf{x} + \mathbf{b}.$$

In [0]:
fc = nn.Linear(in_features=8, out_features=8)
print(fc)

Linear(in_features=8, out_features=8, bias=True)


In [0]:
# It has two parameters, the weight and the bias

for name, p in fc.named_parameters():
    print('param name: {}\t shape: {}'.format(name, p.shape))

param name: weight	 shape: torch.Size([8, 8])
param name: bias	 shape: torch.Size([8])


In [0]:
fc.weight

Parameter containing:
tensor([[-0.2669,  0.1907, -0.1391, -0.3276,  0.1498,  0.3402,  0.3010, -0.2500],
        [ 0.0154,  0.3147, -0.2032, -0.2937, -0.2662,  0.0713,  0.0804,  0.0981],
        [ 0.0988,  0.0569,  0.1203,  0.2472,  0.0577, -0.0078, -0.1302,  0.0956],
        [-0.0064,  0.1049, -0.2925, -0.0192, -0.3011, -0.2878, -0.1191, -0.3001],
        [ 0.1697,  0.3415,  0.1356, -0.0279,  0.3087,  0.2876,  0.1874, -0.2123],
        [-0.1191,  0.2579,  0.2103, -0.1575, -0.2197,  0.1573, -0.3480,  0.0270],
        [-0.0302,  0.2779,  0.1500,  0.2312,  0.0566, -0.1994, -0.0517,  0.2655],
        [-0.1173,  0.2121,  0.0729,  0.0713, -0.2098, -0.0373, -0.0443,  0.0630]],
       requires_grad=True)

In [0]:
# These parameters by default have `requires_grad=True`, so they will collect gradients!

print(fc.bias)

Parameter containing:
tensor([ 0.3485,  0.3280, -0.3456, -0.0870, -0.2239, -0.0148, -0.2640,  0.1459],
       requires_grad=True)


In [0]:
# Let's construct an input tensor with 2 dimensions:
#   - batch dimension of size 64
#   - 8 features

x = torch.randn(64, 8)

In [0]:
# Pass it through the fc layer

result = fc(x)
print(result.shape)

# Why does the `result` have shape [64, 8]?
#   - batch dimension of size 64
#   - 8 output features

torch.Size([64, 8])


In [0]:
# Even though the input `x` has `requires_grad=False`, the convolution
# weight and bias parameters has `requires_grad=True`. So the result also
# requires gradient, with a `grad_fn` to compute backward pass for 
# convolutions.
print(result.requires_grad)
print(result.grad_fn)  # It says `AddmmBackward` because the fc layer performs a matmul and an addition

True
<AddmmBackward object at 0x7f63674ed1d0>


In [0]:
# Say (arbitrarily) we want the layer to behave like the cosine function (yes I know it is impossible)

target = x.cos()

In [0]:
# Let's try MSE loss

loss = F.mse_loss(result, target)
print(loss)

tensor(1.0003, grad_fn=<MseLossBackward>)


In [0]:
# Compute gradients

loss.backward()
print(fc.bias.grad)

tensor([-0.0578, -0.0557, -0.2481, -0.2105, -0.1780, -0.1764, -0.1997, -0.1282])


In [0]:
# We can manually perform GD via a loop

print('bias before GD', fc.bias)

lr = 0.5
with torch.no_grad():  
    # this context manager tells PyTorch that we don't want ops inside to be 
    # tracked by autograd!
    for p in fc.parameters():
        p -= lr * p.grad
        
print('bias after GD', fc.bias)

bias before GD Parameter containing:
tensor([ 0.3485,  0.3280, -0.3456, -0.0870, -0.2239, -0.0148, -0.2640,  0.1459],
       requires_grad=True)
bias after GD Parameter containing:
tensor([ 0.3774,  0.3558, -0.2216,  0.0182, -0.1349,  0.0734, -0.1642,  0.2101],
       requires_grad=True)


#### `torch.optim` optimizers

More easily, we can use the provided [`torch.optim`](https://pytorch.org/docs/stable/optim.html#torch.optim) optimizers. Let's use the [`torch.optim.SGD`](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD) optimizer for example!

In [0]:
# Let's optimize for 5000 iterations

# First, put the layer on GPU so things run faster

fc = fc.to(cuda0)

In [0]:
# Construct an optimizer

optim = torch.optim.SGD(fc.parameters(), lr=0.1)

In [0]:
# training loop

batch_size = 256

for ii in range(5000):
    # clear gradients accumulated on the parameters
    optim.zero_grad()
    
    # get an input (say we only care inputs sampled from N(0, I))
    x = torch.randn(batch_size, 8, device=cuda0)  # this has to be on GPU too
    
    # target is the cos(x)
    target = x.cos()
    
    # forward pass
    result = fc(x)
    
    # compute loss
    loss = F.mse_loss(result, target)
    
    # compute gradients
    loss.backward()
    
    # let the optimizer do its work; the parameters will be updated in this call
    optim.step()
    
    # add some printing
    if ii % 500 == 0:
        print('iteration {}\tloss {:.5f}'.format(ii, loss))


iteration 0	loss 0.77930
iteration 500	loss 0.19448
iteration 1000	loss 0.20639
iteration 1500	loss 0.18550
iteration 2000	loss 0.17982
iteration 2500	loss 0.18540
iteration 3000	loss 0.18934
iteration 3500	loss 0.18994
iteration 4000	loss 0.20697
iteration 4500	loss 0.18157


### Building Deep Neural Neworks

A single `nn.Linear` layer didn't do very well! The MSE loss above is still pretty large.

But this is expected as it is simply a linear transformation and thus has limited expressive power. Let's replace it with a deep network and see out it works!

For simplicity, we will use the following feedforward network architecture (from top to bottom):

```
        [Input]
           ||
[Fully-Connected 8 -> 32]
           ||
    [ReLU activation]
           ||
[Fully-Connected 32 -> 32]
           ||
    [ReLU activation]
           ||
[Fully-Connected 32 -> 8]
           ||
        [Output]
```

In PyTorch, a model is represented by a [`nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) object. The `nn.Linear` layer we looked at above is also an instance of it:

In [0]:
assert isinstance(nn.Linear(8, 8), nn.Module)

Now we want to build a deep network, we can compose the needed layers together by writing a custom `nn.Module` ourselves.

In [0]:
class MyNet(nn.Module):  # subclass nn.Module
    def __init__(self):
        super(MyNet, self).__init__()
        
        # We need 3 fully-connected layers!
        # Simply assigning them as attributes will
        # make sure that PyTorch keeps track of them.
        
        # 8 => 32
        self.fc1 = nn.Linear(8, 32)
        # 32 => 32
        self.fc2 = nn.Linear(32, 32)
        # 32 => 8
        self.fc3 = nn.Linear(32, 8)
        
        
    # We also need to define a `forward()` method that details
    # what should happen when this module is used.
    def forward(self, x):
        x = self.fc1(x)
        x = x.relu()
        x = self.fc2(x)
        x = x.relu()
        return self.fc3(x)

In [0]:
# Okay! Now we are ready to use this deep network! 

# Construct a network and move to GPU
net = MyNet().to(cuda0)

# Construct an optimizer
optim = torch.optim.SGD(net.parameters(), lr=0.1)

In [0]:
# The same training loop, but now using a deep network!

batch_size = 256

for ii in range(5000):
    # clear gradients accumulated on the parameters
    optim.zero_grad()
    
    # get an input (say we only care inputs sampled from N(0, I))
    x = torch.randn(batch_size, 8, device=cuda0)  # this has to be on GPU too
    
    # target is the cos(x)
    target = x.cos()
    
    # forward pass
    result = net(x)  # CHANGED: fc => net
    
    # compute loss
    loss = F.mse_loss(result, target)
    
    # compute gradients
    loss.backward()
    
    # let the optimizer do its work; the parameters will be updated in this call
    optim.step()
    
    # add some printing
    if ii % 500 == 0:
        print('iteration {}\tloss {:.5f}'.format(ii, loss))


iteration 0	loss 0.53365
iteration 500	loss 0.16464
iteration 1000	loss 0.08549
iteration 1500	loss 0.03594
iteration 2000	loss 0.01866
iteration 2500	loss 0.01365
iteration 3000	loss 0.01166
iteration 3500	loss 0.00987
iteration 4000	loss 0.00879
iteration 4500	loss 0.00892


The network did so much better than a single fully-connected layer!

#### More `nn.*` Layers

There are many other layers provided in the `nn.*` package. To list a few, we have
+ Convolutions: e.g., [`nn.Conv2d`](https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d)
+ Normalizations: e.g., [`nn.BatchNorm2d`](https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm2d)
+ Activation functions: e.g., [`nn.ReLU`](https://pytorch.org/docs/stable/nn.html#torch.nn.ReLU)
+ etc.

In [0]:
# A conv 2d layer with 4x4 filters, mapping inputs with 3 channels to outputs with 5 channels

conv = nn.Conv2d(in_channels=3, out_channels=5, kernel_size=4)

In [0]:
# It also has two parameters, the weight and the bias

for name, p in conv.named_parameters():
    print('param name: {}\t shape: {}'.format(name, p.shape))
    
# Why does the weight have shape [5, 3, 4, 4]? 

param name: weight	 shape: torch.Size([5, 3, 4, 4])
param name: bias	 shape: torch.Size([5])


#### `nn.Module` Containers

`torch.nn` also provides many other [`nn.Module` containers](https://pytorch.org/docs/stable/nn.html#containers) for easily building complex networks. E.g., [`nn.Sequential`](https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential) executes a list of submodules sequentially, passing each output to the next's input. 

Using `nn.Sequential`, the above network can be equivalently written as:

In [0]:
net = nn.Sequential(
    nn.Linear(8, 32),
    nn.ReLU(),               # This nn.Module does the ReLU activation on its input
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 8),
).to(cuda0)

In [0]:
########################
#                      #
#       Exercise       #
#                      #
########################

Perform the same regression task (i.e., modeling $f(x) = \cos(x)$), but with the following modifications:

+ Use one *more* hidden layer
+ Each hidden layer should have size 128 neurons
+ Use the `tanh` activation function (see [`my_tensor.tanh()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tanh))
+ Use a batch size of 128
+ Use the [`torch.optim.Adam`](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) optimizer
+ Use the [L1 loss](https://pytorch.org/docs/stable/nn.html#torch.nn.functional.l1_loss) function


The following code skeleton is provided. Fill in the places marked with `FIXME!!!`.

In [0]:
class MyDeeperNet(nn.Module):
    def __init__(self):
        super(MyDeeperNet, self).__init__()
        
        # We need 4 fully-connected layers now! 
        # Each should have 128 neurons, except for the last one, which outputs vectors of size 8.
        
        self.fc1 = nn.Linear(8, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 128)
        self.fc4 = nn.Linear(128, 8)
        
        
    def forward(self, x):
        x = self.fc1(x)
        x = x.tanh()
        x = self.fc2(x)
        x = x.tanh()
        x = self.fc3(x)
        x = x.tanh()
        return self.fc4(x)
        
# Construct our new awesome deeper network and move to GPU
deeper_net = MyDeeperNet().to(cuda0)

# Alternative implementation:
#
# deeper_net = nn.Sequential(
#     nn.Linear(8, 128),
#     nn.Tanh(),
#     nn.Linear(128, 128),
#     nn.Tanh(),
#     nn.Linear(128, 128),
#     nn.Tanh(),
#     nn.Linear(128, 8),
# ).to(cuda0)

# Construct an Adam optimizer
deeper_net_optim = torch.optim.Adam(deeper_net.parameters(), lr=0.01)


# Training loop

batch_size = 128

for ii in range(5000):
    # clear gradients accumulated on the parameters
    deeper_net_optim.zero_grad()
    
    # get an input (say we only care inputs sampled from N(0, I))
    x = torch.randn(batch_size, 8, device=cuda0)  # this has to be on GPU too
    
    # target is the cos(x)
    target = x.cos()
    
    # forward pass
    result = deeper_net(x)
    
    # compute loss
    loss = F.l1_loss(result, target)
    
    # compute gradients
    loss.backward()
    
    # let the optimizer do its work; the parameters will be updated in this call
    deeper_net_optim.step()
    
    # add some printing
    if ii % 500 == 0:
        print('iteration {}\tloss {:.5f}'.format(ii, loss))


iteration 0	loss 0.69156
iteration 500	loss 0.06283
iteration 1000	loss 0.04860
iteration 1500	loss 0.03951
iteration 2000	loss 0.04054
iteration 2500	loss 0.04966
iteration 3000	loss 0.03961
iteration 3500	loss 0.04023
iteration 4000	loss 0.04876
iteration 4500	loss 0.04617
