# Neural Networks

## Introduction

Users can use the `torch.nn` package to simplify the construction of 
neural networks. This package works similarly to Keras' subclassing
API, but is simpler because it doesn't rely on the TF compute graph 
and is eager by default. You know, without requiring `tf.function`
to be remotely performant on custom models. 

*Worth noting, it's also possible to use the Torch Sequential model, 
like Keras Sequential*.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## Example

We can start with an example. In Torch, models (and layers) are subclassed from
`nn.Module`. For example, the class `MaxPool2d` 
[inherits](https://github.com/pytorch/pytorch/blob/1a74bd407de335019afdcb748a758107092a8019/torch/nn/modules/pooling.py#L79)
from `nn.Module` via `_MaxPoolNd`. 

In this example, we define the layers in the `__init__` methods. We have:

- `convolution_1`: Layer for one `input_channels` many input channels and
6 output channels.
- `convolution_2`: Layer for the 6 inputs and 16 outputs. 

Each convolution layer has a $3 \times 3$ kernel.

Then we construct the dense layers. 

- `dense_1`: There are 16 channels from the last convolution. In the toy example,
each image is $6 \times 6$ (with pooling). Thusly, we have $16 \times 6 \times 6$ input nodes. 
Finally, we have 120 outputs.
- `dense_2`: 120 inputs and 84 outputs.
- `classifier`: The final layer with the classification. 

Note that all of the `Linear` [layers](https://en.wikipedia.org/wiki/Affine_transformation) 
apply an [affine transform](https://en.wikipedia.org/wiki/Affine_transformation). In addition,
the `Conv2d` [layers](keras.layers.MaxPooling2D(pool_size=(2, 2))
apply a 2D [convolution](https://en.wikipedia.org/wiki/Convolution) over an input plane.

In the `forward` method, we compute the forward pass. This includes two 
[max pooling](https://computersciencewiki.org/index.php/Max-pooling_/_Pooling)
operations. In addition, each convolution layer has a 'ReLU' activation applied. 
This is equivalent to:

```
model = Sequential([
    Conv2d(input_channels, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2d(6, activation='relu'),
    MaxPooling2d(pool_size=(2, 2)),
    Flatten(),
    Dense(120, activation='relu'),
    Dense(84, activation='relu'),
    Dense(10)
])
```

In [2]:
class ConvolutionalNN(nn.Module):
    def __init__(self, input_channels=1):
        super(ConvolutionalNN, self).__init__()
        # 6 output channels, 3x3 convolution.
        self.convolution_1 = nn.Conv2d(input_channels, 6, 3)
        # 6 input channels from the previous layer, 16 output channels.
        self.convolution_2 = nn.Conv2d(6, 16, 3)
        # Linear layers are affine transforms. No non-linearity.
        # 16 out chans, 6x6 images. 120 outputs.
        self.dense_1 = nn.Linear(16 * 6 * 6, 120)
        self.dense_2 = nn.Linear(120, 84)
        self.classifier = nn.Linear(84, 10)
        
    def forward(self, x):
        # 2x2 window
        x = F.max_pool2d(F.relu(self.convolution_1(x)), (2, 2))
        # If the window is square, you can specify a single number
        x = F.max_pool2d(F.relu(self.convolution_2(x)), 2)
        # flatten
        x = x.view(-1, self._num_flat_features(x))
        x = F.relu(self.dense_1(x))
        x = F.relu(self.dense_2(x))
        x = self.classifier(x)
        return x
    
    def _num_flat_features(self, x):
        # All sizes but batch
        size = x.size()[1:]
        feature_count = 1
        for dim in size:
            feature_count *= dim
        return feature_count

In [3]:
model = ConvolutionalNN()
print(model)

ConvolutionalNN(
  (convolution_1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (convolution_2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (dense_1): Linear(in_features=576, out_features=120, bias=True)
  (dense_2): Linear(in_features=120, out_features=84, bias=True)
  (classifier): Linear(in_features=84, out_features=10, bias=True)
)


We define the forward function, and autograd is able to supply the
`backward` pass. We can then get the learnable parameters with 
`.parameters()`.

In [4]:
print(f'[+] Model Parameters:')
for index, parameter in enumerate(model.parameters()):
    print(f'\t[+] Param Size {index}: {parameter.size()}')

[+] Model Parameters:
	[+] Param Size 0: torch.Size([6, 1, 3, 3])
	[+] Param Size 1: torch.Size([6])
	[+] Param Size 2: torch.Size([16, 6, 3, 3])
	[+] Param Size 3: torch.Size([16])
	[+] Param Size 4: torch.Size([120, 576])
	[+] Param Size 5: torch.Size([120])
	[+] Param Size 6: torch.Size([84, 120])
	[+] Param Size 7: torch.Size([84])
	[+] Param Size 8: torch.Size([10, 84])
	[+] Param Size 9: torch.Size([10])


All of the components in the `nn` package expect data to be fed in batches. 
So, our model expects data of the form:

`samples` $\times$ `channels` $\times$ `height` $\times$ `width`.

When you need to feed in a single sample, just wrap that sample in a fake batch. 
This can be done easily with `data.unsqueeze(0)`.

In [5]:
data = torch.randn((1, 1, 32, 32))
out = model(data)
print(out)

tensor([[ 0.1918, -0.2080,  0.1609,  0.1277, -0.0670,  0.0908,  0.0380, -0.1010,
          0.0893, -0.0083]], grad_fn=<AddmmBackward>)


Now we can zero the gradient buffers and run a backward pass.

In [6]:
model.zero_grad()
# backprop with random grads
out.backward(torch.randn(1, 10))

## Loss Functions & Gradients

Torch has multiple loss functions in the `nn` module. For 
[example](https://github.com/pytorch/pytorch/blob/cef0443464a4ff5e3fd2e3b6eca0ee76c5c428ce/torch/nn/functional.py#L2168), 
`nn.MSELoss()` 
[(mean squared error)](https://en.wikipedia.org/wiki/Mean_squared_error). 
[Loss Functions](https://github.com/pytorch/pytorch/blob/cef0443464a4ff5e3fd2e3b6eca0ee76c5c428ce/torch/nn/modules/loss.py#L8)
in torch also extend the module class. Thusly, they define a `forward()` function,
and then have a `backward` pass defined from it. This allows us to compute 
gradients W.R.T the loss. 

It's worth expanding on this a bit. When you allocate a loss function, you are
creating an instance of `nn.Module`. From there, you can `__call__` the module.
This computes the `forward` pass and creates the gradient graph, among other 
things. 

Now, the result of the forward pass is actually an operation, for which
you can compute gradients. That is, when you compute the forward pass, MSELoss
is delegating to `functional.mse_loss`
[this](https://github.com/pytorch/pytorch/blob/cef0443464a4ff5e3fd2e3b6eca0ee76c5c428ce/torch/nn/functional.py#L2168)
returns the operations.

In [7]:
output = model(data)
target = torch.randn(10)
target = target.view((1, -1))

loss_function = nn.MSELoss()

loss = loss_function(output, target)
print(f'[+] loss_value:\n{loss}')

[+] loss_value:
0.74835604429245


As discussed, the `__call__` to `loss_function` delegates to `forward`, which
delegates to `functional.msel_loss` and returns an operation for which we can
compute gradients via the `backward` method. 

In addition, we are also able to access the `grad_fn`, including its full trace.

In [8]:
print(loss.grad_fn)
print(loss.grad_fn.next_functions)

<MseLossBackward object at 0x7fde5028aa90>
((<AddmmBackward object at 0x7fde5028a8d0>, 0),)


In [9]:
print(loss.grad_fn.next_functions[0][0]) # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0]) # Relu

<AddmmBackward object at 0x7fdde5053350>
<AccumulateGrad object at 0x7fdde5043350>


From here, computing the `backward` pass is easy. But, it's important
to call `zero_grad`, as torch accumulates the gradient buffer.

In [10]:
model.zero_grad()

print(f'[+] convolution_1.bias.grad:\n{model.convolution_1.bias.grad}')

# Recall that loss is tied to the outputs from the model
# and the actual predictions. Made it hard for me to
# remember how this is traced back.
loss.backward()

print(f'[+] convolution_1.bias.grad:\n{model.convolution_1.bias.grad}')

[+] convolution_1.bias.grad:
tensor([0., 0., 0., 0., 0., 0.])
[+] convolution_1.bias.grad:
tensor([-0.0015, -0.0170,  0.0022, -0.0032,  0.0084, -0.0028])


## Optimization In Torch

Let's start by manually computing the gradient descent update rule.

In [11]:
LEARNING_RATE = 0.01

for param in model.parameters():
    # Subtraction in place
    param.data.sub_(param.grad.data * LEARNING_RATE)

This can be simplified with the
[optim](https://pytorch.org/docs/stable/optim.html) package.
Worth noting, this package supports some nice customization,
such as varying learning rates by parameter (layer). In addition
it's possible to start by training a subset of layers, then adding 
some from the optimizer.

All optimizers inherit from the Torch `Optimizer` 
[class](https://github.com/pytorch/pytorch/blob/cef0443464a4ff5e3fd2e3b6eca0ee76c5c428ce/torch/optim/optimizer.py#L17)
, which is in many cases abstract. 

To construct an optimizer, we pass in a set of parameters from our model.
As a note from the documentation:

```
Parameters need to be specified as collections that have a deterministic ordering 
that is consistent between runs. Examples of objects that don’t satisfy those 
properties are sets and iterators over values of dictionaries.
```

In the following example, we allocate an `SGD` optimizer. Next,
we zero the gradient buffers. We then compute the forward pass,
build the gradients WRT loss, and compute them. Finally, because
the optimizer has access to the model parameters, it "knows" the
gradients, and is able to apply them appropriately. 

In [12]:
import torch.optim as optim

optimizer = optim.SGD(params=model.parameters(), lr=LEARNING_RATE)

# Always zero the gradient buffer . . . unless you are doing a custom 
# nesterov or momentum thing. But in that case, probably best to 
# just track your gradients. IDK. I'll ponder on it. Also probably
# built ins for nesterov.
optimizer.zero_grad()
output = model(data)
loss = loss_function(output, target)
loss.backward()
optimizer.step()