# Background 
    - Neural Networks (NN) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors
    - Training a NN happens in 2 steps:
        + Forward Propagation
            In forward prop, the NN makes its best guest about the correct output. It runs the input data through eaech of its functions to make this guess
        + Backward Propagation
            In backdrop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collection the derivatives of the error with respect to the parameters of the functions (gradient), and optimizing the parameters using gradient decent

# Usage in PyTorch
    - Look at an example in a single training step
    - Load a pretrained resnet18 model from `torchvision`
    - Create a random data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding label initialized to some random values.
    - Label in pretrained model has shape (1, 1000)

## Initialize the model

In [18]:
import torch
from torchvision.models import resnet18, ResNet18_Weights

model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

print(data)

tensor([[[[0.7945, 0.9212, 0.0086,  ..., 0.2584, 0.1421, 0.5565],
          [0.6169, 0.1654, 0.6131,  ..., 0.1511, 0.1739, 0.5895],
          [0.6667, 0.9735, 0.9086,  ..., 0.5483, 0.1277, 0.9722],
          ...,
          [0.7525, 0.6982, 0.4203,  ..., 0.9707, 0.6118, 0.3621],
          [0.5401, 0.0985, 0.5641,  ..., 0.2159, 0.8793, 0.3070],
          [0.0022, 0.8326, 0.3665,  ..., 0.8551, 0.1750, 0.9269]],

         [[0.9343, 0.6186, 0.5230,  ..., 0.8610, 0.0111, 0.7296],
          [0.4398, 0.5608, 0.9776,  ..., 0.3988, 0.8763, 0.1165],
          [0.5559, 0.1368, 0.8587,  ..., 0.0586, 0.2351, 0.9759],
          ...,
          [0.0815, 0.5179, 0.0743,  ..., 0.4192, 0.9677, 0.0203],
          [0.3231, 0.8134, 0.7592,  ..., 0.2977, 0.1579, 0.8583],
          [0.4822, 0.8466, 0.4989,  ..., 0.9467, 0.9577, 0.7412]],

         [[0.1369, 0.6842, 0.7579,  ..., 0.7408, 0.1384, 0.3745],
          [0.2588, 0.0715, 0.4255,  ..., 0.4197, 0.0072, 0.5791],
          [0.4399, 0.2203, 0.3946,  ..., 0

## Foward pass
    - Run input data through the model through each of its layers to make a prediction
    - This is a forward pass

In [19]:
prediction = model(data)

print(prediction)

tensor([[-4.9497e-01, -3.7642e-01, -6.6297e-01, -1.8168e+00, -9.5568e-01,
         -2.4242e-01, -6.4858e-01,  5.8965e-01,  3.1449e-01, -1.2833e+00,
         -9.9062e-01, -1.0028e+00, -7.0625e-01, -1.3250e+00, -1.4976e+00,
         -5.8581e-01, -9.7423e-01, -5.9386e-01, -8.8387e-01, -7.8455e-01,
         -1.7550e+00, -7.6758e-01, -1.7041e+00, -1.5296e-01, -1.3841e+00,
         -1.0366e+00, -5.8035e-01, -1.0775e+00, -7.5102e-01, -3.4078e-01,
         -8.0396e-01, -7.7026e-01, -2.8365e-01, -5.3995e-01, -2.7314e-01,
         -3.4115e-01,  7.8332e-01, -5.2909e-01, -3.5916e-01,  7.7009e-02,
         -9.8506e-01, -8.0531e-01, -1.1481e+00, -6.7082e-01, -6.4001e-01,
         -3.4048e-01, -8.1694e-01, -3.9585e-01, -1.3631e+00, -1.1054e+00,
         -7.8536e-01,  3.9908e-01, -2.5636e-01, -4.4092e-01,  1.5435e-02,
         -1.1366e+00, -2.1440e-01, -1.3917e+00, -5.8990e-01, -4.1031e-01,
          6.6778e-01, -9.7795e-02, -1.9884e-01,  3.4840e-01, -9.2255e-01,
         -3.4091e-01, -3.5348e-01, -4.

## Backward propagation
    - Use the model prediction to calculate the error (`loss`)
    - Next step is to backpropagate this error through the networ
    - Back propagation is kicked off when we call `.backward()` on the error tensor
    - Autograd then calculates ad stores the gradients for eaeh model parameter in the parameter's `.grad` attribute

In [20]:
loss = (prediction - labels).sum()
loss.backward()

print(loss)

tensor(-491.6093, grad_fn=<SumBackward0>)


## Optimize
    - Next step is to load an optimizer, in this case SGD with a learning rate of 0.01 and `momentum` of 0.9
    - Register all the parameters of the model in the optimizer
    - SGD = Stochastic Gradient Descent
    - Momentum or SGD with momentum is method which helps accelerate gradients vectors in the right directions, thus leading to faster converging. Specifically it helps the model exit the local min/max to find the absolute min/max
    - Momentum = data from exponentially weighed averages


In [21]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

print(optim)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)


## Step
    - Finally, call `.step()` to initiate gradient descent (next epoch)
    - The optimizer adjusts each parameter by its gradient stored in `.grad`

In [22]:
# Gradient Descent
optim.step()

print(optim)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)
