# PyTorch's Autograd Engine

## Logistic Regression Forward Pass (Initial Prediction)

In [2]:
import torch
import torch.nn.functional as F

In [7]:
# true label
y = torch.tensor([1.0])
# input feature
x1 = torch.tensor([1.3])
# weight
w1 = torch.tensor([2.0])
# bias
b1 = torch.tensor([0.5])

In [None]:
# weighted sum
z1 = w1 * x1 + b1
# sigmoid activation
y_hat = torch.sigmoid(z1)
# binary cross-entropy loss
loss = F.binary_cross_entropy(y_hat, y)

The non-linear sigmoid activation function is applied after the linear transformation `w1 * x1 + b1` 

$\sigma(x) = \frac{1}{1 + e^{-x}}$

In [11]:
print(f"Weighted sum (z1): {z1}")
print(f"Predicted value (y_hat): {y_hat}")
print(f"Loss: {loss.item()}\n")

Weighted sum (z1): tensor([3.1000])
Predicted value (y_hat): tensor([0.9569])
Loss: 0.0440639853477478



The goal is to **minimize** the loss, but how do we do that? 

Neural networks use *gradient descent*, which uses the gradient to determine how much, and in which direction, the network's weights should be tweaked. A gradient is essentially a vector with numbers that represent, "How much, in which direction, should the weights be tweaked to minimize the loss?"

Now the basic mathematical concept behind calculating gradients is in the *chain rule* of calculus, where you get the derivative of the loss with respsect to the weights. With the plain vanilla approach implemented above, we'd have to go backwards through each line of code and calculate the derivative of the result with respect to[^1] the variable of interest. What's nice about PyTorch is, is that it is an *autograd* engine, which basically means it tracks the calculation done on all your tensors and does the dirty work for you. 

So if we implement the above calculation again, but with `requires_grad=True` for `w` and `b`, we'll be able to get the "direction" and magnitude of change we need to make to the weight(s) and bias(es) to minimize the loss. 

> The derivative of $y$ with respect to $x$ means: How much does $y$ change if I change $x$ by a certain amount? 

> Quotes around direction because you actually need to go the opposite direction of the gradient, since the goal is to *minimize* the loss, not make it bigger

In [13]:
from torch.autograd import grad

In [None]:

# true label
y = torch.tensor([1.0])
# input feature
x1 = torch.tensor([1.3])
# weight
w1 = torch.tensor([2.0], requires_grad=True)
# bias
b1 = torch.tensor([0.5], requires_grad=True)
# weighted sum
z1 = w1 * x1 + b1
# sigmoid activation
y_hat = torch.sigmoid(z1)
# binary cross-entropy loss
loss = F.binary_cross_entropy(y_hat, y)

In [14]:
# NOTE: retain_graph=True is needed here because we'll have to keep the "tracking info"
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b1 = grad(loss, b1, retain_graph=True)

In [15]:
print(f"Partial gradient of loss with respect to w1: {grad_L_w1[0]}")
print(f"Partial gradient of loss with respect to b1: {grad_L_b1[0]}")

Partial gradient of loss with respect to w1: tensor([-0.0560])
Partial gradient of loss with respect to b1: tensor([-0.0431])


In [16]:
# You can also use the .backward() method on the loss tensor to compute the gradients
loss.backward()
print(f"Gradient of loss with respect to w1 using backward(): {w1.grad}")
print(f"Gradient of loss with respect to b1 using backward(): {b1.grad}")

Gradient of loss with respect to w1 using backward(): tensor([-0.0560])
Gradient of loss with respect to b1 using backward(): tensor([-0.0431])
