# **Automatic Differentiation with `torch.autograd`**


When training neural networks, the most frequently used algorithm is
**back propagation**. In this algorithm, parameters (model weights) are
adjusted according to the **gradient** of the loss function with respect
to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation
engine called `torch.autograd`. It supports automatic computation of
gradient for any computational graph.

Consider the simplest one-layer neural network, with input `x`,
parameters `w` and `b`, and Binary Cross Entropy loss function for the cases below.

## **Example - 01**

### **Manual Computation**

In [1]:
def dy_dx(x):
    return 2*x

In [2]:
dy_dx(3)

6

### **Using `autograd`**

In [3]:
import torch

In [4]:
x = torch.tensor(3.0, requires_grad=True)
x

tensor(3., requires_grad=True)

In [5]:
y = x**2
y

tensor(9., grad_fn=<PowBackward0>)

In [6]:
y.backward()
y

tensor(9., grad_fn=<PowBackward0>)

In [7]:
x.grad

tensor(6.)

-----

## **Example - 02**

### **Manual Computation**

In [8]:
import math


def dz_dx(x):
    return 2 * x * math.cos(x**2)

In [9]:
dz_dx(4)

-7.661275842587077

### **Using `autograd`**

In [10]:
x = torch.tensor(4.0, requires_grad=True)
x

tensor(4., requires_grad=True)

In [11]:
y = x ** 2
y

tensor(16., grad_fn=<PowBackward0>)

In [12]:
z = torch.sin(y)
z

tensor(-0.2879, grad_fn=<SinBackward0>)

In [13]:
z.backward()
z

tensor(-0.2879, grad_fn=<SinBackward0>)

In [14]:
x.grad

tensor(-7.6613)

----

## **Example - 03**

### **Computational graph**


This code defines the following **computational graph**:

![Computation Graph](https://pytorch.org/tutorials/_static/img/basics/comp-graph.png)

In this network, `w` and `b` are **parameters**, which we need to
optimize. Thus, we need to be able to compute the gradients of loss
function with respect to those variables. In order to do that, we set
the `requires_grad` property of those tensors.


### **Manual Computation**

In [15]:
import torch

# Inputs
x = torch.tensor(6.7)  # Input feature
y = torch.tensor(0.0)  # True label (binary)

w = torch.tensor(1.0)  # Weight
b = torch.tensor(0.0)  # Bias

In [16]:
# Binary Cross-Entropy Loss for scalar
def binary_cross_entropy_loss(prediction, target):
    epsilon = 1e-8  # To prevent log(0)
    prediction = torch.clamp(prediction, epsilon, 1 - epsilon)
    return -(target * torch.log(prediction) + (1 - target) * torch.log(1 - prediction))

In [17]:
# Forward pass
z = w * x + b  # Weighted sum (linear part)

y_pred = torch.sigmoid(z)  # Predicted probability

# Compute binary cross-entropy loss
loss = binary_cross_entropy_loss(y_pred, y)

In [18]:
loss

tensor(6.7012)

In [19]:
# Derivatives:
# 1. dL/d(y_pred): Loss with respect to the prediction (y_pred)
dloss_dy_pred = (y_pred - y)/(y_pred*(1-y_pred))

# 2. dy_pred/dz: Prediction (y_pred) with respect to z (sigmoid derivative)
dy_pred_dz = y_pred * (1 - y_pred)

# 3. dz/dw and dz/db: z with respect to w and b
dz_dw = x  # dz/dw = x
dz_db = 1  # dz/db = 1 (bias contributes directly to z)

dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

In [20]:
print(f"Manual Gradient of loss w.r.t weight (dw): {dL_dw}")
print(f"Manual Gradient of loss w.r.t bias (db): {dL_db}")

Manual Gradient of loss w.r.t weight (dw): 6.691762447357178
Manual Gradient of loss w.r.t bias (db): 0.998770534992218


### **Using `autograd`**

In [21]:
x = torch.tensor(6.7)
y = torch.tensor(0.0)
print(x)
print(y)

tensor(6.7000)
tensor(0.)


In [22]:
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
print(w)
print(b)

tensor(1., requires_grad=True)
tensor(0., requires_grad=True)


In [23]:
z = w*x + b
z

tensor(6.7000, grad_fn=<AddBackward0>)

In [24]:
y_pred = torch.sigmoid(z)
y_pred

tensor(0.9988, grad_fn=<SigmoidBackward0>)

In [25]:
loss = binary_cross_entropy_loss(y_pred, y)
loss

tensor(6.7012, grad_fn=<NegBackward0>)

### **Computing Gradients**

To optimize weights of parameters in the neural network, we need to
compute the derivatives of our loss function with respect to parameters,
namely, we need $\frac{\partial loss}{\partial w}$ and
$\frac{\partial loss}{\partial b}$ under some fixed values of `x` and
`y`. To compute those derivatives, we call `loss.backward()`, and then
retrieve the values from `w.grad` and `b.grad`:


In [26]:
loss.backward()

In [27]:
print(f"By using autograd function (dw): {w.grad}")
print(f"By using autograd function (db): {b.grad}")

By using autograd function (dw): 6.6917619705200195
By using autograd function (db): 0.9987704753875732


----

## **Using Vector Input Tensor**

$$
X = [X_1, X_2, X_3]
$$

$$
X = [1.0, 2.0, 3.0]
$$

In [28]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
x

tensor([1., 2., 3.], requires_grad=True)

$$
Y = mean(x^2)
$$

$$
Y = \frac{(X_1)^2 + (X_2)^2 + (X_3)^2}{3}
$$

$$
Y = fxn(X_1, X_2, X_3)
$$

In [29]:
y = (x**2).mean()
y

tensor(4.6667, grad_fn=<MeanBackward0>)

$$
\frac{\partial Y}{\partial X_1} = \frac{2x_1}{3}
$$

$$
\frac{\partial Y}{\partial X_2} = \frac{2x_2}{3}
$$


$$
\frac{\partial Y}{\partial X_3} = \frac{2x_3}{3}
$$


In [30]:
y.backward()

$$
\frac{\partial Y}{\partial X_1} = \frac{2}{3} = 0.67
$$

$$
\frac{\partial Y}{\partial X_2} = \frac{4}{3} = 1.3
$$


$$
\frac{\partial Y}{\partial X_3} = 2
$$


In [31]:
x.grad

tensor([0.6667, 1.3333, 2.0000])

-----

## **Clearing Gradients**

The key concept behind `autograd` in PyTorch is **gradient accumulation**. Each time you call `.backward()` on a tensor, the gradients for all tensors with `requires_grad=True` are added (accumulated) to their `.grad` attributes. This means if you run the backward pass multiple times without clearing the gradients, the values in `.grad` will keep increasing, reflecting the sum of all computed gradients.

**Why does this happen?**  
This behavior is useful when training neural networks using mini-batches. You might want to accumulate gradients over several batches before updating the model parameters.

**How to manage gradient accumulation:**  
- To avoid unwanted accumulation, always clear gradients before a new backward pass using `.zero_()` on the `.grad` attribute:
    ```python
    x.grad.zero_()
    ```
- Alternatively, use `optimizer.zero_grad()` when working with optimizers.

**Summary:**  
- `.backward()` accumulates gradients in `.grad`.
- Always clear gradients before a new backward pass unless you intentionally want to accumulate them.

In [32]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [33]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [34]:
y.backward()

In [35]:
x.grad

tensor(4.)

In [36]:
x.grad.zero_()

tensor(0.)

---

## **Disabling Gradient Tracking**

Sometimes, we may need to perform computations without tracking gradients or calculating derivatives.

**Scenarios Where Disabling Gradient Tracking is Useful**

- **Model Inference:**  
    When making predictions with a trained model, gradients are not needed.

- **Model Evaluation:**  
    During validation or testing phases, to save memory and computation.

- **Feature Extraction:**  
    When using a model to extract features from data without updating weights.

- **Saving/Loading Model Outputs:**  
    When storing intermediate results for later use.

- **Visualizations:**  
    When plotting or analyzing outputs that do not require gradients.

- **Deployment:**  
    In production environments where only forward passes are performed.

In such cases, PyTorch provides several ways to disable gradient tracking:

- **Set `requires_grad` to `False`:**  
    You can turn off gradient tracking for a tensor by setting its `requires_grad` attribute to `False` using `requires_grad_(False)`.

- **Detach a tensor:**  
    Use `.detach()` to create a new tensor that does not require gradients and is disconnected from the computation graph.

- **Use `torch.no_grad()` context:**  
    Wrap your code inside a `with torch.no_grad():` block to temporarily disable gradient tracking for all operations within the block.

Disabling gradient tracking is useful for inference, evaluation, or any scenario where derivatives are not needed, as it reduces memory usage and speeds up computations.

In [37]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [38]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [39]:
y.backward()

In [40]:
x.grad

tensor(4.)

### **Using option 1 - `requires_grad_(False)`**

In [41]:
x.requires_grad_(False)

tensor(2.)

In [42]:
x

tensor(2.)

In [43]:
y = x ** 2

In [44]:
y

tensor(4.)

In [45]:
# This will not work now:

# y.backward()

### **Using option 2 - `detach()`**

The `.detach()` method in PyTorch creates a new tensor that shares the same data as the original tensor but does not require gradients and is disconnected from the computation graph. This is useful when you want to perform computations on a tensor without tracking gradients or affecting the autograd mechanism.


In [46]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [47]:
z = x.detach()
z

tensor(2.)

In [48]:
y = x ** 2

In [49]:
y

tensor(4., grad_fn=<PowBackward0>)

In [50]:
y1 = z ** 2
y1

tensor(4.)

In [51]:
y.backward()

In [52]:
# This will not work now:

# y1.backward()

### **Using option 3 - `torch.no_grad()`**

The `torch.no_grad()` context manager temporarily disables gradient tracking for all operations within its block. This is useful when you want to perform computations without building the computation graph or storing gradients, such as during model inference or evaluation. Any tensors created or modified inside the `with torch.no_grad():` block will not require gradients, even if their source tensors have `requires_grad=True`. This helps reduce memory usage and speeds up computations when gradients are not needed.

In [53]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [54]:
with torch.no_grad():
    y = x ** 2

In [55]:
y

tensor(4.)

In [56]:
# This will not work now:

# y.backward()