# üîÑ PyTorch Autograd

Autograd is a core component of PyTorch that provides **automatic differentiation** for tensor operations. It is the "magic" that calculates gradients automatically, which is essential for optimizing models (**training neural networks** ) using algorithms like Gradient Descent.


---

## 1Ô∏è‚É£ Why Do We Need Autograd?

Training a neural network requires:
- Computing **loss**
- Finding how loss changes w.r.t. parameters (`‚àÇL/‚àÇw`, `‚àÇL/‚àÇb`)
- Updating parameters using gradient descent

Manually computing gradients for large networks is:

‚ùå error-prone  
‚ùå tedious  
‚ùå impractical  

‚úÖ **Autograd does this automatically**

---

## 2Ô∏è‚É£ Training Process (High-Level)

Training a Neural Network involves four main steps:

1.  **Forward Pass:** Compute the output (prediction) of the network given an input.
    * *Formula:* $\hat{y} = \sigma(w \cdot x + b)$
2.  **Calculate Loss:** Measures the error by comparing the prediction $\hat{y}$ to the actual target $y$.
    * *Formula (Binary Cross-Entropy):* $L = -[y \cdot \ln(\hat{y}) + (1-y) \cdot \ln(1-\hat{y})]$
3.  **Backward Pass:** Compute gradients of the loss with respect to the parameters ($\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}$) using the **Chain Rule**.
4.  **Update Gradients:** Adjust weights and biases to minimize error using an optimizer (e.g., Gradient Descent).


---

### **3. The Chain Rule Explained**
To find how much the Loss ($L$) changes when we tweak a weight ($w$), Autograd multiplies the local derivatives "backwards" from the output to the input:

$$
\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}
$$

* **Step 1:** How Loss changes w.r.t Prediction ($\frac{\partial L}{\partial \hat{y}}$)

* **Step 2:** How Prediction changes w.r.t Linear Output ($\frac{\partial \hat{y}}{\partial z}$) -> *Derivative of Sigmoid*

* **Step 3:** How Linear Output changes w.r.t Weight ($\frac{\partial z}{\partial w} = x$)

**Final Gradient Calculation:**
* **For Weight:** $\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x$

* **For Bias:** $\frac{\partial L}{\partial b} = (\hat{y} - y) \cdot 1$

### **4. Example 1: Simple Function**
If $y = x^2$ and $x = 3$:

1.  **Forward:** $y = 3^2 = 9$

2.  **Backward ($\frac{dy}{dx}$):** Derivative of $x^2$ is $2x$.

3.  **Result:** $2(3) = 6$


### **5 Example 2: Nested Function (Chain Rule)**

In this example, we calculate the gradient for a nested function where one operation feeds into another.

**The Function:**
We have two stages:
1.  **Inner Function:** $y = x^2$
2.  **Outer Function:** $z = \sin(y)$

**Goal:**
Find the gradient of $z$ with respect to $x$ ($\frac{dz}{dx}$).

**Forward Pass:**
* Input $x$ is squared to get $y$.
* $y$ is passed through the sine function to get $z$.

**Backward Pass (Chain Rule):**
Autograd calculates the derivative by multiplying the local gradients:

$$
\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}
$$

**Step-by-Step Calculation:**
1.  **Outer Derivative ($\frac{dz}{dy}$):** The derivative of $\sin(y)$ is $\cos(y)$.
2.  **Inner Derivative ($\frac{dy}{dx}$):** The derivative of $x^2$ is $2x$.
3.  **Final Gradient:**
    $$\frac{dz}{dx} = \cos(y) \cdot 2x = \cos(x^2) \cdot 2x$$


### **6. Example 3: Multivariate Function (Mean of Squares)**

In this example, we calculate the gradient for a function that takes a vector input, squares each element, and finds the mean.

**The Function:**
Given an input vector $x = [x_1, x_2, x_3]$:
$$Y = \text{mean}(x^2) = \frac{x_1^2 + x_2^2 + x_3^2}{3}$$

**Gradient Calculation (Partial Derivatives):**
To find the gradients, we calculate the partial derivative of $Y$ with respect to each input component $x_i$:

$$
\frac{\partial Y}{\partial x_i} = \frac{1}{3} \cdot \frac{d}{dx}(x_i^2) = \frac{2x_i}{3}
$$

**Numerical Example:**
If our input tensor is $x = [x_1, x_2, x_3]$, the gradient tensor stored in `x.grad` will be:
$$
\text{gradients} = \left[ \frac{2x_1}{3}, \frac{2x_2}{3}, \frac{2x_3}{3} \right]
$$

* *Note:* This demonstrates that Autograd handles **vector-Jacobian products** automatically, computing derivatives for every element in the tensor simultaneously.




## 5Ô∏è‚É£ Computational Graph (Core Idea)

During the **forward pass**:

* Each operation becomes a **node**
* Tensors flow through operations
* Graph is dynamically built

During the **backward pass**:

* Gradients flow **backward**
* Each node applies local derivative
* Chain rule connects everything

---

## 6Ô∏è‚É£ Example: Single Neuron (Logistic Regression)

### Forward Pass

1. **Linear transformation**

```text
z = w¬∑x + b
```

2. **Activation (Sigmoid)**

```text
≈∑ = œÉ(z) = 1 / (1 + e‚Åª·∂ª)
```

3. **Loss (Binary Cross Entropy)**

```text
L = -[y¬∑log(≈∑) + (1‚àíy)¬∑log(1‚àí≈∑)]
```

---

### Backward Pass (Gradients)

Using chain rule:

```text
‚àÇL/‚àÇw = (≈∑ ‚àí y) ¬∑ x
‚àÇL/‚àÇb = (≈∑ ‚àí y)
```

Autograd computes these **without you writing the math**.

---

## 7Ô∏è‚É£ Neural Networks & Autograd

In deep networks:

* Each layer adds nodes to the graph
* Backward pass propagates gradients layer by layer
* Autograd handles:

  * Weight gradients
  * Bias gradients
  * Intermediate tensor gradients

---

## 8Ô∏è‚É£ Vector Example (Multiple Inputs)

Given:

```text
x = [x‚ÇÅ, x‚ÇÇ, x‚ÇÉ]
y = mean(x¬≤) = (x‚ÇÅ¬≤ + x‚ÇÇ¬≤ + x‚ÇÉ¬≤) / 3
```

Gradients:

```text
‚àÇy/‚àÇx‚ÇÅ = 2x‚ÇÅ / 3
‚àÇy/‚àÇx‚ÇÇ = 2x‚ÇÇ / 3
‚àÇy/‚àÇx‚ÇÉ = 2x‚ÇÉ / 3
```

Autograd computes each partial derivative automatically.

---

## 9Ô∏è‚É£ Key Autograd Properties

* Gradients are stored in `.grad`
* Gradients accumulate by default
* Graph is freed after `.backward()` (unless retained)
* Works only for tensors with `requires_grad=True`

---

## üß† Mental Model (VERY IMPORTANT)

> **Forward pass builds the graph**
> **Backward pass walks the graph in reverse using chain rule**

You write:

```python
loss.backward()
```

PyTorch does:

```text
Apply ‚àÇL/‚àÇoutput
‚Üí ‚àÇoutput/‚àÇhidden
‚Üí ‚àÇhidden/‚àÇweights
```

---

## üöÄ Why Autograd Is Powerful

* Enables deep learning at scale
* Makes experimentation fast
* Eliminates manual derivative bugs
* Works for arbitrarily complex models

---

## ‚úÖ One-Line Summary

> **Autograd = automatic gradient computation using dynamic computation graphs**


In [None]:
# Define a function that computes the derivative of y = x^2
def dy_dx(x):
    # The derivative of x^2 with respect to x is 2x
    return 2 * x

# Call the function with x = 3
dy_dx(3)

## üîÅ Manual Gradient Computation (dy/dx)

This example demonstrates **manual differentiation**, which helps build intuition
for how **backpropagation** works.

---

## 1Ô∏è‚É£ The Function

```text
y = x¬≤
````

This is a simple quadratic function.

---

## 2Ô∏è‚É£ Derivative

Using basic calculus:

```text
dy/dx = 2x
```

This tells us:

* How fast `y` changes when `x` changes
* The **slope** of the function at any point `x`

---

## 3Ô∏è‚É£ Python Implementation

```python
def dy_dx(x):
    return 2 * x
```

Calling:

```python
dy_dx(3)
```

Gives:

```text
6
```

---

## 4Ô∏è‚É£ Interpretation

At `x = 3`:

* The slope of `y = x¬≤` is `6`
* Increasing `x` slightly will increase `y` rapidly

---

## 5Ô∏è‚É£ Why This Matters in ML

In machine learning:

* `x` ‚Üí model parameters (weights)
* `y` ‚Üí loss

Gradients tell us:

```text
How should weights change to reduce loss?
```



---

## 6Ô∏è‚É£ Connection to PyTorch Autograd

Manual:

```python
dy_dx(3)  # 6
```

Autograd equivalent:

In [11]:
import torch
import math

In [7]:
# 1. Forward Pass Setup
# Create a tensor 'x' with value 3.0.
# requires_grad=True is the SWITCH that turns on Autograd.
# It tells PyTorch: "Please track every operation on this variable so
# we can calculate derivatives later."
x = torch.tensor(3.0, requires_grad=True)

In [8]:
# 2. Define the Function (Computational Graph)
# Operation: y = x^2
# Forward Pass: 3^2 = 9
# PyTorch builds a "graph" in the background connecting x to y.
y = x**2

print("x:", x)  # Output: tensor(3., requires_grad=True)
print("y:", y)  # Output: tensor(9., grad_fn=<PowBackward0>)

x: tensor(3., requires_grad=True)
y: tensor(9., grad_fn=<PowBackward0>)


In [9]:
# 3. Backward Pass (Backpropagation)
# This command triggers the Chain Rule.
# It calculates dy/dx for every variable involved in creating 'y'.
# Mathematically: d(x^2)/dx = 2x.
y.backward()

In [10]:
# 4. Check the Gradient
# Since x = 3, the gradient is 2 * 3 = 6.
# This value is stored in the .grad attribute of x.
x.grad  # Output: tensor(6.)

tensor(6.)

> Autograd does **exactly what we did manually**, but for large computation graphs.

---

## üß† Mental Model

> **Derivative = sensitivity**

Gradients answer:

* Which direction to move?
* How big should the step be?

This is the foundation of **gradient descent**.

---

## ‚úÖ One-Line Summary

> Manual differentiation builds intuition; Autograd scales it to neural networks.

### Important Rules to Remember
‚úÖ Rule 1: Only leaf tensors store gradients

* x is a leaf tensor
* y is not

‚úÖ Rule 2: Gradients accumulate

Calling .backward() again will add gradients unless reset:

`x.grad.zero_()`

‚úÖ Rule 3: .backward() works automatically

You never manually write derivatives for complex graphs.

> Forward pass builds the graph
> Backward pass walks the graph backward using chain rule

```
x ‚Üí y
‚Üë
gradient flows backward
```
üöÄ Why This Matters in Deep Learning

* x ‚Üí model parameters (weights)
* y ‚Üí loss
* x.grad ‚Üí how to update weights

This is the foundation of:
* Gradient Descent
* Backpropagation
* Neural network training

In [12]:
# --- MANUAL CALCULATION ---
# y = x^2
# Function: z = sin(y) = sin(x^2)
# Chain Rule: dz/dx = dz/dy * dy/dx
#           = cos(y) * 2x
#           = cos(x^2) * 2x
def dz_dx(x):
    return 2 * x * math.cos(x**2)

In [13]:
# Calculate Manual Gradient for x = 4
# 2*4 * cos(16) ‚âà 8 * (-0.957) ‚âà -7.66

print(f"Manual Result: {dz_dx(4)}")

Manual Result: -7.661275842587077


In [24]:
# --- PYTORCH AUTOGRAD ---

# 1. Create a scalar tensor with gradient tracking enabled
x = torch.tensor(4.0, requires_grad=True)

# 2. Forward Passes
# Forward pass: y = x^2
y = x ** 2          # y = 16

# Forward pass: z = sin(y)
z = torch.sin(y)    # z = sin(16) ‚âà -0.2879

In [20]:
print("\n--- Tensor State ---")
print(f"x: {x}")
print(f"y: {y}")
print(f"z: {z}")


--- Tensor State ---
x: 4.0
y: 16.0
z: -0.2879033088684082


In [21]:
# 3. Backward Pass
# Compute dz/dx automatically using autograd
# This triggers the chain rule calculation.
z.backward()

In [22]:
# 4. Check Gradients of z with respect to x
print("\n--- Gradients ---")
print(f"x.grad (dz/dx): {x.grad}")  # Matches Manual Result (-7.66)


--- Gradients ---
x.grad (dz/dx): -7.661275863647461


In [23]:
# CRITICAL NOTE:
# y.grad will be None by default because y is NOT a leaf tensor
# PyTorch automatically frees the gradients of "intermediate" nodes (like y)
# to save memory. It only keeps gradients for "Leaf" nodes (like x).

print(f"y.grad (dz/dy): {y.grad}")

y.grad (dz/dy): None


  print(f"y.grad (dz/dy): {y.grad}")


# üîó Chain Rule with PyTorch Autograd (Manual vs Automatic)

This example demonstrates:
- Manual differentiation using calculus
- Automatic differentiation using PyTorch Autograd
- How the **chain rule** is applied internally

---

## 1Ô∏è‚É£ Mathematical Setup

Define:
```text
y = x¬≤
z = sin(y) = sin(x¬≤)
````

---

## 2Ô∏è‚É£ Manual Derivative (Chain Rule)

Using calculus:

```text
dz/dx = dz/dy √ó dy/dx
```

Where:

```text
dz/dy = cos(y)
dy/dx = 2x
```

So:

```text
dz/dx = 2x ¬∑ cos(x¬≤)
```

At `x = 4`:

```text
dz/dx = 2 √ó 4 √ó cos(16)
```

---

## 3Ô∏è‚É£ PyTorch Autograd Computation

```python
x = torch.tensor(4.0, requires_grad=True)
y = x ** 2
z = torch.sin(y)
z.backward()
```

Autograd automatically:

* Builds the computation graph
* Applies the chain rule
* Computes `dz/dx`

---

## 4Ô∏è‚É£ Accessing Gradients

```python
x.grad
```

This stores:

```text
‚àÇz / ‚àÇx
```

Matches the manual derivative result ‚úÖ

---

## 5Ô∏è‚É£ Why is `y.grad` None? ‚ö†Ô∏è (IMPORTANT)

```python
y.grad  # None
```

### Reason:

* `y` is an **intermediate tensor**
* Only **leaf tensors** store gradients by default

Leaf tensor:

```python
x = torch.tensor(..., requires_grad=True)
```

Non-leaf tensor:

```python
y = x ** 2
```

---

## 6Ô∏è‚É£ How to Access Intermediate Gradients (Advanced)

If you really need `y.grad`:

```python
y.retain_grad()
z.backward()
y.grad
```

‚ö†Ô∏è Use this only for:

* Debugging
* Visualization
* Learning

Not for normal training.

---

## üß† Mental Model

> **Forward pass builds a computation graph**
> **Backward pass applies chain rule from output to inputs**

```text
x ‚Üí (square) ‚Üí y ‚Üí (sin) ‚Üí z
‚Üë                    |
‚îî‚îÄ‚îÄ‚îÄ‚îÄ gradient flows ‚îÄ‚îò
```

---

## üöÄ Why This Matters in Deep Learning

* Neural networks are just **very large chain-rule graphs**
* Autograd handles:

  * Thousands of operations
  * Millions of parameters
* You only write:

```python
loss.backward()
```

---

## ‚úÖ Key Takeaways

‚úî Autograd matches manual calculus
‚úî Chain rule is applied automatically
‚úî Gradients stored only for leaf tensors
‚úî Intermediate gradients are optional

---

## üîë One-Line Summary

> **Autograd is chain rule at scale, implemented automatically.**


In [31]:
# 1. Inputs

# Inputs (single data point)
x = torch.tensor(6.7)  # Input feature (scalar)
y = torch.tensor(0.0)  # True label (Target is 0; binary classification: 0 or 1)

# Model parameters
w = torch.tensor(1.0)  # Weight
b = torch.tensor(0.0)  # Bias

In [32]:
# 2. Define Loss Function (Binary Cross-Entropy Loss)
# Formula: L = -[y * log(p) + (1-y) * log(1-p)]
def binary_cross_entropy_loss(prediction, target):
    # Small value to avoid log(0), which is undefined
    epsilon = 1e-8

    # Clamp restricts predictions to be strictly between [0.00000001, 0.99999999]
    # for numerical stability
    prediction = torch.clamp(prediction, epsilon, 1 - epsilon)

    # Binary Cross-Entropy formula
    return -(target * torch.log(prediction) +
             (1 - target) * torch.log(1 - prediction))


In [33]:
# 3. Forward Pass

# Linear transformation
z = w * x + b          # Linear Step: 1.0 * 6.7 + 0 = 6.7

# Apply sigmoid activation to get probability
# y_pred is in range (0, 1)
y_pred = torch.sigmoid(z) # Activation Step: 1 / (1 + e^-6.7) ‚âà 0.9988

In [34]:
# 4. Calculate Loss (Compute binary cross-entropy loss)
# The model is VERY confident (99.88%) that the class is 1.
# BUT the True Label is 0.
# This is a "Wrong and Confident" prediction, so the loss should be HIGH.
loss = binary_cross_entropy_loss(y_pred, y)

In [35]:
print(f"Logit (z): {z:.4f}")
print(f"Prediction (y_pred): {y_pred:.4f}")
print(f"Loss: {loss:.4f}")

Logit (z): 6.7000
Prediction (y_pred): 0.9988
Loss: 6.7012


# üîê Logistic Regression Forward Pass + Binary Cross-Entropy Loss

This example implements **binary classification** from scratch using:
- Linear model
- Sigmoid activation
- Binary Cross-Entropy (BCE) loss

---

## 1Ô∏è‚É£ Problem Setup

We are solving a **binary classification** problem.

- Input feature: `x = 6.7`
- True label: `y = 0` (negative class)

---

## 2Ô∏è‚É£ Model Parameters

```text
z = w¬∑x + b
````

Where:

* `w` = weight
* `b` = bias

This is a **linear transformation**.

---

## 3Ô∏è‚É£ Sigmoid Activation

```python
y_pred = torch.sigmoid(z)
```

### Formula:

```text
œÉ(z) = 1 / (1 + e‚Åª·∂ª)
```

### Output range:

```text
0 < y_pred < 1
```

Interpretation:

* Output is a **probability**
* Used for binary classification

---

## 4Ô∏è‚É£ Binary Cross-Entropy Loss (BCE)

```python
L = -[y¬∑log(yÃÇ) + (1 ‚àí y)¬∑log(1 ‚àí yÃÇ)]
```

Where:

* `y` = true label
* `yÃÇ` = predicted probability

---

### Special Cases

| True Label `y` | Loss Simplifies To |
| -------------- | ------------------ |
| `y = 1`        | `-log(yÃÇ)`         |
| `y = 0`        | `-log(1 ‚àí yÃÇ)`     |

---

## 5Ô∏è‚É£ Why Clamp the Prediction? ‚ö†Ô∏è

```python
prediction = torch.clamp(prediction, Œµ, 1 ‚àí Œµ)
```

### Reason:

* `log(0)` ‚Üí `‚àí‚àû`
* Causes numerical instability

Clamping keeps:

```text
Œµ < prediction < 1 ‚àí Œµ
```

---

## 6Ô∏è‚É£ Full Forward Pass Flow

```text
x ‚Üí (w¬∑x + b) ‚Üí z
z ‚Üí sigmoid(z) ‚Üí yÃÇ
yÃÇ ‚Üí BCE(yÃÇ, y) ‚Üí loss
```

This is the **entire forward pass** of logistic regression.

---

## 7Ô∏è‚É£ Connection to Deep Learning

This exact pattern appears in:

* Logistic Regression
* Final layer of binary classifiers
* Neural networks with sigmoid output

In practice, PyTorch provides:

```python
torch.nn.BCELoss()
torch.nn.BCEWithLogitsLoss()
```

---

## üß† Mental Model

> **Linear model ‚Üí probability ‚Üí loss**

Every deep learning classifier ultimately reduces to this idea.

---

## ‚úÖ One-Line Summary

> Logistic regression = linear transformation + sigmoid + BCE loss


In [41]:
# Derivatives (Manual Backprop)

# Here we see what loss.backward() did using the CHAIN RULE.
# Chain Rule: dL/dw = (dL/dy_pred) * (dy_pred/dz) * (dz/dw)

# 1. dL/d(y_pred)
# Derivative of Binary Cross-Entropy loss w.r.t. prediction
# BCE Loss Furomula: L = -(y*log(yÃÇ) + (1-y)*log(1-yÃÇ))
# The derivative is: dL/dyÃÇ = (yÃÇ - y) / (yÃÇ * (1 - yÃÇ))
dloss_dy_pred = (y_pred - y) / (y_pred * (1 - y_pred))

In [42]:
# 2. dy_pred/dz
# Derivative of Prediction (Sigmoid) w.r.t z ---
# sigmoid(z) = 1 / (1 + e^-z) or sigma(z) * (1 - sigma(z))
# d(sigmoid)/dz = yÃÇ * (1 - yÃÇ) or dy_pred_dz = y_pred * (1 - y_pred)
dy_pred_dz = y_pred * (1 - y_pred)

In [44]:
# 3. Derivative of z w.r.t parameters (w and b) ---
# dz/dw and dz/db
# z = w*x + b
dz_dw = x        # derivative of z w.r.t. w
dz_db = 1        # derivative of z w.r.t. b

In [45]:
# --- Final Calculation (Chain Rule) ---
# Multiply the links together to get the gradient for Weight (w)

# dL/dw = dL/dyÃÇ * dyÃÇ/dz * dz/dw
dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw

# dL/db = dL/dyÃÇ * dyÃÇ/dz * dz/db
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

In [46]:
print(f"Manual Gradient of loss w.r.t weight (dw): {dL_dw}")
print(f"Manual Gradient of loss w.r.t bias (db): {dL_db}")

Manual Gradient of loss w.r.t weight (dw): 6.691762447357178
Manual Gradient of loss w.r.t bias (db): 0.998770534992218


In [48]:
# OBSERVATION:
# Notice that 'dloss_dy_pred' and 'dy_pred_dz' cancel out parts of each other.
# (y_pred - y) / (outcome * (1-outcome)) * (outcome * (1-outcome))
# Simply becomes: (y_pred - y)
# So, dL/dw is usually simplified to just: (y_pred - y) * x

## Key Mathematical Concept Used: The Chain Rule

The code manually calculates gradients by breaking the neural network into 3 stages. To find how the **Weight ($w$)** affects the **Loss ($L$)**, we multiply the derivatives of each stage:

$$\frac{\partial L}{\partial w} = \underbrace{\frac{\partial L}{\partial \hat{y}}}_{\text{Loss changes as pred changes}} \cdot \underbrace{\frac{\partial \hat{y}}{\partial z}}_{\text{Pred changes as z changes}} \cdot \underbrace{\frac{\partial z}{\partial w}}_{\text{z changes as weight changes}}$$

1. $\frac{\partial L}{\partial \hat{y}}$: Calculated as `dloss_dy_pred`.

2. $\frac{\partial \hat{y}}{\partial z}$: Calculated as `dy_pred_dz` (Sigmoid derivative).

3. $\frac{\partial z}{\partial w}$: Calculated as `x` (Input value).

In [54]:
# Automation Verification (PyTorch)

# 1. SETUP: DATA AND PARAMETERS

# Input data (x) and Target label (y)
# These are fixed data points, so requires_grad is False by default.

x = torch.tensor(6.7)
y = torch.tensor(0.0)

In [55]:
# Weights (w) and Bias (b)
# These are the parameters the model wants to learn (optimize).
# requires_grad=True tells PyTorch to track operations on these for 'autograd'.

w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

In [56]:
print(f"Initial Weight: {w.item()}, Initial Bias: {b.item()}")

Initial Weight: 1.0, Initial Bias: 0.0


In [57]:
# 2. FORWARD PASS

# Step A: Linear Combination (The "Neuron" math)
# z = w * x + b
z = w * x + b

# Explanation: This calculates the weighted sum of inputs plus bias.

In [58]:
# Step B: Activation Function (Sigmoid)
# Squeezes the value of 'z' between 0 and 1 to make it a probability.

y_pred = torch.sigmoid(z)

In [59]:
# Step C: Calculate Loss (Binary Cross Entropy)
# This measures how wrong the prediction is compared to the target 'y'.
# Note: Usually we use torch.nn.functional.binary_cross_entropy

loss = torch.nn.functional.binary_cross_entropy(y_pred, y)

In [60]:
print(f"Prediction (y_pred): {y_pred.item():.4f}")
print(f"Loss: {loss.item():.4f}")

Prediction (y_pred): 0.9988
Loss: 6.7012


In [61]:
# 3. AUTOMATIC BACKPROPAGATION (PYTORCH)

# This single line triggers the "Backward Pass".
# PyTorch walks back through the computation graph to calculate gradients.
loss.backward()

In [62]:
print("-" * 30)
print(f"Auto-calc Gradient w.r.t weight (w.grad): {w.grad.item():.4f}")
print(f"Auto-calc Gradient w.r.t bias (b.grad):   {b.grad.item():.4f}")
print("-" * 30)

------------------------------
Auto-calc Gradient w.r.t weight (w.grad): 6.6918
Auto-calc Gradient w.r.t bias (b.grad):   0.9988
------------------------------


In [63]:
# 4. MANUAL BACKPROPAGATION (THE MATH)
# Here we manually recreate what loss.backward() did above using the CHAIN RULE.
# Chain Rule: dL/dw = (dL/dy_pred) * (dy_pred/dz) * (dz/dw)

In [64]:
# --- Link 1: Derivative of Loss w.r.t Prediction ---
# BCE Loss Formula: L = -[y*log(y_pred) + (1-y)*log(1-y_pred)]
# The derivative is: (y_pred - y) / (y_pred * (1 - y_pred))
dloss_dy_pred = (y_pred - y) / (y_pred * (1 - y_pred))

In [65]:
# --- Link 2: Derivative of Prediction (Sigmoid) w.r.t z ---
# Sigmoid derivative property: sigma(z) * (1 - sigma(z))
dy_pred_dz = y_pred * (1 - y_pred)

In [66]:
# --- Link 3: Derivative of z w.r.t parameters (w and b) ---
# Since z = w*x + b:
dz_dw = x  # The derivative of (w*x) with respect to w is just x
dz_db = 1  # The derivative of (b) with respect to b is just 1

In [67]:
# --- Final Calculation (Chain Rule) ---
# Multiply the links together to get the gradient for Weight (w)
dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw

In [68]:
# Multiply the links together to get the gradient for Bias (b)
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

In [69]:
print(f"Manual Gradient w.r.t weight (dw):      {dL_dw.item():.4f}")
print(f"Manual Gradient w.r.t bias (db):        {dL_db.item():.4f}")

Manual Gradient w.r.t weight (dw):      6.6918
Manual Gradient w.r.t bias (db):        0.9988


In [70]:
# OBSERVATION:
# Notice that 'dloss_dy_pred' and 'dy_pred_dz' cancel out parts of each other.
# (y_pred - y) / (outcome * (1-outcome)) * (outcome * (1-outcome))
# Simply becomes: (y_pred - y)
# So, dL/dw is usually simplified to just: (y_pred - y) * x

# üîÅ Manual Backpropagation vs PyTorch Autograd (Binary Classification)

This example shows:
- Manual gradient computation using calculus
- Verification using PyTorch Autograd
- How the **chain rule** powers backpropagation

---

## 1Ô∏è‚É£ Model Definition

We use **logistic regression** for binary classification.

### Forward equations:

```text
z = w¬∑x + b
yÃÇ = sigmoid(z)
L = ‚àí[y¬∑log(yÃÇ) + (1‚àíy)¬∑log(1‚àíyÃÇ)]
````

---

## 2Ô∏è‚É£ Manual Gradient Derivation (Chain Rule)

We apply:

```text
dL/dw = dL/dyÃÇ √ó dyÃÇ/dz √ó dz/dw
dL/db = dL/dyÃÇ √ó dyÃÇ/dz √ó dz/db
```

---

### Step 1: Loss derivative

```text
dL/dyÃÇ = (yÃÇ ‚àí y) / (yÃÇ(1 ‚àí yÃÇ))
```

---

### Step 2: Sigmoid derivative

```text
dyÃÇ/dz = yÃÇ(1 ‚àí yÃÇ)
```

---

### Step 3: Linear derivatives

```text
dz/dw = x
dz/db = 1
```

---

### Final gradients

```text
dL/dw = (yÃÇ ‚àí y) ¬∑ x
dL/db = (yÃÇ ‚àí y)
```

This is the **core gradient rule** of logistic regression.

---

## 3Ô∏è‚É£ PyTorch Autograd Verification

```python
loss.backward()
```

What Autograd does internally:

* Builds a computation graph during the forward pass
* Applies the chain rule automatically
* Stores gradients in `.grad`

```python
w.grad  ‚Üí dL/dw
b.grad  ‚Üí dL/db
```

---

## 4Ô∏è‚É£ Why This Is Important

* This is **exactly how neural networks learn**
* Deep networks are just:

  > logistic regression stacked many times
* Autograd saves you from writing this math manually

---

## üß† Mental Model

> **Forward pass builds the graph**
> **Backward pass applies the chain rule backward**

```text
x ‚Üí z ‚Üí yÃÇ ‚Üí L
‚Üë    ‚Üë    ‚Üë
|____|____|  gradients flow back
```

---

## 5Ô∏è‚É£ Practical Insight

In real projects, you would NOT implement BCE manually.

Instead, use:

```python
torch.nn.BCEWithLogitsLoss()
```

Why?

* Numerically stable
* Combines sigmoid + BCE
* Faster and safer

---

## ‚úÖ One-Line Summary

> Manual gradients build intuition; Autograd scales it to deep networks.


In [71]:
# PART 1: Gradients with Vector Input

# 1. Define a vector tensor with 3 elements
# requires_grad=True tells PyTorch to track every operation on this tensor.
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

In [72]:
# 2. Define the function: y = mean(x^2)
# Mathematical Formula: y = (x1^2 + x2^2 + x3^2) / 3
# We use .mean() to reduce the vector to a single scalar value.
# PyTorch's .backward() usually requires a scalar (single number) output.
y = (x**2).mean()

In [73]:
# 3. Calculate Gradients of y w.r.t. x
# This computes dy/dx for every element in x.
# The derivative of (x^2)/3 is (2x)/3.
y.backward()

In [74]:
# 4. View the result (Gradient is computed for EACH element of x)
# For x=1.0: (2*1)/3 = 0.6667
# For x=2.0: (2*2)/3 = 1.3333
# For x=3.0: (2*3)/3 = 2.0000
print(f"Gradients for vector x: {x.grad}")

Gradients for vector x: tensor([0.6667, 1.3333, 2.0000])


In [75]:
# PART 2: Clearing Gradients

# NOTE: The line below creates a BRAND NEW tensor.
# The previous 'x' and its gradients are discarded/overwritten.
x = torch.tensor(2.0, requires_grad=True)

In [76]:
# Define function: y = x^2
y = x ** 2

In [77]:
# Calculate Gradient
# Derivative of x^2 is 2x.
y.backward()

In [78]:
print(f"Gradient before clearing (2*x): {x.grad}") # Should be 2*2 = 4.0

Gradient before clearing (2*x): 4.0


In [79]:
# 5. CLEARING GRADIENTS (Crucial Step!)
# In PyTorch, gradients "accumulate" (add up) by default if you call backward() twice.
# We must manually set them to zero before the next training step.
# The underscore (_) in zero_() means "in-place operation" (modifies x directly).
x.grad.zero_()

print(f"Gradient after clearing: {x.grad}") # Should be 0.0

Gradient after clearing: 0.0


## Key Concepts in this Snippet

1. **Scalar vs. Vector Output (mean()):**

* `backward()` works best on a Scalar (a single number like Loss).

* Since `x` was a vector `[1, 2, 3]`, `x**2` was also a vector.

* We used `.mean()` to squash that vector into a single number so we could calculate the gradient easily.

2. Why `zero_()`? **(Gradient Accumulation)**

* PyTorch assumes you might want to sum gradients from multiple passes (e.g., in Recurrent Neural Networks).

* Therefore, if you don't call `x.grad.zero_()`, the next time you call `backward()`, the new gradient will be added to the old 4.0, resulting in an incorrect value (e.g., 8.0).

* Best Practice: Always zero out gradients at the start of a training loop step.

###‚úÖ One-Line Summary

> Gradients are element-wise, scale with reductions, and accumulate unless cleared.

> Forward pass computes values

> Backward pass accumulates gradients

> We must reset gradients before the next step

In [94]:
# 1. BASELINE: NORMAL TRACKING

x = torch.tensor(2.0, requires_grad=True)
# Create tensor with gradient tracking
x

tensor(2., requires_grad=True)

In [95]:
# Forward pass
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [96]:
# Backward Pass
y.backward()
print(f"Baseline Gradient: {x.grad}") # Output: 4.0
# Gradient is stored
# x.grad, i.e.,  dy/dx = 2x = 4

Baseline Gradient: 4.0


In [98]:
# OPTION 1: requires_grad_(False)
# ==========================================
# This is an IN-PLACE operation. It effectively flips a switch on the tensor itself.
# Useful when you want to "freeze" a parameter permanently (e.g., a pretrained layer).

x.requires_grad_(False) # Disable gradient tracking IN-PLACE
# x is now treated as a constant, not a variable to optimize.
x

tensor(2.)

In [99]:
# Forward pass again
y = x ** 2
y

tensor(4.)

In [None]:
# Since x has no grad tracking, y also has no gradient function (grad_fn=None).
# Backward now FAILS because y has no grad_fn
# y.backward()  ‚ùå RuntimeError

try:
    y.backward() # THIS WILL FAIL
except RuntimeError as e:
    print(f"\nOption 1 Error: {e}")
    # Error: "element 0 of tensors does not require grad and does not have a grad_fn"

In [100]:
# OPTION 2: .detach()
# ==========================================
# This creates a NEW tensor that shares the same data but is disconnected
# from the computational graph.
# Useful when you need the *value* of a tensor for plotting or metrics
# without affecting the gradients of the original variable.

x = torch.tensor(2.0, requires_grad=True) # Create new tensor with gradient tracking
x

tensor(2., requires_grad=True)

In [101]:
# Detach creates a NEW tensor sharing data
# but WITHOUT computation graph
# Therefore 'z' is a copy of 'x' that does not track gradients.
z = x.detach()
z

tensor(2.)

In [103]:
# Forward pass on original x (tracked)
y = x ** 2    # y depends on x (Tracked)
y

tensor(4., grad_fn=<PowBackward0>)

In [104]:
# Forward pass on detached tensor (not tracked)
y1 = z ** 2   # y1 depends on z (Not Tracked)
y1

tensor(4.)

In [89]:
# Backward on y works
# Backward on y1 FAILS (no graph)
# y1.backward() ‚ùå RuntimeError
y.backward()  # This works! gradients flow back to x.

print(f"\nOption 2 Gradient (from y): {x.grad}")

try:
    y1.backward() # THIS WILL FAIL
except RuntimeError as e:
    print(f"Option 2 Error (from y1): {e}")
    # Error: y1 has no history because z broke the chain.


Option 2 Gradient (from y): 4.0
Option 2 Error (from y1): element 0 of tensors does not require grad and does not have a grad_fn


In [91]:
# OPTION 3: torch.no_grad() (Context Manager)
# ==========================================
# This is a temporary "zone" where tracking is turned off.
# Best Practice for: Inference / Validation loops.

x = torch.tensor(2.0, requires_grad=True)

In [105]:
with torch.no_grad():
    y = x ** 2
    # Inside here, y.requires_grad is False
    print(f"\nInside no_grad block, requires_grad: {y.requires_grad}")
y


Inside no_grad block, requires_grad: False


tensor(4.)

In [93]:
# Outside, tracking resumes for NEW operations
y_outside = x ** 2
y_outside.backward()

# Backward fails because graph was never created
# y.backward() ‚ùå RuntimeError

# üö´ Disabling Gradient Tracking in PyTorch

PyTorch provides **three official ways** to disable gradient tracking.
Each has a **different purpose**.

---

## 1Ô∏è‚É£ Why Disable Gradients?

You should disable gradients when:
- Doing **inference**
- Evaluating models
- Freezing layers
- Saving memory
- Improving speed

---

## 2Ô∏è‚É£ Option 1 ‚Äî `requires_grad_(False)`

```python
x.requires_grad_(False)
````

### What it does:

* Disables gradient tracking **in-place**
* Tensor stops being a leaf for autograd

### Use when:

* Freezing model parameters

### ‚ö†Ô∏è Warning:

* Future operations on `x` are NOT tracked
* `.backward()` will fail

---

## 3Ô∏è‚É£ Option 2 ‚Äî `detach()`

```python
z = x.detach()
```

### What it does:

* Creates a new tensor
* Shares the same data
* **No computation graph**

### Use when:

* You want a tensor value but NOT gradients
* Logging, metrics, auxiliary computations

### Key rule:

```text
detach() cuts the computation graph
```

---

## 4Ô∏è‚É£ Option 3 ‚Äî `torch.no_grad()`

```python
with torch.no_grad():
    y = x ** 2
```

### What it does:

* Temporarily disables gradient tracking
* Most memory-efficient option

### Use when:

* Model inference
* Validation
* Prediction

---

## 5Ô∏è‚É£ Comparison Table

| Method                  | Permanent? | New Tensor? | Typical Use    |
| ----------------------- | ---------- | ----------- | -------------- |
| `requires_grad_(False)` | Yes        | No          | Freeze weights |
| `detach()`              | No         | Yes         | Cut graph      |
| `torch.no_grad()`       | No         | No          | Inference      |

---

## 6Ô∏è‚É£ Why `.backward()` Fails After Disabling

Backpropagation requires:

* A computation graph
* Tensors with `grad_fn`

If gradients are disabled:

```text
No graph ‚Üí No backward pass
```

---

## üß† Mental Model

> **Autograd only tracks what you tell it to track**

* `requires_grad=True` ‚Üí track
* `detach()` ‚Üí cut
* `no_grad()` ‚Üí ignore
* `requires_grad_(False)` ‚Üí stop forever

---

## 7Ô∏è‚É£ Best Practices (REAL-WORLD)

### Training loop

```python
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

---

### Validation / Inference

```python
with torch.no_grad():
    outputs = model(inputs)
```

---

### Freezing layers

```python
for param in model.parameters():
    param.requires_grad = False
```

---

## ‚úÖ One-Line Summary

> Disable gradients intentionally ‚Äî wrong usage silently breaks learning.





| Method | What it does | Best Use Case |
| :--- | :--- | :--- |
| `x.requires_grad_(False)` | **Permanent** change to the tensor (In-place). | **Freezing Weights**: Locking specific layers (e.g., ResNet backbone) during fine-tuning so they don't update. |
| `x.detach()` | Creates a **new** tensor disconnected from the graph. | **Plotting/Metrics**: Converting tensors to NumPy for Matplotlib, or feeding output to a non-differentiable function. |
| `torch.no_grad()` | **Temporary** mode (Context Manager) to stop tracking. | **Inference/Validation**: Running model predictions. Saves massive amounts of memory by not storing gradients. |