### Forward, Loss and Backward Pass

- Forward Pass : Where the data moves forward (eg. in a MLP or a CNN) where the activations propogate through the network to produce a prediction
                Eg. Y = act(W @ X + b )

- Loss : This is the scalar quantity that is calculated from the prediction and the factual data of the observation present. A lower loss is better. 
                Eg. MSE Loss, BCE loss

- Backward: This is where the model learns from previous data rows/batches/whole epochs of data and how much, each of the parameters of the network contributed in the loss.
    - We get the gradient of the loss with respect to the parameters of the model using chain rule to check the influence of that parameter on the final loss


### Gradient (dL/dW)

- This is the mathematical derivative of the Loss with respect to a parameter (like W). It tells us the direction and magnitude to change W to reduce the loss.

Backward Pass

- PyTorch uses Automatic Differentiation (Autograd) to compute all these derivatives efficiently, moving backwards from the scalar loss through the computation graph all the way back to the parameters (W and B).


### Updating (Optimizer)

- After the backward pass, a component called the Optimizer (not in our script) uses the gradients (W.grad and B.grad) to physically update the parameters: W_new =W_old​ −(Learning Rate×W.grad).


### Zeroing Gradients

- W.grad = None (or W.grad.zero_()): Gradients accumulate by default. We must zero them out before each new training step so the gradient from the current batch of data is not polluted by the gradient from the previous batch.

### Methods in PyTorch

1. torch.Tensor
- The fundamental data structure. It is a multi-dimensional array, similar to a NumPy array, but optimized for GPU computation and gradient tracking.

2. `requires_grad=True`
- A boolean flag set during tensor creation (e.g., `W = torch.randn(..., requires_grad=True`). This is the signal to PyTorch's Autograd system to track every operation performed on this tensor. Only tensors representing model parameters (W,B) should typically have this set.

3. `.grad`
- An attribute automatically created and managed by Autograd. After loss.backward() is called, this attribute stores the gradient (∂Loss/∂Parameter) for the tensor. It is initially None.

3. `.detach()`
- Creates a new tensor that shares the underlying data but detaches the tensor from the current computation graph. Useful when you need the value for logging or visualization without tracking its history.

4. `.item()`
- Converts a single-element (scalar) PyTorch tensor (like the final loss value) into a standard Python number (float). Essential for printing or logging the loss.

5. `.zero_()`
- In-place method used on the parameter's .grad attribute (e.g., `optimizer.zero_grad()`). It resets the accumulated gradients to zero before a new backward pass. This is crucial because, by default, PyTorch accumulates gradients across multiple passes.

6. `loss.backward()`
- The most critical command. It triggers the Autograd engine to traverse the computation graph (which was built during the forward pass) backwards from the scalar loss tensor. It calculates the gradients for all tensors that have requires_grad=True and populates their .grad attributes.

In [8]:
import torch
import torch.nn as nn

```
A = torch.tensor([5.0], requires_grad=True)
B = torch.tensor([10.0])
C = A * 2
D = B * 3
E = C + D
```

Which of the tensors (A, B, C, D, E) will have a computation history tracked by Autograd, and why? If you call E.backward(), which of these tensors will have a non-None value in their .grad attribute, and what does this value physically represent?



In [7]:
# Role of requires_grad:

'''
The tensors that will have a computation history tracked by Autograd are those derived from a tensor that has requires_grad=True.
- By default, requires_grad = False
- It has to be explicity specified

A,B,E have their history tracked by Autograd. (E as well because If at least one input tensor requires gradients, the output tensor will also require gradients.)
B,D dont have their history tracked.
'''

a = torch.tensor([1,2,3],dtype=torch.float32, requires_grad=True)
loss = torch.var(a)
loss.backward()
print(a.grad)

tensor([-1.,  0.,  1.])


```
W = torch.randn(10, 1, requires_grad=True)
X = torch.randn(5, 10)
Y_pred = X @ W
Y_pred_detached = Y_pred.detach() # Line A
loss = torch.sum(Y_pred_detached**2) # Line B

If you call loss.backward():

Will the gradient ∂Loss/∂W be computed and stored in W.grad? Explain the role of Line A in this context.

If you changed Line A to be simply Y_pred_detached = Y_pred, what would change regarding the computation of W.grad?



```

In [18]:
# Role of .detatch():

'''
Here Y_pred = X @ W and has history tracked by autograd because its a combination of 2 inputs amongst which one is tracked. 
But when the .detatch() is called, its gradients are no longer tracked 
'''

a = torch.tensor([[1,3,4]],requires_grad=True,dtype=float)
loss = torch.var(a)
loss.backward()
print(a.grad)
# a.detach()
# print(a.grad)

tensor([[-1.6667,  0.3333,  1.3333]], dtype=torch.float64)


In [21]:
loss.detach()

tensor(2.3333, dtype=torch.float64)