### Welcome to my tutorial of training a torch model from scratch

In [1]:
import torch

In [2]:
# We'll create a "batch" of 10 data points
N = 10
# Each data point has 1 feature
D_in = 1
# The output for each data point is a single value
D_out = 1

# Create our input data X
# Shape: (10 rows, 1 column)
X = torch.randn(N, D_in)


In [5]:
# Create our true target labels y by applying the "true" function
# and adding some noise for realism
true_W = torch.tensor([[2.0]])
true_b = torch.tensor(1.0)
y_true = X @ true_W + true_b + torch.randn(N, D_out) * 0.1 # Add a little noise

print(f"Input Data X (first 3 rows):\n {X[:3]}\n")
print(f"True Labels y_true (first 3 rows):\n {y_true[:3]}")

Input Data X (first 3 rows):
 tensor([[ 2.0452],
        [-0.5060],
        [-1.0434]])

True Labels y_true (first 3 rows):
 tensor([[ 5.0715],
        [ 0.1660],
        [-1.0289]])


## 5.3. The Parameters: The Model's "Brain"
Now, we create the parameters W and b that our model will learn. 
We initialize them with random values. Most importantly, we set requires_grad=True to tell PyTorch's Autograd engine to start tracking them.

In [7]:
# Initialize our parameters with random values
# Shapes must be correct for matrix multiplication: X(10,1) @ W(1,1) -> (10,1)
W = torch.randn(D_in, D_out, requires_grad=True)
b = torch.randn(1, requires_grad=True)

print(f"Initial Weight W:\n {W}\n")
print(f"Initial Bias b:\n {b}")

Initial Weight W:
 tensor([[1.1082]], requires_grad=True)

Initial Bias b:
 tensor([0.9393], requires_grad=True)


## 5.4. The Implementation: From Math to Code
Now for the main event. We translate our mathematical formula ŷ = XW + b directly into a single line of PyTorch code.

In [9]:
 # Perform the forward pass to get our first prediction
y_hat = X @ W + b

print(f"Shape of our prediction y_hat: {y_hat.shape}\n")
print(f"Prediction y_hat (first 3 rows):\n {y_hat[:3]}\n")
print(f"True Labels y_true (first 3 rows):\n {y_true[:3]}")

Shape of our prediction y_hat: torch.Size([10, 1])

Prediction y_hat (first 3 rows):
 tensor([[ 3.2057],
        [ 0.3786],
        [-0.2169]], grad_fn=<SliceBackward0>)

True Labels y_true (first 3 rows):
 tensor([[ 5.0715],
        [ 0.1660],
        [-1.0289]])


### 6.1. Defining Error: The Loss Function

We need a single number that tells us how "wrong" our predictions are. This is called the **Loss**. For regression, the most common loss function is the **Mean Squared Error (MSE)**.

The formula is simple:
`L = (1/N) * Σ(ŷ_i - y_i)²`

In plain English: "For every data point, find the difference between the prediction and the truth, square it, and then take the average of all these squared differences."

Let's translate this directly into PyTorch code, using the `y_hat` from Part 5.

In [10]:
# y_hat is our prediction from the forward pass
# y_true is the ground truth
# Let's calculate the loss manually
error = y_hat - y_true
squared_error = error ** 2
loss = squared_error.mean()

print(f"Prediction (first 3):\n {y_hat[:3]}\n")
print(f"Truth (first 3):\n {y_true[:3]}\n")
print(f"Loss (a single number): {loss}")

Prediction (first 3):
 tensor([[ 3.2057],
        [ 0.3786],
        [-0.2169]], grad_fn=<SliceBackward0>)

Truth (first 3):
 tensor([[ 5.0715],
        [ 0.1660],
        [-1.0289]])

Loss (a single number): 1.1553044319152832


### 6.2. The Magic Command: `loss.backward()`

This is where the magic of Autograd happens. With a single command, we tell PyTorch to send a signal backward from the `loss` through the entire computation graph it built during the forward pass.

This command calculates the gradient of the `loss` with respect to every single parameter that has `requires_grad=True`. In our case, it will compute:
*   `∂L/∂W` (the gradient of the Loss with respect to our Weight `W`)
*   `∂L/∂b` (the gradient of the Loss with respect to our Bias `b`)


In [11]:
loss.backward()  

PyTorch has now populated the `.grad` attribute for our `W` and `b` tensors.


### 6.3. Inspecting the Result: The `.grad` Attribute

The `.grad` attribute now holds the gradient for each parameter. This is the "signal" that tells us how to adjust our knobs.


In [12]:
# The gradients are now stored in the .grad attribute of our parameters
print(f"Gradient for W (∂L/∂W):\n {W.grad}\n")
print(f"Gradient for b (∂L/∂b):\n {b.grad}")


Gradient for W (∂L/∂W):
 tensor([[-2.6305]])

Gradient for b (∂L/∂b):
 tensor([0.0684])


#### **How to Interpret These Gradients:**

*   **`W.grad` is -1.0185:** The negative sign is key. It means that if we were to *increase* `W`, the loss would *decrease*. The gradient points in the direction of the steepest *increase* in loss, so we'll want to move in the opposite direction.
*   **`b.grad` is -2.0673:** Similarly, this tells us that increasing `b` will also decrease the loss.

We now have everything we need to improve our model:
1.  A way to measure error (the loss).
2.  The exact direction to turn our parameter "knobs" to reduce that error (the gradients).

We have completed the analysis. The final step is to actually *act* on this information—to update our weights and biases.

This leads us to the heart of the training process. Let's move on to **Part 7: The Training Loop - Gradient Descent in Action**.


## Part 7: The Training Loop - Gradient Descent From Scratch

This is the heart of the entire deep learning process. The **Training Loop** repeatedly executes the forward and backward passes, incrementally updating the model's parameters to minimize the loss. This process is called **Gradient Descent**.

**The Analogy:** We're standing on a foggy mountain (the loss landscape) and want to get to the lowest valley (minimum loss). We can't see the whole map, but we can feel the slope of the ground beneath our feet (the gradients). The training loop is the process of taking a small step downhill, feeling the slope again, taking another step, and repeating until we reach the bottom.

Our goal is to implement this "step-by-step" descent from scratch.

### 7.1. The Algorithm: Gradient Descent

The core update rule for gradient descent was promised in the very beginning, and now we can finally implement it:

`θ_t+1 = θ_t - η * ∇_θ L`

Let's translate this from math to our context:
*   `θ`: Represents all our parameters, `W` and `b`.
*   `η` (eta): The **learning rate**, a small number that controls how big of a step we take.
*   `∇_θ L`: The gradient of the loss with respect to our parameters, which we now have in `W.grad` and `b.grad`.

So, the update rules for our model are:
1.  `W_new = W_old - learning_rate * W.grad`
2.  `b_new = b_old - learning_rate * b.grad`


In [14]:
# Hyperparameters
learning_rate = 0.01
epochs = 200

# Let's re-initialize our random parameters
W = torch.randn(1, 1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

print(f"Starting Parameters: W={W.item():.3f}, b={b.item():.3f}\n")

# The Training Loop
for epoch in range(epochs):
    ### STEP 1 & 2: Forward Pass and Loss Calculation ###
    y_hat = X @ W + b
    loss = torch.mean((y_hat - y_true)**2)

    ### STEP 3: Backward Pass (Calculate Gradients) ###
    loss.backward()

    ### STEP 4: Update Parameters (The Gradient Descent Step) ###
    # We wrap this in no_grad() because this is not part of the model's computation
    with torch.no_grad():
        W -= learning_rate * W.grad
        b -= learning_rate * b.grad

    ### STEP 5: Zero the Gradients ###
    # We must reset the gradients for the next iteration
    W.grad.zero_()
    b.grad.zero_()

    # Optional: Print progress
    if epoch % 10 == 0:
        print(f"Epoch {epoch:02d}: Loss={loss.item():.4f}, W={W.item():.3f}, b={b.item():.3f}")

print(f"\nFinal Parameters: W={W.item():.3f}, b={b.item():.3f}")
print(f"True Parameters:  W=2.000, b=1.000")

Starting Parameters: W=1.803, b=-0.394

Epoch 00: Loss=2.0255, W=1.805, b=-0.366
Epoch 10: Loss=1.3684, W=1.819, b=-0.115
Epoch 20: Loss=0.9266, W=1.835, b=0.091
Epoch 30: Loss=0.6290, W=1.852, b=0.260
Epoch 40: Loss=0.4283, W=1.868, b=0.399
Epoch 50: Loss=0.2927, W=1.883, b=0.512
Epoch 60: Loss=0.2011, W=1.897, b=0.605
Epoch 70: Loss=0.1390, W=1.910, b=0.682
Epoch 80: Loss=0.0970, W=1.921, b=0.745
Epoch 90: Loss=0.0686, W=1.930, b=0.796
Epoch 100: Loss=0.0493, W=1.938, b=0.839
Epoch 110: Loss=0.0363, W=1.945, b=0.874
Epoch 120: Loss=0.0274, W=1.951, b=0.902
Epoch 130: Loss=0.0214, W=1.956, b=0.926
Epoch 140: Loss=0.0174, W=1.961, b=0.945
Epoch 150: Loss=0.0146, W=1.964, b=0.961
Epoch 160: Loss=0.0127, W=1.967, b=0.974
Epoch 170: Loss=0.0114, W=1.970, b=0.985
Epoch 180: Loss=0.0106, W=1.972, b=0.994
Epoch 190: Loss=0.0100, W=1.974, b=1.001

Final Parameters: W=1.975, b=1.007
True Parameters:  W=2.000, b=1.000
