# **PyTorch Fundamentals**


You are transitioning from manual gradient calculations to using a framework that handles the heavy lifting, allowing you to focus on architecture and data.

Here is the breakdown for **L5: PyTorch Fundamentals**.

### Phase 1: Topic Breakdown

I have structured this to ensure you understand the "PyTorch Way" of doing things before we assemble the full MNIST classifier.

```text
L5: PyTorch Fundamentals
├── Concept 1: Tensors & Device Management
│   ├── Tensor creation and operations (vs NumPy)
│   ├── GPU/CUDA context (Hardware Check)
│   ├── Purpose: The fundamental data structure of deep learning
│   ├── Simple terms: N-dimensional arrays that live on video cards
│   └── Task: Create tensors, perform math, and move them to your RTX 4060
│
├── Concept 2: Autograd (Automatic Differentiation)
│   ├── Computational Graphs (DAGs)
│   ├── .requires_grad and .backward()
│   ├── Purpose: Automating the chain rule
│   ├── Simple terms: The engine that remembers your math to calculate gradients later
│   └── Task: Manually compute gradients for a simple equation using Autograd
│
├── Concept 3: Data Handling (Dataset & DataLoader)
│   ├── torch.utils.data.Dataset (Custom Class structure)
│   ├── torch.utils.data.DataLoader (Batching, Shuffling)
│   ├── Purpose: Decoupling data loading from training logic
│   ├── Simple terms: An organized conveyor belt feeding data to your model
│   └── Task: Implement a dummy Custom Dataset and iterate through it
│
├── Concept 4: Model Architecture (nn.Module)
│   ├── nn.Module class structure (__init__, forward)
│   ├── nn.Sequential (Container)
│   ├── Purpose: Encapsulating state (weights) and behavior (forward pass)
│   ├── Simple terms: Blueprints for your neural network layers
│   └── Task: Define a simple Multilayer Perceptron (MLP) for MNIST input
│
├── Concept 5: Loss & Optimization (Implicit Prerequisite)
│   ├── torch.nn Loss functions (CrossEntropyLoss)
│   ├── torch.optim (SGD/Adam)
│   ├── Purpose: Measuring error and updating weights
│   ├── Simple terms: The scoreboard (loss) and the coach (optimizer) correcting the players
│   └── Task: Initialize loss and optimizer for the model
│
└── Concept 6: The Training Loop (The "Build")
    ├── The Standard Cycle: Forward → Loss → Backward → Step → Zero Grad
    ├── Purpose: Orchestrating the learning process
    ├── Simple terms: The actual practice session where learning happens
    └── Mini-Project: Complete MNIST Digit Classifier Training

```

## **Concept 1: Tensors & Device Management**

### Intuition

In the world of PyTorch, the **Tensor** is the primary citizen. While it looks and behaves almost exactly like a NumPy array (n-dimensional grid of numbers), it has two superpowers that NumPy lacks:
   1. **GPU Acceleration:** Tensors can live on the GPU (VRAM) rather than just the CPU (RAM). This allows for massive parallel processing, which is critical for deep learning.
   2. **Autograd Compatibility:** Tensors can track the history of operations performed on them to automatically calculate gradients later (we will cover this in Concept 2).

Think of a CPU as a Professor: extremely smart, capable of complex logic, but can only do one or two things at a time. Think of a GPU as an army of minions: individually simple, but there are thousands of them working exactly in sync. Deep learning is mostly multiplying huge matrices, a task perfectly suited for the army.

### Mechanics

   * **Creation:** You can create tensors directly (`torch.tensor([1, 2])`) or convert from NumPy (`torch.from_numpy(arr)`).
   * **Device Management:** Every tensor has a `.device` property. By default, they are created on the `'cpu'`.
   * **Moving Data:** You cannot perform operations (like addition or multiplication) between a tensor on the CPU and a tensor on the GPU. They must be on the same device. You move them using `.to(device)` or `.cuda()`.
   * **Asynchronous Execution:** CUDA (GPU) operations are asynchronous. When Python tells the GPU to "multiply these matrices," the GPU says "Okay, I'll get to it" and Python immediately moves to the next line of code. If you want to time GPU operations accurately, you must force Python to wait until the GPU is finished using a synchronization command.

### Simpler Explanation

A Tensor is just a container for numbers. If the container is in System RAM, the CPU does the math. If you move the container to Video RAM (VRAM), the graphics card does the math—much faster.

### Trade-offs & Pitfalls

   * **Overhead:** Moving data between CPU and GPU is slow (it travels over the PCIe bus). For small operations (like adding two numbers), the transfer time takes longer than the actual math. GPU is only worth it for *large* matrix operations.
   * **VRAM Limits:** If you load too many huge tensors, you will get a "CUDA Out of Memory" error.


### PyTorch Prerequisites: The Syntax Toolkit

PyTorch is designed to look and feel like NumPy, but with extra commands for the GPU. Here are the specific tools you need for this task.

#### 1. Checking Hardware

To see if your GPU is accessible, PyTorch provides a boolean check:

```python
# Returns True if GPU is ready, False otherwise
status = torch.cuda.is_available()

```
#### **What is CUDA?**

**CUDA** (Compute Unified Device Architecture) is a software layer created by **NVIDIA**.

Think of it as a translator.
   * **You (Python/PyTorch):** Speak high-level code instructions ("Multiply this matrix").
   * **Your GPU:** Speaks low-level hardware voltage signals.

Without CUDA, your Python code has no way to talk to the graphics card. The GPU is just a rock that draws pixels on your screen.

**CUDA** provides the bridge. It allows developers to send general-purpose math problems (not just graphics) to the GPU to be solved.

#### **Why do we check for it?**

In PyTorch, `torch.cuda.is_available()` is effectively asking:
   1. Do you have an NVIDIA GPU?
   2. Do you have the correct drivers installed?
   3. Can PyTorch "see" and talk to that GPU?

If the answer is `True`, you unlock the ability to train models 50x to 100x faster than on your CPU. If not try:

   `pip uninstall torch torchvision torchaudio`

   `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121`

#### 2. Creating Tensors

The syntax is nearly identical to NumPy.
   * **NumPy:** `np.random.rand(rows, cols)`
   * **PyTorch:** `torch.rand(rows, cols)` (Creates a tensor with random numbers between 0 and 1)

#### 3. Moving Data (The `.to()` method)

When a tensor is created, it sits in CPU RAM by default. You move it using `.to()` or specific device methods.

```python
# Create on CPU
x = torch.rand(100, 100)

# Move to GPU (returns a NEW copy on the device)
x_gpu = x.to('cuda')

```

*Note: You can also use string variables: `device = 'cuda' if torch.cuda.is_available() else 'cpu'`, then `x.to(device)`.*

#### 4. Matrix Operations

* **Multiplication:** You can use the standard `@` symbol for matrix multiplication, just like in modern NumPy.
   * `result = tensor_a @ tensor_b`
   * *Constraint:* Both `tensor_a` and `tensor_b` must be on the **same device**. If one is on CPU and one is on GPU, it will crash.



#### 5. Timing GPU Operations (Crucial)

Because Python sends commands to the GPU asynchronously (fire-and-forget), Python might finish its code (stop the timer) while the GPU is still crunching numbers in the background.
To get an accurate time, you must force Python to wait:

```python
# Start Timer
# ... do gpu operations ...
torch.cuda.synchronize() # Wait for all GPU kernels to finish
# Stop Timer

```

---

### Your Task

Now that you have the tools, let's attempt the logic.

**Goal:** Prove the speed difference between CPU and GPU on your machine.

**Steps to Implement:**

1. Import `torch` and `time`.
2. Define a `device` variable (use the check from tool #1). Print what device you got.
3. Create two large random tensors (e.g., shape `10000, 10000`) on the CPU.
4. **CPU Benchmark:**
   * Record `start_time`.
   * Perform `matrix1 @ matrix2`.
   * Record `end_time` and print the duration.

5. **GPU Benchmark:**
   * Move both tensors to the GPU using `.to('cuda')`.
   * **Warm-up:** Run the multiplication once without timing it (GPUs take a moment to "wake up").
   * Record `start_time`.
   * Perform multiplication.
   * **Synchronize:** Call `torch.cuda.synchronize()`.
   * Record `end_time` and print duration.

**Inputs:** None (Generative)

**Outputs:** Device name, CPU time, GPU time, Speedup factor.



In [None]:
import torch
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

print("Creating torches")
x_cpu = torch.rand(10000, 10000)
y_cpu = torch.rand(10000, 10000)

print("===========\nStart Comparision")
start_time = time.time()
_ = x_cpu @ y_cpu
end_time = time.time()
print(f"Time Taken for CPU = {end_time - start_time} sec")

x_gpu = x_cpu.to("cuda")
y_gpu = y_cpu.to("cuda")

print("Start Synchronization") # Wait for everything before starting
torch.cuda.synchronize()
start_time = time.time()
_ = x_gpu @ y_gpu
torch.cuda.synchronize() # <--- CRITICAL: Wait for GPU to finish!
end_time = time.time()
print(f"Time Taken for GPU = {end_time - start_time} sec")


**16.42s vs 0.25s.**
Your GPU is approximately **65x faster** than your CPU for this task. This is why we use GPUs.

Now that we know how to hold data (Tensors) and where to put it (GPU), we need to discuss how PyTorch learns.

---

## Concept 2: Autograd (Automatic Differentiation)

### Intuition

In the "From Scratch" module, you had to manually derive the gradients (derivatives) for every function using Calculus. If you changed your model architecture, you had to re-do all the calculus on paper and re-write the backward pass code.

**Autograd** removes this burden. It is an engine that records every operation you perform on a tensor. When you are done, you press a "rewind" button, and it automatically calculates the gradients for you using the Chain Rule.

### Mechanics

1. **The Tape Recorder:** When you create a tensor with `requires_grad=True`, PyTorch starts a log.
   * If you do `y = x + 2`, PyTorch remembers: "To get $y$, we took $x$ and added 2."
   * It builds a **Computational Graph** (a Directed Acyclic Graph) connecting inputs to outputs.


2. **The Backward Pass:** When you call `.backward()` on the final result (usually the Loss), PyTorch walks backward through this graph.
3. **The `.grad` attribute:** The calculated derivatives are stored in the `.grad` property of the input tensors.

### Simpler Explanation

Imagine you are walking through a maze (the forward pass). You leave a trail of breadcrumbs behind you. When you reach the end (the loss), you follow the breadcrumbs back to the start (backward pass) to figure out which turns led you to the mistake.

### Trade-offs

* **Memory:** Storing this graph takes memory. This is why we turn it off (`torch.no_grad()`) when we are just testing/evaluating the model, to save RAM.

### Your Task

We will verify Autograd against your own calculus knowledge.

**Target Equation:**
$$y = x^3 + 5$$

**Analytical Derivative (Calculus):**
$$\frac{dy}{dx} = 3x^2$$

**Requirements:**

1. Create a tensor `x` with the value `4.0`. **Important:** You must set `requires_grad=True`.
2. Calculate `y = x**3 + 5`.
3. Print the value of `y`.
4. Call `y.backward()`.
5. Print the value of `x.grad`.
6. Manually calculate $3(4)^2$ in Python and print it to prove they match.


In [None]:
import torch

x = torch.tensor(4.0, requires_grad = True)
y = x ** 3 + 5
print(f"Y = {y}")
y.backward()
print(f"X Grad = {x.grad}")
print(f"3(4)**2 = {3*(4**2)}")

Perfect. You just proved that PyTorch can perform Calculus for you.

Now we move to how we feed data into these systems.

---

## **Concept 3: Data Handling (Dataset & DataLoader)**

### Intuition

In the "From Scratch" module, you likely loaded all your data into one giant NumPy array (e.g., `X_train`).
**Problem:** What if your dataset is 500GB of images? You can't load that into RAM.
**Solution:** The **Dataset** class. It acts like a librarian. It doesn't hold all the books (data) in its hands; it just knows *where* they are on the shelf and how to fetch *one* when asked.

The **DataLoader** is the delivery truck. It asks the Librarian for 32 books (a batch), packs them into a box, and delivers them to the GPU.

### Mechanics

To manage data in PyTorch, you almost always create a custom class that inherits from `torch.utils.data.Dataset`.

You **must** implement three magic methods:
   1. `__init__`: Setup (load file paths, CSVs, etc.).
   2. `__len__`: Returns the total number of samples.
   3. `__getitem__(index)`: Returns **one specific sample** at the given `index`.

### Syntax Toolkit

```python
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
        
    def __getitem__(self, idx):
        # Retrieve the item at index 'idx'
        sample = self.data[idx]
        return sample

```

Once the class is defined, you hand it to the loader:

```python
# batch_size=4 means it groups 4 samples into one big tensor
loader = DataLoader(dataset_instance, batch_size=4, shuffle=True)

```

### Your Task

Create a "Lazy" dataset that generates random numbers on the fly.

**Requirements:**
1. Define a class `RandomDataset` inheriting from `Dataset`.
   * `__init__`: Accepts an integer `length`. Stores it.
   * `__len__`: Returns the `length`.
   * `__getitem__`: Ignores the actual `idx`. Instead, it generates a random tensor of shape `(3,)` (the feature) and a random integer (0 or 1) (the label). It returns a tuple: `(feature, label)`.

2. Instantiate the dataset with a length of **10**.
3. Wrap it in a `DataLoader` with `batch_size=4`.
4. Write a loop to iterate through the loader.
5. Inside the loop, print the shape of the features batch and the labels batch.

**Expected Output intuition:**
If batch size is 4 and length is 10, you should see 3 loops.

   * Loop 1: 4 items
   * Loop 2: 4 items
   * Loop 3: 2 items (the remainder)



In [None]:
from torch.utils.data import Dataset, DataLoader
import random

class RandomDataset(Dataset):
    def __init__(self, length):
        self.length = length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        tensor = torch.rand(3)
        ran_in = random.randint(0,1)
        return (tensor, ran_in)


dataset_instance = RandomDataset(10)
loader = DataLoader(dataset_instance, batch_size=4, shuffle=True)

for i, (f, l) in enumerate(loader):
    print(f"For Interation {i+1}:\nFeatures = {f}\nLabels = {l}")

Excellent. You can see how the `DataLoader` automatically stacked your individual tensors into batches (e.g., `[4, 3]` shape). This automation is what makes training on millions of images possible.

Now, we need something to consume that data.

---

## **Concept 4: Model Architecture (nn.Module)**

### Intuition

In PyTorch, every neural network is a Python class that inherits from `nn.Module`.
Think of this class as a blueprint.
   1. **`__init__` (The Inventory):** You list the parts you need (layers). "I need a Linear layer with 784 inputs, a ReLU activation, etc."
   2. **`forward` (The Assembly):** You define how data flows through those parts. "Take input `x`, pass it through layer 1, then apply ReLU, then layer 2."
This separation allows for complex, non-linear flows (like skipping layers) that you'll need for things like ResNet later.

### Mechanics (The Syntax Toolkit)

You need three main components for a basic network:
   1. **`nn.Flatten()`:** Images are 2D grids (e.g., 28x28). Dense layers (`nn.Linear`) only understand flat lists (vectors). This layer squashes the grid into a single line (28*28 = 784).
   2. **`nn.Linear(in_features, out_features)`:** This is the standard "Dense" layer you built from scratch in Module 1 ($y = xW^T + b$).
   3. **`nn.ReLU()`:** The activation function.

**Class Structure:**

```python
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers here
        self.layer1 = nn.Linear(10, 5)
        self.activation = nn.ReLU()
        
    def forward(self, x):
        # Define flow here
        x = self.layer1(x)
        x = self.activation(x)
        return x

```

`super().__init__()` is often blindly added to the code, but it serves a critical purpose.

### The Short Answer

`nn.Module` is a complex parent class that does a lot of hidden "magic" behind the scenes. If you don't call `super().__init__()`, that magic is never turned on, and your model will be broken.

### The "Magic" (Why you need it)

When you assign a layer like `self.layer1 = nn.Linear(...)`, PyTorch automatically detects it and adds its weights to a master list of **trainable parameters**.

The `nn.Module` parent class has an internal dictionary (think of it as a registry) to track these parameters.

* **With `super().__init__()`:** The registry is created. When you add a layer, PyTorch registers it.
* **Without it:** The registry doesn't exist. You will likely get an `AttributeError` saying `_modules` or `_parameters` is missing, or your optimizer won't find any weights to update.

### Intuitive Metaphor

Imagine `nn.Module` is a general **Building Contractor** and your class `MNISTClassifier` is a specific **House Design**.
   * The Contractor knows how to install plumbing, electricity, and the foundation.
   * Your Design specifies where the walls and windows go.

Calling `super().__init__()` is like paying the Contractor to pour the foundation and hook up the power **before** you start putting up your specific walls. If you skip it, you are building walls on dirt with no electricity nothing will work.

---


### Trade-offs

* **Sequential vs. Class:** You *can* use `nn.Sequential` (a list of layers) for simple stacks, but the **Class** structure is mandatory for advanced architectures (Transformers, ResNets). We will stick to the Class structure to build good habits.

### Your Task

Build a Multi-Layer Perceptron (MLP) designed for the MNIST dataset.

**Specifications:**

1. Create a class `MNISTClassifier` inheriting from `nn.Module`.
2. **`__init__`:**
   * Define a `flatten` layer.
   * Define a first linear layer: Input **784** (28x28 pixels), Output **128**.
   * Define a ReLU activation.
   * Define a second linear layer (Output Layer): Input **128**, Output **10** (for digits 0-9).


3. **`forward`:** Connect them in order: Flatten $\to$ Linear1 $\to$ ReLU $\to$ Linear2.
4. **Test it:**
   * Instantiate the model.
   * Move it to the GPU (`.to(device)`).
   * Create a dummy input tensor representing **one batch of 8 images** (Shape: `(8, 28, 28)`). **Important:** Don't forget to move this input to the GPU too!
   * Pass the input through the model.
   * Print the output shape.



**Expected Output Shape:** `torch.Size([8, 10])` (8 images, 10 class probabilities each).


In [1]:
import torch
import torch.nn as nn

class MNISTClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layer1 = nn.Linear(28*28, 128)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.flatten(x)
        x = self.layer1(x)
        x = self.activation(x)
        x = self.layer2(x)
        return x

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = MNISTClassifier().to(device)
    model_input = torch.rand(8, 28, 28).to(device)
    output = model(model_input)
    print(output.shape)
    

torch.Size([8, 10])


Perfect. You have a working brain (the Model) and a working body (the Hardware). Now we need to teach it how to learn.

---

## **Concept 5: Loss & Optimization**

### Intuition

To learn, a model needs two things:
   1. **A Scoreboard (Loss Function):** Tells the model how terrible its guess was.
   2. **A Coach (Optimizer):** Looks at the mistakes (gradients) and updates the weights to do better next time.

### Mechanics

#### 1. The Loss Function: `nn.CrossEntropyLoss`

For multi-class classification (like 10 digits), this is the standard choice.
   * **Important:** PyTorch's `CrossEntropyLoss` expects **raw logits** (the raw numbers coming out of your final Linear layer). It applies Softmax **internally**.
   * **Pitfall:** If you add `nn.Softmax` to the end of your model class and *then* use CrossEntropyLoss, you are applying Softmax twice. This destroys training. Your model correctly ends with `Linear`, so you are safe.

> "Is it like actual - guessed value?"

For **Regression** (predicting a price like $50.50), yes! We use Mean Squared Error (`(actual - guess)^2`).

For **Classification** (predicting "Is this a Cat or Dog?"), it is different.
You cannot subtract "Cat" from "Dog".
Instead, the model outputs probabilities: `[0.9 Dog, 0.1 Cat]`.
If the actual answer is "Cat", the **CrossEntropyLoss** punishes the model heavily for being 90% confident in the wrong answer.
   * If the model says 99% Dog and it is a Dog -> Low Loss (near 0).
   * If the model says 99% Cat and it is a Dog -> Huge Loss (approaches infinity).

#### 2. The Optimizer: `torch.optim`

The optimizer holds the model's parameters and updates them.
   * **SGD (Stochastic Gradient Descent):** The classic approach.
   * **Adam:** A smarter, adaptive version of SGD. It's the default "just make it work" optimizer for most modern tasks.

> "Ideally it will eventually make the loss 0?"

In a perfect world, yes. In reality, a loss of exactly 0 usually means the model has memorized the data (Overfitting), which is bad. We aim for the **Global Minimum**—the lowest point on the error curve where the model still understands general patterns, not just memorized answers.

The optimizer uses the gradients (calculated by `backward()`) to nudge the weights "downhill" towards that minimum.

**Syntax Toolkit:**

```python
# 1. Select Loss
loss_fn = nn.CrossEntropyLoss()

# 2. Select Optimizer
# You must pass the list of parameters it needs to manage
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

```

> "Does it import all parameters?"

Yes. Remember the `super().__init__()` magic? When you defined `self.layer1 = nn.Linear(...)`,  PyTorch automatically registered the weights ($W$) and biases ($b$) of that layer into a list.
Calling `model.parameters()` simply grabs that list and hands it to the optimizer so it knows **which** variables it is allowed to change.

> "If I wanted a different optimizer... I just change it?"

Exactly.
   * `torch.optim.Adam(model.parameters(), lr=0.001)`
   * `torch.optim.SGD(model.parameters(), lr=0.01)`
   * `torch.optim.RMSprop(model.parameters(), lr=0.01)`
The syntax is identical.

### Your Task

Verify the "Backward" plumbing works by manually running one pass.

**Requirements:**
   1. Define the `loss_fn` (CrossEntropyLoss) and `optimizer` (Adam with learning rate `1e-3`).
   2. Create a **target** tensor (the "correct answers"):
      * Shape: `(8,)` (matching your batch size of 8).
      * Values: Random integers between 0 and 9.
      * Device: **Must be on GPU** (same as your output).
   
   3. Calculate the loss: `loss = loss_fn(output, target)`.
   4. Print the loss value.
   5. **Trigger Autograd:** Call `loss.backward()`.
   6. **Verify Gradients:** Print the shape of the gradients for the final layer's weights: `model.layer2.weight.grad.shape`.



In [3]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = MNISTClassifier().to(device)
model_inp = torch.rand(8, 28, 28).to(device)
output = model(model_inp)

target = torch.randint(0, 10, (8,)).to(device)

loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr = 1e-3)

loss = loss_fn(output, target)
print(f"Loss: {loss.item()}")

loss.backward()
print(f"Gradient shape for model.layer2.weight: {model.layer2.weight.grad.shape}")

Loss: 2.2805752754211426
Gradient shape for model.layer2.weight: torch.Size([10, 128])


## **Concept 6: The Training Loop**

### Intuition

This is the heartbeat of Deep Learning. A "Training Loop" is just a standard Python loop that repeats the process you just performed (Forward $\to$ Backward $\to$ Update) thousands of times.

The "Mode Switch": `model.train()` vs `model.eval()`

Neural networks have two modes of existence: **Learning** and **Performing**.
   *1. **`model.train()`:** Tells PyTorch "We are learning right now."
      * **Effect:** Enables layers that work differently during training, such as **Dropout** (randomly turning off neurons) and **Batch Normalization** (calculating statistics based on the current batch).
   *2. **`model.eval()`:** Tells PyTorch "We are testing/predicting."
      * **Effect:** Disables Dropout (uses all neurons), locks Batch Normalization statistics.

`MNISTClassifier` only uses `Linear` and `ReLU`. These two layers behave exactly the same way in both modes. But as soon as you add a `Dropout` layer (which we will later), omitting this line will destroy your model's performance. So we will not use `model.train()` here

### Mechanics: The 5 Steps

Inside the loop, you must perform these 5 steps in this **exact order** for every batch of data:
   1. **Forward Pass:** Pass data through the model.
         ```python
         pred = model(X)
         loss = loss_fn(pred, y)
         
         ```
         * **Action:** The input `X` travels through the layers.
         * **Memory:** PyTorch builds the **Computational Graph** on the fly. It stores the input values at every layer in memory (VRAM). It needs these "activations" later to calculate the derivative.
         * **Result:** You get a loss value (e.g., 2.3). The graph is now fully built, connecting `loss` back to every weight in the model.  

   2. **Calculate Loss:** Compare prediction vs. truth.
      1. **The Inputs**
      
      The function `loss_fn(prediction, target)` takes two very different looking tensors:
         * **`prediction` (from the model):**
            * **Shape:** `[Batch_Size, 10]` (e.g., `[64, 10]`).
            * **Content:** Raw scores (logits). For a single image, it might look like: `[ -2.0, 5.1, 0.1, ... ]`. The higher the number, the more the model believes that index is the correct digit.
         * **`target` (from the dataset):**
            * **Shape:** `[Batch_Size]` (e.g., `[64]`).
            * **Content:** The correct integer indices. For that same image: `1` (meaning the digit "1").
      
      2. **The Internal Mechanics (CrossEntropyLoss)**
         When you run `loss = loss_fn(prediction, target)`, PyTorch performs three mathematical operations instantly:
            1. **Softmax:** It squashes the raw scores into probabilities (0.0 to 1.0) so they sum up to 100%.
            2. **Log:** It takes the logarithm of those probabilities (to penalize wrong answers more heavily).
            3. **Selection (NLL):** It looks at the **target** index. If the target is `1`, it grabs the probability the model assigned to index `1`.
               * If the model assigned 0.9 (High confidence) $\to$ **Low Loss**.
               * If the model assigned 0.1 (Low confidence) $\to$ **High Loss**.
      
      3. **The Output**
      
      The result is a **single scalar number** (a float), like `2.34`.
         * This number represents the *average error* of the entire batch of 64 images.
         * **Crucial:** This number acts as the "root" of the tree. When you call `.backward()` later, it starts here and spreads out to all 64 images' calculations.
   
   3. **Zero Gradients:** `optimizer.zero_grad()`
      * **Why?** PyTorch accumulates gradients by default (adds them up). If you don't reset them to zero, the gradients from Batch 1 will be added to Batch 2, creating a mess.
          ```python
          optimizer.zero_grad()
          
          ```
          * **The Weird Part:** PyTorch accumulates gradients. If you calculate gradients for Batch 1, they are stored in `.grad`. If you immediately run Batch 2 and calculate gradients, PyTorch **adds** the new gradients to the old ones.
          * **Why?** This is useful for "Gradient Accumulation" (simulating a large batch size on small hardware), but fatal for standard training.
          * **Action:** This command goes through every parameter in the model and sets `.grad = 0`. It wipes the slate clean for the current batch.
   
   4. **Backward Pass:** `loss.backward()` (Calculate gradients).
         ```python
         loss.backward()
         
         ```
         * **Action:** This is the "Chain Rule" engine. It starts at the `loss` and walks backward along the graph we built in Step A.
         * **Math:** For every weight $w$, it computes $\frac{\partial Loss}{\partial w}$ (how much the error changes if we wiggle this weight).
         * **Result:** It fills the `.grad` attribute of every parameter with a specific number (a float).
         * **Memory:** Once this is done, the graph (and the saved activations from Step A) are usually freed to save memory.   
   
   5. **Optimizer Step:** `optimizer.step()` (Update weights).
        ```python
        optimizer.step()
        
        ```
        * **Action:** The optimizer looks at the `.data` (the weight value) and the `.grad` (the calculated error slope) for every parameter.
        * **Math (SGD example):** It applies the update rule:
          $$w_{new} = w_{old} - (\text{learning\_rate} \times \text{gradient})$$
        * **Result:** The model is now slightly smarter than it was a millisecond ago.

### Simpler Explanation

Imagine painting a portrait.

1. Paint a stroke (Forward).
2. Compare it to the subject (Loss).
3. **Wipe your brush** (Zero Grad). *If you don't, the old paint mixes with the new paint!*
4. Decide how to fix the stroke (Backward).
5. Move your hand to correct it (Step).

---

## Mini-Project: Fake-MNIST Training Loop

We will put everything together. Since downloading the real MNIST dataset can be messy with file permissions, we will update your `RandomDataset` to generate "Fake MNIST" data (random noise images) to prove the loop works.

**Objective:** Train your model on generated data for 3 epochs.

### Specifications

**1. Data Preparation**
   * Redefine `RandomDataset`:
      * `__getitem__` must now return a tuple: `(image, label)`.
      * `image`: Random tensor of shape `(28, 28)` (representing a 28x28 pixel image).
      * `label`: Random integer `0-9`.
   * Create a `DataLoader` with `batch_size=64` and `length=1000`.

**2. Model & Setup**
   * Instantiate `MNISTClassifier`.
   * Move it to the **GPU**.
   * Define `CrossEntropyLoss` and `Adam` optimizer (lr=0.001).

**3. The Training Loop**
   * Write a loop for `3` Epochs.
   * Inside, iterate through the `DataLoader`.
   * **Crucial:** Move `images` and `labels` to the GPU *inside* the loop.
   * Perform the 5 standard steps (Forward, Loss, Zero, Backward, Step).
   * **Logging:** Print the loss value **every 100 batches** (e.g., "Epoch 1, Batch 100, Loss: 2.30").

**Forbidden:** Do not use `torchvision` or `sklearn`. Use only the tools we discussed.


In [8]:
from torch.utils.data import Dataset, DataLoader
import torch
import random

class RandomDataset(Dataset):
    def __init__(self, length):
        self.length = length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        image = torch.rand(28, 28)
        label = random.randint(0, 9)
        return (image, label)

dataset_instance = RandomDataset(1000)
loader = DataLoader(dataset_instance, batch_size=64, shuffle=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MNISTClassifier().to(device)
model_input = torch.rand(8, 28, 28).to(device)
output = model(model_input)
target = torch.randint(0, 10, (8,)).to(device)


loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)

for epoch in range(1, 4):
    for idx, (image, label) in enumerate(loader):
        image, label = image.to(device), label.to(device)
        output = model(image)
        loss = loss_fn(output, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (idx + 1) % 5  == 0:
            print(f"Epoch {epoch}, Batch {idx + 1}, Loss: {loss.item():.4f}")


Epoch 1, Batch 5, Loss: 2.3696
Epoch 1, Batch 10, Loss: 2.3022
Epoch 1, Batch 15, Loss: 2.3867
Epoch 2, Batch 5, Loss: 2.2850
Epoch 2, Batch 10, Loss: 2.3002
Epoch 2, Batch 15, Loss: 2.2887
Epoch 3, Batch 5, Loss: 2.2729
Epoch 3, Batch 10, Loss: 2.3181
Epoch 3, Batch 15, Loss: 2.3166


### Project Status: **PASSED** ✅

Congratulations. You have successfully built a fully functional PyTorch training pipeline from scratch.

### The "Why" behind your output

You might be wondering why your loss stayed stuck around **2.3** and didn't decrease.

* **The Math:** You have 10 classes.
* **Random Chance:** The probability of guessing correctly by pure luck is  or .
* **The Formula:** CrossEntropyLoss calculates .
* **The Calculation:** .

Because we used **random noise** inputs and **random** labels, the model *cannot* learn a pattern. It is doing exactly what it should: guessing randomly. If the loss had dropped to 0.5, it would have meant your code had a bug (or your random number generator wasn't random)!

### Code Hygiene Review

You have a few lines of "dead code" left over from previous steps that are no longer needed. In a real script, you would delete these to save memory:

```python
# These lines before the loop are unused because the loop handles data loading
# model_input = torch.rand(8, 28, 28).to(device)  <-- DELETE
# output = model(model_input)                     <-- DELETE
# target = torch.randint(0, 10, (8,)).to(device)  <-- DELETE

```

---

## Lesson L5 Complete.

You have mastered the "PyTorch Way":

1. **Tensors:** The data containers.
2. **Autograd:** The automatic calculus engine.
3. **Datasets/Loaders:** The data delivery system.
4. **nn.Module:** The architecture blueprint.
5. **The Loop:** The 5-step cycle of learning.

### What's Next?

Now that the engine is running, we need to make it drive somewhere real. The next logical step in **M1: Foundations** is to apply this to **Real Data** and add **Validation**.

**Ready to move to L6: Training on Real MNIST & Validation Loops?**