# **PyTorch Fundamentals**


You are transitioning from manual gradient calculations to using a framework that handles the heavy lifting, allowing you to focus on architecture and data.

Here is the breakdown for **L5: PyTorch Fundamentals**.

### Phase 1: Topic Breakdown

I have structured this to ensure you understand the "PyTorch Way" of doing things before we assemble the full MNIST classifier.

```text
L5: PyTorch Fundamentals
├── Concept 1: Tensors & Device Management
│   ├── Tensor creation and operations (vs NumPy)
│   ├── GPU/CUDA context (Hardware Check)
│   ├── Purpose: The fundamental data structure of deep learning
│   ├── Simple terms: N-dimensional arrays that live on video cards
│   └── Task: Create tensors, perform math, and move them to your RTX 4060
│
├── Concept 2: Autograd (Automatic Differentiation)
│   ├── Computational Graphs (DAGs)
│   ├── .requires_grad and .backward()
│   ├── Purpose: Automating the chain rule
│   ├── Simple terms: The engine that remembers your math to calculate gradients later
│   └── Task: Manually compute gradients for a simple equation using Autograd
│
├── Concept 3: Data Handling (Dataset & DataLoader)
│   ├── torch.utils.data.Dataset (Custom Class structure)
│   ├── torch.utils.data.DataLoader (Batching, Shuffling)
│   ├── Purpose: Decoupling data loading from training logic
│   ├── Simple terms: An organized conveyor belt feeding data to your model
│   └── Task: Implement a dummy Custom Dataset and iterate through it
│
├── Concept 4: Model Architecture (nn.Module)
│   ├── nn.Module class structure (__init__, forward)
│   ├── nn.Sequential (Container)
│   ├── Purpose: Encapsulating state (weights) and behavior (forward pass)
│   ├── Simple terms: Blueprints for your neural network layers
│   └── Task: Define a simple Multilayer Perceptron (MLP) for MNIST input
│
├── Concept 5: Loss & Optimization (Implicit Prerequisite)
│   ├── torch.nn Loss functions (CrossEntropyLoss)
│   ├── torch.optim (SGD/Adam)
│   ├── Purpose: Measuring error and updating weights
│   ├── Simple terms: The scoreboard (loss) and the coach (optimizer) correcting the players
│   └── Task: Initialize loss and optimizer for the model
│
└── Concept 6: The Training Loop (The "Build")
    ├── The Standard Cycle: Forward → Loss → Backward → Step → Zero Grad
    ├── Purpose: Orchestrating the learning process
    ├── Simple terms: The actual practice session where learning happens
    └── Mini-Project: Complete MNIST Digit Classifier Training

```

## **Concept 1: Tensors & Device Management**

### Intuition

In the world of PyTorch, the **Tensor** is the primary citizen. While it looks and behaves almost exactly like a NumPy array (n-dimensional grid of numbers), it has two superpowers that NumPy lacks:
   1. **GPU Acceleration:** Tensors can live on the GPU (VRAM) rather than just the CPU (RAM). This allows for massive parallel processing, which is critical for deep learning.
   2. **Autograd Compatibility:** Tensors can track the history of operations performed on them to automatically calculate gradients later (we will cover this in Concept 2).

Think of a CPU as a Professor: extremely smart, capable of complex logic, but can only do one or two things at a time. Think of a GPU as an army of minions: individually simple, but there are thousands of them working exactly in sync. Deep learning is mostly multiplying huge matrices, a task perfectly suited for the army.

### Mechanics

   * **Creation:** You can create tensors directly (`torch.tensor([1, 2])`) or convert from NumPy (`torch.from_numpy(arr)`).
   * **Device Management:** Every tensor has a `.device` property. By default, they are created on the `'cpu'`.
   * **Moving Data:** You cannot perform operations (like addition or multiplication) between a tensor on the CPU and a tensor on the GPU. They must be on the same device. You move them using `.to(device)` or `.cuda()`.
   * **Asynchronous Execution:** CUDA (GPU) operations are asynchronous. When Python tells the GPU to "multiply these matrices," the GPU says "Okay, I'll get to it" and Python immediately moves to the next line of code. If you want to time GPU operations accurately, you must force Python to wait until the GPU is finished using a synchronization command.

### Simpler Explanation

A Tensor is just a container for numbers. If the container is in System RAM, the CPU does the math. If you move the container to Video RAM (VRAM), the graphics card does the math—much faster.

### Trade-offs & Pitfalls

   * **Overhead:** Moving data between CPU and GPU is slow (it travels over the PCIe bus). For small operations (like adding two numbers), the transfer time takes longer than the actual math. GPU is only worth it for *large* matrix operations.
   * **VRAM Limits:** If you load too many huge tensors, you will get a "CUDA Out of Memory" error.


### PyTorch Prerequisites: The Syntax Toolkit

PyTorch is designed to look and feel like NumPy, but with extra commands for the GPU. Here are the specific tools you need for this task.

#### 1. Checking Hardware

To see if your GPU is accessible, PyTorch provides a boolean check:

```python
# Returns True if GPU is ready, False otherwise
status = torch.cuda.is_available()

```
#### **What is CUDA?**

**CUDA** (Compute Unified Device Architecture) is a software layer created by **NVIDIA**.

Think of it as a translator.
   * **You (Python/PyTorch):** Speak high-level code instructions ("Multiply this matrix").
   * **Your GPU:** Speaks low-level hardware voltage signals.

Without CUDA, your Python code has no way to talk to the graphics card. The GPU is just a rock that draws pixels on your screen.

**CUDA** provides the bridge. It allows developers to send general-purpose math problems (not just graphics) to the GPU to be solved.

#### **Why do we check for it?**

In PyTorch, `torch.cuda.is_available()` is effectively asking:
   1. Do you have an NVIDIA GPU?
   2. Do you have the correct drivers installed?
   3. Can PyTorch "see" and talk to that GPU?

If the answer is `True`, you unlock the ability to train models 50x to 100x faster than on your CPU. If not try:

   `pip uninstall torch torchvision torchaudio`

   `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121`

#### 2. Creating Tensors

The syntax is nearly identical to NumPy.
   * **NumPy:** `np.random.rand(rows, cols)`
   * **PyTorch:** `torch.rand(rows, cols)` (Creates a tensor with random numbers between 0 and 1)

#### 3. Moving Data (The `.to()` method)

When a tensor is created, it sits in CPU RAM by default. You move it using `.to()` or specific device methods.

```python
# Create on CPU
x = torch.rand(100, 100)

# Move to GPU (returns a NEW copy on the device)
x_gpu = x.to('cuda')

```

*Note: You can also use string variables: `device = 'cuda' if torch.cuda.is_available() else 'cpu'`, then `x.to(device)`.*

#### 4. Matrix Operations

* **Multiplication:** You can use the standard `@` symbol for matrix multiplication, just like in modern NumPy.
   * `result = tensor_a @ tensor_b`
   * *Constraint:* Both `tensor_a` and `tensor_b` must be on the **same device**. If one is on CPU and one is on GPU, it will crash.



#### 5. Timing GPU Operations (Crucial)

Because Python sends commands to the GPU asynchronously (fire-and-forget), Python might finish its code (stop the timer) while the GPU is still crunching numbers in the background.
To get an accurate time, you must force Python to wait:

```python
# Start Timer
# ... do gpu operations ...
torch.cuda.synchronize() # Wait for all GPU kernels to finish
# Stop Timer

```

---

### Your Task

Now that you have the tools, let's attempt the logic.

**Goal:** Prove the speed difference between CPU and GPU on your machine.

**Steps to Implement:**

1. Import `torch` and `time`.
2. Define a `device` variable (use the check from tool #1). Print what device you got.
3. Create two large random tensors (e.g., shape `10000, 10000`) on the CPU.
4. **CPU Benchmark:**
   * Record `start_time`.
   * Perform `matrix1 @ matrix2`.
   * Record `end_time` and print the duration.

5. **GPU Benchmark:**
   * Move both tensors to the GPU using `.to('cuda')`.
   * **Warm-up:** Run the multiplication once without timing it (GPUs take a moment to "wake up").
   * Record `start_time`.
   * Perform multiplication.
   * **Synchronize:** Call `torch.cuda.synchronize()`.
   * Record `end_time` and print duration.

**Inputs:** None (Generative)

**Outputs:** Device name, CPU time, GPU time, Speedup factor.



In [8]:
import torch
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

print("Creating torches")
x_cpu = torch.rand(10000, 10000)
y_cpu = torch.rand(10000, 10000)

print("===========\nStart Comparision")
start_time = time.time()
_ = x_cpu @ y_cpu
end_time = time.time()
print(f"Time Taken for CPU = {end_time - start_time} sec")

x_gpu = x_cpu.to("cuda")
y_gpu = y_cpu.to("cuda")

print("Start Synchronization") # Wait for everything before starting
torch.cuda.synchronize()
start_time = time.time()
_ = x_gpu @ y_gpu
torch.cuda.synchronize() # <--- CRITICAL: Wait for GPU to finish!
end_time = time.time()
print(f"Time Taken for GPU = {end_time - start_time} sec")

Device: cuda
Creating torches
Start Comparision
Time Taken for CPU = 17.051012992858887 sec
Start Synchronization
Time Taken for GPU = 0.2580296993255615 sec



**16.42s vs 0.25s.**
Your GPU is approximately **65x faster** than your CPU for this task. This is why we use GPUs.

Now that we know how to hold data (Tensors) and where to put it (GPU), we need to discuss how PyTorch learns.

---

## Concept 2: Autograd (Automatic Differentiation)

### Intuition

In the "From Scratch" module, you had to manually derive the gradients (derivatives) for every function using Calculus. If you changed your model architecture, you had to re-do all the calculus on paper and re-write the backward pass code.

**Autograd** removes this burden. It is an engine that records every operation you perform on a tensor. When you are done, you press a "rewind" button, and it automatically calculates the gradients for you using the Chain Rule.

### Mechanics

1. **The Tape Recorder:** When you create a tensor with `requires_grad=True`, PyTorch starts a log.
   * If you do `y = x + 2`, PyTorch remembers: "To get $y$, we took $x$ and added 2."
   * It builds a **Computational Graph** (a Directed Acyclic Graph) connecting inputs to outputs.


2. **The Backward Pass:** When you call `.backward()` on the final result (usually the Loss), PyTorch walks backward through this graph.
3. **The `.grad` attribute:** The calculated derivatives are stored in the `.grad` property of the input tensors.

### Simpler Explanation

Imagine you are walking through a maze (the forward pass). You leave a trail of breadcrumbs behind you. When you reach the end (the loss), you follow the breadcrumbs back to the start (backward pass) to figure out which turns led you to the mistake.

### Trade-offs

* **Memory:** Storing this graph takes memory. This is why we turn it off (`torch.no_grad()`) when we are just testing/evaluating the model, to save RAM.

### Your Task

We will verify Autograd against your own calculus knowledge.

**Target Equation:**
$$y = x^3 + 5$$

**Analytical Derivative (Calculus):**
$$\frac{dy}{dx} = 3x^2$$

**Requirements:**

1. Create a tensor `x` with the value `4.0`. **Important:** You must set `requires_grad=True`.
2. Calculate `y = x**3 + 5`.
3. Print the value of `y`.
4. Call `y.backward()`.
5. Print the value of `x.grad`.
6. Manually calculate $3(4)^2$ in Python and print it to prove they match.


In [13]:
import torch

x = torch.tensor(4.0, requires_grad = True)
y = x ** 3 + 5
print(f"Y = {y}")
y.backward()
print(f"X Grad = {x.grad}")
print(f"3(4)**2 = {3*(4**2)}")

Y = 69.0
X Grad = 48.0
3(4)**2 = 48


Perfect. You just proved that PyTorch can perform Calculus for you.

Now we move to how we feed data into these systems.

---

## **Concept 3: Data Handling (Dataset & DataLoader)**

### Intuition

In the "From Scratch" module, you likely loaded all your data into one giant NumPy array (e.g., `X_train`).
**Problem:** What if your dataset is 500GB of images? You can't load that into RAM.
**Solution:** The **Dataset** class. It acts like a librarian. It doesn't hold all the books (data) in its hands; it just knows *where* they are on the shelf and how to fetch *one* when asked.

The **DataLoader** is the delivery truck. It asks the Librarian for 32 books (a batch), packs them into a box, and delivers them to the GPU.

### Mechanics

To manage data in PyTorch, you almost always create a custom class that inherits from `torch.utils.data.Dataset`.

You **must** implement three magic methods:
   1. `__init__`: Setup (load file paths, CSVs, etc.).
   2. `__len__`: Returns the total number of samples.
   3. `__getitem__(index)`: Returns **one specific sample** at the given `index`.

### Syntax Toolkit

```python
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
        
    def __getitem__(self, idx):
        # Retrieve the item at index 'idx'
        sample = self.data[idx]
        return sample

```

Once the class is defined, you hand it to the loader:

```python
# batch_size=4 means it groups 4 samples into one big tensor
loader = DataLoader(dataset_instance, batch_size=4, shuffle=True)

```

### Your Task

Create a "Lazy" dataset that generates random numbers on the fly.

**Requirements:**
1. Define a class `RandomDataset` inheriting from `Dataset`.
   * `__init__`: Accepts an integer `length`. Stores it.
   * `__len__`: Returns the `length`.
   * `__getitem__`: Ignores the actual `idx`. Instead, it generates a random tensor of shape `(3,)` (the feature) and a random integer (0 or 1) (the label). It returns a tuple: `(feature, label)`.

2. Instantiate the dataset with a length of **10**.
3. Wrap it in a `DataLoader` with `batch_size=4`.
4. Write a loop to iterate through the loader.
5. Inside the loop, print the shape of the features batch and the labels batch.

**Expected Output intuition:**
If batch size is 4 and length is 10, you should see 3 loops.

   * Loop 1: 4 items
   * Loop 2: 4 items
   * Loop 3: 2 items (the remainder)



In [23]:
from torch.utils.data import Dataset, DataLoader
import random

class RandomDataset(Dataset):
    def __init__(self, length):
        self.length = length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        tensor = torch.rand(3)
        ran_in = random.randint(0,1)
        return (tensor, ran_in)


dataset_instance = RandomDataset(10)
loader = DataLoader(dataset_instance, batch_size=4, shuffle=True)

for i, (f, l) in enumerate(loader):
    print(f"For Interation {i+1}:\nFeatures = {f}\nLabels = {l}")

For Interation 1:
Features = tensor([[0.6995, 0.8537, 0.7609],
        [0.4655, 0.4702, 0.3763],
        [0.6756, 0.4313, 0.4652],
        [0.9021, 0.4186, 0.5740]])
Labels = tensor([1, 1, 1, 0])
For Interation 2:
Features = tensor([[0.5449, 0.1629, 0.1815],
        [0.2859, 0.3900, 0.5773],
        [0.1734, 0.2097, 0.5909],
        [0.2663, 0.4277, 0.4301]])
Labels = tensor([0, 0, 1, 0])
For Interation 3:
Features = tensor([[0.0786, 0.5816, 0.5636],
        [0.3999, 0.5130, 0.2747]])
Labels = tensor([0, 1])


Excellent. You can see how the `DataLoader` automatically stacked your individual tensors into batches (e.g., `[4, 3]` shape). This automation is what makes training on millions of images possible.

Now, we need something to consume that data.

---

## **Concept 4: Model Architecture (nn.Module)**

### Intuition

In PyTorch, every neural network is a Python class that inherits from `nn.Module`.
Think of this class as a blueprint.
   1. **`__init__` (The Inventory):** You list the parts you need (layers). "I need a Linear layer with 784 inputs, a ReLU activation, etc."
   2. **`forward` (The Assembly):** You define how data flows through those parts. "Take input `x`, pass it through layer 1, then apply ReLU, then layer 2."
This separation allows for complex, non-linear flows (like skipping layers) that you'll need for things like ResNet later.

### Mechanics (The Syntax Toolkit)

You need three main components for a basic network:
   1. **`nn.Flatten()`:** Images are 2D grids (e.g., 28x28). Dense layers (`nn.Linear`) only understand flat lists (vectors). This layer squashes the grid into a single line (28*28 = 784).
   2. **`nn.Linear(in_features, out_features)`:** This is the standard "Dense" layer you built from scratch in Module 1 ($y = xW^T + b$).
   3. **`nn.ReLU()`:** The activation function.

**Class Structure:**

```python
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers here
        self.layer1 = nn.Linear(10, 5)
        self.activation = nn.ReLU()
        
    def forward(self, x):
        # Define flow here
        x = self.layer1(x)
        x = self.activation(x)
        return x

```

### Trade-offs

* **Sequential vs. Class:** You *can* use `nn.Sequential` (a list of layers) for simple stacks, but the **Class** structure is mandatory for advanced architectures (Transformers, ResNets). We will stick to the Class structure to build good habits.

### Your Task

Build a Multi-Layer Perceptron (MLP) designed for the MNIST dataset.

**Specifications:**

1. Create a class `MNISTClassifier` inheriting from `nn.Module`.
2. **`__init__`:**
   * Define a `flatten` layer.
   * Define a first linear layer: Input **784** (28x28 pixels), Output **128**.
   * Define a ReLU activation.
   * Define a second linear layer (Output Layer): Input **128**, Output **10** (for digits 0-9).


3. **`forward`:** Connect them in order: Flatten $\to$ Linear1 $\to$ ReLU $\to$ Linear2.
4. **Test it:**
   * Instantiate the model.
   * Move it to the GPU (`.to(device)`).
   * Create a dummy input tensor representing **one batch of 8 images** (Shape: `(8, 28, 28)`). **Important:** Don't forget to move this input to the GPU too!
   * Pass the input through the model.
   * Print the output shape.



**Expected Output Shape:** `torch.Size([8, 10])` (8 images, 10 class probabilities each).


In [None]:
import torch.nn as nn

class MNISTClassifier(nn.Module):
    def __init__(self):
        
