# PyTorch Basics: Complete Fundamentals
## Building Foundation for Mamba Integration

**Goal**: Learn PyTorch by implementing core concepts hands-on

**Modules Covered**:
1. Tensors & Basic Operations
2. Building Custom nn.Module Classes
3. Residual Connections & Normalization
4. Working with Image Data
5. Convolutional Layers
6. Backpropagation & Gradients
7. Building Multi-Stage Encoders
8. Understanding UNet Architecture

---

## Setup: Import Libraries

First, let's import everything we'll need.

In [None]:
# Install packages as needed
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.10.7-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp311-cp311-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.60.1-cp311-cp311-win_amd64.whl.metadata (114 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.9-cp311-cp311-win_amd64.whl.metadata (6.4 kB)
Collecting pyparsing>=3 (from matplotlib)
  Downloading pyparsing-3.2.5-py3-none-any.whl.metadata (5.0 kB)
Downloading matplotlib-3.10.7-cp311-cp311-win_amd64.whl (8.1 MB)
   ---------------------------------------- 0.0/8.1 MB ? eta -:--:--
   ------------ --------------------------- 2.6/8.1 MB 15.1 MB/s eta 0:00:01
   --------------------------- ------------ 5.5/8.1 MB 14.6 MB/s eta 0:00:01
   -----------------------------------

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Check PyTorch version and CUDA availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.4.1+cu124
CUDA available: True
CUDA device: NVIDIA GeForce RTX 4060


---
# Module 0.1: Tensors & Basic Operations

**Goal**: Understand PyTorch tensors (the foundation of everything)

**What is a tensor?**
- Like a NumPy array, but can run on GPU
- Can automatically compute gradients (for backpropagation)
- The basic data structure for all neural networks

**Analogy from scikit-learn**:
- In sklearn: You work with NumPy arrays (X, y)
- In PyTorch: You work with tensors (same idea, but GPU-enabled)

## Exercise 1.1: Creating Tensors

Let's create tensors in different ways.

In [4]:
# Method 1: From a Python list
tensor_from_list = torch.tensor([1, 2, 3, 4, 5])
print(f"From list: {tensor_from_list}")
print(f"Shape: {tensor_from_list.shape}")
print(f"Data type: {tensor_from_list.dtype}\n")

From list: tensor([1, 2, 3, 4, 5])
Shape: torch.Size([5])
Data type: torch.int64



In [5]:
# Method 2: Random tensor (most common for initialization)
random_tensor = torch.randn(3, 4)  # 3 rows, 4 columns (like np.random.randn)
print(f"Random tensor (3x4):")
print(random_tensor)
print(f"Shape: {random_tensor.shape}\n")

Random tensor (3x4):
tensor([[-2.5413,  0.4841,  1.3195,  0.8582],
        [ 0.6416, -0.6697, -0.8900, -0.4335],
        [ 0.5710,  0.3317,  0.3666, -1.1197]])
Shape: torch.Size([3, 4])



In [6]:
# Method 3: Zeros and ones
zeros = torch.zeros(2, 3)
ones = torch.ones(2, 3)
print(f"Zeros (2x3):")
print(zeros)
print(f"\nOnes (2x3):")
print(ones)
print()

Zeros (2x3):
tensor([[0., 0., 0.],
        [0., 0., 0.]])

Ones (2x3):
tensor([[1., 1., 1.],
        [1., 1., 1.]])



In [7]:
# Method 4: From NumPy array (useful when converting existing data)
numpy_array = np.array([[1, 2], [3, 4]])
tensor_from_numpy = torch.from_numpy(numpy_array)
print(f"From NumPy:")
print(tensor_from_numpy)
print(f"Shape: {tensor_from_numpy.shape}")

From NumPy:
tensor([[1, 2],
        [3, 4]])
Shape: torch.Size([2, 2])


## Exercise 1.2: Tensor Shapes and Reshaping

In [8]:
# Create a tensor representing an image
# Format: (Channels, Height, Width)
image = torch.randn(3, 512, 512)  # RGB image, 512x512
print(f"Image tensor shape: {image.shape}")
print(f"This represents: 3 channels (RGB), 512 height, 512 width\n")

Image tensor shape: torch.Size([3, 512, 512])
This represents: 3 channels (RGB), 512 height, 512 width



In [9]:
# Reshape to simulate extracting 16x16 patches
# This is what PatchEmbedding does!
patch_size = 16
num_patches_per_side = 512 // patch_size  # 32 patches per side
print(f"Patch size: {patch_size}x{patch_size}")
print(f"Number of patches per side: {num_patches_per_side}")
print(f"Total patches: {num_patches_per_side * num_patches_per_side}\n")

Patch size: 16x16
Number of patches per side: 32
Total patches: 1024



In [10]:
# Method 1: Manual reshaping (understand the concept)
# Reshape: (3, 512, 512) â†’ (3, 32, 16, 32, 16)
#          [C,  H,   W ]    [C, #H, pH, #W, pW]
# Where #H = number of patches in height, pH = patch height
reshaped = image.reshape(3, num_patches_per_side, patch_size, 
                          num_patches_per_side, patch_size)
print(f"After reshape: {reshaped.shape}")

After reshape: torch.Size([3, 32, 16, 32, 16])


In [11]:
# Now rearrange to (num_patches, channels, patch_height, patch_width)
# This makes each patch a separate "item"
patches = reshaped.permute(1, 3, 0, 2, 4)  # Rearrange dimensions
print(f"After permute: {patches.shape}")

After permute: torch.Size([32, 32, 3, 16, 16])


In [12]:
# Flatten to get (num_patches, channels*patch_height*patch_width)
patches_flat = patches.reshape(num_patches_per_side * num_patches_per_side, -1)
print(f"Flattened patches: {patches_flat.shape}")
print(f"This is {patches_flat.shape[0]} patches, each with {patches_flat.shape[1]} features\n")

Flattened patches: torch.Size([1024, 768])
This is 1024 patches, each with 768 features



In [13]:
# Key operations you'll use constantly:
print("=== Key Reshape Operations ===")
x = torch.randn(2, 3, 4)
print(f"Original: {x.shape}")

=== Key Reshape Operations ===
Original: torch.Size([2, 3, 4])


In [14]:
# .reshape() - change shape (must have same total elements)
y = x.reshape(2, 12)
print(f"After reshape(2, 12): {y.shape}")

After reshape(2, 12): torch.Size([2, 12])


In [15]:
# .view() - similar to reshape, but has stricter memory requirements
z = x.view(6, 4)
print(f"After view(6, 4): {z.shape}")

After view(6, 4): torch.Size([6, 4])


In [16]:
# .permute() - rearrange dimensions
w = x.permute(2, 0, 1)  # (4, 2, 3) - swap dimensions
print(f"After permute(2, 0, 1): {w.shape}")

After permute(2, 0, 1): torch.Size([4, 2, 3])


In [17]:
# .unsqueeze() - add a dimension
u = x.unsqueeze(0)  # Add batch dimension
print(f"After unsqueeze(0): {u.shape}")

After unsqueeze(0): torch.Size([1, 2, 3, 4])


In [18]:
# .squeeze() - remove dimensions of size 1
s = u.squeeze(0)  # Remove batch dimension
print(f"After squeeze(0): {s.shape}")

After squeeze(0): torch.Size([2, 3, 4])


## Exercise 1.3: GPU Operations

In [19]:
# Check if CUDA (GPU) is available
if torch.cuda.is_available():
    print(f"âœ“ CUDA is available!")
    print(f"Device name: {torch.cuda.get_device_name(0)}\n")
    
    # Create tensor on CPU (default)
    cpu_tensor = torch.randn(3, 3)
    print(f"CPU tensor device: {cpu_tensor.device}")
    
    # Move to GPU - Method 1: .cuda()
    gpu_tensor = cpu_tensor.cuda()
    print(f"GPU tensor device: {gpu_tensor.device}")
    
    # Move to GPU - Method 2: .to('cuda') (preferred, more flexible)
    gpu_tensor2 = cpu_tensor.to('cuda')
    print(f"GPU tensor device: {gpu_tensor2.device}\n")
    
    # Operations on GPU tensors
    result = gpu_tensor + gpu_tensor2
    print(f"Result device: {result.device}")
    print("âœ“ GPU operations work!\n")
    
    # Move back to CPU (needed for plotting, NumPy conversion)
    back_to_cpu = result.cpu()
    print(f"Back to CPU: {back_to_cpu.device}")
    
    # IMPORTANT: Can't mix CPU and GPU tensors!
    try:
        mixed = cpu_tensor + gpu_tensor  # This will error!
    except RuntimeError as e:
        print(f"\nâœ— Error when mixing CPU and GPU tensors:")
        print(f"  {str(e)[:80]}...")
        print("  â†’ Always ensure tensors are on the same device!")
        
else:
    print("âœ— CUDA not available - will use CPU")
    print("(For Mamba, you'll need GPU via WSL2)")

âœ“ CUDA is available!
Device name: NVIDIA GeForce RTX 4060

CPU tensor device: cpu
GPU tensor device: cuda:0
GPU tensor device: cuda:0

Result device: cuda:0
âœ“ GPU operations work!

Back to CPU: cpu

âœ— Error when mixing CPU and GPU tensors:
  Expected all tensors to be on the same device, but found at least two devices, c...
  â†’ Always ensure tensors are on the same device!


## Exercise 1.4: Practical Example - Image Tensor to Patches

This is what PatchEmbedding will do!

In [20]:
# Simulate a batch of images
batch_size = 2
channels = 3
height = 512
width = 512

images = torch.randn(batch_size, channels, height, width)
print(f"Input images shape: {images.shape}")
print(f"This is {batch_size} RGB images, each {height}x{width}\n")

Input images shape: torch.Size([2, 3, 512, 512])
This is 2 RGB images, each 512x512



In [21]:
# Method: Using unfold (efficient way to extract patches)
patch_size = 16

# unfold extracts sliding windows
# unfold(dimension, size, step)
patches_h = images.unfold(2, patch_size, patch_size)  # Unfold height
patches_hw = patches_h.unfold(3, patch_size, patch_size)  # Unfold width

print(f"After unfolding: {patches_hw.shape}")
print("Shape breakdown:")
print(f"  Dimension 0: batch_size = {patches_hw.shape[0]}")
print(f"  Dimension 1: channels = {patches_hw.shape[1]}")
print(f"  Dimension 2: num_patches_height = {patches_hw.shape[2]}")
print(f"  Dimension 3: num_patches_width = {patches_hw.shape[3]}")
print(f"  Dimension 4: patch_height = {patches_hw.shape[4]}")
print(f"  Dimension 5: patch_width = {patches_hw.shape[5]}\n")

After unfolding: torch.Size([2, 3, 32, 32, 16, 16])
Shape breakdown:
  Dimension 0: batch_size = 2
  Dimension 1: channels = 3
  Dimension 2: num_patches_height = 32
  Dimension 3: num_patches_width = 32
  Dimension 4: patch_height = 16
  Dimension 5: patch_width = 16



In [22]:
# Rearrange to (batch, num_patches, channels * patch_height * patch_width)
num_patches_h = patches_hw.shape[2]
num_patches_w = patches_hw.shape[3]
total_patches = num_patches_h * num_patches_w

# Rearrange dimensions
patches_rearranged = patches_hw.permute(0, 2, 3, 1, 4, 5)
print(f"After permute: {patches_rearranged.shape}")

After permute: torch.Size([2, 32, 32, 3, 16, 16])


In [23]:
# Flatten each patch
patches_final = patches_rearranged.reshape(batch_size, total_patches, -1)
print(f"Final patches shape: {patches_final.shape}")
print(f"\nMeaning:")
print(f"  - {batch_size} images")
print(f"  - Each has {total_patches} patches")
print(f"  - Each patch has {patches_final.shape[2]} features")
print(f"\nThis is exactly what PatchEmbedding outputs!")
print("(Except PatchEmbedding also projects to a different dimension)")

Final patches shape: torch.Size([2, 1024, 768])

Meaning:
  - 2 images
  - Each has 1024 patches
  - Each patch has 768 features

This is exactly what PatchEmbedding outputs!
(Except PatchEmbedding also projects to a different dimension)


---
# Module 0.2: Building Your First nn.Module

**Goal**: Learn how to create custom PyTorch models

**Key concept**: All PyTorch models inherit from `nn.Module`

## Exercise 2.1: Simple Linear Layer Wrapper

In [None]:
# TODO: Step by step

## Exercise 2.2: Multi-Layer Block

In [None]:
# TODO: Step by step

---
# Module 0.3: Residual Connections & Normalization

**Goal**: Understand the building pattern used in modern architectures (including Mamba!)

## Exercise 3.1: Simple Residual Connection

In [None]:
# TODO: Step by step

## Exercise 3.2: LayerNorm + Pre-norm Architecture

In [None]:
# TODO: Step by step

---
# Module 0.4: Working with Image Data

**Goal**: Understand how images are represented as tensors

## Exercise 4.1: Load and Convert Image

In [None]:
# TODO: Step by step

## Exercise 4.2: Image Tensor Format (C, H, W)

In [None]:
# TODO: Step by step

---
# Module 0.5: Convolutional Layers

**Goal**: Understand Conv2d (used in PatchEmbedding!)

## Exercise 5.1: Basic Conv2d

In [None]:
# TODO: Step by step

## Exercise 5.2: Using Conv2d for Patch Extraction

In [None]:
# TODO: Step by step

---
# Module 0.6: Backpropagation & Gradients

**Goal**: Understand gradient flow (crucial for debugging!)

## Exercise 6.1: Forward + Backward Pass

In [None]:
# TODO: Step by step

## Exercise 6.2: Understanding Leaf vs Non-Leaf Tensors

In [None]:
# TODO: Step by step

---
# Module 0.7: Building Multi-Stage Encoders

**Goal**: Combine components into complex architectures

## Exercise 7.1: Single Encoder Stage

In [None]:
# TODO: Step by step

## Exercise 7.2: Multi-Stage Encoder

In [None]:
# TODO: Step by step

---
# Module 0.8: Understanding UNet Architecture

**Goal**: Learn the encoder-decoder pattern with skip connections

## Exercise 8.1: Simple UNet

In [None]:
# TODO: Step by step

## Exercise 8.2: Visualize Skip Connections

In [None]:
# TODO: Step by step

---
# Summary

## What You've Learned
âœ“ PyTorch tensors and operations
âœ“ Building custom nn.Module classes
âœ“ Residual connections and normalization
âœ“ Working with image data
âœ“ Convolutional layers
âœ“ Gradient flow and backpropagation
âœ“ Multi-stage encoder architectures
âœ“ UNet encoder-decoder pattern

## Next Steps
After completing this notebook:
1. Phase 1 Deep Dive: What is Mamba?
2. Phase 2 Deep Dive: Why each building block?
3. Build MambaUNet from scratch!

---
**Great job working through the fundamentals!** ðŸš€