#**Computer Vision**

Welcome to Module 2: Computer Vision. We are shifting gears significantly here. In Module 1 (Foundations), you dealt with tabular data where rows were independent samples and columns were features.

In Computer Vision, "features" aren't given to us in columns; they are hidden in the spatial arrangement of pixels. **Convolutional Neural Networks (CNNs)** are the tools we use to automatically extract these spatial features (edges, textures, shapes) without manually engineering them.

Here is the roadmap for this session. We will start by building the mechanics "by hand" (using Python loops) to ensure you understand exactly what the sliding window does, then switch to PyTorch for the heavy lifting.

### **Phase 1: Topic Breakdown**

```text
L9: CNN Fundamentals
├── Concept 1: The Convolution Operation (The Sliding Window)
│   ├── Kernel/Filter (The Feature Detector)
│   ├── Dot Product (The Matching Mechanism)
│   ├── Purpose: Detecting local patterns (edges, lines) regardless of position.
│   ├── Simpler Terms: Sliding a small flashlight over an image to find matches.
│   └── Task: Implement a 2D convolution using nested loops (No PyTorch layers).
│
├── Concept 2: Stride & Padding (Controlling Geometry)
│   ├── Stride (Step size of the slide)
│   ├── Padding (Adding borders to preserve size)
│   ├── Output Dimension Formula
│   ├── Purpose: Controlling the spatial reduction and handling image borders.
│   ├── Simpler Terms: How big a step we take, and adding a frame so we don't lose the corners.
│   └── Task: Update manual implementation to support stride and zero-padding.
│
├── Concept 3: Pooling Layers (Downsampling)
│   ├── Max Pooling vs Average Pooling
│   ├── Purpose: Reducing computational load and adding translation invariance.
│   ├── Simpler Terms: Summarizing a region by taking the loudest voice (Max) or the group consensus (Avg).
│   └── Task: Implement a manual Max Pool function.
│
├── Concept 4: Receptive Fields & Architecture
│   ├── The "Cone of Vision"
│   ├── Deep vs Shallow layers
│   ├── Purpose: Understanding how deeper neurons see larger parts of the original image.
│   ├── Simpler Terms: How much of the original photo one specific pixel in the output cares about.
│   └── Task: Calculate output dimensions and parameters for a specific architecture.
│
├── Concept 5: PyTorch CNN Components (Transition to Framework)
│   ├── nn.Conv2d, nn.MaxPool2d, nn.Flatten
│   ├── Channel dimensions (N, C, H, W)
│   └── Task: Define a class-based CNN architecture in PyTorch.
│
├── Concept 6: Training on CIFAR-10 (The Application)
│   ├── Data Normalization for Images
│   ├── CrossEntropyLoss in Multiclass classification
│   └── Task: Train the defined model on the CIFAR-10 dataset.
│
└── Concept 7: Feature Map Visualization (Peeking inside the Black Box)
    ├── Extracting intermediate activations
    ├── Purpose: Verifying what filters are actually learning.
    └── Task: Visualize the output of the first convolutional layer.

```

---


## **Concept 1: The Convolution Operation (The Sliding Window)**

### Intuition: The "Flashlight" Search

Imagine you are looking for a specific pattern, like a vertical edge or a diagonal line, in a large image. Instead of checking the whole image at once, you use a small "scanner" (the **Kernel** or **Filter**) and slide it across the image.

At every position, you compare the scanner with the part of the image underneath it.

* If the image pixels match the scanner's pattern, the result is a high number (strong activation).
* If they don't match, the result is low or negative.

This creates a **Feature Map**: a map showing *where* in the image that specific pattern was found.

### Mechanics

Mathematically, a convolution is a sum of element-wise products.

Given:
   * **Input Image ($I$):** A 2D matrix of pixel values.
   * **Kernel ($K$):** A smaller 2D matrix (e.g., $3 \times 3$) containing weights.

For every position $(i, j)$ where the kernel fits entirely on the image:
   1. **Overlay** the kernel on top of the image patch.
   2. **Multiply** every kernel weight by the corresponding pixel value underneath it.
   3. **Sum** all these products to get a single number.
   4. **Place** this number in the output matrix at position $(i, j)$.
   
$$Output[i, j] = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I[i+m, j+n] \times K[m, n]$$

*(Note: In deep learning, this is technically "cross-correlation" because we don't flip the kernel, but everyone calls it convolution.)*

### Simpler Explanation

Think of the Kernel as a stencil. You slide the stencil over a piece of paper (the image). At each spot, you calculate how much the drawing on the paper looks like the shape cut into the stencil. You write that "similarity score" down on a new sheet of grid paper. That new sheet is your Convolution Output.

### Trade-offs

* **Pros:** **Translation Invariance** (mostly). A vertical edge detector will find a vertical edge whether it's in the top-left or bottom-right.
* **Cons:** Computational cost. You are doing many multiplications for every single pixel.

---

### Your Task

**Implement the 2D Convolution logic using Python loops.**

We will stick to 2D arrays (grayscale, single channel) for this specific exercise to isolate the sliding window logic.

**Specifications:**

   1. **Function Name:** `convolution_2d_manual(input_matrix, kernel)`
   2. **Inputs:**
      * `input_matrix`: A numpy array of shape $(H, W)$.
      * `kernel`: A numpy array of shape $(K, K)$.
   
   3. **Logic:**
      * Calculate the output dimensions. Formula: $Output\_Dim = Input\_Dim - Kernel\_Dim + 1$ (Assuming Stride=1, No Padding).
      * Initialize an output matrix of zeros.
      * Use nested loops to iterate through valid positions of the input.
      * Perform the element-wise multiplication and sum.
   
   4. **Constraints:**
      * **No** `scipy.signal.convolve` or `torch.nn`.
      * **Use** `numpy`.
      * Assume Stride = 1 and No Padding.

**Test Case to Verify:**
   * Input: A $5 \times 5$ matrix of all ones.
   * Kernel: A $3 \times 3$ matrix of all ones.
   * **Expected Output:** A $3 \times 3$ matrix where every value is 9 (since $3 \times 3$ summed 9 times = 9).



In [3]:
import numpy as np

def convolution_2d_manual(input_matrix, kernel):
    im_n, im_m = input_matrix.shape
    k_n, k_m = kernel.shape
    output_n = im_n - k_n + 1 
    output_m = im_m - k_m +1
    
    output = np.zeros((output_n, output_m))

    for i in range(output_n):
        for j in range(output_m):
            region = input_matrix[i:i + k_n, j:j + k_m]
            output[i, j] = np.sum(region * kernel)
    return output



## **Concept 2: Stride & Padding (Controlling Geometry)**

### Intuition

1. **Stride (Stepping):**
   * In the previous concept, we slid the window 1 pixel at a time.
   * **Stride** is simply taking bigger steps. If Stride = 2, you skip every other pixel.
   * **Why?** It drastically reduces the size of the output (downsampling) and saves computation.


2. **Padding (The Frame):**
   * Notice how `output_n = im_n - k_n + 1`? The output always gets *smaller* than the input. If you apply many layers, your image eventually disappears!
   * Also, pixels in the center are scanned many times, but corners are scanned only once.
   * **Padding** adds a border of "dummy" pixels (usually zeros) around the input image. This allows the kernel to "hang off the edge" and maintains the image size.


### Mechanics (The Master Formula)

This is the most important formula in CNN arithmetic. Memorize this.
$$Output = \lfloor \frac{Input - Kernel + 2 \times Padding}{Stride} \rfloor + 1$$

   * **Input:** Input size ($W$ or $H$)
   * **Kernel:** Filter size ($K$)
   * **Padding:** Pixels added to *each* side ($P$)
   * **Stride:** Step size ($S$)

### Simpler Explanation
   * **Stride:** Instead of walking heel-to-toe (Stride 1), you jump over cracks in the sidewalk (Stride 2).
   * **Padding:** Putting a mat around a painting so you can paint right up to the very edge without getting paint on the floor.

### Trade-offs
   * **Stride > 1:** Faster, but you might lose fine-grained details.
   * **Padding:** Keeps spatial dimensions constant (helpful for deep networks), but adds "fake" data (zeros) at the edges.

---

### Your Task

Update your `convolution_2d_manual` function to support **Stride** and **Padding**.

**Specifications:**
   1. **Update Function Signature:** `convolution_2d_manual(input_matrix, kernel, stride=1, padding=0)`
   2. **Logic Update:**
      * **Apply Padding First:** Create a new, larger matrix with zeros around the border. (Hint: `np.pad` is useful, or create a larger zeros matrix and slot the image into the center).
      * **Update Output Dimensions:** Use the "Master Formula" above.
      * **Update Loops:** The loop range must use the new output dimensions.
      * **Update Slicing:** When calculating where the `region` is, you need to multiply the loop index by the stride. (e.g., if `i=1` and `stride=2`, the window starts at index 2).


In [None]:
import numpy as np

def cconvolution_2d_manual(input_matrix, kernel, stride=1, padding=0):
    im_n, im_n = input_matrix.shape
    k_m, k_n = kernel.shape

    output = 