# Intro to Basic Pytorch. CNNs. Image Classification

Course: Audio Processing<br>
Author: Ostap Viniavskyi<br>
Date: Spring 2025<br>

## Agenda
0. [**Intro**](#Intro)<br>
1. [**Introduction to Pytorch**](#Introduction-to-Pytorch)<br>
    [1.1 Basic Operations](#Basic-Operations)<br>
    [1.2 Intro to Autograd](#Intro-to-Autograd)<br>
    [1.3 Solving Non-Linear Least-Squares problem in Pytorch](#Solving-Non-Linear-Least-Squares-problem-in-Pytorch)<br>
    
2. [**CNN**](#Convolutional-Neural-Networks-(CNN))<br>
    [2.1 Convolution operation](#Convolution-operation)<br>
    [2.2 CNN building blocks](#CNN-building-blocks)<br>
    [2.3 Building CNN in Pytorch](#Building-CNN-in-Pytorch)<br>
    
3. [**Image Classification**](#Image-Classification)<br>
    [3.1 Setting up CNN training pipeline](#Setting-up-CNN-training-pipeline)<br>
    [3.2 Evalution and inference of CNN model](#Evalution-and-inference-of-CNN-model)<br>

4. [**Conclusions**](#Conclusions)<br>


## Intro

Deep learning has significantly advanced the field of computer vision, enabling automatic performance of tasks such as image recognition, object detection, and segmentation with high accuracy. Among the most widely used architectures for such tasks are **Convolutional Neural Networks (CNNs)**, which leverage spatial hierarchies in data to learn effective feature representations.  

This lecture provides a structured introduction to **PyTorch**, a widely used deep learning framework, and its core functionalities for implementing and training deep learning models. In particular, we will focus on:  

- The fundamentals of **PyTorch tensors** and operations.  
- **Autograd and backpropagation** for automatic differentiation and gradient computation.  
- The architecture and working principles of **Convolutional Neural Networks (CNNs)**.  
- **Image classification** using CNNs in PyTorch.  

By the end of this session, students will be able to:  
1. Understand the basic concepts and operations of PyTorch.  
2. Implement **automatic differentiation** and **backpropagation** using PyTorch’s `autograd` module. 
3. Construct and train a **CNN model** for image classification.  
4. Evaluate the performance of the trained model  

### Introduction to Pytorch  

PyTorch is an open-source deep learning framework developed by **Meta AI (formerly Facebook AI Research)**. It provides a flexible and efficient platform for building and training neural networks, making it widely used in both research and industry. At its core, PyTorch offers **tensor computation with strong GPU acceleration** and an **automatic differentiation system** (`autograd`), which simplifies gradient-based optimization for deep learning models.  

PyTorch excels in **dynamic computation graphs**, allowing users to modify models on the fly, making it highly suitable for applications such as **natural language processing (NLP), computer vision, and reinforcement learning**. The framework provides built-in support for popular deep learning tasks, including **convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based architectures**.  

However, PyTorch does have some limitations. While it is efficient for research and prototyping, its deployment options were traditionally less optimized compared to static graph frameworks like TensorFlow. That said, **TorchScript** and **PyTorch’s integration with ONNX (Open Neural Network Exchange)** have significantly improved its deployment capabilities. Additionally, PyTorch requires **manual optimization for large-scale distributed training**, which can be complex compared to frameworks designed specifically for production environments.  

Despite these challenges, PyTorch remains one of the most popular frameworks due to its ease of use, strong community support, and extensive ecosystem, including **TorchVision (for computer vision tasks), TorchText (for NLP), and TorchAudio (for audio processing)**.  


**Differences Between PyTorch and NumPy** 

PyTorch and NumPy both provide powerful tensor computation capabilities, but PyTorch extends NumPy's functionality by **supporting GPU acceleration** and **automatic differentiation**. While NumPy arrays (`ndarray`) are optimized for general numerical computing, PyTorch tensors (`torch.Tensor`) are designed specifically for deep learning, allowing seamless computation on both CPUs and GPUs. Additionally, PyTorch’s `autograd` module enables automatic differentiation, making it easier to compute gradients and train neural networks, whereas NumPy requires explicit differentiation implementations. Despite these differences, PyTorch tensors and NumPy arrays share a similar interface, and conversion between them is straightforward using `.numpy()` and `torch.from_numpy()`.  


#### Basic Operations

In [None]:
import torch
import torchvision
import matplotlib.pyplot as plt
import numpy as np
import cv2
from torchviz import make_dot

%matplotlib inline

In [None]:
if torch.cuda.is_available():
    device = "cuda:0"
    print("CUDA is available. Setting device=\"cuda\"")

else:
    device = "cpu"
    print("CUDA is not available. Setting device=\"cpu\"")

In [None]:
# create torch tensor from scalar value
x = torch.tensor(0.)
print(f"{x.device=}", f"{x.shape=}", f"{x.dtype=}")

# create 2x2 matrix
M = torch.tensor([[1., 2.], [3., 4.]])
print(f"{M.device=}", f"{M.shape=}", f"{M.dtype=}")

In [None]:
# move tensor to another device, convert dtype to int64
M1 = M.to(device, dtype=torch.int64)
print(f"{M1.device=}", f"{M1.dtype=}")

# reshape tensor
M2 = M1.reshape(1, 4)
print(f"{M2.shape=}")

M1, M2

In [None]:
# elementwise operations
A = torch.randn(3, 4)
B = torch.randn(3, 4)

C = A + B
print(C.shape)

try:
    C = A + B.T
except Exception as e:
    print(type(e), e)

In [None]:
# matrix mulitplication
A = torch.randn(3, 4)
B = torch.randn(3, 4)

D = torch.matmul(A, B.T)
print(D.shape)

try:
    C =  torch.matmul(A, B)
except Exception as e:
    print(type(e), e)

#### Intro to Autograd


PyTorch's **Autograd** is an automatic differentiation engine that plays a crucial role in optimizing deep learning models. It allows for efficient gradient computation by tracking operations performed on tensors and dynamically constructing a **computational graph** in the background.

**Computational Graph**

A **computational graph** is a directed acyclic graph (DAG) where nodes represent mathematical operations, and edges represent the flow of data (tensors). When performing operations on tensors with `requires_grad=True`, PyTorch builds this graph dynamically, recording the sequence of operations for efficient gradient computation during **backpropagation**.

**Static vs. Dynamic Computational Graphs**  

| Feature           | Static Graph (e.g., TensorFlow v1) | Dynamic Graph (PyTorch) |
|------------------|--------------------------------|------------------------|
| Graph Definition | Defined before execution      | Built dynamically during execution |
| Flexibility      | Less flexible, requires re-compilation for changes | Highly flexible, allowing model modifications on the fly |
| Debugging        | More complex, requires specialized tools | Easier, as it integrates with Python’s native debugging tools |
| Memory Efficiency | Can be optimized before execution | May use more memory due to dynamic allocation |

PyTorch's dynamic computation graph allows for more intuitive and flexible model development, particularly useful for tasks like **variable-length sequences and reinforcement learning**.

**Forward vs. Backward Automatic Differentiation**  

![](images/forward_autodif.png) <br>

1. **Forward Mode Differentiation**  
   - Computes derivatives **from inputs to outputs**.  
   - Efficient when the number of inputs is **small**, but outputs are large.  
   - Less commonly used in deep learning but useful in some scientific computing applications.

![](images/backward_autodif.png) <br>
2. **Backward Mode Differentiation (Reverse Mode Differentiation)**  
   - Computes derivatives **from outputs back to inputs**.  
   - Efficient for deep learning, where the number of outputs (loss scalar) is much smaller than the number of parameters.  
   - PyTorch primarily uses **backward differentiation** for backpropagation.  

PyTorch’s `autograd` enables efficient **backward differentiation** by calling `.backward()` on a scalar loss, computing gradients for all tensors involved in the computation.  

**Logistic Regression and Reverse Mode Differentiation**  

Logistic regression is a binary classification model that predicts the probability of an input belonging to a certain class. Given an input **vector** $\mathbf{x} \in \mathbb{R}^m$ with **m features**, the model computes the probability using the **sigmoid activation function**:  

$$
y = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

where  

$$
z = \mathbf{w}^T \mathbf{x} + b = \sum_{i=1}^{m} w_i x_i + b
$$

- $\mathbf{w} \in \mathbb{R}^m$ is the weight vector  
- $b \in \mathbb{R}$ is the bias term  
- $\sigma(z)$ is the sigmoid function, which maps $z$ to the range $(0,1)$  


The loss function for logistic regression is the **binary cross-entropy loss**, given by:  

$$
L = - \left( y_{\text{true}} \log y + (1 - y_{\text{true}}) \log (1 - y) \right)
$$

where $y_{\text{true}}$ is the ground truth label.

**Reverse Mode Differentiation (Backpropagation)**

To train the model, we need to compute the **gradients** of the loss $L$ with respect to each **weight value** $w_i$ and bias $b$. Using reverse mode differentiation, we compute derivatives step by step.

**Step 1: Compute $\frac{dL}{dy}$**

$$
\frac{dL}{dy} = - \left( \frac{y_{\text{true}}}{y} - \frac{1 - y_{\text{true}}}{1 - y} \right)
$$

**Step 2: Compute $\frac{dy}{dz}$ (Derivative of Sigmoid)**

$$
\frac{dy}{dz} = \sigma(z) (1 - \sigma(z)) = y (1 - y)
$$

**Step 3: Compute $\frac{dz}{dw_i}$**  

$$
\frac{dz}{dw_i} = x_i
$$

**Step 4: Compute $\frac{dL}{dw_i}$ using Chain Rule**

By applying the chain rule:

$$
\frac{dL}{dw_i} = \frac{dL}{dy} \cdot \frac{dy}{dz} \cdot \frac{dz}{dw_i}
$$

Substituting the values:

$$
\frac{dL}{dw_i} = \left( - \frac{y_{\text{true}}}{y} + \frac{1 - y_{\text{true}}}{1 - y} \right) \cdot y (1 - y) \cdot x_i
$$

**Step 5: Interpretation**

This derivative tells us how much the loss changes with respect to each input $w_i$. It is used in **gradient-based optimization algorithms**, such as **stochastic gradient descent (SGD)**, to update the model parameters.  


In [None]:
# simulate logistic regression set-up
weights_require_grad = True # set to True to avoid error

# input data X and label y
X = torch.randn(100, 3)
y_true = torch.randint(2, size=(100, ))

# weights wector w and bias b
w = torch.randn(3, requires_grad=weights_require_grad)
b = torch.randn(1, requires_grad=weights_require_grad)

# compute y_pred
y = torch.sigmoid(X @ w + b)

# compute binary cross-entropy loss
loss = -(1 - y_true) * torch.log(1 - y) - y_true * torch.log(y)
loss = loss.sum()
loss.backward()

In [None]:
make_dot(loss, params=dict(weights=w, bias=b))

In [None]:
# print grads
print(f"{w.grad=}")
print(f"{b.grad=}")
print(f"{y.grad=}")

In [None]:
# check that autodiff results correspond to manually derived values
with torch.no_grad():
    w_grad = (-y_true / y + (1 - y_true) / (1 - y)) * y * (1 - y) @ X
    b_grad = ((-y_true / y + (1 - y_true) / (1 - y)) * y * (1 - y)).sum()
    
print(f"{w_grad=}")
print(f"{b_grad=}")

<a id="Solving-Non-Linear-Least-Squares-problem-in-Pytorch"></a>
#### Solving Non-Linear Least-Squares problem in Pytorch

![](images/logistic_curve.png) <br>

**Logistic Curve Fitting (Non-Linear Least Squares)**  

The logistic function is defined as:  

$$
f(x; L, k, x_0) = \frac{L}{1 + e^{-k(x - x_0)}}
$$

where:  
- $ L $ is the maximum value (asymptote),  
- $ k $ is the growth rate,  
- $ x_0 $ is the midpoint (where $ f(x) = L/2 $),  
- $ x $ is the input variable.  

**Least Squares Loss Function** 

To fit a logistic curve to given data points $ (x_i, y_i) $, we minimize the **mean squared error (MSE)**:  

$$
J(L, k, x_0) = \frac{1}{N} \sum_{i=1}^{N} \left( f(x_i; L, k, x_0) - y_i \right)^2
$$

where $ N $ is the number of data points.

To update parameters using **vanilla gradient descent**, we compute the partial derivatives of the loss function with respect to each parameter.

**Gradient Descent Update Rule**

Using vanilla gradient descent, we update the parameters as follows:

$$
L^{(t+1)} = L^{(t)} - \alpha \frac{\partial J}{\partial L}
$$

$$
k^{(t+1)} = k^{(t)} - \alpha \frac{\partial J}{\partial k}
$$

$$
x_0^{(t+1)} = x_0^{(t)} - \alpha \frac{\partial J}{\partial x_0}
$$

where $ \alpha $ is the learning rate and $ t $ is the iteration index.

**Iterative Procedure**

1. Initialize $ L, k, x_0 $ with random values.  
2. Compute the loss $ J(L, k, x_0) $.  
3. Compute the gradients $ \frac{\partial J}{\partial L}, \frac{\partial J}{\partial k}, \frac{\partial J}{\partial x_0} $.  
4. Update parameters using the gradient descent update rules.  
5. Repeat until convergence (e.g., when the loss stops decreasing significantly). 

In [None]:
# Generate synthetic data from a logistic curve
np.random.seed(42)
torch.manual_seed(42)

def logistic(x, L, k, x0):
    return L / (1 + torch.exp(-k * (x - x0)))

# True parameters
L_true = 10.0
k_true = 1.5
x0_true = 5.0

# Generate x values
x = torch.linspace(0, 10, steps=100)
y_true = logistic(x, L_true, k_true, x0_true)

# Add noise
noise_std = 2.
y_noisy = y_true + noise_std * torch.randn_like(y_true)

plt.scatter(x.numpy(), y_noisy.numpy(), label='Noisy Data', alpha=0.6)
plt.plot(x.numpy(), y_true.numpy(), label='True Curve', linestyle='dashed')
plt.legend()
plt.show()

In [None]:
# Initialize parameters randomly
L = torch.tensor(5.0, requires_grad=True)
k = torch.tensor(1.0, requires_grad=True)
x0 = torch.tensor(-2.0, requires_grad=True)

# Hyperparameters
lr = 0.05
num_epochs = 700

# Gradient descent
for epoch in range(num_epochs):
    # Forward pass
    y_pred = logistic(x, L, k, x0)
    loss = torch.mean((y_pred - y_noisy) ** 2)  # Mean Squared Error
    
    # Backward pass
    loss.backward()
    
    # Manual update
    with torch.no_grad():
        L -= lr * L.grad
        k -= lr * k.grad
        x0 -= lr * x0.grad
    
    # Zero gradients
    L.grad.zero_()
    k.grad.zero_()
    x0.grad.zero_()
    
    if epoch % 50 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# Plot results
plt.scatter(x.numpy(), y_noisy.numpy(), label='Noisy Data', alpha=0.6)
plt.plot(x.numpy(), logistic(x, L, k, x0).detach().numpy(), label='Fitted Curve', color='red')
plt.plot(x.numpy(), y_true.numpy(), label='True Curve', linestyle='dashed')
plt.legend()
plt.show()

# Print learned parameters
print(f"Learned parameters: L={L.item():.2f}, k={k.item():.2f}, x0={x0.item():.2f}")

### Convolutional Neural Networks (CNN)

#### Convolution operation

![](images/conv_filter.gif) <br>

In a 2D convolution operation, a small matrix called a **kernel** or **filter** slides over a larger input matrix (typically an image) to produce an output feature map. Mathematically, for an input $ X $ of size $ H \times W $ and a kernel $ K $ of size $ k_h \times k_w $, the convolution at position $ (i, j) $ is computed as:  

$$
Y(i, j) = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} X(i+m, j+n) \cdot K(k_h - 1 - m, k_w - 1 - n)
$$

This operation captures **spatial patterns** by detecting local features such as edges, textures, and shapes. **Padding** can be applied to preserve spatial dimensions, while **strides** determine the step size of the kernel's movement. Multiple filters in a convolutional layer allow the extraction of diverse features, forming the foundation of deep learning architectures for image processing.

**EXAMPLE: 2D Laplacian of Gaussian (LoG)**  

The **Laplacian of Gaussian (LoG)** is a second-order edge detection operator that enhances regions of rapid intensity change. It is computed by first applying a **Gaussian blur** to smooth the image and then taking the **Laplacian**, which measures the second derivative of intensity. Mathematically, the **LoG filter** is defined as:

$$
\text{LoG}(x, y) = \nabla^2 (G(x, y) * I(x, y))
$$

where:
- $ G(x, y) $ is a Gaussian function that reduces noise,
- $ I(x, y) $ is the input image,
- $ \nabla^2 $ is the **Laplacian operator**, which detects regions of high curvature.

**Applications in Blob Detection**

LoG is widely used in **blob detection**, where blobs are defined as regions that differ significantly in intensity from their surroundings. The **zero-crossings** in the LoG output indicate blob-like structures, making it effective for tasks such as:
- **Keypoint detection** in image processing (e.g., in **SIFT** feature extraction).
- **Medical imaging**, where it highlights circular or irregular structures (e.g., cell or tumor detection).
- **Astronomy**, for identifying celestial objects like stars and galaxies.

In [None]:
from scipy.ndimage import gaussian_laplace

In [None]:
# Read image in grayscale
image = cv2.imread('images/sunplowers.jpeg', cv2.IMREAD_GRAYSCALE)
print(image.shape)

# Convert image to float32 and normalize to 0-1 range
image = image.astype(np.float32) / 255.
H, W = image.shape

plt.imshow(image, cmap='gray')

In [None]:
# plot image as a 2d function
# xx, yy = np.meshgrid(np.arange(W), np.arange(H))
# fig = plt.figure(figsize=(14, 6))
# ax = fig.add_subplot(111, projection='3d')
# ax.plot_surface(xx, yy, image, cmap='coolwarm')

In [None]:
# Create Laplacian of Gaussian (LoG) kernel
def create_log_kernel(sigma):
    kernel_size = int(2 * np.ceil(3 * sigma) + 1)
    impulse_image = np.zeros((kernel_size, kernel_size))
    impulse_image[kernel_size // 2, kernel_size // 2] = 1  # Center pixel is 1

    return gaussian_laplace(impulse_image, sigma=sigma)


def visualize_log_kernel(log_kernel):
    kernel_size = log_kernel.shape[0]
    ax = np.linspace(-(kernel_size // 2), kernel_size // 2, kernel_size)
    xx, yy = np.meshgrid(ax, ax)
    
    fig = plt.figure(figsize=(14, 6))
    
    ax = fig.add_subplot(121, projection='3d')
    ax.plot_surface(xx, yy, log_kernel, cmap='coolwarm')
    
    ax = fig.add_subplot(122)
    ax.imshow(log_kernel, cmap='coolwarm')
    
    plt.show()

log_kernel_s5 = create_log_kernel(sigma=5).astype(np.float32)
log_kernel_s10 = create_log_kernel(sigma=10).astype(np.float32)
visualize_log_kernel(log_kernel_s10)

In [None]:
%%time

def convolve_gray(image: np.ndarray, kernel: np.ndarray) -> np.ndarray:
    """
    Perform convolution on a grayscale image using the given kernel.

    :param image: 2D NumPy array representing the grayscale image
    :param kernel: 2D NumPy array representing the convolution filter
    :return: 2D NumPy array of the convolved image
    """
    # Get dimensions
    img_h, img_w = image.shape
    kernel_h, kernel_w = kernel.shape
    
    # Compute padding size
    pad_h = kernel_h // 2
    pad_w = kernel_w // 2
    
    # Pad the image with zeros
    padded_image = np.pad(image, ((pad_h, pad_h), (pad_w, pad_w)), mode='constant', constant_values=0)
    
    # Output image
    output = np.zeros((img_h, img_w), dtype=np.float32)
    
    # Flip the kernel for convolution
    kernel = np.flipud(np.fliplr(kernel))
    
    # Perform convolution
    for i in range(img_h):
        for j in range(img_w):
            patch = padded_image[i:i+kernel_h, j:j+kernel_w]
            output[i, j] = np.sum(patch * kernel)
    
    return output

log5_image = convolve_gray(image, log_kernel_s5)
log10_image = convolve_gray(image, log_kernel_s10)

In [None]:
fig = plt.figure(figsize=(14, 6))

ax = fig.add_subplot(121,)
ax.imshow(log5_image, cmap='gray')
ax.title.set_text("Convolution of input image with LoG(sigma=5)")

ax = fig.add_subplot(122)
ax.imshow(log10_image, cmap='gray')
ax.title.set_text("Convolution of input image with LoG(sigma=10)")

plt.show()

*Additional Resourses*: <br>
[3Blue1Brown video - Intuitive explanation of convolution operation](https://www.youtube.com/watch?v=KuXjwB4LzSA&ab_channel=3Blue1Brown) <br>
[Stanford Noted - Great convolution output size explanation and other CNN-related stuff](https://cs231n.github.io/convolutional-networks/) <br>
[Medium article - Efficient convolution implementation strategies](https://medium.com/@sundarramanp2000/different-implementations-of-the-ubiquitous-convolution-6a9269dbe77f) <br>

#### CNN building blocks

**Multichannel Convolution**

In deep learning, **multichannel convolution** extends the standard 2D convolution operation to handle inputs with multiple channels, such as RGB images or feature maps in deeper layers of a neural network. Unlike single-channel convolution, where a **2D kernel** slides over a single input, in multichannel convolution, a separate kernel is applied to **each input channel**, and the results are summed to produce a **single output value** per spatial location.  

For an input tensor of size $H \times W \times C_{\text{in}}$ (height, width, and number of channels), a convolutional layer with $C_{\text{out}}$ filters applies a **kernel of size** $k_h \times k_w \times C_{\text{in}}$ per filter. Each filter produces a single output channel, resulting in an output tensor of size $H' \times W' \times C_{\text{out}}$, where $H'$ and $W'$ depend on padding and stride.  

*Example*:  
- **Input:** $32 \times 32 \times 3$ (RGB image)  
- **Kernel:** $3 \times 3 \times 3$  
- **Output (with 16 filters):** $30 \times 30 \times 16$ (if no padding, stride = 1)  

In deeper layers, multichannel convolution enables networks to learn complex hierarchical features by combining multiple filters. This is essential for tasks like **image classification, object detection, and segmentation**.

**MaxPooling 2D (MaxPool2D)** 

**MaxPooling 2D** (MaxPool2D) is a downsampling operation commonly used in convolutional neural networks (CNNs) to reduce spatial dimensions while retaining important features. It operates by sliding a **fixed-size window** (e.g., $2 \times 2$ or $3 \times 3$) over the input feature map and selecting the **maximum value** within each window. This process helps to reduce computation, increase translation invariance, and improve robustness to small spatial variations.  

For an input feature map of size $H \times W \times C$, applying a max-pooling operation with a window of size $k_h \times k_w$ and stride $s$ results in an output feature map of size:

$$
H' = \frac{H - k_h}{s} + 1, \quad W' = \frac{W - k_w}{s} + 1
$$

*Example*:
- **Input:** $32 \times 32 \times 64$ feature map  
- **MaxPool:** $2 \times 2$ window, stride = 2  
- **Output:** $16 \times 16 \times 64$  

Unlike convolution, max-pooling does not have learnable parameters—it simply selects the strongest activations. This makes it an effective tool for reducing spatial size while preserving dominant features, helping CNNs generalize better for tasks like **image classification and object detection**.



In [None]:
import torch.nn.functional as F

In [None]:
# Convert image and kernel to torch
image_t_cpu = torch.tensor(image, dtype=torch.float32, device="cpu")
log_kernel_s5_t_cpu = torch.tensor(log_kernel_s5, dtype=torch.float32, device="cpu")
log_kernel_s10_t_cpu = torch.tensor(log_kernel_s10, dtype=torch.float32, device="cpu")

# unsqueeze dimensions in the tensor
# image [H, W] -> [B=1, C=1, H, W]
image_t_cpu = image_t_cpu.unsqueeze(0).unsqueeze(0)
print(f"{image_t_cpu.shape=}")

# kernel [KH, KW] -> [N=1, M=1, KH, KW]
log_kernel_s5_t_cpu = log_kernel_s5_t_cpu.unsqueeze(0).unsqueeze(0)
print(f"{log_kernel_s5_t_cpu.shape=}")

log_kernel_s10_t_cpu = log_kernel_s10_t_cpu.unsqueeze(0).unsqueeze(0)
print(f"{log_kernel_s10_t_cpu.shape=}")

# create GPU version of tensors
image_t_device = image_t_cpu.to(device)
log_kernel_s5_t_device = log_kernel_s5_t_cpu.to(device)
log_kernel_s10_t_device = log_kernel_s10_t_cpu.to(device)

In [None]:
%%time
log5_image_t_cpu = F.conv2d(image_t_cpu, log_kernel_s5_t_cpu, padding=log_kernel_s5_t_device.shape[-1]//2)
log10_image_t_cpu = F.conv2d(image_t_cpu, log_kernel_s10_t_cpu, padding=log_kernel_s10_t_device.shape[-1]//2)

In [None]:
%%time
log5_image_t_device = F.conv2d(image_t_device, log_kernel_s5_t_device, padding=log_kernel_s5_t_device.shape[-1]//2)
log10_image_t_device = F.conv2d(image_t_device, log_kernel_s10_t_device, padding=log_kernel_s10_t_device.shape[-1]//2)
torch.cuda.synchronize(0)

In [None]:
print(f"{log5_image_t_device.shape=}")
print(f"{log10_image_t_device.shape=}")

fig = plt.figure(figsize=(14, 6))

ax = fig.add_subplot(121)
ax.imshow(log5_image_t_device.cpu()[0, 0], cmap='gray')
ax.title.set_text("Convolution of input image with LoG(sigma=5)")

ax = fig.add_subplot(122)
ax.imshow(log10_image_t_device.cpu()[0, 0], cmap='gray')
ax.title.set_text("Convolution of input image with LoG(sigma=10)")

plt.show()

In [None]:
max_pool_out = F.max_pool2d(log5_image_t_device, kernel_size=8, stride=8, )

print(f"{max_pool_out.shape=}")
plt.imshow(max_pool_out.cpu()[0, 0], cmap='gray')
plt.title("Max pool with KS=8, stride=8 of LoG of input image")
plt.show()

#### Building CNN in Pytorch

**What CNNs Learn Internally and the Concept of Receptive Field** 

![](images/feat_by_layers.webp) <br>

In a Convolutional Neural Network (CNN), the filters in different layers learn to detect features of increasing complexity. In the **early layers**, filters typically learn to detect **low-level patterns** such as edges, textures, and simple shapes. As the network goes deeper, these filters begin to recognize **mid-level patterns** like corners, contours, and more complex textures. In the **final layers**, the network learns **high-level semantic features**, such as object parts or entire objects, which are crucial for classification and detection tasks.  

A key concept in CNNs is the **receptive field**, which refers to the region of the input image that influences a particular neuron in a feature map. In the **earlier layers**, the receptive field is small, meaning each neuron captures only local patterns. As we move deeper into the network, stacking convolution and pooling layers increases the **effective receptive field**, allowing neurons to capture larger spatial relationships. This hierarchical feature extraction enables CNNs to recognize objects at different scales and positions, making them highly effective for visual tasks. 


**Translation Equivariance in Convolution**

![](images/equivariance.webp) <br>

One of the fundamental properties of convolution is **translation equivariance**. This means that if an object in the input image shifts spatially, the output feature map shifts accordingly but remains otherwise unchanged. Mathematically, if \( f(X) \) represents a convolution operation on an input image \( X \), and \( T \) is a translation operator, then:

$$
f(T(X)) = T(f(X))
$$



In [None]:
import torch.nn as nn

from torchview import draw_graph

import graphviz
graphviz.set_jupyter_format('png')

**AlexNet**, introduced by **Krizhevsky et al. (2012)**, was a groundbreaking deep learning model that won the **ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012)** with a **top-5 error rate of 15.3%**, significantly outperforming traditional computer vision methods. It was one of the first deep **Convolutional Neural Networks (CNNs)** to demonstrate the power of deep learning in large-scale image classification.  

AlexNet consists of **eight layers**: five **convolutional layers** followed by three **fully connected layers**. The architecture employs **ReLU activations** instead of traditional sigmoid/tanh functions, enabling faster training. It also introduces **dropout** in the fully connected layers to reduce overfitting. To efficiently train on large datasets, AlexNet used **GPU acceleration** with two parallel NVIDIA GTX 580 GPUs. Additionally, **overlapping max-pooling** was used to enhance translation invariance, and **data augmentation** (such as random cropping and flipping) was applied to improve generalization.  

AlexNet's success marked the beginning of the **deep learning revolution**, inspiring later architectures such as **VGG, ResNet, and DenseNet**. Despite being relatively simple by modern standards, it remains a foundational model in computer vision.

![](images/alexnet.png) <br>


In [None]:
class AlexNet(nn.Module):
    def __init__(self, num_classes):
        super(AlexNet, self).__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0)
        self.pool1 = nn.MaxPool2d(kernel_size=3, stride=2)
        
        self.conv2 = nn.Conv2d(96, 256, kernel_size=5, padding=2)
        self.pool2 = nn.MaxPool2d(kernel_size=3, stride=2)
        
        self.conv3 = nn.Conv2d(256, 384, kernel_size=3, padding=1)        
        self.conv4 = nn.Conv2d(384, 384, kernel_size=3, padding=1)        
        self.conv5 = nn.Conv2d(384, 256, kernel_size=3, padding=1)
        self.pool5 = nn.MaxPool2d(kernel_size=3, stride=2)
        
        
        # Activation function
        self.relu = nn.ReLU(inplace=True)

        # Pooling layers

        # Fully connected layers
        self.fc1 = nn.Linear(256 * 6 * 6, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)

        # Dropout layers
        self.dropout = nn.Dropout(p=0.6)

    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.pool1(x)

        x = self.relu(self.conv2(x))
        x = self.pool2(x)

        x = self.relu(self.conv3(x))
        x = self.relu(self.conv4(x))
        x = self.relu(self.conv5(x))
        x = self.pool5(x)

        x = torch.flatten(x, 1)  # Flatten from (B, 256, 6, 6) to (B, 256*6*6)

        x = self.dropout(self.relu(self.fc1(x)))
        x = self.dropout(self.relu(self.fc2(x)))
        x = self.fc3(x)

        return x

# Example usage
model = AlexNet(num_classes=10)
print(model)

In [None]:
num_model_params = sum(p.numel() for p in model.parameters())
print(f"{num_model_params=}")

model_graph = draw_graph(model, input_size=(1, 3, 227, 227), expand_nested=True, device="meta")
model_graph.visual_graph

### Image Classification

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from tqdm import tqdm

#### Setting up CNN training pipeline

In [None]:
# Load CIFAR-10 dataset
transform = transforms.Compose([
    transforms.Resize(227),  # Resize for AlexNet
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

trainloader = DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)
testloader = DataLoader(testset, batch_size=64, shuffle=False, num_workers=2)


# Define CIFAR-10 class names
classes = ('airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')

# Get a batch of training data
data_iter = iter(trainloader)
images, labels = next(data_iter)

# Denormalize images for correct visualization
def denormalize(img):
    img = img * 0.5 + 0.5  # Reverse normalization
    return img.clamp(0, 1)  # Ensure values are in valid range

# Plot images
fig, axes = plt.subplots(2, 5, figsize=(10, 5))  # 2 rows, 5 columns
axes = axes.flatten()

for i in range(10):
    img = denormalize(images[i]) 
    img = img.permute(1, 2, 0)
    
    axes[i].imshow(img)
    axes[i].set_title(classes[labels[i].item()])
    axes[i].axis("off")

plt.tight_layout()
plt.show()

In [None]:
# Define training parameters
model = AlexNet(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Training loop
epochs = 10
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    progress_bar = tqdm(trainloader, desc=f"Epoch {epoch+1}/{epochs}", leave=False)
    
    for images, labels in progress_bar:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        progress_bar.set_postfix(loss=f"{loss.item():.4f}")
    
    print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/len(trainloader):.4f}")
    
    # Evaluate model after each epoch
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    accuracy = 100 * correct / total
    print(f'Test Accuracy after epoch {epoch+1}: {accuracy:.2f}%')

#### Evalution and inference of CNN model

In [None]:
# Get a batch of test data
data_iter = iter(testloader)
images, labels = next(data_iter)

# Move images to the same device as the model
images, labels = images.to(device), labels.to(device)

# Make predictions
model.eval()  
with torch.no_grad():
    outputs = model(images) 
    _, predicted = torch.max(outputs, 1) 

fig, axes = plt.subplots(2, 5, figsize=(10, 5))  # 2 rows, 5 columns
axes = axes.flatten()

for i in range(10):
    img = denormalize(images[i].cpu()) 
    img = img.permute(1, 2, 0)  # Convert from (C, H, W) to (H, W, C)

    true_label = classes[labels[i].item()]
    pred_label = classes[predicted[i].item()]
    
    axes[i].imshow(img)
    axes[i].set_title(f"True: {true_label}\nPred: {pred_label}", fontsize=10, color="green" if true_label == pred_label else "red")
    axes[i].axis("off")

plt.tight_layout()
plt.show()


**How to improve the model**: 
- investigtate if the model overfits -- prevent overfitting by [regularization](https://medium.com/analytics-vidhya/understanding-regularization-with-pytorch-26a838d94058)
- add [BatchNorm operation](https://towardsdatascience.com/batch-norm-explained-visually-how-it-works-and-why-neural-networks-need-it-b18919692739/)
- pretrained model initialization -- explore [Transfer Learning](https://towardsdatascience.com/transfer-learning-for-beginner-9b59490d1b9d/)
- add [augmentations](https://towardsdatascience.com/complete-guide-to-data-augmentation-for-computer-vision-1abe4063ad07/)
- explore other architectures -- [ResNet](https://towardsdatascience.com/resnets-why-do-they-perform-better-than-classic-convnets-conceptual-analysis-6a9c82e06e53/), [MobileNet](https://medium.com/towards-data-science/understanding-depthwise-separable-convolutions-and-the-efficiency-of-mobilenets-6de3d6b62503), [ShuffleNet](https://medium.com/towards-data-science/review-shufflenet-v1-light-weight-model-image-classification-5b253dfe982f)


## Conclusions 

In this lecture, we explored the fundamentals of **PyTorch** and its application in deep learning, particularly in **image classification** using Convolutional Neural Networks (CNNs). We covered the key features of PyTorch, including its **dynamic computation graph and autograd system**, which facilitate efficient gradient computation for optimization. We also discussed **CNN architectures**, focusing on **convolutional operations, pooling, and hierarchical feature learning**, which allow networks to extract meaningful patterns from images. Additionally, we examined **AlexNet**, one of the pioneering deep learning models, and demonstrated how to preprocess and visualize the **CIFAR-10 dataset** for training and evaluation.  

Through practical examples, we implemented a **complete image classification pipeline**, including **data loading, model training, and evaluation**. We also visualized model predictions to gain insights into its performance. Understanding how CNNs learn features at different depths and the role of the **receptive field** is crucial for designing more advanced architectures. Moving forward, students can explore **modern architectures like ResNet, transfer learning, and fine-tuning techniques** to improve classification accuracy on more complex datasets. Mastering these concepts provides a strong foundation for tackling real-world computer vision problems with deep learning.