<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M3_4_CNN_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Convolutional Neural Networks (CNNs)


Convolutional Neural Networks (CNNs) are a class of deep learning models specifically designed for tasks related to computer vision, such as image classification, object detection, and image segmentation. CNNs have proven to be highly effective in these tasks because they can automatically learn hierarchical features from raw pixel data. Let's dive into the key components of CNNs:



In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# Image Embeddings

Image embeddings are a representation of images in the form of high-dimensional vectors of numerical values. This technique transforms raw images into a form that a computer can understand and process. The goal of creating image embeddings is to capture the essential features of an image, such as shapes, colors, textures, or any other relevant visual information, in a compact numerical format.

In [1]:
%%html

<iframe width="428" height="761" src="https://www.youtube.com/embed/TJOLwBGq9bU" title="Leo Messi Dice Art Timelapse" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Image embeddings generated by models, especially those involving deep learning techniques like CNNs, are adept at capturing a wide range of information from images, including edges, corners, light intensity, textures, and other visual features. This capability stems from the hierarchical nature of how these models process image data.

In the early layers of a CNN, the model learns to identify simple features such as edges and corners. These are basic patterns that are fundamental to the structure of objects within images. As the information passes through subsequent layers, the network combines these simple features to identify more complex patterns, such as textures, shapes, and eventually, entire objects or scenes.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-16-10-58-29.png)

transforms.Normalize(mean, std): After converting the image to a tensor, this transformation normalizes the tensor's values to have a mean of 0 and a standard deviation of 1, which helps in training machine learning models. In the code snippet, it's subtracting the mean (0.5, 0.5, 0.5) from each channel (R, G, B) and then dividing by the standard deviation (0.5, 0.5, 0.5) for each channel. This operation scales the pixel values to be in the range [-1, 1] for each channel

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-16-11-01-53.png)

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-16-11-03-58.png)

In [None]:
# Hyper-parameters of the model
num_epochs = 4
batch_size = 16
learning_rate = 0.003

In [None]:
# We transform them to Tensors of normalized range [-1, 1]
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# CIFAR-10 Dataset Overview

The CIFAR-10 dataset is a widely used dataset in the field of computer vision and machine learning. It is commonly used for tasks such as image classification and object recognition. Here are some key details about the CIFAR-10 dataset:

## Dataset Details

- **Number of Classes:** CIFAR-10 consists of 10 classes or categories, each representing a different object or category commonly found in everyday life.

- **Images:** The dataset contains a total of 60,000 images, with 6,000 images per class. These images are divided into two subsets:
  - **Training Set:** This subset contains 50,000 images, with 5,000 images per class. It is used for training machine learning models.
  - **Test Set:** The remaining 10,000 images, with 1,000 images per class, make up the test set. This set is used for evaluating the performance of trained models.

- **Image Size:** Each image in the CIFAR-10 dataset is a color image with dimensions 32 pixels in height and 32 pixels in width. Therefore, the images have a resolution of 32x32 pixels.

![](https://sichkar-valentyn.github.io/cifar10/images/CIFAR-10_examples.png)

In [None]:
#DOWNLOAD DATASET
train_dataset = torchvision.datasets.CIFAR10(root = './data_CIFAR10', train=True,
                                            download=True, transform=transform)


test_dataset = torchvision.datasets.CIFAR10(root = './data_CIFAR10', train=False,
                                            download=True, transform=transform)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_CIFAR10/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:01<00:00, 98177520.73it/s] 


Extracting ./data_CIFAR10/cifar-10-python.tar.gz to ./data_CIFAR10
Files already downloaded and verified


In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(test_dataset, batch_size = batch_size, shuffle=False)

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

## Main components in the architecture of a CNN model

![](https://editor.analyticsvidhya.com/uploads/34881cnn_architecture_1.png)

# CIFAR-10 Dataset Overview

The CIFAR-10 dataset is a widely used dataset in the field of computer vision and machine learning, specifically designed for image classification tasks. It focuses on classifying images into one of ten distinct categories. Below are the primary components of a Convolutional Neural Network (CNN) when applied to the CIFAR-10 dataset:

## 1. Convolutional Layer

Convolutional Layers are fundamental for extracting meaningful features from images in the CIFAR-10 dataset. They employ filters that slide over the image grid to identify relevant patterns. Let's delve into this process:

![Convolutional Layer](https://editor.analyticsvidhya.com/uploads/36813convolution_overview.gif)

In this illustration, a window, represented by a convolutional filter, moves across the entire image.

![Convolutional Layer](https://editor.analyticsvidhya.com/uploads/89792convolution_example.png)

During this operation, each filter element multiplies with the corresponding image element, and the products are summed to generate a single value in the output feature map. This process continues until the entire input feature map is covered, resulting in the populated output feature map.

![Convolutional Layer](https://editor.analyticsvidhya.com/uploads/578272021-07-20%2023_09_31-ML%20Practicum_%20Image%20Classification%20%C2%A0_%C2%A0%20Google%20Developers.png)

## 2. Pooling Layer

Pooling Layers are utilized to reduce the dimensionality of feature maps. Here, a 2x2 max-pooling layer is used. As the window traverses the image, it selects the maximum value within the window:

![Maxpool Layer](https://editor.analyticsvidhya.com/uploads/66402maxpool_animation.gif)

Following the max-pooling operation, the input's dimension, originally 4x4, is downsized to 2x2. This dimension reduction is essential for managing computational complexity and enhancing feature recognition.

## 3. Fully Connected Layer

The Fully Connected Layer serves as the final section of the CNN architecture. It receives the rich features extracted from the CIFAR-10 dataset using convolutional filters. These features are then forwarded through the network, ultimately reaching the output layer. In the output layer, the model predicts the probability of the input image belonging to one of the ten predefined classes. The final predicted output corresponds to the class with the highest probability according to the model's prediction.

In summary, this explanation provides an overview of the primary components of a CNN architecture when applied to the CIFAR-10 dataset. Convolutional Layers extract features, Pooling Layers reduce dimensionality, and Fully Connected Layers make the final class predictions. This understanding forms the foundation for implementing the CNN architecture in code.



In [None]:
import torch.nn as nn

# 1. Creating a Neural Network
# Define the ConvNet using nn.Sequential
conv_net = nn.Sequential(
    nn.Conv2d(3, 6, 5),  # Convolutional layer 1: 3 input channels, 6 filters, 5x5 filter size
    nn.ReLU(),          # ReLU activation function
    nn.MaxPool2d(2, 2), # Max pooling layer: size 2x2, stride 2

    nn.Conv2d(6, 16, 5), # Convolutional layer 2: 6 input channels, 16 filters, 5x5 filter size
    nn.ReLU(),          # ReLU activation function
    nn.MaxPool2d(2, 2), # Max pooling layer: size 2x2, stride 2

    nn.Flatten(),       # Flatten the output for fully connected layers

    nn.Linear(16*5*5, 120), # Fully connected layer 1
    nn.ReLU(),              # ReLU activation function

    nn.Linear(120, 84),     # Fully connected layer 2
    nn.ReLU(),              # ReLU activation function

    nn.Linear(84, 10)       # Fully connected layer 3 (output layer)
)

# Define the device (CPU or GPU)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Create an instance of the ConvNet model and move it to the specified device
model = conv_net.to(device)

In [None]:
from PIL import Image
import torchvision.transforms as transforms

# Load the image
image_path = '/content/car.jpg'
image = Image.open(image_path)

# Define the same transform as used for your training data
transform = transforms.Compose([
    transforms.Resize((32, 32)),  # Resize the image to 32x32 pixels
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize the image
])

# Apply the transform to the image
image_transformed = transform(image).float()

# Add an extra batch dimension since PyTorch treats all inputs as batches
image_transformed = image_transformed.unsqueeze(0)

# Check the image tensor shape to ensure it's ready for the model (should be [1, 3, 32, 32])
print(image_transformed.shape)

torch.Size([1, 3, 32, 32])


In [None]:
# Assuming `net` is your trained model. Ensure it's in eval mode
model.eval()

# Make a prediction
output = model(image_transformed)

# Get the index of the highest score in the output
_, predicted = torch.max(output.data, 1)

print(f'Predicted Class: {classes[predicted.item()]}')

Predicted Class: frog


numChannels: The number of channels in the input images (1 for grayscale or 3 for RGB)

In [None]:
criterion = nn.CrossEntropyLoss() #CrossEntropyLoss already includes SoftMax
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [None]:
n_total_steps = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # input_layer: 3 input channels, 6 output channels, 5 kernel size
        images = images.to(device)
        labels = labels.to(device)

        # 2. Forward Pass
        outputs = model(images)

        # 3. FeedForward Evaluation
        loss = criterion(outputs, labels)

        # 4. Backward Pass / Gradient Calculation
        optimizer.zero_grad() #with zero_grad() we ensure that the gradients are properly reset to zero at the start of each iteration
        loss.backward()

        # 5. Back Propagation / Update Weights
        optimizer.step()
        #wandb.log({"loss": loss})  uncomment this line if you want to send data to weights and biases interface


        if (i+1) % 2000 == 0:
            print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{n_total_steps}], Loss: {loss.item():.4f}')
            # Calculate average accuracy for every 2000 steps
            model.eval()
            with torch.no_grad():
                correct = 0
                total = 0
                for images, labels in test_loader:
                    images = images.to(device)
                    labels = labels.to(device)
                    outputs = model(images)
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
                accuracy = correct / total
                #wandb.log({"accuracy": accuracy}) uncomment this line if you want to send data to weights and biases interface

print('Finished Training')
PATH = './cnn.pth'
torch.save(model.state_dict(), PATH)

Epoch [1/4], Step [2000/3125], Loss: 2.2898
Epoch [2/4], Step [2000/3125], Loss: 1.9791
Epoch [3/4], Step [2000/3125], Loss: 1.3779
Epoch [4/4], Step [2000/3125], Loss: 1.6012
Finished Training


In [None]:
with torch.no_grad():
    n_correct = 0
    n_samples = 0
    n_class_correct = [0 for i in range(10)]
    n_class_samples = [0 for i in range(10)]
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        # max returns (value ,index)
        _, predicted = torch.max(outputs, 1)
        n_samples += labels.size(0)
        n_correct += (predicted == labels).sum().item()

        for i in range(labels.size(0)):  # Use labels.size(0) to get the batch size
            label = labels[i].item()  # Extract the actual label value using .item()
            pred = predicted[i].item()  # Extract the predicted label value using .item()
            if (label == pred):
                n_class_correct[label] += 1
            n_class_samples[label] += 1

    acc = 100.0 * n_correct / n_samples
    print(f'Accuracy of the network: {acc} %')

    for i in range(10):
        if n_class_samples[i] == 0:
            acc = 0.0  # To avoid division by zero
        else:
            acc = 100.0 * n_class_correct[i] / n_class_samples[i]
        print(f'Accuracy of {classes[i]}: {acc} %')

Accuracy of the network: 41.48 %
Accuracy of plane: 44.1 %
Accuracy of car: 83.1 %
Accuracy of bird: 11.0 %
Accuracy of cat: 30.5 %
Accuracy of deer: 36.4 %
Accuracy of dog: 45.3 %
Accuracy of frog: 55.4 %
Accuracy of horse: 48.9 %
Accuracy of ship: 31.9 %
Accuracy of truck: 28.2 %


In [None]:
from PIL import Image
import torchvision.transforms as transforms

# Load the image
image_path = '/content/car.jpg'
image = Image.open(image_path)

# Define the same transform as used for your training data
transform = transforms.Compose([
    transforms.Resize((32, 32)),  # Resize the image to 32x32 pixels
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize the image
])

# Apply the transform to the image
image_transformed = transform(image).float()

# Add an extra batch dimension since PyTorch treats all inputs as batches
image_transformed = image_transformed.unsqueeze(0)

# Check the image tensor shape to ensure it's ready for the model (should be [1, 3, 32, 32])
print(image_transformed.shape)


torch.Size([1, 3, 32, 32])


In [None]:
# Assuming `net` is your trained model. Ensure it's in eval mode
model.eval()

# Make a prediction
output = model(image_transformed)

# Get the index of the highest score in the output
_, predicted = torch.max(output.data, 1)

print(f'Predicted Class: {classes[predicted.item()]}')


Predicted Class: car


# Exercise: Image Classification with CNNs on Fashion-MNIST

####**Objective:**
The goal of this exercise is to build, train, and evaluate a Convolutional Neural Network (CNN) on the Fashion-MNIST dataset using PyTorch.

####**Dataset:**
Fashion-MNIST consists of 60,000 training images and 10,000 test images, each a 28x28 grayscale image associated with one of 10 fashion categories.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Subset

# Transformations applied on each image
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Loading the training dataset
trainset_fashion_full = torchvision.datasets.FashionMNIST(root='./data_Fashion_MINST', train=True, download=True, transform=transform)
indices = range(0, 60000) # 0 to 9999
trainset_fashion = Subset(trainset_fashion_full, indices)
trainloader_fashion = torch.utils.data.DataLoader(trainset_fashion, shuffle=True)

# Loading the testing dataset
testset_fashion = torchvision.datasets.FashionMNIST(root='./data_Fashion_MINST', train=False, download=True, transform=transform)
testloader_fashion = torch.utils.data.DataLoader(testset_fashion, shuffle=False)

# Classes in Fashion-MNIST
classes = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']


![](https://raw.githubusercontent.com/aaubs/ds-master/main/data/Images/Fashion-MNIST-dataset.png)