# HW3 - Q6: Exploring Inductive Bias of Convolutional Neural Networks and Systematic Experimentation in Machine Learning

In this homework, we will study 1) what is inductive bias and how it affects the learning process, and 2) how to conduct systematic experiments in machine learning. We will compare convolutional neural networks (CNNs) and multi-layer perceptrons (MLPs) extensively as an example to study these two topics.

## 1. Inductive Bias

What is inductive bias? It is the assumption that the learning algorithm makes about the problem domain. Suppose that we build a machine learning system. We want to leverage the specific knowledge about the problem domain to make the learning process **more efficient** and the system **generalize much better** with fewer parameters. Let's be more precise. What do exactly **more efficient** and **generalize much better** mean? The learning process is more efficient 1) if we can learn the model with fewer parameters, 2) if we can learn the model with fewer data, and 3) if we can learn the model with fewer iterations. And the system generalizes much better if the model can generalize to the unseen data well.

We have already observed the power of inductive bias. We know that CNN generalizes better than MLP even with the same number of parameters. We partially concluded that is because CNN has the inductive bias that the model is translation invariant. We will study the inductive bias of CNN in more detail in this homework.

In this homework, we will use the edge detection task as an example to study the inductive bias of CNN. We will compare CNN and MLP extensively. And we will see when CNN can fail.

## 2. Systematic Experimentation in Machine Learning

How can we prove our hypothesis that CNN has the inductive bias that the model is translation invariant? We conduct extensive experiments in machine learning research (and other fields) to prove our hypothesis. In this context, systematic experimentation refers to running a series of experiments to prove our hypothesis. In this homework, we will study how to conduct systematic experimentation in machine learning.

Let's take a step back and think about 1) what our hypothesis is and 2) what experiments are needed to conduct to prove our hypothesis. The first question is easy. The hypothesis is that CNN has the inductive biases of locality and translational invariance. It is not enough to show that CNN performs better than MLP with the same number of parameters. Then, how do we design the experiments to prove our hypothesis? In this homework, we will design the experiments, conduct the experiments, analyze the results, and draw a conclusion.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision.transforms as T

from src.dataset import EdgeDetectionDataset
from src.models import *
from src.utils import *
from src.visualization import *

%load_ext autoreload
%autoreload 2

### Helper functions

The following code cell defines function and classes that will be used in the succeeding codes. Feel free to check it if you are not sure about details.

In [None]:
# Moved to src/dataset.py, src/models.py, src/utils.py, src/visualization.py


In [None]:
seed = 10
set_seed(seed)

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
# comment out the lines below if using Python version >= 3.12 (but you will need to manually rerun cells)
%load_ext autoreload
%autoreload 2

## Generate Dataset

What would be an excellent dataset to study the inductive bias of CNN? First, have to start with the problem as simple as possible. The complex problem makes it hard to understand the underlying mechanism and is challenging to debug in experimental settings. Hence, we choose the edge detection task as an example to study the inductive bias of CNN. Because
1. Edge detection is a straightforward task,
2. It is easy to generate the dataset,

3. The edge of the image is a very fundamental low-level feature useful to every computer vision task such as object detection and finally,

4. Edge detection is an excellent example of studying the inductive bias of CNN.

We will generate the dataset for this toy problem. The dataset consists of 10 images of size 28x28 per class, which are all grey scales. Each image contains a vertical edge, a horizontal edge, or nothing. The labels are 0 for vertical edges, 1 for horizontal edges, and 2 for nothing.

`EdgeDetectionDataset` class is a dataset class that generates and loads the dataset. The dataset inherits `torch.utils.data.Dataset`, and it generates data when it is initialized. This class takes two arguments: `domain_config` and `transform.` `domain_config` is a dictionary that specifies the domain information of train/valid dataset, such as the number of images per class and the size of the image. `transform` is a function that transforms the image. In this homework, we will use `torchvision.transforms.ToTensor()` to convert the image to a tensor.

We highly recommend you read the implementation of `EdgeDetectionDataset` class in `dataset/edge_detection_dataset.py` to understand how the dataset is generated.

In [None]:
# Define the domain configuration of the dataset
set_seed(seed)

visualize_data_config = dict(
    data_per_class=10,
    num_classes=3,
    class_type=["horizontal", "vertical", "none"],
)

visualize_dataset = EdgeDetectionDataset(visualize_data_config, mode='train', transform=None)

## Visualize Dataset

In [None]:
vis_dataset(visualize_dataset, num_classes=3, num_show_per_class=10)
plt.show()

## Q1. Overfitting Models to Small Dataset

In this problem, we will make our models overfit the small dataset to test the model architecture and our synthetic dataset. We use the same dataset for both models. Let's generate a small dataset with ten images per class.

In [None]:
set_seed(seed)

small_dataset_config = None
small_dataset = None
transforms = T.Compose([T.ToTensor()])

#############################################################################
# TODO: Generate dataset with 10 images per class                          #
# Hint: Refer visualize_data_config and use EdgeDetectionDataset and        #
# transforms function provided above.                                       #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################

In this notebook, we will use pytorch dataloader to load the dataset. We will use `torch.utils.data.DataLoader` to load the dataset. `DataLoader` takes two arguments: `dataset` and `batch_size`. `dataset` is the dataset that we want to load. Note that `batch_size` is one of important hyperparameters. We will use `batch_size=32` for this problem.

In [None]:
small_dataset_loader = None

#############################################################################
# TODO: Implement dataloader                                                #
# Hint: You should flag shuffle = True for training data loader             #
# This flag makes huge difference in training                               #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################

### Model Architecture

MLP has two hidden layer with 10 hidden units and 10 hidden units. The input size is 28x28=784 and the output size is 3. We use ReLU as the activation function. We use cross entropy loss as the loss function.

MLP architecture: FC(784, 10) -> ReLU -> FC(10, 10) -> ReLU -> FC(10, 3)

CNN has two convolutional layers followed by global average pooling and one fully connected layer. Both convolutional layers have 3 filters whose kernel size is 7. We use ReLU as the activation function. We use cross entropy loss as the loss function.

CNN arhitecture is as follows: CONV - RELU - MAXPOOL - CONV - RELU - MAXPOOL - FC

### Fitting on Small Dataset

Now let's train the model on the small dataset. The final tranining loss should be around 100% for both models.

In [None]:
set_seed(seed)

lr = 1e-2
num_epochs = 500
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

cnn_model = SimpleCNN(kernel_size=7)
cnn_model.to(device)
untrained_cnn_model = deepcopy(cnn_model)

mlp_model = ThreeLayerMLP(hidden_dims=[50, 10])
mlp_model.to(device)

mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

criterion = nn.CrossEntropyLoss()
print("CNN Model has {} parameters".format(count_parameters(cnn_model, only_trainable=True)))
print("MLP Model has {} parameters".format(count_parameters(mlp_model, only_trainable=True)))

for epoch in tqdm(range(num_epochs)):
    train_one_epoch(cnn_model, cnn_optimizer, criterion, small_dataset_loader, device, epoch, verbose=False)
    train_one_epoch(mlp_model, mlp_optimizer, criterion, small_dataset_loader, device, epoch, verbose=False)

_, cnn_acc, _ = evaluate(cnn_model, criterion, small_dataset_loader, device, verbose=False)
_, mlp_acc, _ = evaluate(mlp_model, criterion, small_dataset_loader, device, verbose=False)

print("CNN Acc: {}, MLP Acc: {}".format(cnn_acc, mlp_acc))

We checked that both models can overfit the small dataset. This is one of the most important sanity check. If the model cannot overfit the small dataset, the model is not powerful enough to learn the dataset. In this case, we need to increase the size of the model.

### Visualize Learned Filters

In [None]:
cnn_kernel = cnn_model.conv1.weight.data.clone().cpu()
untrained_kernel = untrained_cnn_model.conv1.weight.data.clone().cpu()

vis_kernel(cnn_kernel, ch=0, allkernels=False, title='Trained CNN Kernel')
vis_kernel(untrained_kernel, ch=0, allkernels=False, title='Untrained CNN Kernel')


### Question

**Can you find any interesting patterns in the learned filters?** Answer this question in your submission of the written assignment.

## Q2. Sweeping the Number of Training Images

We understood the given task and checked that both models had enough expressive power. We will compare the performance of MLP and CNN by changing the number of data per class. We expect that the model with proper inductive biases on this task will fit with **fewer training examples**. And let's see which one has inductive biases. In this problem, we will use the same dataset for both models. We sweep the number of training images from 10 to 500. The validation set will be the same for all the experiments.

In [None]:
set_seed(seed)

train_loader_dict = dict()
num_images_list = [10, 30, 50, 100]
valid_loader = None

transforms = T.Compose([T.ToTensor()])
train_batch_size = 10
valid_batch_size = 256
#############################################################################
# TODO: Implement train_loader_dict for each number of training images.     #
# Key: The number of training images (5, 10, 30, 50, and 100)               #
# Value: The corresponding dataloader                                       #
# The validation set size is 50 images per class                            #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################

In [None]:
lr = 1e-2
num_epochs = 30
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_acc_list = list()
mlp_acc_list = list()

cnn_kernel_dict = dict()
untrained_cnn_kernel_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=7)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    mlp_model = ThreeLayerMLP(hidden_dims=[50, 10])
    mlp_model.to(device)

    mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    for epoch in tqdm(range(num_epochs)):
        train_one_epoch(cnn_model, cnn_optimizer, criterion, train_loader, device, epoch, verbose=False)
        train_one_epoch(mlp_model, mlp_optimizer, criterion, train_loader, device, epoch, verbose=False)

    cnn_kernel_dict[num_image] = deepcopy(cnn_model.conv1.weight.cpu().detach())
    untrained_cnn_kernel_dict[num_image] = deepcopy(untrained_cnn_model.conv1.weight.cpu().detach())

    _, cnn_valid_acc, _ = evaluate(cnn_model, criterion, valid_loader, device, verbose=False)
    _, mlp_valid_acc, _ = evaluate(mlp_model, criterion, valid_loader, device, verbose=False)

    print("CNN Acc: {}, MLP Acc: {}".format(cnn_valid_acc, mlp_valid_acc))
    cnn_acc_list.append(cnn_valid_acc)
    mlp_acc_list.append(mlp_valid_acc)

In [None]:
## Plot the validation accuracy
plt.clf()
fig, ax = plt.subplots(1, 1, figsize=(3, 3), dpi=200)
ax.plot(num_images_list, cnn_acc_list, marker='o', label='CNN')
ax.plot(num_images_list, mlp_acc_list, marker='o', label='MLP')
ax.set_xlabel('# of Training Images per Class')
ax.set_ylabel('Validation Accuracy (%)')
ax.legend()
ax.grid()
plt.tight_layout()
plt.show()

OK, in most cases, CNN looks like it is performing better than MLP. So can we conclude that CNN has the inductive biases of locality and translational invariance? Not yet. We need to conduct a series of other experiments to show that CNN has such inductive biases.



Seemingly, the experiment result is odd. First, the performance of the low data regime ```num_train_images_per_class=10``` is very bad, considering the task is straightforward. Second, we observe that the performance of MLP is better than CNN at some point. At least, CNN should be much better even in a small data regime if it is translational equivariant. How do we debug the model? We will study how to debug the model in the following problem.

Here are some checklists that you can do to debug the problem.

1. Did you check the dataset? For example, is the dataset balanced? Is the dataset noisy? Is the dataset too small?
2. Did you check the model architecture? For example, is the model architecture powerful enough to learn the dataset? Is the model architecture too complex? Is the model architecture too simple?
3. Did you check the model initialization? For example, is the model initialized properly? Is the model initialized randomly? Is the model initialized with the pre-trained weights?
4. Did you check that the model is trained correctly? For example, does the kernel look like an edge detector? What would be the performance of CNN if kernels were initialized with edge detectors?
5. Did you check the training procedure? For example, is the training procedure correct? Is the training procedure stable? Is the training procedure too slow?
6. Did you optimize the hyperparameters? For example, learning rate, batch size, and the number of epochs.

Note that we already checked the dataset, initialization, and model architecture. But we didn't check the step after 3. Let's step 4 first. We will first see what the learned weights look like, initialize the kernels with edge detectors, and see what happens.

In [None]:
for num_image, cnn_kernel in cnn_kernel_dict.items():
    untrained_kernel = untrained_cnn_kernel_dict[num_image]
    vis_kernel(cnn_kernel, ch=0, allkernels=False, title='Trained Kernel - data: {}'.format(num_image))
    vis_kernel(untrained_kernel, ch=0, allkernels=False, title='Untrained Kernel - data: {}'.format(num_image))
plt.show()

#### Question

**Compare the learned kernels, untrainable kernels, and edge-detector kernels. What do you observe?** Answer this question in your submission of the written assignment.

### Injecting Inductive Bias: Initialize Kernels with Edge Detectors

In [None]:
lr = 1e-2
num_epochs = 30
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

edge_init_cnn_acc_list = list()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    cnn_model = SimpleCNN(kernel_size=7)
    init_conv_kernel_with_edge_detector(cnn_model)
    freeze_conv_layer(cnn_model)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    # logging how training and validation accuracy changes
    edge_init_cnn_valid_acc_list = []
    for epoch in tqdm(range(num_epochs)):
        cnn_train_loss, cnn_train_acc = train_one_epoch(cnn_model, cnn_optimizer, criterion, train_loader, device, epoch, verbose=False)

    _, cnn_valid_acc, _ = evaluate(cnn_model, criterion, valid_loader, device, verbose=False)
    print("CNN Acc: {}".format(cnn_valid_acc))
    edge_init_cnn_acc_list.append(cnn_valid_acc)


In [None]:
## Plot the validation accuracy
plt.clf()
fig, ax = plt.subplots(1, 1, figsize=(3.5, 3.5), dpi=200)
plt.plot(num_images_list, cnn_acc_list, marker='o', label='Randomly Initialized')
plt.plot(num_images_list, edge_init_cnn_acc_list, marker='o', label='Edge Initialized')
ax.set_xlabel('# of Training Images per Class')
ax.set_ylabel('Validation Accuracy (%)')
ax.legend()
ax.grid()
plt.tight_layout()
plt.show()


### Question

We freeze the convolutional layer and train only final layer (classifier) in this experiment. For a high data regime, the performance of CNN initialized with edge detectors is worse than CNN initialized with random weights. **Why do you think this happens?** Answer this question in your submission of the written assignment.

## Q3. Checking the Training Procedure

Checking the training procedure is very important. We must log at least training loss, training accuracy, validation loss, and validation accuracy. Let's log such training signals and find out what is going on.

In [None]:

lr = 1e-2
num_epochs = 30
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_acc_list = list()
mlp_acc_list = list()

cnn_kernel_dict = dict()
untrained_cnn_kernel_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=7)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    mlp_model = ThreeLayerMLP(hidden_dims=[50, 10])
    mlp_model.to(device)

    mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    mlp_results = train_model(mlp_model, mlp_optimizer, num_epochs, train_loader, valid_loader)
    cnn_results = train_model(cnn_model, cnn_optimizer, num_epochs, train_loader, valid_loader)

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], mlp_results["train_loss"], mlp_results["train_acc"])
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], mlp_results["valid_loss"], mlp_results["valid_acc"])

    print("CNN Acc: {}, MLP Acc: {}".format(cnn_results["final_valid_acc"], mlp_results["final_valid_acc"]))

What is going on here? Validation loss and validation accuracy are not flat at the end. It means that the model is not converged. We need to train the model more. Let's train the model with the higher number of epochs. Increase the number of epochs until the validation loss and accuracy are flat.

#### Question

**List every epochs that you trained the model.** Final accuracy of CNN should be at least 95% for 30 images per class. Answer this question in your submission of the written assignment.

#### Question

**Check the learned kernels. What do you observe?** Answer this question in your submission of the written assignment.

#### Question (Optional)

You might find that with the high number of epochs, validation loss of MLP is increasing while validation accuracy increasing.  **How can we interpret this?** Answer this question in your submission of the written assignment.

(Hint: you may find papers that discuss calibrations related to this question (e.g., [paper](https://arxiv.org/pdf/1706.04599.pdf))

#### Question (Optional)

Do hyperparameter tuning. **And list the best hyperparameter setting that you found and report the final accuracy of CNN and MLP.** Answer this question in your submission of the written assignment.

In [None]:
#############################################################################
# TODO: Try other num_epochs. Final accuracy of CNN should be at around     #
# 95-99% for 30 images per class.                                           #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
lr = 1e-2
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_acc_list = list()
mlp_acc_list = list()

cnn_kernel_dict = dict()
untrained_cnn_kernel_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=7)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    mlp_model = ThreeLayerMLP(hidden_dims=[50, 10])
    mlp_model.to(device)

    mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    mlp_results = train_model(mlp_model, mlp_optimizer, num_epochs, train_loader, valid_loader)
    cnn_results = train_model(cnn_model, cnn_optimizer, num_epochs, train_loader, valid_loader)

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], mlp_results["train_loss"], mlp_results["train_acc"])
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], mlp_results["valid_loss"], mlp_results["valid_acc"])

    cnn_kernel_dict[num_image] = deepcopy(cnn_model.conv1.weight.data.detach().cpu())
    untrained_cnn_kernel_dict[num_image] = deepcopy(untrained_cnn_model.conv1.weight.data.detach().cpu())

    print("CNN Acc: {}, MLP Acc: {}".format(cnn_results["final_valid_acc"], mlp_results["final_valid_acc"]))

In [None]:
for num_image, cnn_kernel in cnn_kernel_dict.items():
    untrained_kernel = untrained_cnn_kernel_dict[num_image]
    vis_kernel(cnn_kernel, ch=0, allkernels=False, title='Trained CNN Kernel {}'.format(num_image))
    vis_kernel(untrained_kernel, ch=0, allkernels=False, title='Untrained CNN Kernel {}'.format(num_image))
plt.show()

#### Question

**How much more data is needed for MLP to get a competitive performance with CNN?** Answer this question in your submission of the written assignment.

## Q4. Domain Shift between Training and Validation Set

In this problem, we will see how the model performance changes when the domain of the training set and that of the validation set are different. We will generate training set images with edges that locate only half of the image and validation set images with edges that locate only the other half of the image. Let's repeat the same experiment as the previous problem.

In [None]:
set_seed(seed)
train_loader_dict = dict()
num_train_images_list = [10, 30, 50, 100]
possible_edge_location_ratio = 0.5
valid_loader = None

transforms = T.Compose([T.ToTensor()])
batch_size = 10

for num_image in num_train_images_list:
    train_dataset_config = dict(
        data_per_class=num_image,
        possible_edge_location_ratio=possible_edge_location_ratio,
    )
    train_dataset = EdgeDetectionDataset(train_dataset_config, 'train', transform=transforms)
    train_loader_dict[num_image] = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

valid_dataset_config = dict(
    data_per_class=50,
    possible_edge_location_ratio=possible_edge_location_ratio,
)
valid_dataset = EdgeDetectionDataset(valid_dataset_config, 'valid', transform=transforms)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)


In [None]:
lr = 1e-2
num_epochs = 300
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_acc_list = list()
mlp_acc_list = list()

cnn_kernel_dict = dict()
untrained_cnn_kernel_dict = dict()

cnn_confusion_matrix_dict = dict()
mlp_confusion_matrix_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=7)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    mlp_model = ThreeLayerMLP(hidden_dims=[50, 10])
    mlp_model.to(device)

    mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    mlp_results = train_model(mlp_model, mlp_optimizer, num_epochs, train_loader, valid_loader)
    cnn_results = train_model(cnn_model, cnn_optimizer, num_epochs, train_loader, valid_loader)

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], mlp_results["train_loss"], mlp_results["train_acc"])
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], mlp_results["valid_loss"], mlp_results["valid_acc"])

    cnn_kernel_dict[num_image] = deepcopy(cnn_model.conv1.weight.detach().cpu())
    untrained_cnn_kernel_dict[num_image] = deepcopy(untrained_cnn_model.conv1.weight.detach().cpu())

    cnn_confusion_matrix_dict[num_image] = cnn_results["confusion_matrix"]
    mlp_confusion_matrix_dict[num_image] = mlp_results["confusion_matrix"]

    print("CNN Acc: {}, MLP Acc: {}".format(cnn_results["final_valid_acc"], mlp_results["final_valid_acc"]))

In [None]:
for num_image, cnn_kernel in cnn_kernel_dict.items():
    untrained_kernel = untrained_cnn_kernel_dict[num_image]
    vis_kernel(cnn_kernel, ch=0, allkernels=False, title='Trained CNN Kernel Data={}'.format(num_image))
    vis_kernel(untrained_kernel, ch=0, allkernels=False, title='Untrained CNN Kernel Data={}'.format(num_image))
plt.show()

In this example, you will see that both CNN and MLP performance are worse than those in the previous question. If two models learn how to extract edges, they should be able to classify the images with edges even though the edges locate in the other half of the images. However, both models suffer from performance degration (especially for MLP). What would be the problem? To investigate this, let's first look at the confusion matrices for both models  [link](https://en.wikipedia.org/wiki/Confusion_matrix).

In [None]:
## Plot the confusion matrix
for num_image, cnn_confusion_matrix in cnn_confusion_matrix_dict.items():
    mlp_confusion_matrix = mlp_confusion_matrix_dict[num_image]
    vis_confusion_matrix(cnn_confusion_matrix, ['horizontal', 'vertical', 'none'], 'CNN-{}-images'.format(num_image))
    vis_confusion_matrix(mlp_confusion_matrix, ['horizontal', 'vertical', 'none'], 'MLP-{}-images'.format(num_image))
plt.show()

#### Question

**Why do you think the confusion matrix looks like this?** Answer this question in your submission of the written assignment.

(Hint: Visualize some of the images in the training and validation set. And we are using kernel_size=7, which is large relative to the image size.)

We can do better than this. We didn't explore hyperparameter space yet. Let's search hyperparameters that can generalize well to the validation set. We will change the learning rate, the number of epochs, and kernel size for CNN.

In [None]:
#############################################################################
# TODO: Try other num_epochs, lr, kernel_size. The validation accuracy      #
# should achieve around 97-100% for 10 images per class.                    #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_valid_acc_list = list()

cnn_kernel_dict = dict()

cnn_confusion_matrix_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=kernel_size)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    cnn_results = train_model(cnn_model, cnn_optimizer, num_epochs, train_loader, valid_loader)

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], None, None)
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], None, None)

    print("CNN Acc: {}".format(cnn_results["final_valid_acc"]))

#### Question

**Why do you think MLP fails to learn the task while CNN can learn the task?** Answer this question in your submission of the written assignment.

(Hint: Think about the model architecture.)

## Q5. When is CNN Worse than MLP?

In this problem, we will see that CNN is not always better than MLP in the image domain. Using CNN assumes that the data has locally correlated, whatever data looks. We can manually 'whiten' or remove such local correlation simply by applying random permutation to the images. A random permutation matrix is a matrix that has the same number of rows and columns. Each row and column has the same number of 1s. The rest of the elements are 0s. For example, the following is a random permutation matrix.

```
[[0, 1, 0, 0],
 [0, 0, 0, 1],
 [1, 0, 0, 0],
 [0, 0, 1, 0]]
```

This matrix randomly reorders the elements of the vector. For example, if we apply this matrix to the vector `[1, 2, 3, 4]`, we will get `[2, 4, 1, 3]`. If we apply this matrix to the image, we will get the image with the same content, but the pixels are randomly shuffled. One property of the random permutation matrix is that it is invertible. It means that we can recover the original image by simply applying the inverse matrix to the shuffled image. From the information-theoretical perspective, the random permutation matrix preserves the mutual information of the image and the label.

We will repeat the same experiment as the previous problem. Visualize the dataset first.

In [None]:
set_seed(seed)
visual_domain_config = None
use_permutation = True

permutater = np.arange(28 * 28,  dtype=np.int32)
np.random.shuffle(permutater)
unpermutater = np.argsort(permutater)

visual_dataset = None

transforms = T.Compose([T.ToTensor()])

visual_domain_config = dict(
    data_per_class=10,
    use_permutation=True,
    permutater=permutater,
    unpermutater=unpermutater,
)
visual_dataset = EdgeDetectionDataset(visual_domain_config, mode='train', transform=transforms)

In [None]:
## Visualize the images
unpermutator = visual_dataset.get_unpermutater()
print('Dataset Image before permutation')
vis_unpermuted_dataset(visual_dataset, num_classes=3, num_show_per_class=10, unpermutator=unpermutator)

print('Dataset Image after permutation')
vis_dataset(visual_dataset, num_classes=3, num_show_per_class=10)

plt.show()

Now let's train CNN and MLP on the permuted dataset.

In [None]:
set_seed(seed)

train_loader_dict = dict()
num_train_images_list = [10, 30, 50, 100]
use_permutation = True
valid_loader = None

permutater = np.arange(28 * 28,  dtype=np.int32)
np.random.shuffle(permutater)
unpermutater = np.argsort(permutater)

transforms = T.Compose([T.ToTensor()])

batch_size = 10

for num_image in num_train_images_list:
    train_dataset_config = dict(
        data_per_class=num_image,
        use_permutation=True,
        permutater=permutater,
        unpermutater=unpermutater,
    )
    train_dataset = EdgeDetectionDataset(train_dataset_config, 'train', transform=transforms)
    train_loader_dict[num_image] = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

valid_dataset_config = dict(
    data_per_class=50,
    use_permutation=True,
    permutater=permutater,
    unpermutater=unpermutater,
)
valid_dataset = EdgeDetectionDataset(valid_dataset_config, 'valid', transform=transforms)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)


In [None]:
lr = 1e-2
num_epochs = 300
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_kernel_dict = dict()
untrained_cnn_kernel_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=7)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    mlp_model = ThreeLayerMLP(hidden_dims=[50, 10])
    mlp_model.to(device)

    mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    mlp_results = train_model(mlp_model, mlp_optimizer, num_epochs, train_loader, valid_loader)
    cnn_results = train_model(cnn_model, cnn_optimizer, num_epochs, train_loader, valid_loader)

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], mlp_results["train_loss"], mlp_results["train_acc"])
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], mlp_results["valid_loss"], mlp_results["valid_acc"])

    cnn_kernel_dict[num_image] = cnn_model.conv1.weight.detach().cpu()
    untrained_cnn_kernel_dict[num_image] = untrained_cnn_model.conv1.weight.detach().cpu()

    print("CNN Acc: {}, MLP Acc: {}".format(cnn_results["final_valid_acc"], mlp_results["final_valid_acc"]))

#### Question

**What do you observe? What is the reason that CNN is worse than MLP?** Answer this question in your submission of the written assignment.

(Hint: Think about the model architecture.)

#### Question

**Assuming we are decreasing kernel size of CNN. Does the validation accuracy increase or decrease? Why?** Answer this question in your submission of the written assignment.

Now let's visualize CNN's learned kernel.

In [None]:
for num_image, cnn_kernel in cnn_kernel_dict.items():
    untrained_kernel = untrained_cnn_kernel_dict[num_image]
    vis_kernel(cnn_kernel, ch=0, allkernels=False, title='Trained CNN Kernel Data={}'.format(num_image))
    vis_kernel(untrained_kernel, ch=0, allkernels=False, title='Untrained CNN Kernel Data={}'.format(num_image))
plt.show()


#### Question

**How do the learned kernels look like? Explain why.** Answer this question in your submission of the written assignment.

From the above example, we can see that CNN is not always better than MLP. We have to think about the domain (or task) of the dataset and the model architecture to decide which model is better.

## Q6. Increasing the Number of Classes

OK, can we conclude that CNN has the inductive bias that the model is translation invariant? Let's try other experiments. We make the task harder. In this problem, we increase the number of classes to 5. The new classes are 0 for horizontal edges, 1 for vertical edges, 2 for diagonal edges, 3 for vertical and horizontal, and 4 for nothing. Let's generate the dataset with 10 images per class and visualize the dataset.

In [None]:
set_seed(seed)
visual_domain_config = None

visual_dataset = None

transforms = T.Compose([T.ToTensor()])
visual_domain_config = dict(
    data_per_class=10,
    class_type=['horizontal', 'vertical', 'diagonal', 'both', 'none'],
    num_classes=5,
)

visual_dataset = EdgeDetectionDataset(visual_domain_config, 'train', transform=transforms)

Let's visualize the dataset first.

In [None]:
vis_dataset(visual_dataset, 5, 10)
plt.show()

Now let's make the new dataset. In this problem, we also see how the model performance changes as the number of images per class increases. Let's sweep the number of training images 10, 30, 50, and 100. The validation set will be the same (50) for all the cases.

In [None]:
set_seed(seed)

train_dataset_config = None
train_loader_dict = dict()
num_train_images_list = [10, 30, 50, 100]
valid_loader = None

transforms = T.Compose([T.ToTensor()])
batch_size = 10
class_type = ['horizontal', 'vertical', 'diagonal', 'both', 'none']
train_dataset_config = dict(
    class_type=class_type,
    num_classes=len(class_type),
)
for num_train_images in num_train_images_list:
    train_dataset_config['data_per_class'] = num_train_images
    train_dataset = EdgeDetectionDataset(train_dataset_config, 'train', transform=transforms)
    train_loader_dict[num_train_images] = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)


valid_dataset_config = dict(
    data_per_class=50,
    class_type=['horizontal', 'vertical', 'diagonal', 'both', 'none'],
    num_classes=len(class_type),
)
valid_dataset = EdgeDetectionDataset(valid_dataset_config, 'valid', transform=transforms)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)

In [None]:
lr = 1e-2
num_epochs = 200
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnn_kernel_dict = dict()
untrained_cnn_kernel_dict = dict()

cnn_confusion_matrix_dict = dict()
mlp_confusion_matrix_dict = dict()

cnn_result_dict = dict()
for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)
    cnn_model = SimpleCNN(kernel_size=7, num_classes=5)
    untrained_cnn_model = deepcopy(cnn_model)
    cnn_model.to(device)

    mlp_model = ThreeLayerMLP(hidden_dims=[50, 10], num_classes=5)
    mlp_model.to(device)

    mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=lr, momentum=0.9)
    cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=lr, momentum=0.9)

    mlp_results = train_model(mlp_model, mlp_optimizer, num_epochs, train_loader, valid_loader)
    cnn_results = train_model(cnn_model, cnn_optimizer, num_epochs, train_loader, valid_loader)

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], mlp_results["train_loss"], mlp_results["train_acc"])
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], mlp_results["valid_loss"], mlp_results["valid_acc"])

    cnn_kernel_dict[num_image] = cnn_model.conv1.weight.detach().cpu()
    untrained_cnn_kernel_dict[num_image] = untrained_cnn_model.conv1.weight.detach().cpu()

    cnn_result_dict[num_image] = cnn_results
    print("CNN Acc: {}, MLP Acc: {}".format(cnn_results["final_valid_acc"], mlp_results["final_valid_acc"]))

*We look at two types of pooling operations to downsample the image features:*

1) Max pooling: The maximum pixel value of the batch is selected.
2) Average pooling: The average value of all the pixels in the batch is selected.

Up until this point, we have been using the first type of pooling operation (Max pooling). Let's train the same model but with the average pooling to compare these two types of operations!  

In [None]:
lr = 1e-2
num_epochs = 200
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

cnnavg_kernel_dict = dict()

for num_image, train_loader in train_loader_dict.items():
    print("Training with {} images".format(num_image))
    set_seed(seed)

    cnnavg_model = SimpleCNN_avgpool(kernel_size=7, num_classes=5)
    cnnavg_model.to(device)

    cnnavg_optimizer = optim.SGD(cnnavg_model.parameters(), lr=lr, momentum=0.9)

    cnnavg_results = train_model(cnnavg_model, cnnavg_optimizer, num_epochs, train_loader, valid_loader)
    cnn_results = cnn_result_dict[num_image] # load the results from the previous cell as we have already trained the maxpool model.

    vis_training_curve(cnn_results["train_loss"], cnn_results["train_acc"], cnnavg_results["train_loss"], cnnavg_results["train_acc"], label="CNN-avgpool")
    vis_validation_curve(cnn_results["valid_loss"], cnn_results["valid_acc"], cnnavg_results["valid_loss"], cnnavg_results["valid_acc"], label="CNN-avgpool")

    cnnavg_kernel_dict[num_image] = cnnavg_model.conv1.weight.detach().cpu()

    print("CNN-maxpool Acc: {}, CNN-avgpool Acc: {}".format(cnn_results["final_valid_acc"], cnnavg_results["final_valid_acc"]))

In [None]:
for num_image, cnn_kernel in cnn_kernel_dict.items():
    untrained_kernel = untrained_cnn_kernel_dict[num_image]
    cnnavg_kernel = cnnavg_kernel_dict[num_image]
    vis_kernel(cnn_kernel, ch=0, allkernels=False, title='Trained CNN Kernel Maxpool Data={}'.format(num_image))
    vis_kernel(cnnavg_kernel, ch=0, allkernels=False, title='Trained CNN Kernel Avgpool Data={}'.format(num_image))
    vis_kernel(untrained_kernel, ch=0, allkernels=False, title='Untrained CNN Kernel Data={}'.format(num_image))

plt.show()

#### Question

**Compare the performance of CNN with max pooling and average pooling. What are the advantages of each pooling method?** Answer this question in your submission of the written assignment.


## Q7. Wider/Deeper CNNs

Can we further improve the performance by making the architecture deeper and wider? In this question, we focus on the dataset where there are only 30 images per class and try to push the performance of the CNNs further.

The patterns that we have to detect are 5 but our kernels per layer (`num_filters` in the network definition above) are only 3. Intuitively, this is quite a suboptimal. Here, we will investigate the affect of increasing width and depth. Let's use the same dataset but we will use ```DeeperCNN``` and ```WiderCNN``` in ```cnn.py```. ```DeeperCNN``` has 2 times more layers than ```SimpleCNN``` and ```WiderCNN``` has 2 times more kernels per layer than ```SimpleCNN```. Let's train the models and visualize the validation accuracy.

In [None]:
#############################################################################
# TODO: Train DeeperCNN and tuning hyperparameters. Try other num_epochs,   #
# lr, kernel_size. Also try a different optimizer (e.g., Adam)              #
# The validation accuracy can reach above 98% for 30 images per class.      #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

train_loader = train_loader_dict[30]
set_seed(seed)
deeper_cnn_model = DeeperCNN(kernel_size=kernel_size)
untrained_deeper_cnn_model = deepcopy(deeper_cnn_model)
deeper_cnn_model.to(device)

deeper_cnn_optimizer = optim.SGD(deeper_cnn_model.parameters(), lr=lr, momentum=0.9)
# deeper_cnn_optimizer = optim.Adam(deeper_cnn_model.parameters(), lr=lr)  # try me!

deeper_cnn_results = train_model(deeper_cnn_model, deeper_cnn_optimizer, num_epochs, train_loader, valid_loader)

vis_training_curve(deeper_cnn_results["train_loss"], deeper_cnn_results["train_acc"], None, None)
vis_validation_curve(deeper_cnn_results["valid_loss"], deeper_cnn_results["valid_acc"], None, None)

print("CNN Acc: {}".format(deeper_cnn_results["final_valid_acc"]))

In [None]:
#############################################################################
# TODO: Train DeeperCNN and tuning hyperparameters. Try other num_epochs,   #
# lr, kernel_size. Also try a different optimizer (e.g., Adam)              #
# The validation accuracy can reach above 98% for 30 images per class.      #
#############################################################################
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

train_loader = train_loader_dict[30]
set_seed(seed)
wider_cnn_model = WiderCNN(kernel_size=kernel_size)
untrained_wider_cnn_model = deepcopy(wider_cnn_model)
wider_cnn_model.to(device)

wider_cnn_optimizer = optim.SGD(wider_cnn_model.parameters(), lr=lr, momentum=0.9)
# wider_cnn_optimizer = optim.Adam(wider_cnn_model.parameters(), lr=lr)  # try me!

wider_cnn_results = train_model(wider_cnn_model, wider_cnn_optimizer, num_epochs, train_loader, valid_loader)

vis_training_curve(wider_cnn_results["train_loss"], wider_cnn_results["train_acc"], None, None)
vis_validation_curve(wider_cnn_results["valid_loss"], wider_cnn_results["valid_acc"], None, None)

print("CNN Acc: {}".format(wider_cnn_results["final_valid_acc"]))