# Lab Assignment: How AlexNet Changed Artificial Intelligence Forever

**Author:** Danilo de Goede

In this lab assignment, we will take a closer look at AlexNet, the first end-to-end learned Artificial Intelligence (AI) system to achieve remarkable success in the [ImageNet Large Scale Visual Recognition Challenge (ILSVRC)](https://www.image-net.org/challenges/LSVRC/). 
The original [paper](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) was published in 2012 and marked a paradigm shift in AI, where the dominant approach shifted to *learning* features directly from data rather than using hand-crafted ones.
While the idea of learning features from data dates back much further, AlexNet was the first to scale this approach to large datasets like ImageNet.
Since then, the field of deep learning has taken off, leading to major research breakthroughs such as AlphaFold, Stable Diffusion, and GPT.

<center width="100%">
    <img src="https://www.researchgate.net/profile/Dae-Young-Kang/publication/346091812/figure/fig2/AS:979480482938881@1610537753860/Algorithms-that-won-the-ImageNet-Large-Scale-Visual-Recognition-Challenge-ILSVRC-in.png" width="600px">
</center>

In the remainder of this assignment, we will delve deeper into the inner workings of AlexNet. Since this assignment is designed to be accessible to undergraduate students without prior experience in programming or machine learning, certain concepts are intentionally explained at an abstract level. Although we provide code to make the assignment more interactive, it is not necessary to examine or understand the code in detail. Instead, it is recommended to focus on reading through the text and answering the questions provided in this notebook.

As a final note, it is recommened to a GPU in your runtime set-up in Colab; otherwise the code will run very slowly. To do this, go to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`. The `runtime` button can be found in the top menu of the Colab interface.

Let's begin by importing all relevant libraries that we will use throughout this assignment.

In [None]:
## Standard libraries
import os
import numpy as np
from PIL import Image
import urllib

## Imports for plotting
%matplotlib inline
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
import matplotlib
matplotlib.rcParams['lines.linewidth'] = 2.0
import seaborn as sns
sns.reset_orig()

## PyTorch
import torch
import torch.nn.functional as F


if not os.path.exists("backend.py"):
    !wget https://raw.githubusercontent.com/ddgoede/alexnet-tutorial/main/backend.py

import backend

backend.set_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if not os.path.exists("imagenet_classes.txt"):
    !wget "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"

with open("imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]

## What can AlexNet do?

Before we discuss how AlexNet works internally at a deeper level, let us first consider what task it allows us to solve.
AlexNet is a network which has been trained to classify high-resolution images into 1000 possible different classes.
While such a task is quite easy for humans, it is far from trivial for computers to solve.
For instance, it is easy for us to recognize the image we load below as a dog:

In [None]:
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)
input_image = Image.open(filename)
input_image

This ability is a remarkable feature of the visual cortex system in the human brain. Although the image contains an enormous amount of information, our brain can recognize a dog in it after just a split-second glance.

If we want computers to perform the same task, we must first somehow convey the image's content to a computer. A typical way to represent an image is as a collection of pixels, where each pixel represents a specific color using a linear combination of red, green, and blue.

In [None]:
input_image_array = np.asarray(input_image)
image_height, image_width, _ = input_image_array.shape
total_num_pixels = image_height * image_width
total_num_values = total_num_pixels * 3

print(f"Image dimensions: {image_height} x {image_width} pixels")
print(f"Total number of pixels: {total_num_pixels:,}")
print(f"Total number of values: {total_num_values:,}")
print("Image content:")
input_image_array

We want the computer to analyze these 5.6 million numbers and recognize that they collectively represent a dog.
Let us see if AlexNet can do this incredibly difficult task.
For this purpose, let's first load the AlexNet model which has been trained on 1.2 million high-resolution images.

In [None]:
model = backend.load_trained_alexnet(device)

Now that we have loaded AlexNet, let's input the image above and see what it identifies.

In [None]:
input_batch = backend.preprocess_image(input_image, device)

with torch.no_grad():
    model_outputs = model(input_batch)

prediction_probabilities = F.softmax(model_outputs, dim=-1).squeeze()
prediction = prediction_probabilities.argmax().item()
predicted_category = categories[prediction]
print(f"AlexNet: 'This is an image of a {predicted_category}!'.")

That is truly incredible! Not only did AlexNet recognize that the image contains a dog, but it also identified the specific breed — a Samoyed.

Hopefully, this result excites you enough to want to learn more about how AlexNet has achieved this without *any* prior knowledge of the world. Indeed, this is what we will focus on in the remainder of this assignment.

### Question 1
- **[Q1]**: As mentioned, AlexNet was not the first to introduce the idea of learning features directly from images. For instance, Yann LeCun proposed a [similar idea](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf) for document recognition back in 1998. Discuss at least 2 major reasons why approaches prior to AlexNet that use this strategy have not been successful at high-resolution images in the ILSVRC contest.

## AlexNet architecture

As mentioned earlier, AlexNet is trained to classify images by learning "features" directly from data. But what exactly do we mean by "features", and how are they learned?

Internally, AlexNet represents these features as something that we call "*convolution kernels*", or "*kernel*" for short.
Intuitively, a kernel can be thought of as a simple pattern that is itself is also represented in an image-like form.
Each convolution kernel corresponds to a specific pattern. 
For instance, one kernel might align with the edge of an object, while another might correspond to a corner.

We can then use these kernels to recognize patterns by sliding them over the image, producing a single number for each location. This number is high when that part of the input image matches the kernel — in other words, when the image contains a pattern similar to the kernel at that location. Mathematically, this operation is called the *dot product*.

This is illustrated in the animation below, where the input image is represented in red, the kernel in blue, and the output response in purple.

<center width="100%"><img src="https://raw.githubusercontent.com/ddgoede/alexnet-tutorial/main/media/convolution.gif" width="600px"></center>

A single convolution kernel is only able to recognize a single pattern in an image. How does this help us to recognize various complex object that we may encounter in the real world?

Surprisingly, AlexNet achieves this by simply repeating this seemingly straightforward operation many times.
Specficially, AlexNet consists of a sequence of *convolution layers*, as shown in the figure below.
Each layer applies several different learned kernels in parallel to recognize a variety of patterns.
The output of each layer serves as input for the next, allowing AlexNet to combine patterns and construct increasingly complex ones.
After applying a sequence of layers, the learned convolution kernels might respond strongly to more advanced concepts, such as a dog's ear or a bicycle's wheel.

<center width="100%">
    <img src="https://raw.githubusercontent.com/ddgoede/alexnet-tutorial/main/media/cnn-convolution-only.png" width="575px">
</center>

Let's take a look at what the kernels of AlexNet's first layer look like.

In [None]:
backend.visualize_layer_weights(model, layer_index=0)

We can indeed observe that the kernels in AlexNet's first layer can detect simple patterns such as edges, corners, and blobs in the image.

To make a final decision about the type of object an image contains, AlexNet combines the features output by the last convolutional layer and passes them through three "*fully connected layers*" and a "*softmax function*." While we leave out the details for simplicity, it is important to know that the fully connected layers also *learn* how to combine these features to predict what is in the input image.

<center width="100%">
    <img src="https://media.licdn.com/dms/image/D5612AQGOui8XZUZJSA/article-cover_image-shrink_720_1280/0/1680532048475?e=2147483647&v=beta&t=8aodfukDSrrnnxOVSNobKYJtbtSDB7yC83LUky-Ob68" width="1000px">
</center>

In addition to convolutional and linear layers, AlexNet also uses other operations like ReLU and pooling layers. We won't discuss them here, as they are not fundamental to understanding AlexNet at an abstract level.

Returning back to our previous example, let's inspect what the output of the last softmax layer looks like.

In [None]:
print(f"The output of the softmax function contains {len(prediction_probabilities)} numbers:")
prediction_probabilities

As we can see, the softmax layer outputs a total of 1,000 numbers between 0 and 1.
Each number represents the predicted probability that the image belongs to a specific object class.
Since most numbers are close to 0, it can be difficult to interpret this output directly.
To better interpret these 1,000 predicted probabilities, we can visualize them as a $25 \times 40$ heatmap:

In [None]:
backend.visualize_prediction(prediction_probabilities, categories)

Here we can see that indeed the softmax outputs 0 for most classes, but for a few classes the probability is noticeably larger than 0.
For instance, the output entry coinciding with the Samoyed class is roughly 0.72, which means that AlexNet is about 72% certain that the image it received as input contains a Samoyed.

### Question 2
- **[Q2]**: Although AlexNet outputs the highest probability for the Samoyed class, it also assigns a probability above 10% to the wallaby category. Can you think of a possible reason why AlexNet might confuse the Samoyed in the image with a wallaby? Please explain your reasoning from the perspective of convolution kernels, as discussed above.

## Training AlexNet: Learning the Kernels

We have seen how AlexNet can classify any input image by passing it through a sequence of kernels that detect and combine patterns. But how are these kernels learned? The full answer is, unfortunately, quite complex and requires some mathematical background. Instead, we will explain the learning process of AlexNet in an abstract and somewhat simplified way.

The original AlexNet was trained on high-resolution images. However, to keep the computational cost of this assignment manageable, we will use a smaller version of AlexNet and train it on low-resolution images.

Let's start by loading this smaller version of AlexNet and inspecting what it architecture looks like.

In [None]:
model = backend.load_untrained_alexnetmini(device)
backend.print_model_architecture(model)

For simplicity, we replace any layer that we did not cover in detail by `??`. 
We observe a total of 2 convolution layers which learn kernels to detect patterns in the images, and 2 linear layers which learn to combine those patterns to make the prediction.
Note that this is different from the original AlexNet architecture, which contains 5 convolutional layers and 3 linear layers.

Let's inspect what the kernels of the first convolutional layer look like.

In [None]:
backend.visualize_layer_weights(model, 0)

It seems that the kernels in the first layer are just noise. This is because AlexNet-mini has not been trained yet, and the kernels are initialized randomly. To learn meaningful kernels, we need to train AlexNet-mini on a dataset of images.

The original AlexNet was trained on the ImageNet dataset, which contains 1.2 million high-resolution images in 1000 different classes. 
However, training AlexNet on ImageNet is computationally expensive and time-consuming. 
Therefore, we will train AlexNet on a smaller dataset called CIFAR10 instead. 
While ImageNet and CIFAR10 are both classification datasets, CIFAR10 is much smaller and contains only 60,000 low-resolution images in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

Let's load the CIFAR-10 dataset and inspect a few images:

In [None]:
train_loader, test_loader = backend.get_cifar10_dataloaders(visualize_samples=True)

Using the CIFAR10 dataset, we can train AlexNet to recognize patterns in images and classify them into one of 10 different classes. Think of AlexNet as a very smart student who is learning to identify things by looking at many pictures and trying to name what is in them.

Here is how AlexNet learns: Each picture in the CIFAR10 dataset comes with a label, which is the correct answer (for example, "frog" or "truck"). AlexNet starts by making a guess about what it sees in the picture. At first, these guesses might not be very accurate because AlexNet has not learned much yet.

After making a guess, AlexNet compares its answer to the correct label. If the guess is wrong, AlexNet measures how far off it was. This difference between the guess and the correct answer is called the loss. The larger the loss, the more incorrect AlexNet's guess was.

AlexNet then adjusts its kernels slightly to reduce this loss so that the next time it sees a similar image, it is more likely to get the right answer. By repeating this process many times, AlexNet gradually learns kernels that, together, are capable of recognizing patterns in images which are useful to make accurate predictions about what is in them.

We will train AlexNet for 10 epochs, meaning the network will see all 60,000 images in the dataset 10 times. After each epoch, we will measure AlexNet's accuracy on the test set. The accuracy tells us how many images AlexNet correctly classified out of all images it saw; an accuracy of 35% means that AlexNet correctly classified 35% of the images.

In [None]:
model = backend.train_model(model, train_loader, test_loader, device, num_epochs=10, lr=0.01)

### Question 3
- **[Q3a]**: What accuracy do you observe before any training?
- **[Q3b]**: Does this accuracy make sense? If so, explain why. If not, describe what accuracy you expected to see and why.
- **[Q3c]**: Describe how the accuracy changes over the course of training.

To conclude this assignment, let's inspect the kernels of the first convolutional layer after training AlexNet on CIFAR10.

In [None]:
backend.visualize_layer_weights(model, 0)

### Question 4
- **[Q4a]**: How did the kernels change after training AlexNet-mini on CIFAR10 for 10 epochs compared to the untrained kernels?
- **[Q4b]**: The kernels obtained after briefly training AlexNet-mini on CIFAR10 look different from those of the original AlexNet trained on ImageNet, which contained clear and recognizable patterns such as edges, blobs, and corners. Despite these differences, the kernels are still effective for classifying images into 10 different classes on CIFAR10. Please explain what types of patterns these kernels might detect that allow AlexNet-mini to make accurate prediction.

## Conclusion

In this assignment, we have taken a closer look at AlexNet, the first end-to-end learned AI system to achieve remarkable success in the ILSVRC. We have seen how AlexNet is able to recognize objects in images by learning features directly from data. We have also seen how AlexNet learns these features by adjusting its kernels to reduce the loss between its guess and the correct label. We hope that this assignment has given you a better understanding of how AlexNet works and has sparked your excitement to delve deeper into the world of AI.