# Lecture 3.15: Computer Vision Pt. 1

[**Lecture Slides**](https://docs.google.com/presentation/d/1peBshx2Ift4UGNylVcC5pvroWx883h8mklruR8bEMzM/edit?usp=sharing)

This lecture, we are going to train a Convolutional Neural Network (CNN) in pytorch.

**Learning goals:**
- convert images to pytorch-read `Tensor`s using `pillow`
- create a CNN
- train a CNN
- debug a CNN by printing out layer input/output sizes
- plot a loss curve per epoch
- understand how CNN width vs depth affects optimization

## 1. Introduction


This notebook can be run locally with jupyter, or on [Google colab](https://colab.research.google.com/github/camille-vanhoffelen/introduction-to-machine-learning/blob/master/data_analysis/lecture3.15/computer_vision_pt.1.ipynb). This is because we choose the right `device` below. Try to compare the training speeds of CPU & GPU runtimes! 🏎

In [0]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

The [Unicode Consortium](https://en.wikipedia.org/wiki/Unicode_Consortium) has contacted us and needs our help. ☎️ They receive too many new emoji proposals, and cannot keep track of all of them. In order to choose the next emojis, they wish to _classify_ the submissions into three groups: `face`, `flag`, and `animal`.

Let's make a Convolutional Neural Network to classify emojis. 🔥

## 2. Data Munging

The consortium has provided a _training dataset_ emoji images. These are stored in a public [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html) bucket. We can download them to our local/cloud environment with [`wget`](https://www.gnu.org/software/wget/):

In [0]:
!wget https://introduction-to-machine-learning-ilia-university.s3.eu-west-2.amazonaws.com/emojis.tar.gz

The emojis are packaged in a [`tar`](https://en.wikipedia.org/wiki/Tar_(computing) compressed archive, which can be extracted with:

In [0]:
!tar -xf emojis.tar.gz
!ls emojis

The archive contains 2 directories, `test` & `train. For now, we are interested in the `train` folder:

In [0]:
!ls emojis/train

The training data is split in three directories corresponding to each class: `animals`, `faces`, `flags`.

Let's have a look at the images:

In [0]:
from PIL import Image

img = Image.open('emojis/train/faces/42.png')
print(f'image size: {img.size}')
img

It looks like we are dealing with 64x64 grayscale images.

We wish to load these independently into 3 list of `ndarray`s, one for each class. Check out lecture 2.5 if you'd like a refresher on `pillow` and how to convert images to NumPy arrays:

In [0]:
import glob
import numpy as np

def load_imgs(input_dir):
  paths = glob.glob(input_dir + '*.png')
  imgs = [Image.open(path) for path in paths]
  return [np.array(img).reshape(1, 64, 64) for img in imgs]


faces_dir = "emojis/train/faces/"
faces_features = load_imgs(faces_dir)

flags_dir = "emojis/train/flags/"
flags_features = load_imgs(flags_dir)

animals_dir = "emojis/train/animals/"
animals_features = load_imgs(animals_dir)

faces_features[0].shape

We created 1x64x64 `ndarrays` because convention is to place the [channel](https://en.wikipedia.org/wiki/Channel_(digital_image)) dimension before the height & width. Our images are grayscale, so this channel dimension is 1.

We now wish to turn these _feature matrices_ into _examples_ by matching them with a _label_. To do so, we create integer labels corresponding to our classes:
- faces: 0
- flags: 1
- animals: 2

We then use [`.zip()`](https://realpython.com/python-zip-function/) 🤐 to combine the features and labels just like last lecture:

In [0]:
def label(features, class_index):
  labels = np.full(len(features), class_index, dtype=np.int8)
  return list(zip(features, labels))
    
faces_examples = label(faces_features, 0)
flags_examples = label(flags_features, 1)
animals_examples = label(animals_features, 2)

examples = faces_examples + flags_examples + animals_examples

print(f"feature shape: {examples[0][0].shape}")
print(f"label value: {examples[0][1]}")

We have turned the emojis into a collection of examples: pairs of feature matrices and labels. Let's get training! 🏋️‍♀️

## 3.Training

### 3.1 CNN Architecture

We are going to create a Convolutional Neural Network class called `ConvNet` with the same process as lecture 3.14. We extend the `nn.Module` and implement the `.forward()` method. Layers are initialized in the `ConvNet` `__init__()` constructor.

Since we are _still_ lazy, we use pytorch's [`Conv2D`](https://pytorch.org/docs/master/generated/torch.nn.Conv2d.html) and [`MaxPool2D`](https://pytorch.org/docs/master/generated/torch.nn.MaxPool2d.html) layers. Familiarize yourself with their arguments: 
- `in_channels` is the _depth_ of the input volume
- `out_channels` is the _depth_ of the output volume (# of filters)
- `kernel_size` is the filter/field matrix size
- you should recognize `stride` and `padding` from the lecture slides

Fully connected layers use `nn.Linear` as in lecture 3.14.

In [0]:
import torch
import torch.nn.functional as F


class ConvNet(torch.nn.Module):

    def __init__(self, verbose=False):
        super(ConvNet, self).__init__()
  
        self.verbose = verbose
        # 1x64x64 => 8x64x64
        self.conv_1 = torch.nn.Conv2d(in_channels=1,
                                      out_channels=8,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        # 8x64x64 => 8x32x32
        self.pool_1 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        # 8x32x32 => 16x32x32
        self.conv_2 = torch.nn.Conv2d(in_channels=8,
                                      out_channels=16,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        # 16x32x32 => 16x16x16                             
        self.pool_2 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        
        # 16x16x16 => 32x16x16
        self.conv_3 = torch.nn.Conv2d(in_channels=16,
                                      out_channels=32,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        
        # 16x16x32 => 8x8x32                             
        self.pool_3 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        
        # 2048 => 64
        self.linear_1 = torch.nn.Linear(8*8*32, 64)
        # 64 => 3
        self.linear_2 = torch.nn.Linear(64, 3)

        
        
    def forward(self, x):
      x = F.relu(self.conv_1(x))
      x = self.pool_1(x)

      x = F.relu(self.conv_2(x))
      x = self.pool_2(x)

      x = F.relu(self.conv_3(x))
      x = self.pool_3(x)
      
      # flatten
      x = x.view(-1, 8*8*32)

      x = F.relu(self.linear_1(x))
      
      logits = self.linear_2(x)
      return logits


🧠 Try to track the sizes of the neuron volumes throughout the convolutional network.

🧠 How do convolutional layers maintain their input's height & width? How do pooling layers half their input's height & width?

Two things might stand out when looking at the `ConvNet` code above

**flatten:**

Notice the `# flatten` operation between the last pooling layer and the first fully connected layer. This is necessary because convolutional & pooling layers deal with input _volumes_, whereas dense layers operate on _vectors_. We therefore reshape the activations into a 1D `Tensor`, which is called _flattening_. This doesn't change the tensor values, only their spatial arrangement.

**logits:**

We've seen `logits` before but haven't formally defined them. [Logits](https://developers.google.com/machine-learning/glossary/#logits) are the unbounded outputs of a linear layer, that yet have to be fed into a sigmoid/softmax function. i.e they are scalar values that are to be transformed into probabilities. 

Last lecture, we used the [`BCEWithLogits`](https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss) loss, which required returning the _logits_ as opposed to the sigmoid _probabilities_ as output of our neural network. For this multi-class classification problem, we use the [`CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) which works in the same way: we don't apply the softmax ourselves, it is already included in the loss to improve numerical stability. 🧘

### 3.2 Debugging

Pytorch's dynamic computation graph makes neural network code easy to debug. One can step through the execution, and investigate outputs during runtime. 

To showcase this, we rewrote our CNN and added optional `print()` statements between every single layer. Creating a `ConvNet` with `verbose=True` will then print the tensor shapes flowing through the network.

This is only for demonstration purposes and is a bad idea in general for two reasons:
- the `ConvNet` code has become messy
- once the code works, we don't need the option to print out shapes for every single `.forward()` pass

Instead, data scientists typically use [debuggers](https://docs.python.org/3/library/pdb.html). But now we know that  we can add print statements everywhere if we feel like it!
 😤

In [0]:
import torch
import torch.nn.functional as F


class ConvNet(torch.nn.Module):

    def __init__(self, verbose=False):
        super(ConvNet, self).__init__()
  
        self.verbose = verbose
        # 1x64x64 => 8x64x64
        self.conv_1 = torch.nn.Conv2d(in_channels=1,
                                      out_channels=8,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        # 8x64x64 => 8x32x32
        self.pool_1 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        # 8x32x32x8 => 16x32x32
        self.conv_2 = torch.nn.Conv2d(in_channels=8,
                                      out_channels=16,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        # 16x32x32 => 16x16x16                             
        self.pool_2 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        
        # 16x16x16 => 32x16x16
        self.conv_3 = torch.nn.Conv2d(in_channels=16,
                                      out_channels=32,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1)
        
        # 16x16x32 => 8x8x32                             
        self.pool_3 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0)
        
        # 8x8x32 => 64
        self.linear_1 = torch.nn.Linear(8*8*32, 64)
        # 64 => 3
        self.linear_2 = torch.nn.Linear(64, 3)

        
        
    def forward(self, x):
      if self.verbose: 
        print(f"conv_1 input: {repr(x.shape)}")
      x = F.relu(self.conv_1(x))
      if self.verbose: 
        print(f"conv_1 output: {x.shape}")

      x = self.pool_1(x)
      if self.verbose: 
        print(f"pool_1 output: {x.shape}")

      x = F.relu(self.conv_2(x))
      if self.verbose: 
        print(f"conv_2 output: {x.shape}")

      x = self.pool_2(x)
      if self.verbose: 
        print(f"pool_2 output: {x.shape}")

      x = F.relu(self.conv_3(x))
      if self.verbose: 
        print(f"conv_3 output: {x.shape}")

      x = self.pool_3(x)
      if self.verbose: 
        print(f"pool_3 output: {x.shape}")
        
      x = F.relu(self.linear_1(x.view(-1, 8*8*32)))
      if self.verbose: 
        print(f"linear_1 output: {x.shape}")

      logits = self.linear_2(x)
      if self.verbose: 
        print(f"linear_2 output: {logits.shape}")
      return logits


To test the verbose output, we fetch and convert the first example to a pytorch-ready `Tensor`. Notice the `Tensor` is reshaped to 1x1x64x64, the first dimension standing for the _batch size_ (which is 1 in our case). 

We use the `torch.no_grad()` context to let pytorch know that it doesn't need to keep track of `.grad_fn`, even if `Tensor`s were built with `requires_grad=True`. This speeds up predictions and lowers their memory consumption. Always use `torch.no_grad()` if you aren't planning to `.backward()` through the `Tensor`s!

In [0]:
with torch.no_grad():
  features, label = examples[0]
  features = torch.tensor(features).float().view(1, 1, 64, 64)
  net = ConvNet(verbose=True)
  net(features)

The print statements worked, and we can clearly see the effects of the convolutional and pooling layers on the activation volumes. 

### 3.3 Optimization

Let's train our `ConvNet`! First, we create a `DataLoader` to shuffle and iterate through batches 🔀:

In [0]:
from torch.utils.data import DataLoader
batch_size = 16
data_loader = DataLoader(dataset=examples, 
                          batch_size=batch_size, 
                          shuffle=True)

We can then initialize our network, optimizer, and criterion. Notice that we're sending the network parameters to the runtime's `device`, whether this notebook is run on a CPU or a GPU:

In [0]:
net = ConvNet()
net = net.to(device)

optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()

We are now ready to loop through the dataset and train the model parameters. This code is almost the same as last lecture's banknote classifier, with the exception of:
- `loss_per_epoch` is an example of how to keep track of the average loss for each epoch
- `labels` have `dtype=long` because our `CrossEntropyLoss` expects categorical whole numbers; the `0`, `1`, & `2` corresponding to our classes (see [documentation](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html)). 
- `start_time` & `stop_time` make it easy to compare CPU and GPU training times, if this notebook is run in different environments

In [0]:
import time 
    
torch.manual_seed(1337)
np.random.seed(666)

start_time = time.time()    
loss_per_batch = []
loss_per_epoch = []

for epoch in range(20):

    running_losses = []
    for batch, data in enumerate(data_loader):
        
        features, labels = data
        features = features.float().to(device)
        labels = labels.long().to(device)

        logits = net(features)
        loss = criterion(logits, labels)
        loss_per_batch.append(loss.item())
        running_losses.append(loss.item())
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    mean_epoch_loss = np.array(running_losses).mean()
    print(f"epoch: {epoch}, loss: {mean_epoch_loss:.6f}")
    loss_per_epoch.append(mean_epoch_loss)

stop_time = time.time()
print(f'total training time: {stop_time - start_time}')
          

🧠 Take the time to understand each step of the loop above. Check out lecture 3.14 for a refresher.

Our CNN was trained! 🕺 It looks like the loss was successfully minimized. Let's visualize some loss curves to know more:

In [0]:
import matplotlib.pyplot as plt

def plot_loss_curves(loss_per_batch, loss_per_epoch, ylim=(-0.1, 1)):
  fig = plt.figure(figsize=(12, 4))
  ax1 = fig.add_subplot(121)
  ax1.plot(loss_per_batch)
  ax1.set_ylim(ylim)
  ax1.set_ylabel('loss')
  ax1.set_xlabel('batch')
  ax1.set_title('Loss Curve')

  ax2 = fig.add_subplot(122)
  ax2.plot(loss_per_epoch)
  ax2.set_ylim(ylim)
  ax2.set_ylabel('loss')
  ax2.set_xlabel('epoch')
  ax2.set_title('Loss Curve')

plot_loss_curves(loss_per_batch, loss_per_epoch)

The loss curve per _epoch_ displays the same information as the per _batch_ graph, except the values were averaged across batches. This should be clear from the training loop code above (see `running_losses`). Loss curves are typically shown per epoch for deep neural networks, as these can routinely take > 100 epochs to converge.

The curves suggest that our model was trained successfully. 😎

🧠 Why is that?

### 3.4 Exercises

We are curious about CNN architectures... there are so many ways of arranging and sizing the different layers! In particular, we want to know about the power of _width_ vs _depth_. If we put all 32 filters in a single layer, how will this affect the neural network's optimization? 🤔

💪💪💪 Train a convolutional net with a single convolutional + pooling layer pair, and plot its loss curve.
- create a new CNN class called `WideConvNet`
- use one convolutional layer with 32 filters, followed by a standard pooling layer
- adjust the input the size of your fully connected layer accordingly
- remember to update both the `__init__` and `.forward()` methods
- create an instace of `WideConvNet` called `wide_net` (or this will affect section 4!)
- train the `wide_net` with the same training loop as above
- create `loss_per_batch` and `loss_per_epoch` to use the plotting method run before the unit test
- tip: keep track of the shapes of the volumes flowing between the layers. You can always print them out!

In [0]:
# INSERT YOUR CODE HERE

In [0]:
def test_wide_conv_net():
  with torch.no_grad():
    wide_net = WideConvNet()
    named_params = list(wide_net.named_parameters())
    n_layers = len(named_params)
    assert n_layers == 6, f"Expected 6 layers, but got {n_layers}. Are you using 1 conv, 1 pool, and 2 linear layers?"
    n_filters = named_params[1][1].size()[0]
    assert n_filters == 32 , f"Expected 32 convolutional filters, but got {n_filters}"
    linear_1_inputs = named_params[2][1].size()[1]
    assert linear_1_inputs == 32768 , f"Expected 32768 inputs for linear_1, but got {linear_1_inputs}"
    print('Success! 🎉')

plot_loss_curves(loss_per_batch, loss_per_epoch, ylim=(-0.1, 10))
test_wide_conv_net()

🧠🧠 The loss curve is significantly different to our first CNN architecture. What changed? What does this suggest about this trained model? 

## 4. Prediction

Our CNN is trained, and the loss function converged, but we aren't quite convinced of our model's ability to classify emojis. We don't want to send a faulty model to the Unicode Consortium, the consequences would be disastrous! 🙀 In the `test` data directory is the rainbow flag. Let's put our model to the test.

In [0]:
x_img = Image.open("emojis/test/pride.png")
x_img

The rainbow flag is colorful 🌈 but we need it to be in grayscale. Remember: always apply the same transformations to the prediction data as the training data!

We use [pillow](https://pillow.readthedocs.io/en/stable/) to convert the image to grayscale, then reshape it, and create a `Tensor`:

In [0]:
x_img_gray = x_img.convert('L')
x_arr = np.array(x_img_gray).reshape(1, 1, 64, 64)
x_predict = torch.tensor(x_arr, dtype=torch.float32)

x_predict.shape

The batch size, channel, height, and width dimensions are present, so let's send the feature `Tensor` to the `device`, since our `net` parameters are already located there:

In [0]:
x_predict = x_predict.to(device)

We can now feed the features to our CNN. We use the `torch.no_grad()` context since we don't plan to call `.backward()` on the output:

In [0]:
with torch.no_grad():
  logits = net(x_predict)
  print(logits)

Recall that we used the `CrossEntropyLoss()` which incorporates the softmax operation. This means that our model returns _logits_, and we have to apply the softmax ourselves:

In [0]:
with torch.no_grad():
  probas = torch.softmax(logits, dim=1)
  print(probas)

This is a _probability vector_ which tells us the likelihood of this emoji belonging to each class: `0`, `1`, or `2`. Our CNN seems _very_ confident here, but often these probabilities are more even. They always add up to 1 though! We can find the most likely class with an argmax operation:

In [0]:
with torch.no_grad():
  _, y_predict = torch.max(probas, 1)
  print(y_predict.item())

Remember that we defined class integers as:
- face: 0
- flag: 1
- animals: 2

Our model identified the emoji as a flag! 🏳️‍🌈 We fed in an image that our CNN had _never seen before_ , and it correctly classified it. The power of machine learning... 🧙‍♀️

Let's test our model even further, with emojis that the world has never seen. 😮 Create your own emoji submission to the Unicode Consortium, and then check if our CNN correctly classifies it.

💪💪 Make a face emoji [here](https://emoji-maker.com/designer), save it in your local directory / upload it to your Google colab environment (by selecting "upload" on the menu on the left hand side). Then check our trained CNN's class prediction.
- you might need to [crop](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=crop#PIL.Image.Image.crop) the image to make it square
- preprocess the image in the same way as we did during training: our CNN expects a 64x64 grayscale image
- don't forget about batch size and channel dimensions
- use the `torch.no_grad()` context
- feel free to message me your creation / classification, I'm curious about the results!

In [0]:
# INSERT YOUR CODE HERE

## 5. Summary


Today, we learned about **Convolutional Neural Networks**. We defined **computer vision** as the field specializing in image data analysis. We pointed out a few standout quirks of image data: **high dimensionality**, **translational invariance**, and **local redundancy**. We discovered **convolutional neural layers**, and showed how they leverage translational invariance to **share model parameters** across sliding input windows. This helps with the **computational complexity** of modeling so many feature dimensions. We then highlighted how **pooling layers** further improve network efficiency in the deeper layers by **downsampling** activation tensors, and discarding **locally redundant** information. When stacked with **fully connected layers**, these form **convolutional neural networks**. We listed some CNN **design conventions** , but noted that these architecture choices are **complex**. Finally, we helped the Unicode Consortium by implementing and training our own CNN **image classifier** in pytorch, to identify `face`, `flag`, and `animal` emojis.

# Resources

## Core Resources

- [cs231n](https://cs231n.github.io/convolutional-networks/)  
Classic course on convolutional neural networks
- [introduction to CNNs](https://victorzhou.com/blog/intro-to-cnns-part-1/)  
Excellent visual blogpost which implements a CNN from scratch with NumPy

### Additional Resources

- [Deep learning school - DL for CV](https://youtu.be/u6aEYuemt0M)  
Karpathy lecture explaining CNN architectures
- [Convolution explained](https://youtu.be/N-zd-T17uiE)  
Youtube video detailing the maths of convolution
- [Translational invariance in CNNs](https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo)  
Stackexchange thread outlining how convolutions relate to translational invariance & equivariance
- [how to choose MNIST CNN architecture](https://www.kaggle.com/cdeotte/how-to-choose-cnn-architecture-mnist)  
Kaggle kernel trying different CNN architectures on the MNIST dataset
- [PyTorch implementations of CNNs](https://nbviewer.jupyter.org/github/rasbt/deeplearning-models/tree/master/pytorch_ipynb/cnn/)  
Pytorch CNNs from the deeplearning models repository
