## Chapter 14: Convolutional Neural Networks

# 14.3 Convolutional neural network

In [3]:
# This code cell won't be shown in the HTML version of this notebook.
# import autograd functionalities
import autograd.numpy as np
from autograd import grad as compute_grad   

# import plotting library and other necessities
import matplotlib.pyplot as plt
from matplotlib import gridspec

# import general libraries
import copy
from datetime import datetime 

#this is needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

import sys
sys.path.append('../../')
from mlrefined_libraries.deeplearning_library_v1 import superlearn_setup

%load_ext autoreload
%autoreload 2

Previously we discussed in full detail how to use the convolution operation to extract features based on the edge content of input images. More elaborate variations of this idea have been used effectively in practice (e.g., HoG [[1]](#bib_cell) and SIFT [[2]](#bib_cell) features) particularly for object detection, where in all cases features are computed using fixed (i.e., pre-defined) kernels. In general, employing this breed of 'engineered' features in place of raw pixel values (see Figure 1) considerably improves the overall performance of supervised/unsupervised learners, as we confirmed empirically in Section 14.2 via a simple face detection experiment.

A simple Convolutional Neural Network (CNN), as depicted in the bottom row of Figure 1, is different from the architectures we have seen so far (middle row of Figure 1) in that with CNNs, in addition to the tunable weights of the supervised/unsupervised learner, we also learn the convolutional kernels simultaneously using our training data.

<figure>
<img src="../../mlrefined_images/convnet_images/raw_fixed_CNN.png" width="80%" height="auto"/>
<figcaption> <strong>Figure 1:</strong> <em> Three simple pipelines for machine learning tasks involving images. (top row) Using raw pixels values as input features generally leads to subpar results. (middle row) A fixed convolutional layer (consisting of convolution, ReLU, and pooling modules) is instered between the input image and the MLP module. This is the pipeline used in Section 14.2. (bottom row) With CNNs the kernels in the convolutional layer are also tuned along with the MLP weights. The modules involving tunable weights are colored yellow. </em>
</figcaption>
</figure>

Recall from our discussion of multilayer perceptrons that we can write the ```model``` associated with a general $L$ layer MLP as 

\begin{equation}
\text{model}\left(\mathbf{x},\mathbf{w}\right) = \mathbf{w}_{L+1}^T\mathring{\mathbf{f}}^{(L)}
\end{equation}

where $\mathring{\mathbf{f}}^{(L)}$, as discussed in Section 13.1, is formed by tacking a $1$ on top of the feature transformation  

\begin{equation}
\mathbf{f}^{(L)} = \mathbf{a}\left(\mathbf{W}_{L}^T\mathring{\mathbf{a}}\left(\mathbf{W}_{L-1}^T\,\mathring{\mathbf{a}}\left( \cdots \mathbf{W}_{1}^T\mathring{\mathbf{x}}\right) \right)\right)
\end{equation}

Hence, to write the ```model``` associated with a simple CNN algebraically, all we need to is replace the input $\mathbf{x}$ in Equation (2) with the output of the convolutional layer given as

\begin{equation}
\text{pool}\left(\text{max}\left(0,\mathbf{W}_{conv}*\mathbf{x}\right)\right) 
\end{equation}

where $\mathbf{W}_{conv}$ represents all convolutional kernel weights to be tuned together with MLP weights $\mathbf{W}_{1},\ldots,\mathbf{W}_{L}$ and $\mathbf{w}_{L+1}$.

That the convolutional kernels are learned to the data with CNNs implies we should intuitively expect them to outperform their fixed kernel counterparts - a hypothesis we now put to test through a simple experiment.

#### <span style="color:#a50e3e;">Example 1. </span>  CNN vs. fixed

Recall, the pooling module was introduced first in Section 14.2 in order primarily to reduce the dimension of multiple convolutional feature maps, one produced per each convolutional kernel. This reduction in size is due to the lack of padding with pooling, but more importantly the use of stride values greater than $1$. In the spirit of simplicity, it is worthwhile to explore whether we can achieve dimension reduction during convolution by choosing stride values of greater than $1$, thereby completely forgoing the pooling process.                     

#### <span style="color:#a50e3e;">Example 2. </span> To pool or not to pool?

<a id='bib_cell'></a>

## References

[1] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, volume 1, pp. 886–893. IEEE, 2005

[2] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2) 91–110, 2004