# Convolutional Neural Network

## Neuron in Human vision

The human sense of vision is unbelievably advanced. Within fractions of seconds, we can identify objects within our field of view, without thought or hesitation. Not only can we name objects we are looking at, we can also perceive their depth, perfectly distinguish their contours, and separate the objects from their backgrounds. Somehow our eyes take in raw voxels of color data, but our brain transforms that information into more meaningful primitives—lines, curves, and shapes—that might indicate, for example, that we’re looking at a house cat.

Foundational to the human sense of vision is the neuron. Specialized neurons are responsible for capturing light information in the human eye.2 This light information is then preprocessed, transported to the visual cortex of the brain, and then finally analyzed to completion. Neurons are single-handedly responsible for all of these functions. As a result, intuitively, it would make a lot of sense to extend our neural network models to build better computer vision systems. In this chapter, we will use our understanding of human vision to build effective deep learning models for image problems. But before we jump in, let’s take a look at more traditional approaches to image analysis and why they fall short.


## The Shortcomings of Feature Selection 

Let’s   begin by considering a simple computer vision problem. I give you a randomly selected image, such as the one in Figure below. Your task is to tell me if there is a human face in this picture. This is exactly the problem that Paul Viola and Michael Jones tackled in their seminal paper published in 2001.

<img src='images/img1.PNG'>

For a human like you or me, this task is completely trivial. For a computer, however, this is a very difficult problem. How do we teach a computer that an image contains a face? We could try to train a traditional machine learning algorithm  by giving it the raw pixel values of the image and hoping it can find an appropriate classifier. Turns out this doesn’t work very well at all because the signal-to-noise ratio is much too low for any useful learning to occur. We need an alternative.

The compromise that was eventually reached was essentially a trade-off between the traditional computer program, where the human defined all of the logic, and a pure machine learning approach, where the computer did all of the heavy lifting. In this compromise, a human would choose the features (perhaps hundreds or thousands) that he or she believed were important in making a classification decision. In doing so, the human would be producing a lower-dimensional representation of the same learning problem. The machine learning algorithm would then use these new feature vectors to make classification decisions. Because the feature extraction process improves the signal-to-noise ratio (assuming the appropriate features are picked), this approach had quite a bit of success compared to the state of the art at the time.

Viola and Jones had the insight that faces had certain patterns of light and dark patches that they could exploit. For example, there is a difference in light intensity between the eye region and the upper cheeks. There is also a difference in light intensity between the nose bridge and the two eyes on either side. 

By themselves, each of these features is not very effective at identifying a face. But when used together  their combined effectiveness drastically increases. On a dataset of 130 images and 507 faces, the algorithm achieves a 91.4% detection rate with 50 false positives. The performance was unparalleled at the time, but there are fundamental limitations of the algorithm. If a face is partially covered with shade, the light intensity comparisons no longer work. Moreover, if the algorithm is looking at a face on a crumpled flier or the face of a cartoon character, it would most likely fail. 

The problem is the algorithm hasn’t really learned that much about what it means to “see” a face. Beyond differences in light intensity, our brain uses a vast number of visual cues to realize that our field of view contains a human face, including contours, relative positioning of facial features, and color. And even if there are slight discrepancies in one of our visual cues (for example, if parts of the face are blocked from view or if shade modifies light intensities), our visual cortex can still reliably identify faces. 

In order to use traditional machine learning techniques to teach a computer to “see,” we need to provide our program with a lot more features to make accurate decisions. Before the advent of deep learning, huge teams of computer vision researchers would take years to debate about the usefulness of different features. As the recognition problems became more and more intricate, researchers had a difficult time coping with the increase in complexity. 

To illustrate the power of deep learning, consider the  ImageNet challenge, one of the most prestigious benchmarks in computer vision (sometimes even referred to as the Olympics of computer vision).4 Every year, researchers attempt to classify images into one of 200 possible classes given a training dataset of approximately 450,000 images. The algorithm is given five guesses to get the right answer before it moves onto the next image in the test dataset. The goal of the competition is to push the state of the art in computer vision to rival the accuracy of human vision itself (approximately 95– 96%). In 2011, the winner of the ImageNet benchmark had an error rate of 25.7%, making a mistake on one out of every four images.5 Definitely a huge improvement over random guessing, but not good enough for any sort of commercial application. Then in 2012, Alex Krizhevsky from Geoffrey Hinton’s lab at the University of Toronto did the unthinkable. Pioneering a deep learning architecture known as a convolutional neural network for the first time on a challenge of this size and complexity, he blew the competition out of the water. The runner up in the competition scored a commendable 26.1% error rate. But AlexNet, over the course of just a few months of work, completely crushed 50 years of traditional computer vision research with an
error rate of approximately 16%.6 It would be no understatement to say that AlexNet single-handedly put deep learning on the map for computer vision, and completely revolutionized the field.

## Vanilla Deep Neural Networks Don’t Scale

The fundamental goal in applying deep learning to computer vision is to remove the cumbersome, and ultimately limiting, feature selection process. 

deep neural networks are perfect for this process because each layer of a neural network is responsible for learning and building up features to represent the input data that it receives. A naive approach might be for us to use a vanilla deep neural network using the network layer primitive we designed in Chapter 3 for the MNIST dataset to achieve the image classification task.

If we attempt to tackle the image classification problem in this way, however, we’ll quickly face a pretty daunting challenge, visually demonstrated in Figure below. In MNIST, our images were only 28 x 28 pixels and were black and white. As a result, a neuron in a fully connected hidden layer would have 784 incoming weights. This seems pretty tractable for the MNIST task, and our vanilla neural net performed quite well. This technique, however, does not scale well as our images grow larger. For example, for a full-color 200 x 200 pixel image, our input layer would have 200 x 200 x 3 = 120,000 weights. And we’re going to want to have lots of these neurons over multiple layers, so these parameters add up quite quickly! Clearly, this full connectivity is not only wasteful, but also means that we’re much more likely to overfit to the training dataset.

<img src='images/img2.PNG'>

The convolutional network takes advantage of the fact that we’re analyzing images, and sensibly constrains the architecture of the deep network so that we drastically reduce the number of parameters in our model. Inspired by how human vision works, layers of a convolutional network have neurons arranged in three dimensions, so layers have a width, height, and depth as shown in figure below. As we’ll see, the neurons in a convolutional layer are only connected to a small, local region of the preceding layer, so we avoid the wastefulness of fully-connected neurons. A convolutional layer’s function can be expressed simply: it processes a three-dimensional volume of information to produce a new three-dimensional volume of information. We’ll take a closer look at how this works in the next section.

<img src='images/img3.PNG'>

## Filters and Feature Maps 

In order to motivate the primitives of the convolutional layer, let’s build an intuition for how the human brain pieces together raw visual information into an understanding of the world around us. One of the most influential studies in this space came from David Hubel and Torsten Wiesel, who discovered that parts of the visual cortex are responsible for detecting edges. In 1959, they inserted electrodes into the brain of a cat and projected black-and-white patterns on the screen. They found that some neurons fired only when there were vertical lines, others when there were horizontal lines, and still others when the lines were at particular angles.

Further work determined that the visual cortex was organized in layers. Each layer is responsible for building on the features detected in the previous layers—from lines, to contours, to shapes, to entire objects. Furthermore, within a layer of the visual cortex, the same feature detectors were replicated over the whole area in order to detect features in all parts of an image. These ideas significantly impacted the design of convolutional neural nets. 

The first concept that arose was that of a filter, and it turns out that here, Viola and Jones were actually pretty close. A filter is essentially a feature detector, and to understand how it works, let’s consider the toy image in Figure below.

<img src='images/img4.PNG'>

Let’s say that we want to detect vertical and horizontal lines in the image. One approach would be to use an appropriate feature detector, as shown in Figure below. For example, to detect vertical lines, we would use the feature detector on the top, slide it across the entirety of the image, and at every step check if we have a match. We keep track of our answers in the matrix in the top right. If there’s a match, we shade the appropriate box black. If there isn’t, we leave it white. This result is our feature map, and it indicates where we’ve found the feature we’re looking for in the original image. We can do the same for the horizontal line detector (bottom), resulting in the feature map in the bottom-right corner.

<img src='images/img5.PNG'>

This operation is called a convolution. We take a filter and we multiply it over the entire area of an input image. Using the following scheme, let’s try to express this operation as neurons in a network. In this scheme, layers of neurons in a feedforward neural net represent either the original image or a feature map. Filters represent combinations of connections  that get replicated across the entirety of the input. In Figure below, connections of the same color are restricted to always have the same weight. We can achieve this by initializing all the connections in a group with identical weights and by always averaging the weight updates of a group before applying them at the end of each iteration of backpropagation. The output layer is the feature map generated by this filter. A neuron in the feature map is activated if the filter contributing to its activity detected an appropriate feature at the corresponding position in the previous layer.

<img src='images/img6.PNG'>

Let’s denote the kth feature map in layer m as mk. Moreover, let’s denote the corresponding filter by the values of its weights W. Then assuming the neurons in the feature map have bias bk (note that the bias is kept identical for all of the neurons in a feature map), we can mathematically express the feature map as follows: 

$$ {m^{k}_{ij}} = f((W*x)_{ij} + b^k $$

This mathematical description is simple and succinct, but it doesn’t completely describe filters as they are used in convolutional neural networks. Specifically, filters don’t just operate on a single feature map. They operate on the entire volume of feature maps that have been generated at a particular layer. For example, consider a situation in which we would like to detect a face at a particular layer of a convolutional net. And we have accumulated three feature maps, one for eyes, one for noses, and one for mouths. We know that a particular location contains a face if the corresponding locations in the primitive feature maps contain the appropriate features (two eyes, a nose, and a mouth). In other words, to make decisions about the existence of a face, we must combine evidence over multiple feature maps. This is equally necessary for an input image that is of full color. These images have pixels represented as RGB values, and so we require three slices in the input volume (one slice for each color). As a result, feature maps must be able to operate over volumes, not just areas. This is shown below in figure below. Each cell in the input volume is a neuron. A local portion is multiplied with a filter (corresponding to weights in the convolutional layer) to produce a neuron in a filter map in the following volumetric layer of neurons.

<img src='images/img7.PNG'>

As we discussed in the previous section, a convolutional layer (which consists of a set of filters) converts one volume of values into another volume of values. The depth of the filter corresponds to the depth of the input volume. This is so that the filter can combine information from all the features that have been learned. The depth of the output volume of a convolutional layer is equivalent to the number of filters in that layer, because each filter produces its own slice. We visualize these relationships in figure below

<img src='images/img8.PNG'>

In the next section, we will use these concepts and fill in some of the gaps to create a full description of a convolutional layer.


# Full Description of Convolutional Layer

Let’s use the concepts we’ve developed so far to complete the description of the convolutional layer. First, a convolutional layer takes in an input volume. This input volume has the following characteristics: 

* Its width W
* Its height h
* Its depth d
* Its zero padding p

This volume is processed by a total of k filters, which represent the weights and connections in the convolutional network. These filters have a number of hyperparameters, which are described as follows: 

*  Their spatial extent e, which is equal to the filter’s height and width.
*  Their stride s, or the distance between consecutive applications of the filter on the input volume. If we use a stride of 1, we get the full convolution described in the previous section. We illustrate this in Figure below.
* The bias b (a parameter learned like the values in the filter) which is added to each component of the convolution.

<img src='images/img9.PNG'>

This results in an output volume with the following characteristics: 

* Its function f, which is applied to the incoming logit of each neuron in the output volume to determine its final value.
    
* Its width $$ w_{out} = \frac{w_{in}^{-e+2p}}{s} + 1 $$

* Its heigth $$ h_{out} = \frac{h_{in}^{-e+2p}}{s} + 1 $$

* Its depth $$ d_{out}= K $$

The mth “depth slice” of the output volume, where 1 ≤ m ≤ k, corresponds to the function f applied to the sum of the mth filter convoluted over the input volume and the bias bm. Moreover, this means that per filter, we have dine2 parameters. In total, that means the layer has kdine2 parameters and k biases. To demonstrate this in action, we provide an example of a convolutional layer in following figures with a 5 x 5 x 3 input volume with zero padding p = 1. We’ll use two 3 x 3 x 3 filters (spatial extent ) with a stride s = 2. We’ll use a linear function to produce the output volume, which will be of size 3 x 3 x 2.

<img src='images/img10.PNG'>

<img src='images/img11.PNG'>

Generally, it’s wise to keep filter sizes small (size 3 x 3 or 5 x 5). Less commonly, larger sizes are used (7 x 7) but only in the first convolutional layer. Having more small filters is an easy way to achieve high representational power while also incurring a smaller number of parameters. It’s also suggested to use a stride of 1 to capture all useful information in the feature maps, and a zero padding that keeps the output volume’s height and width equivalent to the input volume’s height and width. 

TensorFlow provides us with a convenient operation to easily perform a convolution on a minibatch of input volumes (note that we must apply our choice of function f ourselves and it is not performed by the operation itself).

##### tf.nn.conv2d(input,filter,strides,padding,use_cudnn_gpu=True,name=None)

Here, input is a four-dimensional tensor of size: $$ N * h_{in} * w_{in} * d_{in}$$ where N is the number of examples in out minibatch. The filter argument is also a fourdimensional tensor representing all of the filters applied in the convolution. It is of size $$ e*e*d_{in}*k $$
The resulting tensor emitted by this operation has the same structure as input. Setting the padding argument to "SAME" also selects the zero padding so that height and width are preserved by the convolutional layer. 

# Max Pooling

To aggressively reduce dimensionality of feature maps and sharpen the located features, we sometimes insert a max pooling layer after a convolutional layer.10 The essential idea behind max pooling is to break up each feature map into equally sized tiles. Then we create a condensed feature map. Specifically, we create a cell for each tile, compute the maximum value in the tile, and propagate this maximum value into the corresponding cell of the condensed feature map. This process is illustrated in figure below.

<img src='images/img12.PNG'>

More rigorously, we can describe a pooling layer with two parameters: 
    * Its spatial extent e
    * Its stride s
    
It’s important to note that only two major variations of the pooling layer are used. The first is the nonoverlapping pooling layer with e = 2,s = 2. The second is the overlapping pooling layer with e = 3,s = 2. The resulting dimensions of each feature map are as follows:

* Its width $$ w_{out} = \frac {w_{in}^{-e}} {s} +1 $$
* Its height $$ w_{out} = \frac {h_{in}^{-e}} {s} +1 $$

One interesting property of max pooling is that it is locally invariant. This means that even if the inputs shift around a little bit, the output of the max pooling layer stays constant. This has important implications for visual algorithms. Local invariance is a very useful property if we care more about whether some feature is present than exactly where it is. However, enforcing large amounts of local invariance can destroy our network’s ability to carry important information. As a result, we usually keep the spatial extent of our pooling layers quite small. 

Some recent work along this line has come out of the University of Warwick from Graham11, who proposes a concept called fractional max pooling. In fractional max pooling, a pseudorandom number generator is used to generate tilings with noninteger lengths for pooling. Here, fractional max pooling functions as a strong regularizer, helping prevent overfitting in convolutional networks.

# Full Architectureal Description of Convolution Networks

Now that we’ve described the building blocks of convolutional networks, we start putting them together. Figure 5-14 depicts several architectures that might be of practical use.

<img src='images/img13.PNG'>

One theme we notice as we build deeper networks is that we reduce the number of pooling layers and instead stack multiple convolutional layers in tandem. This is generally helpful because pooling operations are inherently destructive. Stacking several convolutional layers before each pooling layer allows us to achieve richer representations. 

As a practical note, deep convolutional networks can take up a significant amount of space, and most casual practitioners are usually bottlenecked by the memory capacity on their GPU. The VGGNet architecture, for example, takes approximately 90 MB of memory on the forward pass per image and more than 180 MB of memory on the backward pass to update the parameters. Many deep networks make a compromise by using strides and spatial extents in the first convolutional layer that reduce the amount of information that needs to propagated up the network.