# 1. Foundations of Convolutional Neural Networks

## 1.1 Image classification with fully connected layers

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.

What kind of problems does deep learning solve, and more importantly, can it solve yours? To know the answer, you need to ask questions:

* What outcomes do I care about? Those outcomes are labels that could be applied to data: for example, spam or not_spam in an email filter, good_guy or bad_guy in fraud detection, angry_customer or happy_customer in customer relationship management.

* Do I have the data to accompany those labels? That is, can I find labeled data, or can I create a labeled dataset (with a service like AWS Mechanical Turk or Figure Eight or Mighty.ai) where spam has been labeled as spam, in order to teach an algorithm the correlation between labels and inputs?

All classification tasks depend upon labeled datasets; that is, humans must transfer their knowledge to the dataset in order for a neural to learn the correlation between labels and data. This is known as supervised learning.

* Detect faces, identify people in images, recognize facial expressions (angry, joyful)
* Identify objects in images (stop signs, pedestrians, lane markers…)
* Recognize gestures in video
* Detect voices, identify speakers, transcribe speech to text, recognize sentiment in voices
* Classify text as spam (in emails), or fraudulent (in insurance claims); recognize sentiment in text (customer feedback)

Any labels that humans can generate, any outcomes you care about and which correlate to data, can be used to train a neural network.

Deep learning is the name we use for “stacked neural networks”; that is, networks composed of several layers.

The layers are made of nodes. A node is just a place where computation happens, loosely patterned on a neuron in the human brain, which fires when it encounters sufficient stimuli. A node combines input from the data with a set of coefficients, or weights, that either amplify or dampen that input, thereby assigning significance to inputs for the task the algorithm is trying to learn. (For example, which input is most helpful is classifying data without error?) These input-weight products are summed and the sum is passed through a node’s so-called activation function, to determine whether and to what extent that signal progresses further through the network to affect the ultimate outcome, say, an act of classification.

Here’s a diagram of what one node might look like.

<img src="images/cnn14.png" >


A node layer is a row of those neuron like switches that turn on or off as the input is fed through the net. Each layer’s output is simultaneously the subsequent layer’s input, starting from an initial input layer receiving your data. In our previous courses of deep learning we have learnt in much more detail about fully connected neural networks. Architecture of a neural network looks like the following figure given below.

<img src="images/cnn11.png" width="500">

<br/><br/>

**Key Concepts of Deep Neural Networks**

Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth; that is, the number of node layers through which data passes in a multistep process of pattern recognition.

Earlier versions of neural networks such as the first perceptrons were shallow, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep is a strictly defined, technical term that means more than one hidden layer.

In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.

This is known as feature hierarchy, and it is a hierarchy of increasing complexity and abstraction. It makes deep-learning networks capable of handling very large, high-dimensional data sets with billions of parameters that pass through nonlinear functions.

Deep-learning networks perform automatic feature extraction without human intervention, unlike most traditional machine-learning algorithms. Given that feature extraction is a task that can take teams of data scientists years to accomplish, deep learning is a way to circumvent the chokepoint of limited experts. It augments the powers of small data science teams, which by their nature do not scale.

A deep-learning network trained on labeled data can then be applied to unstructured data, giving it access to much more input than machine-learning nets. This is a recipe for higher performance: the more data a net can train on, the more accurate it is likely to be. (Bad algorithms trained on lots of data can outperform good algorithms trained on very little.) Deep learning’s ability to process and learn from huge quantities of unlabeled data give it a distinct advantage over previous algorithms.

<br/><br/><br/><br/>


As you are already familiar with fully connected neural networks. Here we are going to use it to classify [Sign Language Digits Dataset](https://www.kaggle.com/ardamavi/sign-language-digits-dataset).

About Data:

* It is an image dataset which contains images of different signs (we have given some preview images below).
* Each sample is a (64, 64) sized array, which you can imagine as a 2D matrix with value of each cell corresponds to value of a pixel.
* The color space is grayscal and there are total of 10 classes, the link to this kaggle dataset is given above.

<img src="images/cnn12.png" width="600">

We achieve this task in following steps:

* Flatten each sample image in a vector



In [21]:
from sklearn.model_selection import train_test_split
from keras.models import Sequential, load_model
from keras.layers.core import Dense, Dropout, Activation

import numpy as np
import matplotlib.pyplot as plt

import os

# fix random seed for reproducibility
np.random.seed(7)

X = np.load('data/Sign-language-digits/X.npy')
y = np.load('data/Sign-language-digits/Y.npy')



# Get shape of the data
print('X shape : {}  Y shape: {}'.format(X.shape, y.shape))

# Visualize an image
plt.imshow(X[700], cmap='gray')
print(y[700]) # one-hot labels starting at zero

In [32]:
# Flatten the data in a vector, each sample is an input to neural network
X_ = X.reshape(2062, 4096)

# Divide entire data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.2, random_state=8)

# building a linear stack of layers with the sequential model
model = Sequential()
model.add(Dense(512, input_shape=(4096,)))
model.add(Activation('relu'))                            
model.add(Dropout(0.2))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')


# training the model and saving metrics in history
history = model.fit(X_train, y_train,
          batch_size=128, epochs=50,
          verbose=2,
          validation_data=(X_test, y_test))

# saving the model
model_path = 'results/keras_sign_lang1.h5'
model.save(model_path)
print('Saved trained model at %s ' % model_path)

Train on 1649 samples, validate on 413 samples
Epoch 1/50
 - 1s - loss: 4.0306 - acc: 0.1043 - val_loss: 2.4310 - val_acc: 0.1065
Epoch 2/50
 - 1s - loss: 2.4112 - acc: 0.1134 - val_loss: 2.2972 - val_acc: 0.0944
Epoch 3/50
 - 0s - loss: 2.2787 - acc: 0.1316 - val_loss: 2.2763 - val_acc: 0.1404
Epoch 4/50
 - 0s - loss: 2.2642 - acc: 0.1692 - val_loss: 2.2397 - val_acc: 0.1743
Epoch 5/50
 - 0s - loss: 2.2180 - acc: 0.2007 - val_loss: 2.1734 - val_acc: 0.3269
Epoch 6/50
 - 0s - loss: 2.1407 - acc: 0.2238 - val_loss: 2.0459 - val_acc: 0.3002
Epoch 7/50
 - 0s - loss: 2.0626 - acc: 0.2420 - val_loss: 1.9942 - val_acc: 0.2300
Epoch 8/50
 - 0s - loss: 1.9511 - acc: 0.2850 - val_loss: 1.7999 - val_acc: 0.3753
Epoch 9/50
 - 0s - loss: 1.8272 - acc: 0.3372 - val_loss: 1.6936 - val_acc: 0.4262
Epoch 10/50
 - 0s - loss: 1.7003 - acc: 0.3730 - val_loss: 1.6176 - val_acc: 0.4140
Epoch 11/50
 - 0s - loss: 1.6671 - acc: 0.4209 - val_loss: 1.5676 - val_acc: 0.4818
Epoch 12/50
 - 0s - loss: 1.6704 - acc

In [33]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_38 (Dense)             (None, 512)               2097664   
_________________________________________________________________
activation_37 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_25 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_39 (Dense)             (None, 512)               262656    
_________________________________________________________________
activation_38 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_26 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_40 (Dense)             (None, 10)                5130      
__________

## 1.2 Motivation for using CNNs

In 90s LeCun devised a gradient based method to classify image. It gave good results for postal zip code recognition but beyond that it was not able to scale yet. Then in 2012 Alex Krizhevsky used the similar architecture to train on ImageNet data(this architecture is polpularly referred to as AlexNet). This gave significantly better results compared to other methods used at that time. This was made possible due to huge dataset of image collected over the internet and parallel computing power of GPUs. Look below at the classification results using AlexNets.

<img src="images/cnn50.png" width="800">

Now we also have convnets which are used for detection in an image, that is to detect where in an image is for example say a bus or tree or a boat and so on, and draw precisely bounding boxes around it. Convnets are importantpart in all these, face detection, face recognition, interpretation and diagnosis of medical images, classification of galaxies, street sign recognition and neural transfer in which we are able to rendera an image using style of a particular artist and artwork.

<img src="images/cnn51.png" width="700">
<center>Fig. Street sign recognition </center>

<img src="images/cnn52.jpg" width="700">
<center>Fig. Neural style transfer </center>


The main motivation behind the emergence of CNNs in deep learning scenarios has been to address many of the limitations that traditional neural networks faced when applied to those problems. When used in areas like image classification, traditional fully-connected neural networks simply don’t scale well due to their disproportionally large number of connections(total trainable parameters in previous example was 2,365,450). CNNs bring a few new ideas that contribute to improve the efficiency of deep neural networks. 

In a classification task rotation or shifting or cropping should not change anything in ideal case but in neural networks it is hard to capture these invariance. Flattening images means we cannot capture local information about a pixel. We will see later how convolutiona neural networks solve these problems. Also using traditional fully connected neural network we cannot learn to classify images of different size.

<br/><br/>

**Convolutional Neural Networks**

Convolutional neural networks are deep artificial neural networks that are used primarily to classify images (e.g. name what they see), cluster them by similarity (photo search), and perform object recognition within scenes. They are algorithms that can identify faces, individuals, street signs, tumors, platypuses and many other aspects of visual data.

Convolutional networks perform optical character recognition (OCR) to digitize text and make natural-language processing possible on analog and hand-written documents, where the images are symbols to be transcribed. CNNs can also be applied to sound when it is represented visually as a spectrogram. More recently, convolutional networks have been applied directly to text analytics as well as graph data with graph convolutional networks.

The efficacy of convolutional nets (ConvNets or CNNs) in image recognition is one of the main reasons why the world has woken up to the efficacy of deep learning. They are powering major advances in computer vision (CV), which has obvious applications for self-driving cars, robotics, drones, security, medical diagnoses, and treatments for the visually impaired.


**How Convolutional Neural Networks learn**<br/>

In neural networks, Convolutional neural network (ConvNets or CNNs) is one of the main categories to do images recognition, images classifications. Objects detections, recognition faces etc. as mentioned above, are some of the areas where CNNs are widely used. CNN image classifications takes an input image, process it and classify it under certain categories (Eg., Dog, Cat, Tiger, Lion). Computers sees an input image as array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d( h = Height, w = Width, d = Dimension ). Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x 1 array of matrix of grayscale image.

<img src="images/cnn13.png" width="200">


Technically, deep learning CNN models to train and test, each input image will pass it through a series of convolution layers with filters (Kernals), Pooling, fully connected layers (FC) and apply Softmax function to classify an object with probabilistic values between 0 and 1. We will explain all these in much more detail in next section.


## 1.3 Strides, padding and pooling layers



Before we look into convolution operation in depth. Let us first try to grasp the bigger picture here. Like all supervised learning task we have a host of labeled images here. Earlier we saw how images were flattened and that flattened vector was used as input in feed-forward network. Each element of this vector is value of one pixel in the images. Intuitively you can understand that you are using only that one pixel as input and it doesnot take into account the values of pixel surrounding that pixel. In other words you are considering only information of each pixel in a sandalone manner. But won't it be better to consider instead of each pixel at a time, we rather consider each pixel locality. Imagine instead of looking at only one pixel you look at the pixel and also its surrounding. Isn't that how we look at any picture, that when you look at a dog in the picture just looking at one pixel which is black won't tell you which part of the dog it belongs to like it maybe part of the fur or tail or eyes but if you also look at its surrounding pixel you may tell whether it is an eye or not. 

So let us say you convolve with a kernel which will tell to next layers whether there is an eye or not. Now this will be used in next layer to determine whether it is an animal or not. Then it will later on decide if it is an animal then is it a dog or not. So basically you want your model to learn to extract features within the model across various layers and then classify using those features. That is you don't have to manually program to extract features from image to input to the model, rather you just input the image and model will do all the task from feature extraction to classification. It learns all these from our training examples. So  training examples play a huge a role while training a convolutional network.

In this chapter we will first look at building blocks of convnets and then go on to discuss few network architectures using CNN. Later we will look at an image classification task using this model.


### CONVOLUTION

The first thing to know about convolutional networks is that they don’t perceive images like we humans do. Therefore, you are going to have to think in a different way about what an image means as it is fed to and processed by a convolutional network. An image is stored as a 2d matrix. So imagine looking at an image of a dog so you look at its color and retina in your eyes sends the signals to the brain. But for a computer this image is just a bunch of numbers in a 2d array. So to infer a particular shape from a bunch of numbers is tough.

<img src="images/cnn53.png" width="700">

Convolutional networks perceive images as volumes; i.e. three-dimensional objects, rather than flat canvases to be measured only by width and height. That’s because digital color images have a red-blue-green (RGB) encoding, mixing those three colors to produce the color spectrum humans perceive. A convolutional network ingests such images as three separate strata of color stacked one on top of the other.

So a convolutional network receives a normal color image as a rectangular box whose width and height are measured by the number of pixels along those dimensions, and whose depth is three layers deep, one for each letter in RGB. Those depth layers are referred to as channels. As images move through a convolutional network, we will describe them in terms of input and output volumes. We can express images mathematically as matrices of multiple dimensions in this form: 30x30x3. From layer to layer, this dimensions change for reasons that will be explained below.

You will need to pay close attention to the precise measures of each dimension of the image volume, because they are the foundation of the linear algebra operations used to process images. Now, for each pixel of an image, the intensity of R, G and B will be expressed by a number, and that number will be an element in one of the three, stacked two-dimensional matrices, which together form the image volume. Those numbers are the initial, raw, sensory features being fed into the convolutional network, and the ConvNets purpose is to find which of those numbers are significant signals that actually help it classify images more accurately. (Just like other feedforward networks we have discussed.)

Rather than focus on one pixel at a time, a convolutional net takes in square patches of pixels and passes them through a filter. That filter is also a square matrix smaller than the image itself, and equal in size to the patch. It is also called a kernel, which will ring a bell for those familiar with support-vector machines, and the job of the filter is to find patterns in the pixels.


<img src="images/cnn16.gif" >
<center>Fig 1.3.1.  CNN with a kernel of size $(3 \times 3)$ on an image of size $(5 \times 5)$</center>


Imagine two matrices. One is 30x30, and another is 3x3. That is, the filter covers one-hundredth of one image channel’s surface area. We are going to take the dot product of the filter with this patch of the image channel(It is like keeping the kernel on any part of image and first get product of each elements over each other and then take sum look how it has been calculated in animation above). If the two matrices have high values in the same positions, the dot product’s output will be high. If they don’t, it will be low. In this way, a single value – the output of the dot product – can tell us whether the pixel pattern in the underlying image matches the pixel pattern expressed by our filter.

Let’s imagine that our filter expresses a horizontal line, with high values along its second row and low values in the first and third rows. Now picture that we start in the upper lefthand corner of the underlying image, and we move the filter across the image step by step until it reaches the upper righthand corner. The size of the step is known as **stride**. You can move the filter to the right one column at a time, or you can choose to make larger steps.

At each step, you take another dot product, and you place the results of that dot product in a third matrix known as an **activation map**. The width, or number of columns, of the activation map is equal to the number of steps the filter takes to traverse the underlying image. Since *larger strides* lead to fewer steps, a big stride will produce a smaller activation map. This is important, because the size of the matrices that convolutional networks process and produce at each layer is directly proportional to how computationally expensive they are and how much time they take to train. A larger stride means less time and compute.

A filter superimposed on the first three rows will slide across them and then begin again with rows 4-6 of the same image. *If it has a stride of three*, then it will produce a matrix of dot products that is 10x10. That same filter representing a horizontal line can be applied to all three channels of the underlying image, R, G and B. And the three 10x10 activation maps can be added together, so that the aggregate activation map for a horizontal line on all three channels of the underlying image is also 10x10.

Now, because images have lines going in many directions, and contain many different kinds of shapes and pixel patterns, you will want to slide other filters across the underlying image in search of those patterns. You could, for example, look for 96 different patterns in the pixels. Those 96 patterns will create a stack of 96 activation maps, resulting in a new volume that is 10x10x96. In the diagram below, we’ve relabeled the input image, the kernels and the output activation maps to make sure we’re clear.

<br/><br/>

<img src="images/cnn15.png">
<center>Fig 1.3.2</center>

What we just described is a convolution. One of the main problems with images is that they are high-dimensional, which means they cost a lot of time and computing power to process. Convolutional networks are designed to reduce the dimensionality of images in a variety of ways. Filter stride is one way to reduce dimensionality. Another way is through pooling. Let us first formally define strides and padding and then we will go on and talk about pooling layers.


**Overview**

The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels). During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.

<br/><br/><br/><br/>


### STRIDES

Stride controls how the filter convolves around the input volume. In the example we had in Figure 1.3.1, the filter convolves around the input volume by shifting one unit at a time. The amount by which the filter shifts is the stride. In that case, the stride was implicitly set at 1. Stride is normally set in a way so that the output volume is an integer and not a fraction. Let’s look at an example. Let’s imagine a 7 x 7 input volume, a 3 x 3 filter (Disregard the 3rd dimension for simplicity), and a stride of 1. This is the case that we’re accustomed to.


<img src="images/cnn17.png">
<center>Fig 1.3.3 When stride is set to 1.</center>

See if you can try to guess what will happen to the output volume as the stride increases to 2.

<img src="images/cnn18.png">
<center>Fig 1.3.4 When stride is set to 2.</center>

So, as you can see, the receptive field is shifting by 2 units now and the output volume shrinks as well. Notice that if we tried to set our stride to 3, then we’d have issues with spacing and making sure the receptive fields fit on the input volume. Normally, programmers will increase the stride if they want receptive fields to overlap less and if they want smaller spatial dimensions.

<br/><br/><br/>


### PADDING

Now, let’s take a look at padding. Before getting into that, let’s think about a scenario. What happens when you apply three 5 x 5 x 3 filters to a 32 x 32 x 3 input volume? The output volume would be 28 x 28 x 3. Notice that the spatial dimensions decrease. As we keep applying conv layers, the size of the volume will decrease faster than we would like. In the early layers of our network, we want to preserve as much information about the original input volume so that we can extract those low level features. Let’s say we want to apply the same conv layer but we want the output volume to remain 32 x 32 x 3. To do this, we can apply a zero padding of size 2 to that layer. Zero padding pads the input volume with zeros around the border. If we think about a zero padding of two, then this would result in a 36 x 36 x 3 input volume(imagine an image as a 2D matrix, now padding means adding cells on the entire circumference of the 2D shape. This means if you have initially a matrix of size $n \times m$ then after padding you will get $(n+2)\times(m+2)$).

<img src="images/cnn19.png">

Above you saw how padding can be used to control output after a convolutional layer. In neural network you may have seen that you may decide on your own the number of neurons you want to keep in any hidden layer. So, padding too gives you that independence that you can decide upon the size of any layer by deciding how much padding to apply to previous layer before applying convolution. If you have a stride of 1 and if you set the size of zero padding to $$\text{Zero Padding} = \frac{K-1}{2}$$ where K is the filter size, then the input and output volume will always have the same spatial dimensions. This amount of padding is called as 'same'. The formula for calculating the output size for any given conv layer is $$O = \frac{(W - K + 2P)}{S} + 1$$ where O is the output height/length, W is the input height/length, K is the filter size, P is the padding, and S is the stride.




### POOLING

Above we have talked all about a convolutional layer. It has kernels and then we saw how we can have different strides in convolution and how it affects the size of resultant volume. Then we went on to learn about padding and how it help us to control height and width from necessarily decreasing after a convolutional layer. Now we look at a different type of layer. Note this is completely different from a convolutional layer (about which we have been talking above).

The next layer in a convolutional network has three names: max pooling, downsampling and subsampling. The activation maps are fed into a downsampling layer, and like convolutions, this method is applied one patch at a time. In this case, max pooling simply takes the largest value from one patch of an image, places it in a new matrix next to the max values from other patches, and discards the rest of the information contained in the activation maps.

<img src="images/cnn20.png">


Only the locations on the image that showed the strongest correlation to each feature (the maximum value) are preserved, and those maximum values combine to form a lower-dimensional space.

Much information about lesser values is lost in this step, which has spurred research into alternative methods. But downsampling has the advantage, precisely because information is lost, of decreasing the amount of storage and processing required.

Above we have talked about max pooling but there can be of different types:

* Max Pooling
* Average Pooling
* Sum Pooling



**Pooling in CNN is used mainly for**

Dimension Reduction: In deep learning when we train a model, because of excessive data size the model can take huge amount of time for training. Now consider the use of max pooling of size 5x5 with 1 stride. It reduces the successive region of size 5x5 of the given image to a 1x1 region with max value of the 5x5 region. Here pooling reduces the 25 (5x5) pixel to a single pixel (1x1) to avoid curse of dimensionality.

Rotational/Position Invariance Feature Extraction : Pooling can also be used for extracting rotational and position invariant feature. Consider the same example of using pooling of size 5x5. Pooling extracts the max value from the given 5x5 region. Basically extract the dominant feature value (max value) from the given region irrespective of the position of the feature value. The max value would be from any position inside the region. Pooling does not capture the position of the max value thus provides rotational/positional invariant feature extraction.


### ReLU (Rectified Linear Units) Layers

After each conv layer, it is convention to apply a nonlinear layer (or activation layer) immediately afterward.The purpose of this layer is to introduce nonlinearity to a system that basically has just been computing linear operations during the conv layers (just element wise multiplications and summations).In the past, nonlinear functions like tanh and sigmoid were used, but researchers found out that ReLU layers work far better because the network is able to train a lot faster (because of the computational efficiency) without making a significant difference to the accuracy. It also helps to alleviate the vanishing gradient problem, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers (Explaining this might be out of the scope of this post, but see here and here for good descriptions). The ReLU layer applies the function f(x) = max(0, x) to all of the values in the input volume. In basic terms, this layer just changes all the negative activations to 0.This layer increases the nonlinear properties of the model and the overall network without affecting the receptive fields of the conv layer.


## 1.4 Benefits of using Convolutional Neural Networks 

Now that we have an understanding of how Convolutional Neural Networks work let’s explore a few of the fundamental principles leveraged by CNNs:

1.  **Sparse Representations**<br/> Let’s assume that you are working on an image classification problem that involves the analysis of large pictures that are millions of pixels in size. A traditional neural network will model the knowledge using matrix multiplication operations that involve every input and every parameter which results easily in tens of billions of computations. Remember that CNNs are based on convolution operations between and input and a kernel tensors? Well, it turns out that the kernel in convolution functions tends to be drastically smaller than the input which simplifies the number of computations required to train the model or to make predictions. In our sample scenario, a potential CNN algorithm will focus only on relevant features of the input image requiring fewer parameters to use in the convolution. The result could be a few billion operations smaller and more efficient than traditional fully-connected neural networks.<br/><br/>
2. **Parameter Sharing**<br/> Another important optimization technique used in CNNs is known as parameter sharing. Conceptually, parameter sharing simply refers to the fact that CNNs tend to reuse the same parameters across different functions in the deep neural network. More specifically, parameter sharing entails that the weight parameters will be used on every position of the input which will allow the model to learn a single set of weights once instead of a different set for every function. Parameter sharing in CNNs typically results on massive savings in memory compared to traditional models.<br/><br/>
3. **Equivariance**<br/> Equivariance property of CNNs can be seen as a specific type of parameter sharing. Conceptually, a function can be considered equivariance if, upon a change in the input, a similar change is reflected in the output. Using a mathematically nomenclature, a function f(x) is considered equivariant to a function g() if f(g(x)) = g(f(x)). It turns out that convolutions are equivariant to many data transformation operations which means that we can predict how specific changes in the input will be reflected in the output. 


## 1.5 Visualizing Convolutional Neural Networks

In a previous post, we built up an understanding of convolutional neural networks, without referring to any significant mathematics. To go further, however, we need to understand convolutions.

If we just wanted to understand convolutional neural networks, it might suffice to roughly understand convolutions. But the aim of this series is to bring us to the frontier of convolutional neural networks and explore new options. To do that, we’re going to need to understand convolutions very deeply.

Thankfully, with a few examples, convolution becomes quite a straightforward idea.

### Lessons from a Dropped Ball

Say you have a ball which you drop from height $h_1$ and it can only move horizontally. It hits the ground first at some point A and then goes up to say height $h_2$ then hits the ground again at B and then again hits at C. Let us consider an event that the distance of A and B is 2 units and B and C is 1 unit. From all the possibility of point A, B and C what is the probability that it is in above state. We can calculate this by taking product of the probability of ball hitting at 2 unit distance from A (i.e. probability of B being 2 unit away) when dropped at A from height $h_1$ and probability of ball hitting 1 unit away when dropped at B from height $h_2$. now let us consider this say you drop the ball from height h to A what is the probability that the second drop is not more than 3 unit away from A horizontally. So above case is one of the specific case you will actually have to consider all the combinations of distances of A, B and C such that distance of A and C is 3, like distance of A and B is 0.5 and distance of B and C is 2.5 and so on. There are infinitely many combinations and the probability of above problem can be calculated by taking sum of all these probabilities. In other words to find the total likelihood of the ball reaching a total distance of 3 units, we can’t consider only one possible way of reaching C. Instead, we consider all the possible ways of partitioning AC into two drops and sum over the probability of each way.

<img src="images/cnn41.png" width="600">

<img src="images/cnn42.png" width="600">

To make this a bit more concrete, we can think about this in terms of positions the ball might land. After the first drop, it will land at an intermediate position a with probability f(a). If it lands at a, it has probability g(c−a) of landing at a position c.

<img src="images/cnn43.png" width="600">

To get the convolution, we consider all intermediate positions.

<img src="images/cnn44.png" width="400">

<br/><br/>

### Visualizing Convolutions

There’s a very nice trick that helps one think about convolutions more easily.

First, an observation. Suppose the probability that a ball lands a certain distance x from where it started is f(x). Then, afterwards, the probability that it started a distance x from where it landed is f(−x).

<img src="images/cnn45.png" width="600">

If we know the ball lands at a position c after the second drop, what is the probability that the previous position was a?

<img src="images/cnn46.png" width="400">

So the probability that the previous position was a is g(−(a−c))=g(c−a).

Now, consider the probability each intermediate position contributes to the ball finally landing at c. We know the probability of the first drop putting the ball into the intermediate position a is f(a). We also know that the probability of it having been in a, if it lands at c is g(c−a).

<img src="images/cnn47.png" width="600">

Summing over the as, we get the convolution.

The advantage of this approach is that it allows us to visualize the evaluation of a convolution at a value c in a single picture. By shifting the bottom half around, we can evaluate the convolution at other values of c. This allows us to understand the convolution as a whole.

For example, we can see that it peaks when the distributions align.

<img src="images/cnn48.png" width="400">

And shrinks as the intersection between the distributions gets smaller.

<img src="images/cnn49.png" width="400">

By using this trick in an animation, it really becomes possible to visually understand convolutions.

Below, we’re able to visualize the convolution of two box functions:

<img src="images/cnn50.gif" width="400">

Armed with this perspective, a lot of things become more intuitive.

Let’s consider a non-probabilistic example. Convolutions are sometimes used in audio manipulation. For example, one might use a function with two spikes in it, but zero everywhere else, to create an echo. As our double-spiked function slides, one spike hits a point in time first, adding that signal to the output sound, and later, another spike follows, adding a second, delayed copy.

# 2. Deep convolutional models

## 2.1 Classical Networks

Now that we have gone through basics of CNN and we’ll go into summarizing a lot of the new and important developments in the field of computer vision and convolutional neural networks. We’ll look at some of the most important papers that have been published over the last  years and discuss why they’re so important. In this section we will discuss three papers which are different applications for image classification.

### 2.1.1 LeNet-5

Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner proposed a neural network architecture for *handwritten and machine-printed character recognition* in 1990’s which they called LeNet-5. The architecture is straightforward and simple to understand that’s why it is mostly used as a first step for teaching Convolutional Neural Network.

<img src="images/cnn24.png" width="800">

Above shown architecture is a very general form of how a Convolutional model looks. Notice how gradually depth is higher in hidden layers and height and width decreases. In the beginning we have convolutional and pooling layers and then in the end we add fully connected layers and last layer is softmax layer which is probability of 10 way classification of each class. We will look at this architecture more closely below. Remember in next sections we will look at some more recent architectures will have overall similar architecture as this model.

The LeNet-5 architecture consists of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier.

The input for LeNet-5 is a 32×32 grayscale image which passes through the first convolutional layer with 6 feature maps or filters having size 5×5 and a stride of one. The image dimensions changes from 32x32x1 to 28x28x6. 

<img src="images/cnn25.jpg" width="500">

Then the LeNet-5 applies *average pooling layer* or sub-sampling layer with a filter size 2×2 and a stride of two. The resulting image dimensions will be reduced to 14x14x6. Next, there is a second convolutional layer with 16 feature maps having size 5×5 and a stride of 1. In this layer, only 10 out of 16 feature maps are connected to 6 feature maps of the previous layer as shown below.

<img src="images/cnn26.png" width="500">

The main reason is to break the symmetry in the network and keeps the number of connections within reasonable bounds. That’s why the number of training parameters in this layers are 1516 instead of 2400 and similarly, the number of connections are 151600 instead of 240000. The fourth layer (S4) is again an average pooling layer with filter size 2×2 and a stride of 2. This layer is the same as the second layer (S2) except it has 16 feature maps so the output will be reduced to 5x5x16. The fifth layer (C5) is a fully connected convolutional layer with 120 feature maps each of size 1×1. Each of the 120 units in C5 is connected to all the 400 nodes (5x5x16) in the fourth layer S4. The sixth layer is a fully connected layer (F6) with 84 units. Finally, there is a fully connected softmax output layer ŷ with 10 possible values corresponding to the digits from 0 to 9. 

Following is the summary of the architecture:

<img src="images/cnn27.jpg" width="500">



### 2.1.2 AlexNet

This paper, titled “ImageNet Classification with Deep Convolutional Networks”, has been cited more than 6,184 times and is widely regarded as one of the most influential publications in the field. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton created a “large, deep convolutional neural network” that was used to win the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge). For those that aren’t familiar, this competition can be thought of as the annual Olympics of computer vision, where teams from across the world compete to see who has the best computer vision model for tasks such as classification, localization, detection, and more. 2012 marked the first year where a CNN was used to achieve a top 5 test error rate of 15.4% (Top 5 error is the rate at which, given an image, the model does not output the correct label with its top 5 predictions). The next best entry achieved an error of 26.2%, which was an astounding improvement that pretty much shocked the computer vision community. Safe to say, CNNs became household names in the competition from then on out. ImageNet challenge has data with more than 14 million images and 1000 class.



Let us have a walk-through of this paper:

The highlights of the paper:
* Use Relu instead of Tanh to add non-linearity. It accelerates the speed by 6 times at the same accuracy.
* Use dropout instead of regularisation to deal with overfitting. However the training time is doubled with the dropout rate of 0.5.
* Overlap pooling to reduce the size of network. It reduces the top-1 and top-5 error rates by 0.4% and 0.3%, repectively.

**The architecture**

The input to AlexNet is an RGB image of size 256×256. This means all images in the training set and all test images need to be of size 256×256. Random crops of size 227×227 were generated from inside the 256×256 images to feed the first layer of AlexNet (Note a lot of cropped images can be constructed from a single image thus model is trained multiple times on the same image with slight cropping thus it learns to be invariant to lateral movements in image).

AlexNet was much larger than previous CNNs used for computer vision tasks ( e.g. Yann LeCun’s LeNet paper in 1998). It has 60 million parameters and 650,000 neurons and took five to six days to train on two GTX 580 3GB GPUs. Today there are much more complex CNNs that can run on faster GPUs very efficiently even on very large datasets. But back in 2012, this was huge!

<img src="images/cnn22.png" width="800">
<center>Fig 2.1.1 Architecture of AlexNet</center>

AlexNet consists of 5 Convolutional Layers and 3 Fully Connected Layers. The first Convolutional Layer of AlexNet contains 96 kernels of size 11x11x3. Note the width and height of the kernel are usually the same and the depth is the same as the number of channels. The first two Convolutional layers are followed by the Overlapping Max Pooling layers that we describe next. The third, fourth and fifth convolutional layers are connected directly. The fifth convolutional layer is followed by an Overlapping Max Pooling layer, the output of which goes into a series of two fully connected layers. The second fully connected layer feeds into a softmax classifier with 1000 class labels. ReLU nonlinearity is applied after all the convolution and fully connected layers. In below figure we have provided a summary of all the layers.

<img src="images/cnn28.jpg" width="800">

**Overlapping Max Pooling**

Max Pooling layers are usually used to downsample the width and height after any layer, keeping the depth same. Overlapping Max Pool layers are similar to the Max Pool layers, except the adjacent windows over which the max is computed overlap each other. The authors used pooling windows of size 3×3 with a stride of 2 between the adjacent windows. This overlapping nature of pooling helped reduce the top-1 error rate by 0.4% and top-5 error rate by 0.3% respectively when compared to using non-overlapping pooling windows of size 2×2 with a stride of 2 that would give same output dimensions.(Note top 5 error here means that for each text case you taje the 5 most probable predictions and if the correct label is among those 5 then image is counted as correctly predicted else if correct label is not among those 5 then it is not correctly predicted, so accuracy predicted this way is called top-5 error).

**ReLU Nonlinearity**

An important feature of the AlexNet is the use of ReLU(Rectified Linear Unit) Nonlinearity. Tanh or sigmoid activation functions used to be the usual way to train a neural network model. AlexNet showed that using ReLU nonlinearity, deep CNNs could be trained much faster than using the saturating activation functions like tanh or sigmoid. The figure below from the paper shows that using ReLUs(solid curve), AlexNet could achieve a 25% training error rate six times faster than an equivalent network using tanh(dotted curve). This was tested on the CIFAR-10 dataset. 

<img src="images/cnn23.png" width="300">

**Reducing Overfitting**

The size of the Neural Network is its capacity to learn, but if you are not careful, it will try to memorize the examples in the training data without understanding the concept. As a result, the Neural Network will work exceptionally well on the training data, but they fail to learn the real concept. It will fail to work well on new and unseen test data. This is called overfitting.

Author of this paper used following methods to solve the problem of overfitting:

* **Data Augmentation**: Overfitting is reduced when training data is increased, so following we will look at how they have augmented images such that actual number of images upon which model is trained is actually much larger than training images provided in ImageNet. So showing a Neural Net different variation of the same image helps prevent overfitting. Intuitively you are forcing it to not memorize! If we have an image of a cat in our training set, its mirror image is also a valid image of a cat. So, that the model doesnot remember that instance as cat but only learn the concept of calling an image cat. Please see the figure below for an example. So we can double the size of the training dataset by simply flipping the image about the vertical axis. In addition, cropping the original image randomly will also lead to additional data that is just a shifted version of the original data. Notice the four randomly cropped images look very similar but they are not exactly the same. This teaches the Neural Network that minor shifting of pixels does not change the fact that the image is still that of a cat. Without data augmentation, the authors would not have been able to use such a large network because it would have suffered from substantial overfitting.

* **Dropout** With about 60 million parameters to train, the authors experimented with other ways to reduce overfitting too. So they applied another technique called dropout that was introduced by G.E. Hinton in another paper in 2012. In dropout, a neuron is dropped from the network with a probability of 0.5. When a neuron is dropped, it does not contribute to either forward or backward propagation. So every input goes through a different network architecture, as shown in the animation below. As a result, the learnt weight parameters are more robust and do not get overfitted easily. During testing, there is no dropout and the whole network is used, but output is scaled by a factor of 0.5 to account for the missed neurons while training. Dropout increases the number of iterations needed to converge by a factor of 2, but without dropout, AlexNet would overfit substantially.


### 2.1.3 VGG-16

Simplicity and depth. That’s what a model created in 2014 (weren’t the winners of ILSVRC 2014) best utilized with its 7.3% error rate. Karen Simonyan and Andrew Zisserman of the University of Oxford created a 16 layer CNN that strictly used 3x3 filters with stride and pad of 1, along with 2x2 maxpooling layers with stride 2. Simple enough right? You would have observed above that in AlexNet author used different sizes of kernel while in VGG-16 only kernels of size $3\times 3$ was used for simplicity. Also all the convolutional layer has padding as same that is shape does not change in convolutional layer (you may look at padding section above) so shape remains same after any cnvolutional layer and change only at the pooling layers. Also all the padding are with kernel of size ($2\times 2$) and stride of 2 so that height and width reduces to half after every pooling layer while depth remains the same. Note that we do not count padding layer explicitly as a layer, that we consider the network given below as only 16 layers with 13 convolutional layers and 3 fully connected layers. You can look at the beow given overview of network arc hitecture of VGG-16, it is quite straightforward and simple to understand. I would also recommend you to once look athe paper and read its introduction and architecture part as the paper too is fairly easy to understand. A major inference from this paper is that you don't always have to design very complicated architectures to get a good result rather even simple architectures can give good results. Also note that authors of this paper experimented with different number of layers 11, 16 and 19. Try to look at the different architectures used in all these 3.

We already mentoned that all the kernels are of size $3 \times 3$, now note that in the first two convolutional layers you have 64 kernels each. Also note that each convolutional layer has same padding i.e. shape doen't change after convolutional layer. So input to first layer is shape of input image ($224\times 224 \times 3$) and output is ($224\times 224 \times 64$) Second layer also has 64 kernels and padding as same so output of this layer is ($224\times 224 \times 64$), Now there is max pool layer with kernels of size ($2\times 2$) and stride of 2. So the output of pooling layer is ($112\times 112 \times 64$), now there is 2 more convolutional layers with 128 kernels so that output after these two layers is of shape ($112\times 112 \times 128$). Then a max pool layer whose output is ($56\times 56 \times 128$). Now we have 3 more convolutional layers with 256 kernels whose output is ($56\times 56 \times 256$) then again a max pool whose output is ($28\times 28 \times 256$). Now 3 more convolutional layers with 512 kernels and a maxpool afterwards, whose output is ($14\times 14 \times 512$). This is followed by 3 more convolutional layer with 512 kernels and a max pool layer at the end such that output is ($7\times 7 \times 512$). This is finally followed by two fully connected layers of 4096 neurons each and end with 1000 way softmax. Count total convolutional layers above (13) and end with 3 fully connected layers. So that we have total of 16 layers in this architecture.

<img src="images/cnn20.jpg" width="500">

**Main Points**

* The use of only 3x3 sized filters is quite different from AlexNet’s 11x11 filters in the first layer. The authors’ reasoning is that the combination of two 3x3 conv layers has an effective receptive field of 5x5. This in turn simulates a larger filter while keeping the benefits of smaller filter sizes. One of the benefits is a decrease in the number of parameters. Also, with two conv layers, we’re able to use two ReLU layers instead of one.
* 3 conv layers back to back have an effective receptive field of 7x7.
* As the spatial size of the input volumes at each layer decrease (result of the conv and pool layers), the depth of the volumes increase due to the increased number of filters as you go down the network.
* Interesting to notice that the number of filters doubles after each maxpool layer. This reinforces the idea of shrinking spatial dimensions, but growing depth.
* Worked well on both image classification and localization tasks. The authors used a form of localization as regression (see page 10 of the paper for all details).
* Built model with the Caffe toolbox.
* Used scale jittering as one data augmentation technique during training.
* Used ReLU layers after each conv layer and trained with batch gradient descent.
* Trained on 4 Nvidia Titan Black GPUs for two to three weeks.


## 2.2 Some advanced networks

### Inception Networks

While VGG achieves a phenomenal accuracy on ImageNet dataset, its deployment on even the most modest sized GPUs is a problem because of huge computational requirements, both in terms of memory and time. It becomes inefficient due to large width of convolutional layers.

For instance, a convolutional layer with 3X3 kernel size which takes 512 channels as input and outputs 512 channels, the order of calculations is 9X512X512. 

In a convolutional operation at one location, every output channel (512 in the example above), is connected to every input channel, and so we call it a dense connection architecture. The GoogLeNet builds on the idea that most of the activations in a deep network are either unnecessary(value of zero) or redundant because of correlations between them. Therefore the most efficient architecture of a deep network will have a sparse connection between the activations, which implies that all 512 output channels will not have a connection with all the 512 input channels. There are techniques to prune out such connections which would result in a sparse weight/connection. But kernels for sparse matrix multiplication are not optimized in BLAS or CuBlas(CUDA for GPU) packages which render them to be even slower than their dense counterparts.

So GoogLeNet devised a module called inception module that approximates a sparse CNN with a normal dense construction(shown in the figure). Since only a small number of neurons are effective as mentioned earlier, the width/number of the convolutional filters of a particular kernel size is kept small. Also, it uses convolutions of different sizes to capture details at varied scales(5X5, 3X3, 1X1).

Another salient point about the module is that it has a so-called bottleneck layer(1X1 convolutions in the figure). It helps in the massive reduction of the computation requirement as explained below.

Let us take the first inception module of GoogLeNet as an example which has 192 channels as input. It has just 128 filters of 3X3 kernel size and 32 filters of 5X5 size. The order of computation for 5X5 filters is 25X32X192 which can blow up as we go deeper into the network when the width of the network and the number of 5X5 filter further increases. In order to avoid this, the inception module uses 1X1 convolutions before applying larger sized kernels to reduce the dimension of the input channels, before feeding into those convolutions. So in the first inception module, the input to the module is first fed into 1X1 convolutions with just 16 filters before it is fed into 5X5 convolutions. This reduces the computations to 16X192 +  25X32X16. All these changes allow the network to have a large width and depth.

Another change that GoogLeNet made, was to replace the fully-connected layers at the end with a simple global average pooling which averages out the channel values across the 2D feature map, after the last convolutional layer. This drastically reduces the total number of parameters. This can be understood from AlexNet, where FC layers contain approx. 90% of parameters. Use of a large network width and depth allows GoogLeNet to remove the FC layers without affecting the accuracy. It achieves 93.3% top-5 accuracy on ImageNet and is much faster than VGG.

<img src="images/cnn34.png" width="550">

<br/><br/><br/>

### ResNet  

As per what we have seen so far, increasing the depth should increase the accuracy of the network, as long as over-fitting is taken care of. But the problem with increased depth is that the signal required to change the weights, which arises from the end of the network by comparing ground-truth and prediction becomes very small at the earlier layers, because of increased depth. It essentially means that earlier layers are almost negligible learned. This is called vanishing gradient. The second problem with training the deeper networks is, performing the optimization on huge parameter space and therefore naively adding the layers leading to higher training error. Residual networks allow training of such deep networks by constructing the network through modules called residual models as shown in the figure. This is called degradation problem. The intuition around why it works can be seen as follows:

<img src="images/cnn30.png" width="350">

Imagine a network, A which produces x amount of training error. Construct a network B by adding few layers on top of A and put parameter values in those layers in such a way that they do nothing to the outputs from A. Let’s call the additional layer as C. This would mean the same x amount of training error for the new network. So while training network B, the training error should not be above the training error of A. And since it DOES happen, the only reason is that learning the identity mapping(doing nothing to inputs and just copying as it is) with the added layers-C is not a trivial problem, which the solver does not achieve. To solve this, the module shown above creates a direct path between the input and output to the module implying an identity mapping and the added layer-C just need to learn the features on top of already available input. Since C is learning only the residual, the whole module is called residual module. 

Also, similar to GoogLeNet, it uses a global average pooling followed by the classification layer. Through the changes mentioned, ResNets were learned with network depth of as large as 152. It achieves better accuracy than VGGNet and GoogLeNet while being computationally more efficient than VGGNet. ResNet-152 achieves 95.51 top-5 accuracies. 

The architecture is similar to the VGGNet consisting mostly of 3X3 filters. From the VGGNet, shortcut connection as described above is inserted to form a residual network. This can be seen in the figure which shows a small snippet of earlier layer synthesis from VGG-19.

The power of the residual networks can be judged from one of the experiments in paper 4. The plain 34 layer network had higher validation error than the 18 layers plain network. This is where we realize the degradation problem. And the same 34 layer network when converted into the residual network has much lesser training error than the 18 layer residual network. 

<img src="images/cnn31.webp" width="550">

<br/><br/><br/><br/>

Finally let us review these architecture once :

<img src="images/cnn33.png" width="550">




## 2.3 Understanding and Visualizing CNNs

One of the most debated topics in deep learning is how to interpret and understand a trained model – particularly in the context of high risk industries like healthcare. The term “black box” has often been associated with deep learning algorithms. How can we trust the results of a model if we can’t explain how it works? It’s a legitimate question.

Take the example of a deep learning model trained for detecting cancerous tumours. The model tells you that it is 99% sure that it has detected cancer – but it does not tell you why or how it made that decision.

Did it find an important clue in the MRI scan? Or was it just a smudge on the scan that was incorrectly detected as a tumour? This is a matter of life and death for the patient and doctors cannot afford to be wrong.

Here, we will explore how to visualize a convolutional neural network (CNN). We will get to know the importance of visualizing a CNN model, and the methods to visualize them. We will also take a look at a use case that will help you understand the concept better.

<br/><br/>

### Importance of Visualizing a CNN model

As we have seen in the cancerous tumour example above, it is absolutely crucial that we know what our model is doing – and how it’s making decisions on its predictions. Typically, the reasons listed below are the most important points for a deep learning practitioner to remember:

1. Understanding how the model works
2. Assistance in Hyperparameter tuning
3. Finding out the failures of the model and getting an intuition of why they fail
4. Explaining the decisions to a consumer / end-user or a business executive

Let us look at an example where visualizing a neural network model helped in understanding the follies and improving the performance :

Once upon a time, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. The researchers trained a neural net on 50 photos of camouflaged tanks in trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest.

This did not ensure, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that would not generalize to any new problem. Wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees. They had used only 50 of each for the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed! The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.

It turned out that in the researchers’ dataset, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from an empty forest.

<br/><br/>

**Methods of Visualizing a CNN model**

Broadly the methods of Visualizing a CNN model can be categorized into three parts based on their internal workings

* **Preliminary methods** – Simple methods which show us the overall structure of a trained model
* **Activation based methods** – In these methods, we decipher the activations of the individual neurons or a group of neurons to get an intuition of what they are doing
* **Gradient based methods** – These methods tend to manipulate the gradients that are formed from a forward and backward pass while training a model

We will look at each of them in detail in the sections below. 


<br/><br/>

### Preliminary Methods

**Plotting model architecture**

The simplest thing you can do is to print/plot the model. Here, you can also print the shapes of individual layers of neural network and the parameters in each layer.

In keras, you can implement it with *model.summary()*, output is as below:
<img src="images/cnn35.png" width="500">

For a more creative and expressive way – you can draw a diagram of the architecture (hint – take a look at the keras.utils.vis_utils function).

<img src="images/cnn36.png" width="500">

**Visualize filters**

Another way is to plot the filters of a trained model, so that we can understand the behaviour of those filters. For example, the first filter of the first layer of the above model looks like:
<img src="images/cnn37.png" width="300">

Generally, we see that the low level filters work as edge detectors, and as we go higher, they tend to capture high level concepts like objects and faces.

<img src="images/cnn38.png" width="500">


### Activation Maps

**Maximal Activations**

Another visualization technique is to take a large dataset of images, feed them through the network and keep track of which images maximally activate some neuron. We can then visualize the images to get an understanding of what the neuron is looking for in its receptive field. One such visualization (among others) is shown in Rich feature hierarchies for accurate object detection and semantic segmentation by Ross Girshick et al.:

<img src="images/cnn40.jpeg" width="800">

Maximally activating images for some POOL5 (5th pool layer) neurons of an AlexNet. The activation values and the receptive field of the particular neuron are shown in white. (In particular, note that the POOL5 neurons are a function of a relatively large portion of the input image!) It can be seen that some neurons are responsive to upper bodies, text, or specular highlights.

One problem with this approach is that ReLU neurons do not necessarily have any semantic meaning by themselves. Rather, it is more appropriate to think of multiple ReLU neurons as the basis vectors of some space that represents in image patches. In other words, the visualization is showing the patches at the edge of the cloud of representations, along the (arbitrary) axes that correspond to the filter weights. This can also be seen by the fact that neurons in a ConvNet operate linearly over the input space, so any arbitrary rotation of that space is a no-op. This point was further argued in Intriguing properties of neural networks by Szegedy et al., where they perform a similar visualization along arbitrary directions in the representation space.



**Image Occlusion**

Suppose that a ConvNet classifies an image as a dog. How can we be certain that it’s actually picking up on the dog in the image as opposed to some contextual cues from the background or some other miscellaneous object? One way of investigating which part of the image some classification prediction is coming from is by plotting the probability of the class of interest (e.g. dog class) as a function of the position of an occluder object. That is, we iterate over regions of the image, set a patch of the image to be all zero, and look at the probability of the class. We can visualize the probability as a 2-dimensional heat map. This approach has been used in Matthew Zeiler’s Visualizing and Understanding Convolutional Networks:

<img src="images/cnn39.jpeg" width="500">




## 2.4 Comparison with traditional methods

Now we have learnt most of the basic concepts regarding convnets models. We have repeatedly stated that models using CNN perform better on image classification tasks. Here we will reinforce that idea by comparison with SVM. 

Below you can observe that accuracy of the digit sign classification using SVM is nearly 75%. Later we will see that accuracy using CNN is much better at nearly 90%.

Also we have provided a table below for accuracy comparison using different algorithms.

| Model | Accuracy |
| :------------ | -----------: |
| Feed forward Neural Networks | 0.7748|
| SVM | 0.7457 |
| kNN | 0.6828 |
| Logistic Regression| 0.7506 |
| Random Forest| 0.8063 |
| **CNN** | **0.9225** |

In [18]:
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn import svm
from sklearn.metrics import accuracy_score

# fix random seed for reproducibility
np.random.seed(7)

X = np.load('data/Sign-language-digits/X.npy')
y_ = np.load('data/Sign-language-digits/Y.npy')

y = np.argmax(y_, axis=1)

# Get shape of the data
print('X shape : {}  Y shape: {}'.format(X.shape, y.shape))

# Flatten the data in a vector, each sample is an input to neural network
X_ = X.reshape(2062, 4096)

# Divide entire data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.2, random_state=8)


X shape : (2062, 64, 64)  Y shape: (2062,)


In [19]:
clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.7457627118644068


In [20]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.6828087167070218


In [24]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))



0.7506053268765133




In [27]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.8062953995157385


## 2.4 Hand sign digit classification using CNN

Earlier we have seen how to classify hand sign classification using feed forward neural network. Now let us apply CNN for classification on cifar-10 dataset. 

In [28]:
from sklearn.model_selection import train_test_split
from keras.models import Sequential, load_model
from keras.layers.core import Dense, Dropout, Activation

import numpy as np
import matplotlib.pyplot as plt

import os

# fix random seed for reproducibility
np.random.seed(7)

X = np.load('data/Sign-language-digits/X.npy')
y = np.load('data/Sign-language-digits/Y.npy')



# Get shape of the data
print('X shape : {}  Y shape: {}'.format(X.shape, y.shape))

# change shape of the data in a vector, each sample is an image with depth 1.
X_ = X.reshape(2062, 64, 64, 1)

# Divide entire data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.2, random_state=8)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(64, 64, 1)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))


model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')


# training the model and saving metrics in history
history = model.fit(X_train, y_train,
          batch_size=128, epochs=10,
          verbose=2,
          validation_data=(X_test, y_test))

# Evaluate the model
scores = model.evaluate(X_test,y_test)

print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])


# saving the model
model_path = 'results/keras_sign_lang3.h5'
model.save(model_path)
print('Saved trained model at %s ' % model_path)

X shape : (2062, 64, 64)  Y shape: (2062, 10)
Train on 1649 samples, validate on 413 samples
Epoch 1/10
 - 17s - loss: 2.3205 - acc: 0.1134 - val_loss: 2.3032 - val_acc: 0.1671
Epoch 2/10
 - 16s - loss: 2.2983 - acc: 0.1298 - val_loss: 2.2845 - val_acc: 0.2349
Epoch 3/10
 - 16s - loss: 2.0882 - acc: 0.3269 - val_loss: 1.4760 - val_acc: 0.5932
Epoch 4/10
 - 16s - loss: 1.3475 - acc: 0.5349 - val_loss: 1.0169 - val_acc: 0.6659
Epoch 5/10
 - 16s - loss: 1.0163 - acc: 0.6628 - val_loss: 0.7048 - val_acc: 0.7724
Epoch 6/10
 - 16s - loss: 0.7557 - acc: 0.7392 - val_loss: 0.5618 - val_acc: 0.8257
Epoch 7/10
 - 16s - loss: 0.5694 - acc: 0.8181 - val_loss: 0.4613 - val_acc: 0.8668
Epoch 8/10
 - 16s - loss: 0.4520 - acc: 0.8520 - val_loss: 0.3535 - val_acc: 0.8838
Epoch 9/10
 - 16s - loss: 0.3760 - acc: 0.8781 - val_loss: 0.3031 - val_acc: 0.8983
Epoch 10/10
 - 16s - loss: 0.3260 - acc: 0.8939 - val_loss: 0.2701 - val_acc: 0.9225
Loss: 0.270
Accuracy: 0.923
Saved trained model at results/keras_s

With 4 convolutional layer and 2 dense layer at the end we obtained an accuracy of more that 92% on the dataset in only 10 epochs. In comparison we were not even close in case of feed forward network.

In [29]:
# On cifar-10 dataset

from __future__ import print_function

import numpy as np

from keras.callbacks import EarlyStopping
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.convolutional import Conv2D
from keras.optimizers import Adam
from keras.layers.pooling import MaxPooling2D
from keras.utils import to_categorical
from keras.datasets import mnist

# For reproducibility
np.random.seed(1000)

(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()

if __name__ == '__main__':
    # Load the dataset
    

    # Create the model
    model = Sequential()

    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
    model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(1024, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10, activation='softmax'))

    # Compile the model
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=0.0001, decay=1e-6),
                  metrics=['accuracy'])

    # Train the model
    model.fit(X_train / 255.0, to_categorical(Y_train),
              batch_size=128,
              shuffle=True,
              epochs=250,
              validation_data=(X_test / 255.0, to_categorical(Y_test)),
              callbacks=[EarlyStopping(min_delta=0.001, patience=3)])

    # Evaluate the model
    scores = model.evaluate(X_test / 255.0, to_categorical(Y_test))

    print('Loss: %.3f' % scores[0])
    print('Accuracy: %.3f' % scores[1])

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Train on 50000 samples, validate on 10000 samples
Epoch 1/250
Epoch 2/250
Epoch 3/250
Epoch 4/250
Epoch 5/250
Epoch 6/250
Epoch 7/250
Epoch 8/250
Epoch 9/250
Epoch 10/250
Epoch 11/250
Epoch 12/250
Epoch 13/250
Epoch 14/250
Epoch 15/250
Epoch 16/250
Epoch 17/250
Epoch 18/250
  640/50000 [..............................] - ETA: 1:35 - loss: 0.9565 - acc: 0.6922

KeyboardInterrupt: 