# Convolutional Neural Network


- Resources:
    1. [An Intuitive Explannation of Convolutional Neural Networks](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)
    2. [A Quick Introduction to Neural Network](https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/)
    3. [Convolutional Neural Networks (LaNet)](http://deeplearning.net/tutorial/lenet.html)
    4. [UFLDL Convolutional Neural Network Tutorial](http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/)


### Motivation
   - From Hubel and Wiesel's early work on the cat's visual cortex, we know that the visual cortex contains a complex arrangement of cells. 
   - these cells are sensitive to small sub-regions of the visual field, called a *receptive field.*
   -  the sub-regions are tiled to cover the entire visual field. These cell acts as local filters over the input space and are well-suited to eploit the strong spatially local correlation present in natural images.
   - Additionally, two basic cell types have been ideified: 
       - simple cells respond maximally to speciic edge-like patterns within their receptive field. 
       - complex cells have larger receptive fields and are locally invariant to the exact position of the pattern. 
    
    
## What is \*Convolution\*? 
- In Math, Convolution is essentially the blending of two functions into a third function.
- In the context of image processing, convolution is kind of like transforming image pixels in a structured way, taking nearby pixels into account. 
- In terms of coding, think of image as a 2D array of pixels with 3 channels (RGB).


- <img src = "convNN-overview.png" />
- there are 4 main operations in the ConvNet:
    1. Convolution
    2. Non-Linearity (ReLu)
    3. Pooling or Sub Sampling
    4. Classification (Fully Connected Layer)


**Every Image can be represented using a matrix of pixel values**
1. **channel**: a conventional term used to refer to a certain component of an image. 
    - an image usually have 3 channels: red, green and blue
    - imagine those as three 2d-matrices stacked over each other, one for each color, each having pixel value in the range 0-255
    - a **grayscale** image, just has one channel.


** The Convolution Step **
- primary purpose of Convolution in case of a ConvNet is to ** extract features from the input image**. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. 
- In CNN terminology, the 3-by-3 matrix is called a **filter**, or kernel, or feature detector, and the matrix is formed by sliding the filter over the image and computing the dot product is called the 'Convolved Feature' or 'Activation Map' or the **'feature map'**.
- In general, a CNN learns the values of these filters on its own during the training process. The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images. 
<img src = "ConvolutionStep.png" />
- the size of the Feature Map is controlled by three parameters:
    1. **Depth**: depth correspond to the number of filters we use for the convolution operation.
    2. **Stide**: the number of pixels by which we slide our input matrix. When the stride is 1, then we move the filters one pixel at a time. When the stride is 2, then we jumped 2 pixels at a time.
    - having a large slide will produce smaller feature maps
    3. **Zero-padding**
        - sometimes it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix.
        - a nice features of zero padding is that it allows us to control the size of the feature maps. Adding zero padding is sometimes refer to as **wide convolution**, and not using zero-padding would be a narrow convolution.

** ReLu Non-Linearlity **
<img src = 'ReLu.png' />

 - ReLu is an element wise operation ,and replaces all negative pixel values in the feature map by zero. The purpose of ReLu is to introduce non-linearlity in ConvNet, since most of the real-world data would be non-linear

** The Pooling Step **
- Spatial Pooling (aka subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information.
- In case of Max Pooling, we define a spatial neighborhood ( say 2-by-2 window) and take the largest element from the rectified feature map within that window . 
- this function of pooling is to progressively reduce the spatial size of the input representation. Pooling
    - make the input representations  (feature dimension) smaller and more manageable
    - reduce the number of parameters in the network, hence control overfitting
    - make the network invariant to small transformations, distortions and translations in the input image. 
    - helps us arrive at an almost scale invariant representation of our image. 
<img src = "maxPool.png" />

** Fully Connected Layer **
- the fully connected layer is a traditional Multi-layer perceptron that uses a **softmax activation function ** (SVM can also be used) in the output layer. 
- "fully connected" implies that every neuron in the previous layer is connected to every neuron on the next layer.
- the output from the convolutional and pooling layers represent high-level features of the input image. 
- the purpose is to use these features for classifying the input image into various classes based on the training dataset. 
- apart from classification, adding a fully-connected layer is a cheap way of learning non-linear combinations of these features. 

## Overall Training process of the Convolutional Neural Network:
1. initialize all fiters and parameters/weights with random values
2. the network takes a training image as input, goes through the forward prop step, (this includes convolution, ReLu,  pooling operation, and the fully connected layer), and finds the output probabilities for each class.
    - let's say the output probabilities for the boat image above are [0.2,0.4,0.1,0.3]
    - since weights are randomly assigned, the output makes sense to look random.
3. Calculate the total error at the output layer. (summation over all classes)
    - Total Error = $ \sum \frac{1}{2} (totalProbability - outputProbability) ^ 2 $
4. Use Backprop to calculate the *gradients of the error *with respect to all weights in the network and use gradient descent to *update all filter values / weights* and parameter value to minimize the output error. 
    - the weights are adjusted in proportion to their contribution to the total error.
    - when the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is close to the target vector of [0,0,1,0]
    - this means the network has learnt to classify this particular image correctly by adjusting its weights and filters such that the output error is reduced.
    - hyperparameters like number of fiters, fiter size, architechture of the network, etc have all been fixed before Step 1 and do not change during the training process. Only the values of the filter matrix and connection weights get updated. 

5. Repeat steps 2 - 4 with all images in the training set. 
    - the above steps train the Conv Net , this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set. 
    - the steps above use only 2 sets of alternating convolution and pooling layers. Note that there can be repeated any number of times in a single CNN. 
    
<img src = 'intuition.png' />

-------------
1. ### Review Regular NN:
    - Traditional Neural Networks receive an input (as a single vector), and transform it through a series of hidden layers.
    - each hidden layer is made up of a set of neurons, where each neuron is fully connected to all the neurons in the previous layer. 
    - the last fully-connected layer is called the output layer, and in classification setting it represents the class scores.
    - the problem is: Regular Neural Nets don't scale well to full images. 
        - For example, say an image has size: 200x200x3, this would lead to neurons that have 200x200x3=120,000 weights.
        - We would want to have several such neurons, so the parameters would add up quickly.
        - so clearly, this **fully connectivity** is wasteful and the huge number of parameters would quickly lead to overfitting. 

2. ### 3D volumes of Neurons:
    - CNN take advantage of the fact that the input consists of images and they constrain the architechture in a more sensible way.
    - Unlike traditional neural network, CNN have neurons arranged in 3 dimensions: **width, height, depth**. For example, the example above, we would have width = 200, height = 200, and depth = 3.
    
3. ### Full ConvNet Architecture
    - Convolutional Layer(CONV)
    - Pooling Layer
    - Fully-Connected Layer (FC)
    - **Note**:
        - the ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note also that some layers contain parameters and others don't. 
        - In particular, CONV/ FC layers perform transformations that are a function of *not only the activations in the input volume, but also of the parameters.*
        - On the other hand, ReLu and POOL layers will implement a fixed function. The parameters in the CONV / FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.
        - each layer may or may not have additional hyperparameters( e.g.: CONV/ FC/ POOL do, but ReLu doesn't)
        
4. ### Convolutional Layer: 
    - the core building block of a Convolutional Neural Network.
    - **Local Connectivity**:
        - we will connect each neuron to only a local region of the input volume.
        - the spatial extent of this connectivity is a hyperparmeter called the ** receptive field ** of the neuron.
        - the extent of the connectivity along the depth size is always equal to the depth of the input volume. 
     - <img src = "connectivity.png" />
    - **Spatial Arrangement **: Three hyperparameters control the size of the output volume: the depth, steide, and zero-padding.
        - depth: correspond to the number of filters we would like to use. Each learning to look for something different in the input. 
            - for example, if the first Convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edges, or blobs of color. 
            - we refer to a set of neurons that are all looking at the same region of the input as a **depth column**.
      
        - stride: when the stride is 1 then we move the filters one pixel at a time.
        - zero-padding: it will be convenient to pad the input volume with zeros around the border.
            - the nice feature of zero-padding is that it will allow us to control the spatial size of the output volume. 
            - we can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the COnv Layer Neurons (F), the stride with which they are applied (S), and the amount of zero-padding used (P) on the border. 
            - $ spatialSize = \frac{W - F + 2P}{S + 1}$

-----------------------------------------------------------------
Let's code the simple CNN with Tensorflow. 
This is the tutorial by [American Damien](https://github.com/aymericdamien)

** TensorFlow's MNIST datasets **
- MNIST is a classical problem that look at 28 by 28 pixel images of handwritten digits and determine which digit the image represents, for all the digits from 0-9
- At the top of the code when you import everything, make sure you have the latest download
   </br> `mnist = input_data.read_data_sets('/tmp/data/', one_hot=True)`

In [29]:
#from __future__ import division, print_function, absolute_import
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# Import MNIST data
mnist = input_data.read_data_sets('/tmp/data/', one_hot=True)

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [34]:
# 1. setting up all the hyperparameters
learning_rate = 0.001 # alpha
num_steps = 500 # iterations
batch_size = 128
display_step = 10

# 2.  Network Parameters
num_input = 784 # 28 by 28 pixel images
num_classes = 10 # 0 - 9 digits
dropout = 0.75 # probability of dropping out a hidden unit

# tensorflow graph input
X = tf.placeholder(tf.float32,[None, num_input])
Y = tf.placeholder(tf.float32,[None,num_classes])
keep_prob = tf.placeholder(tf.float32) # dropout keep probability


In [31]:
# Create some wrappers for simplicity
def conv2d(x, W, b, strides =1):
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, W, strides = [1, strides, strides, 1], padding = 'SAME')
    x = tf.nn.bias_add(x,b)
    x = tf.nn.relu(x)
    return x

def maxpool2d(x,k =2):
    # MaxPool2D Wrapper
    return tf.nn.max_pool(x,ksize=[1,k,k,1], strides = [1,k,k,1], padding = 'SAME')

# Create the model
def conv_net (x, weihgts,biases, dropout):
    # MNIST data input is a 1-D vectors of 784 features (28*28 pixels)
    # Reshape to match picture formats
    # Tensor input becomes 4D [Batch size, Height, Width, Channel]
    x = tf.reshape(x,shape = [-1,28,28,1])
    
    # first layer convolution layer
    conv1 = conv2d(x,weights['wc1'], biases['bc1'])
     # max pooling 
    conv1 = maxpool2d(conv1, k = 2)
   
    conv2 = conv2d(conv1, weights['wc2'],biases['bc2'])
    conv2 = maxpool2d(conv2,k = 2)
    
    # Fully connected layer
    # reshape conv2 output to fit fully connected layer input
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1,weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)
    
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out
    

In [32]:
# Store layers weight and bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5,5,1,32])),
    'wc2': tf.Variable(tf.random_normal([5,5,32,64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024,num_classes]))
}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([num_classes]))
}

# Construct Model
logits = conv_net(X, weights, biases, keep_prob)
prediction = tf.nn.softmax(logits)

# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = Y))
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate)
train_op = optimizer.minimize(loss_op)

# Evaluate Model
correct_pred = tf.equal(tf.argmax(prediction,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# initialize the variables
init = tf.global_variables_initializer()

In [33]:
# Start training
with tf.Session() as sess:
    
    sess.run(init)
    
    for step in range(1,num_steps +1):
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        
        sess.run(train_op, feed_dict = {X:batch_x, Y: batch_y, keep_prob:dropout})
        
        if step % display_step == 0 or step == 1:
            # Calculate batch's loss value and accuracy
            loss, acc = sess.run([loss_op, accuracy], feed_dict = {X:batch_x, Y:batch_y, keep_prob:1.0})
            print("Step "+ str(step) + ", Minibatch Lost = " + 
                  "{:.4f}".format(loss) + ", Training Accuracy = " + "{:.3f}".format(acc))
            
    print("Optimization Finished!")
    
    print("Testing Accuracy: ", sess.run(accuracy, feed_dict = {X: mnist.test.images[:256], 
                                                                Y:mnist.test.labels[:256], keep_prob:1.0}))

Step 1, Minibatch Lost = 44357.6797, Training Accuracy = 0.117
Step 10, Minibatch Lost = 24775.9453, Training Accuracy = 0.250
Step 20, Minibatch Lost = 12615.4619, Training Accuracy = 0.414
Step 30, Minibatch Lost = 6684.8511, Training Accuracy = 0.594
Step 40, Minibatch Lost = 3336.0073, Training Accuracy = 0.781
Step 50, Minibatch Lost = 4144.3545, Training Accuracy = 0.781
Step 60, Minibatch Lost = 2608.1147, Training Accuracy = 0.805
Step 70, Minibatch Lost = 2018.2419, Training Accuracy = 0.867
Step 80, Minibatch Lost = 2419.0928, Training Accuracy = 0.867
Step 90, Minibatch Lost = 2091.0078, Training Accuracy = 0.852
Step 100, Minibatch Lost = 1925.6846, Training Accuracy = 0.898
Step 110, Minibatch Lost = 2503.4480, Training Accuracy = 0.852
Step 120, Minibatch Lost = 1779.9392, Training Accuracy = 0.859
Step 130, Minibatch Lost = 2397.7441, Training Accuracy = 0.836
Step 140, Minibatch Lost = 2417.6355, Training Accuracy = 0.875
Step 150, Minibatch Lost = 2622.3828, Training A