# CNN (Convolutional Neural Network)
We will implement very basic CNN with MNIST dataset.

In [1]:
from IPython.display import Image

## Why CNN for image recognition?
Think about you use MLP (multi layer perceptron) for image recognition.  
You can recall how MLP works from below picture.  
MLP changes image shape from 2d(matrix) to 1d(array) and fully connect all nodes to get prediction.  
Take a moment to see difference between two different inputs in which both target numbers are equally '2'

In [2]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/mlp_overview.png", width=500, height=250)

In [3]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/mlp_overview2.png", width=500, height=250)

Even though they have same target number('2'), as you can see the **input arrays are very different.**  
Since MLP only knows input array, it is hard for MLP to learn these two numbers have same target.  
As a solution for this, you can introduce more nodes in the MLP, but this solution easily results in high computation, longer train.

MLP's issue:  
1) when number is written in different pixels, input for MLP is totally different.    
2) having many nodes results in high computation and longer train time.

## How do you recognize a thing?
In order to resolve MLP's issue, let's think how we actually recognize a number.  
If you look at number '2', how do you know it is 2?  
Your brain consciously or inconsciously recognize head and tail of number 2 and diagonal connector between head and tail, just like below pictures. 

In [4]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/cnn_filter.png", width=500, height=250)

## CNN is recognizing a thing just like you!
CNN recognize an object just like you. CNN is trained to capture features of object (in this case, head, tail, connector, edge) and recognize the number.

## How CNN identify features?
In contrast to MLP, CNN uses 2 dimension information in order to capture local connectivity such as head, tail, connector.  

here is how CNN finds features of object.



In [5]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/stride.png", width=500, height=250)

### stride, filter, receptive field
As you can see from above picture, CNN slides filter from the top left to the bottom right.  
Sliding filter is called **stride**.    
**Filter** is feature identifier, sometimes called **kernel**,  
and we call the area where filter stays a **receptive field**.  
the filter from above picture is to identify diagonal connected pixel.  

From below picture, you can see the filter detects two diagonal connected pixels from left input (digit 2),  
while there is no diagonal connected pixels frin right input (digit 1).

In [6]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/filter_diff.png", width=500, height=250)

As you know CNN is just mathmetical model, we haven't talk math behind intuitive example.  
Suppose our image is gray scale image, the pixel will be represented number between 0 to 255.  
0 means white color and 255 means black color just like below picture.  

In [7]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/elem_mul.png", width=500, height=250)

As you can see from above pictures, the filter detects feature by element-wise multiplication of filter and receptive field numbers.  
Larger number means the area has more chance to have the feature,  
Less number means the area has less chance to have the feature.

In order to detect your target, for example, here to detect number between 0 to 9,  
you will need more filters such as straight line, vertical line, curve line.  
your **first convolutional layer** will detect these essential features and  
the **next convolutional layers** can detect more high level such as circle, triangle, rectangle based on the previous detected features.  
After detecting all features, these features will be input to MLP (full connected layers) and end up with softmax to classify input to target value.

here is one of famous CNN architecture from Stanford CNN lecture slide.

In [8]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/cnn_architecture.png", width=500, height=250)

citation:
source from http://cs231n.github.io/convolutional-networks/

CONV stands for convolutioal layer which we have learned so far.  
Also you can find FC which is fully connected layer for classifying input to given target.
In this picture, you can see RELU as activation function of convolutional layer. we will also use RELU for our tensorflow practice in this jupyter notebook.  

The only one term we haven't talk is POOL which is pooling layer.

### pulling layer
The main purpose of having pulling layer is to reduce parameters and computtations of CNN model.  
Therefore, controlling overfitting.

From our stride example of digit 2 (when stride is 1),  
the output of stride of the diagonal filter (we call it **feature map**) is below,  
FYI, Relu(feature map) is called a **actvation map**.

In [9]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/stride_result.png", width=500, height=250)

When you apply pooling, you can use either **max pooling** or **average pooling**.  
Here is an example of max pooling (2*2 filter and stride 2).

In [10]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/max_pool.png", width=500, height=250)

As you can see, the output of convolutional layer decreased from (4 x 4) to (2 x 2).  
Decreasing feature map results in reducing the number of parameters and computation time.  
Also by reducing the number of parameters, it gives control of **overfitting**.

### zero padding
Lastly, I should give you insights of zero padding while it is not shown at Stanford CNN architecture diagram.  
Zero padding is mostly applied on recent CNN with mainly two reasons below,  
1) reduce information loss from convolutional layer.   
2) let the CNN knows where is the boundary of the input.

Let's revisit convolutional layer. As you can see from below image,  
The output dimension (4 x 4) is less than input dimension (5 x 5). That said, we lose some info at each convonlutional layer.

In [11]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/zeropadding1.png", width=500, height=250)

Take a look below picture. Zero Padding gives more space to stride.  
If your filter size is (3 X 3) and stride 1 pixel at a time, the output dimension will be (5 x 5)  
which is exactly same with original input.

In [12]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/zeropadding.png", width=500, height=250)

### what if image is color image?

In this Jupyter Notebook, I use MNIST which is gray scale image, however there are a lot more chance you will handle color image which is composed with RGB color. This is very easy problem if you think color image is just overlapped three layers.  
if you have 10 gray scale image which is 28 * 28 pixel, your input tensor is (10,28,28,1). because you only have one color.  
if you have 10 RGB color image which is 28 * 28 pixel, your input tensor is (10,28,28,3).  because you have three colors.

In [13]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/rgb.png", width=500, height=250)

from above picture, you can double check the color image is just overlapped three color layers.  
We call the overlapped layers a **depth**.
let's take a look how one filter work on three color layers from below image.  
You can find one filter has three sub filters. sub filters will stride and each stide will be summed up (plus bias) to one pixel in a feature map.

In [14]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/rgb1.png", width=500, height=250)

In [15]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/rgb2.png", width=500, height=250)

As you can see from above picture, each filter output is one feature map.  
Therefore, if you have 10 filters at first convolutional layer, the next layer's input will have **depth** 10.

## Train
Let's resume entire CNN architecture to talk about Training.

In [16]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/cnn_train.png", width=500, height=250)

As you can see from above picture, if input is color image, the depth is 3, that is why you can see three layer at the input.  
First Convolution layer has four filters that is why you have four layers at Conv1.  
Second Convolution layer has three filters that is why you have three layers at Conv2.
Pooling layer has stride size 2, that is why you have (2 x 2) feature maps.  
Flatten will have 2 x 2 x 3 = 12 values in an array as an input to fully connected layer.  
Theorically, convolution layers identify features.  
Fully Connected Layer classify input using all identified features.  
CNN is supervised learning, by giving target value, CNN will use back propagation to optimize parameters at convolution layers, fully connected layer.

### Parameter Optimization
The purpose of training is to optimize parameters of filter and parameters of fully connected layer(FC).  
Initially we give random value for these parameters, but the CNN keep update parameters using backpropagation in order to have meaningful filters an FC.  
Yes, initially the CNN even don't know which filter to have (head, tail, diagonal connector), the CNN will automatically know which filter to have during training.  
By minimizing the difference between target and output, the CNN will eventually have meaningful filters (head, tail, diagonal connector, round, triangle) and also meaningful classifier with these feature maps.  
Parameters (W in the below picture) can be optimized using gradience descent algorithm by minimizing difference between target and output (Loss in the below picture). The local minimum value of Loss is where derivative Loss with respect to W is 0. Therefore everytime we update W, we slightly update W with negative direction of derivative and find local minimum. 

In [17]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/sgd.png", width=300, height=150)

## Tensorflow implementation
Let's implement CNN using Tensorflow for identify handwritten number in MNIST dataset.

In [18]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/practice_cnn.png", width=800, height=200)

In [19]:
import tensorflow as tf

### collect data

In [20]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

train data has **60000** samples  
test data has **10000** samples   
every data is **28 * 28** pixels  

below image shows 28*28 pixel image sample for hand written number '0' from MNIST data.  
MNIST is gray scale image [0 to 255] for hand written number.

![0 from MNIST](https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/mnist_sample.png)

### Split train data into train and validation data
Split train data into train data and validation data, in order to check validation accuracy.

In [21]:
x_val  = x_train[50000:60000]
x_train = x_train[0:50000]
y_val  = y_train[50000:60000]
y_train = y_train[0:50000]

In [22]:
print("train data has " + str(x_train.shape[0]) + " samples")
print("every train data is " + str(x_train.shape[1]) 
      + " * " + str(x_train.shape[2]) + " image")

train data has 50000 samples
every train data is 28 * 28 image


In [23]:
print("validation data has " + str(x_val.shape[0]) + " samples")
print("every train data is " + str(x_val.shape[1]) 
      + " * " + str(x_train.shape[2]) + " image")

validation data has 10000 samples
every train data is 28 * 28 image


28 * 28 pixels has gray scale value from **0** to **255**

In [24]:
# sample to show gray scale values
print(x_train[0][8])

[  0   0   0   0   0   0   0  18 219 253 253 253 253 253 198 182 247 241
   0   0   0   0   0   0   0   0   0   0]


each train data has its label **0** to **9**

In [25]:
# sample to show labels for first train data to 10th train data
print(y_train[0:9])

[5 0 4 1 9 2 1 3 1]


test data has **10000** samples  
every test data is **28 * 28** image  

In [26]:
print("test data has " + str(x_test.shape[0]) + " samples")
print("every test data is " + str(x_test.shape[1]) 
      + " * " + str(x_test.shape[2]) + " image")

test data has 10000 samples
every test data is 28 * 28 image


### Reshape
We reshape x_train to be fitted in tf model.

In [27]:
import numpy as np
x_train = np.reshape(x_train, (50000,28,28,1))
x_val = np.reshape(x_val, (10000,28,28,1))
x_test = np.reshape(x_test, (10000,28,28,1))

print(x_train.shape)
print(x_test.shape)

(50000, 28, 28, 1)
(10000, 28, 28, 1)


### Normalize data
normalization usually helps faster learning speed, better performance  
by reducing variance and giving same range to all input features.  
since MNIST data set all input has 0 to 255, normalization only helps reducing variances.  
it turned out normalization is better than standardization for MNIST data with my MLP architeture,    
I believe this is because relu handles 0 differently on both feed forward and back propagation.  
handling 0 differently is important for MNIST, since 1-255 means there is some hand written,  
while 0 means no hand written on that pixel.

In [28]:
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
x_test = x_test.astype('float32')

gray_scale = 255
x_train /= gray_scale
x_val /= gray_scale
x_test /= gray_scale

### label to one hot encoding value
In order to measure difference between softmax output and target,  
target value need to be one hot encoding.

In [29]:
num_classes = 10
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_val = tf.keras.utils.to_categorical(y_val, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

### Implement CNN tensorflow graph

In [30]:
Image(url= "https://raw.githubusercontent.com/minsuk-heo/deeplearning/master/img/practice_cnn.png", width=800, height=200)

We use image itself (28 x 28) as input of CNN.
Target will be number between 0 to 9.

In [31]:
x = tf.placeholder(tf.float32, shape=[None, 28, 28, 1])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

We initialize parameters near 0.  
The point for using **truncated normal** is to overcome saturation of sigmoid in softmax (where if the value is too big/small, the neuron stops learning).  
tf.truncated_normal() selects random numbers from a normal distribution whose mean is close to 0 and values are close to 0 Ex. -0.1 to 0.1.  
It's called truncated because your cutting off the tails from a normal distribution.  
**tf.random_normal()** selects random numbers from a normal distribution whose mean is close to 0; however the values can be a bit further apart. Ex. -2 to 2  

In [32]:
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

**Same padding** means the size of output feature-maps are the same as the input feature-maps.  
For example, our MNIST has (28x28) shape, so the output will also (28,28) shape.

In [33]:
def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

First Convolutional Layer has 16 filters with size 5 by 5.  

In [34]:
W_conv1 = weight_variable([5, 5, 1, 16])
b_conv1 = bias_variable([16])

We use Relu activation funtion. Activation function brings non linearity in the model.

In [35]:
h_conv1 = tf.nn.relu(conv2d(x, W_conv1) + b_conv1)

After Convolutional Layer, we apply Pooling layer to reduce activation map size.  
Pooling layer will reduce parameters and control overfitting.

In [36]:
h_pool1 = max_pool_2x2(h_conv1)

After Max Pooling, now we have (14,14) input shape.  
Here we have second Convolutional layer.

In [37]:
W_conv2 = weight_variable([5, 5, 16, 32])
b_conv2 = bias_variable([32])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

After Max Pooling, now we have (7,7) input shape.
#### FC (Fully Connected Layer)
here is FC where we use activation maps from CONV as features for digit classification.  
You can find, we flatten the activation map pixels to one array in order to input to FC.

In [38]:
W_fc1 = weight_variable([7 * 7 * 32, 128])
b_fc1 = bias_variable([128])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*32])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

FC has two hidden layers. first hidden layer has 128 nodes, and second hidden layer has 10 nodes in order to match with our target range 0 to 9.

In [39]:
W_fc2 = weight_variable([128, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1, W_fc2) + b_fc2

We will use cross entropy as our loss function.

In [40]:
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_, logits=y_conv))

We will use Adam Optimizer as our parameters optimizer.

In [41]:
train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)

Accuracy will be from below code block.

In [42]:
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

#### Train and Test
Here we go.  
We will perform 3 epoches.  
Using Mini Batch, we will optimize parameters everytime we pass 500 train data to the model.

In [43]:
# initialize
init = tf.global_variables_initializer()

# train hyperparameters
epoch_cnt = 3
batch_size = 500
iteration = len(x_train) // batch_size

# Start training
with tf.Session() as sess:
    tf.set_random_seed(777)
    # Run the initializer
    sess.run(init)
    for epoch in range(epoch_cnt):
        avg_loss = 0.
        start = 0; end = batch_size
        
        for i in range(iteration):
            if i%10 == 0:
                train_accuracy = accuracy.eval(feed_dict={x:x_train[start: end], y_: y_train[start: end]})
                print("step "+ str(i) + ": training accuracy: "+str(train_accuracy))
            train_step.run(feed_dict={x:x_train[start: end], y_: y_train[start: end]})
            start += batch_size; end += batch_size    
        
        # Validate model
        val_accuracy = accuracy.eval(feed_dict={x:x_val, y_: y_val})
        print("validation accuracy: "+str(val_accuracy))
        
    test_accuracy = accuracy.eval(feed_dict={x:x_test, y_: y_test}) 
    print("test accuracy: "+str(test_accuracy))

step 0: training accuracy: 0.1
step 10: training accuracy: 0.616
step 20: training accuracy: 0.818
step 30: training accuracy: 0.836
step 40: training accuracy: 0.89
step 50: training accuracy: 0.876
step 60: training accuracy: 0.9
step 70: training accuracy: 0.926
step 80: training accuracy: 0.932
step 90: training accuracy: 0.918
validation accuracy: 0.944
step 0: training accuracy: 0.938
step 10: training accuracy: 0.948
step 20: training accuracy: 0.94
step 30: training accuracy: 0.944
step 40: training accuracy: 0.948
step 50: training accuracy: 0.954
step 60: training accuracy: 0.958
step 70: training accuracy: 0.962
step 80: training accuracy: 0.968
step 90: training accuracy: 0.956
validation accuracy: 0.9699
step 0: training accuracy: 0.966
step 10: training accuracy: 0.962
step 20: training accuracy: 0.946
step 30: training accuracy: 0.962
step 40: training accuracy: 0.972
step 50: training accuracy: 0.974
step 60: training accuracy: 0.98
step 70: training accuracy: 0.972
ste

You got test accuracy **97.4%** with even just **3** epoches.
I wish you enjoyed learning CNN. Thanks!