#### This is an exercise notebook for myslf to learn CNN.
#### The material I used of this exercise is listed below:
    https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/
    https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/
    https://machinelearningmastery.com/review-of-architectural-innovations-for-convolutional-neural-networks-for-image-classification/
    https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/
    https://machinelearningmastery.com/review-of-architectural-innovations-for-convolutional-neural-networks-for-image-classification/
    http://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/
    https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d
    https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
    Hands on Machine Learning, Chapter 13.

# Convolutional layers:
## 1.Convolutional
    CNN is a type of neural network that is designed to deal with two-dimensional image data, although it can be used to deal with one-dimensional or three-dimensional data. 
    The convolutional layer that performs an operation called 'convolution' is central to CNN.
    A a convolution is a linear operation that involves the multiplication of a set of weights with the input. The multiplication is performed between an array of input data and a two-dimensional array of weights, called a filter or a kernel.
    
## 2.Feature Map
    Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input, such as an image.
    The output from multiplying the filter with the input array one time is a single value. As the filter is applied multiple times to the input array, the result is a two-dimensional array of output values that represent a filtering of the input. As such, the two-dimensional output array from this operation is called a “feature map“.
    In summary, we have a input, such as an image of pixel values, and we have a filter, which is a set of weights, and the filter is systematically applied to the input data to create a feature map.
    
## 3. Weights
    The innovation of using the convolution operation in a neural network is that the values of the filter are weights to be learned during the training of the network.
    The filter weights represent the structure or feature that the filter will detect and the strength of the activation indicates the degree to which the feature was detected.
    

# Powerful:
## 1.Multiple features:
    Convolutional neural networks do not learn a single filter; they, in fact, learn multiple features in parallel for a given input.
    It is common for a convolutional layer to learn from 32 to 512 filters in parallel for a given input.This gives the model 32, or even 512, different ways of extracting features from an input, or many different ways of both “learning to see” and after training, many different ways of “seeing” the input data.
    
## 2.Multiple channels:
    Color images have multiple channels, typically one for each color channel, such as red, green, and blue.
    A filter must always have the same number of channels as the input, often referred to as “depth“. 
    
## 3.Multiple layers:
    Convolutional layers are not only applied to input data, but they can also be applied to the output of other layers.
    Lower layers extract low-level features, such as lines. Very deep layers are extracting faces, animals, houses, and so on.

## One-Dimensional input data example:

In [13]:
from numpy import asarray
from keras.models import Sequential
from keras.layers import Conv1D

In [14]:
# define input data
data = asarray([0, 0, 0, 1, 1, 0, 0, 0])
data = data.reshape(1, 8, 1)
# create model
model = Sequential()
model.add(Conv1D(1, 3, input_shape=(8, 1)))
# define a vertical line detector
weights = [asarray([[[0]],[[1]],[[0]]]), asarray([0.0])]
# confirm they were stored
print(model.get_weights())
# apply filter to input data
yhat = model.predict(data)
print(yhat)
'''Note that the feature map has six elements, whereas our input has eight elements.
We can use padding to get 8 elements feature map.
'''

[array([[[-0.9488809 ]],

       [[-0.66656923]],

       [[ 0.1292522 ]]], dtype=float32), array([0.], dtype=float32)]
[[[ 0.        ]
  [ 0.1292522 ]
  [-0.53731704]
  [-1.6154501 ]
  [-0.9488809 ]
  [ 0.        ]]]


'Note that the feature map has six elements, whereas our input has eight elements.\nWe can use padding to get 8 elements feature map.\n'

In [None]:
'''
Note:
The first dimension refers to each input sample; in this case, we only have one sample.
The second dimension refers to the length of each sample; in this case, the length is eight. 
The third dimension refers to the number of channels in each sample; in this case, we only have a single channel.
'''

# Two-Dimensional input data example:

In [27]:
from keras.layers import Conv2D
# define input data
# [1,8,8,1]=>[samples, columns, rows, channels] 
data = [[0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0]]
data = asarray(data)
data = data.reshape(1, 8, 8, 1)
# create model
model = Sequential()
model.add(Conv2D(1, (3,3), input_shape=(8, 8, 1)))
# define a vertical line detector
detector = [[[[0]],[[1]],[[0]]],
            [[[0]],[[1]],[[0]]],
            [[[0]],[[1]],[[0]]]]
weights = [asarray(detector), asarray([0.0])]
# store the weights in the model
model.set_weights(weights)
# confirm they were stored
print(model.get_weights())
# apply filter to input data
yhat = model.predict(data)
for r in range(yhat.shape[1]):
    # print each column in the row
    print([yhat[0,r,c,0] for c in range(yhat.shape[2])])

[array([[[[0.]],

        [[1.]],

        [[0.]]],


       [[[0.]],

        [[1.]],

        [[0.]]],


       [[[0.]],

        [[1.]],

        [[0.]]]], dtype=float32), array([0.], dtype=float32)]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]


In [None]:
#For Conv2D:
'''
keras.layers.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, 
dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', 
bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, 
activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
'''
'''
When using Conv2D as the first layer in a model, provide the keyword argument:
input_shape (tuple of integers, does not include the batch axis)
e.g. input_shape=(128, 128, 3) for 128x128 RGB pictures in data_format="channels_last"
'''

# Pooling Layers
## 1.Problem of output feature map
    Sensitive to location of features. 'local translation invariance', downsize the feature map to make it less sensitive.
    Small movements in the position of the feature in the input image will result in a different feature map. This can happen with re-cropping, rotation, shifting, and other minor changes to the input image.
## 2. Pooling layer
    A pooling layer is a new layer added after the convolutional layer. 
    Pooling layer will always reduce the size of each feature map by a factor of 2, e.g. each dimension is halved, reducing the number of pixels or values in each feature map to one quarter the size.
    For example, a pooling layer applied to a feature map of 6×6 (36 pixels) will result in an output pooled feature map of 3×3 (9 pixels).
   ### 2.1 Average Pooling
    Calculate the average value for each patch on the feature map.
    Average pooling involves calculating the average for each patch of the feature map. This means that each 2×2 square of the feature map is down sampled to the average value in the square.
    For example:
    [0.0, 0.0, 4.0, 5.0, 0.0, 0.0]
    [0.0, 0.0, 4.0, 5.0, 0.0, 0.0]
    After average pooling:
    [0.0,4.5,0.0]
   ### 2.2 Max Pooling
    Calculate the maximum value for each patch of the feature map.
    For example:
    [0.0, 0.0, 4.0, 5.0, 0.0, 0.0]
    [0.0, 0.0, 4.0, 5.0, 0.0, 0.0]
    After max pooling:
    [0.0,5.0,0.0]
   ### 2.3 Global Pooling
    Global pooling down samples the entire feature map to a single value. 
    This would be the same as setting the pool_size to the size of the input feature map.

# Fully Connected Layers
    'Flatten out'
    Fully connected layers are the normal flat feed-forward neural network layer.
    Fully connected layers are used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers. They are used to create final non-linear combinations of features and for making predictions by the network.

# Dropout Layers
    http://jmlr.org/papers/v15/srivastava14a.html
    A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and improve generalization error in deep neural networks of all kinds.
    
## 1.1 Dropout 
    By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections
    
## 1.2 How
    Dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints and sparse activity regularization. 
    Dropout may also be combined with other forms of regularization to yield a further improvement.
### 1.2.1 Where
    Dropout is implemented per-layer in a neural network.
    It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer.
     It is not used on the output layer.
     
### 1.2.2 Parameter
    A new hyperparameter is introduced that specifies the probability at which outputs of the layer are dropped out, or inversely, the probability at which outputs of the layer are retained. 
    A common value is: 
        -a probability of 0.5 for retaining the output of each node in a hidden layer. A good value for dropout in a hidden layer is between 0.5 and 0.8.
        -a value close to 1.0, such as 0.8, for retaining inputs from the visible layer. Input layers use a larger dropout rate, such as of 0.8.
        
### 1.2.3 Grid Search Parameters
    Rather than guess at a suitable dropout rate for your network, test different rates systematically.
    For example, test values between 1.0 and 0.1 in increments of 0.1.

# Padding & Stride
## 1. What is border effects
    Reduction in the size of the input to the feature map is referred to as border effects. It is caused by the interaction of the filter with the border of the image.
    This is often not a problem for large images and small filters but can be a problem with small images. 
    **This can become a problem as we develop very deep convolutional neural network models with tens or hundreds of layers. We will simply run out of data in our feature maps upon which to operate.
## 2. Padding
    Adding addition of pixels to the edge of the image, value zero value that has no effect with the dot product operation when the filter is applied, is called padding. This will lead to applying a filter to an image is to ensure that each pixel in the image is given an opportunity to be at the center of the filter.
    model.add(Conv2D(), padding='same') ==> Adds the padding required to the input image (or feature map) to ensure that the output has the same shape as the input.
## 3.Stride
    The default stride or strides in two dimensions is (1,1) for the height and the width movement.
    The stride can be changed, which has an effect both on how the filter is applied to the image and, in turn, the size of the resulting feature map.

<table><tr>
    <td> <img src="Padding.png" alt="Drawing" style="width: 300px;"/> </td>
    <td>   </td>
    <td>   </td>
    <td>   </td>
    <td>   </td>
    <td>   </td>
    <td>   </td>
    <td> <img src="Stride.png" alt="Drawing" style="width: 400px;"/> </td>
</tr><table>

# Some Atchitectures introduction
    The elements of a convolutional neural network, such as convolutional and pooling layers, are relatively straightforward to understand.
    The challenging part of using convolutional neural networks in practice is how to design model architectures that best use these simple elements.

## 1. LeNet-5
    https://ieeexplore.ieee.org/abstract/document/726791
    Keywords: 7 layers, MINIST Dataset(32*32,grayscale image), Classification, 10 categories
### 1.1 Architecture   
    In.MINIST images are 28*28, zero-padded to 32*32.
    C1.Convolution layer, 6 filters each with the size of 5×5
    S2.average Pooling(subsampling)
    C3.Convolution layer, 16 filters with a size of 5×5
    S4.average Pooling(subsampling)
    C5.Convolution layer, 120 filters with size of 1*1
    F6.fully connected layer
    Out.fully connected layer
<table><tr>
    <td> <img src="LeNet-5.png" alt="Drawing" style="width: 400px;"/> </td>
    <td><img src="LeNet-51.png" alt="Drawing" style="width: 400px;"/> </td>
</tr><table>
    
### 1.2 Compare to mordern 
    Compared to modern applications, the number of filters is also small, but the trend of increasing the number of filters with the depth of the network also remains a common pattern in modern usage of the technique.
    In modern terminology, the final section of the architecture is often referred to as the classifier, whereas the convolutional and pooling layers earlier in the model are referred to as the feature extractor.

## 2. AlexNet
    http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
    Keywords: ImageNet, similar to LetNet-5, Larger and Deeper, Stack Converlutional layers directly, Max pooling, ReLU, softmax, Dropout Method, 
### 2.1 Architecture
    In.224*224, 3 channels
    C1.Convolution, 96 filters with size of 11*11, stride=4
    S2.Max Pooling, 3*3, stride=2
    C3.Convolution, 256 filters with size of 5*5, stride=1
    S4.Max Pooling, 3*3, stride=2
    C5.Convolution, 384 filters with size of 3*3
    C6.Convolution, 384 filters with size of 3*3
    C7.Convolution, 256 filters with size of 3*3
    F8.Fully connected, 4096
    F9.Fully connected, 4096
    Out.Fully connected, 1000
<table><tr>
    <td> <img src="AlexNet.png" alt="Drawing" style="width: 400px;"/> </td>
    <td><img src="AlexNet1.png" alt="Drawing" style="width: 400px;"/> </td>
</tr><table>


## 3. GoogLeNet
    http://www.image-net.org/challenges/LSVRC/2014/
    https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html
    Keywors: Inception Modules
   
### 3.1 Inception Module
<img src="GoolgeLeNet.png" alt="Drawing" style="width: 800px;"/>

### 3.2 Architecture
    A simplified architecture
<img src="GoogleNet1.png" alt="Drawing" style="width: 800px;"/>
<img src="GoogleNet2.png" alt="Drawing" style="width: 800px;"/>

## 4. ResNet (Residual Network)
    https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
    Keywords: 152 layers, skip connections(short cut connections)
### 4.1 Residual block
    These are simply connections in the network architecture where the input is kept as-is (not weighted) and passed on to a deeper layer, e.g. skipping the next layer.
    A residual block is a pattern of two convolutional layers with ReLU activation where the output of the block is combined with the input to the block, e.g. the shortcut connection.
<img src="VGG vs ResNet.png" alt="Drawing" style="width: 800px;"/>