<a href="https://colab.research.google.com/github/djdtimit/Deep-Learning/blob/master/convolutional_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Networks

## Computer Vision Problems

- Image Classification: Cat or not?
- Object Detection: Finding object in image and adding bounding box
- Neural Style Transfer: Making Picture of two images in the style of one image

## Difficulties

- With large images as input, there will be a lot of parameters in an fully connected architecture -> overfitting and high computational requirements -> solution: convolution

## Strategy

- Convnets are the best type of machine-learning models for computer-vision
tasks. It’s possible to train one from scratch even on a very small dataset, with
decent results.
- On a small dataset, overfitting will be the main issue. Data augmentation is a
powerful way to fight overfitting when you’re working with image data.
- It’s easy to reuse an existing convnet on a new dataset via feature extraction.
This is a valuable technique for working with small image datasets.
- As a complement to feature extraction, you can use fine-tuning, which adapts to
a new problem some of the representations previously learned by an existing
model. This pushes performance a bit further.


## Idea behind Convolution

<img src="https://drive.google.com/uc?id=1i127PLBw0d-nmwkkVWkt7E4_senC6QGd">

First detect edges than other parts of the face and in the end the whole face => low level features

## Convolutional operation for vertical edge detection

<img src="https://drive.google.com/uc?id=10-0JTDw55R-I4tedIrdXXumJzsq3h9xX">

## Why will applying this filter detect vertical edges?

<img src="https://drive.google.com/uc?id=1Nu5dNjrrBtw2oqdv5Q7BclpC2q0t5wQl">

## Learning to detect edges

<img src="https://drive.google.com/uc?id=1OhqOonrQni3gNDyYtfXczIn1lRNp3xpB">

treat the numbers in the filter as parameters. These parameters will be learned from the data

## Padding

<img src="https://drive.google.com/uc?id=1o3Iv4jgdGm8vtJeQwdbwR8Lwo1Q17bWa">

applying the convolution operation leads to two problems:

- the output shrinks
- the filter is not often applied to the edges such that a lot of information from the edges will be thrown away

solution: padding

padding means to extend the input with new edges filled with zeros. This will solve the problems.

## Valid and Same convolutions

<img src="https://drive.google.com/uc?id=1U7-iTVxIjszgFqrdL_k4ZFzbFxg6Q9o2">

valid convolution means no padding and the output shrinks

same convolution means using padding such that the output size will not shrink. In this case p must be chosen according to the formula and f needs to be odd. f is usually odd because in this case the filter has a central position.

valid: only valid window locations will be used

same: pad in such a way as to have an output with the same width and height as the input

## strided convolution

<img src="https://drive.google.com/uc?id=1ZeAjCZlOkcrdqHnMpWxoYr3krrKj4PWH">

stepping over by two steps instead of one when applying convolution

## Convolution on volumes

<img src="https://drive.google.com/uc?id=1klhJJ2g-sMxM79OU0MbhuWVrTIEfPCia">

the third dimension (=channel) should be the same between the input and the filter. Apply convolution channel by channel and add the results to get a result with a single channel

## Multiple Filters

<img src="https://drive.google.com/uc?id=1sgNcYB3c4rUpLtLdzIIjs27a1M-HUPge">

applying several filters for different kind of detection (vertical or horizontal edge detection) will result in an output with the number of channels equal to the number of filters

<img src="https://drive.google.com/uc?id=1W9NPToxU-5nByL29KT7asQ54zmZH47VK">

to the output add bias and add activation function. Afterwards use this as the input to the next layer

the number of parameters to learn are included in the filter and in the bias

the number of parameters are independent of the size of the image. this is different to a fully connected layer and this architecture is not as prone to overfitting as a fully connected architecture (see above)

<img src="https://drive.google.com/uc?id=1xmVtbkrKaZV0WACM-Vyb1qfYxobNyjS6">

The striding parameter s results in fast reduction of the first and second output dimension.
The first and second dimension will for s = 0 always shrink by 2. If s is applied this is also the case but the result is divided by s and the floor is pplied to the result

After the final convolution layer in this example are 1960 features learned. These features will be flattenend and will be fed into an activation function to generate the predictions.

Convolution is typically done with 3 x 3 windows and stride = 1

## Pooling Layers

- to reduce the size of the representation
- to speed up computation
- to make features more robust

### Max Pooling

<img src="https://drive.google.com/uc?id=1uxRFT1YVxGD2prQRo0DNLf4Tr_mxyK_F">

f = dimension of filter 
s = stride
=> apply 3x3 filter to input feature map with stride of 1 and return max

The input feature map channel size is preserved

So what the max operation does is a lots of features detected anywhere, and one of these quadrants , it then remains preserved in the output of max pooling.

max pooling is usually done with 2 x 2 windows and stride 2

### average pooling

<img src="https://drive.google.com/uc?id=1vmStEcB_lVDM4KXLdUN36XKaAeXkv1Bk">

Same idea as max pooling but taking the average

For pooling layers f and s are the hyperparameters. There are no parameters to learn in this layers

## Neural network example

<img src="https://drive.google.com/uc?id=1hy4m5F06EePJunVqpVtP_ixWtDXTuDSN">

formula: $\lfloor\frac{n+2p-f}{s} + 1\rfloor$
Here: n = 32 (n_c = 3), p = 0, f = 5, s = 1 => 28

convention: conv + max-pooling = one layer since max pooling contains no parameter to learn

common pattern: n_c, n_w decreases, n_c increases

n_c in CONV1 and POOL1 should be 8 (see below)

Convolutions operate over 3D tensors, called feature maps.
The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map.



<img src="https://drive.google.com/uc?id=1f-6XGKXZNXliRA7yS8pUBF6rYLCF1QBY">

activation size = n_w x n_h x n_c

number of parameters = (f x f x n_c_previous_layer + bias) x n_c_current_layer

<img src="https://drive.google.com/uc?id=1r70_JlEs9hQ5J8Ovcj2zGv6SzQrK89Ru">

Parameter sharing: The learned filter is useful for different parts of the image

Sparsity of connections: convolution and pooling reduces the output size of a layer

The fundamental difference between a densely connectd layer and a convolution layer:

- Dense layers learn global patterns in their input feature space

- Convolution layers learn local patterns 

The patterns convnets learns are translation invariant:

- a convnet can regognize learned patterns anywhere

- densely connected network would have to learn the patterns anew if it appeared at a new location

=> convnets need fewer training examples

Convnets can learn spatial hierarchies of patterns:

A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. This allows convnets to efficiently learn increasingly complex and abstract visual concepts.

## Initiating a small convnet

In [0]:
from keras import layers
from keras import models

Using TensorFlow backend.


In [0]:
model = models.Sequential()
model.add(layers.Conv2D(filters=32,kernel_size=(3,3),activation='relu',input_shape=(28,28,1)))
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Conv2D(filters=64,kernel_size=(3,3),activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3,3), activation='relu'))







In [0]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


number of parameters = (f x f x n_c_previous_layer + bias) x n_c_current_layer => (3 x 3 x 1 + 1) x 32 = 320

(3 x 3 x 32 +1) x 64 = 18496

(3 x 3 x 64 +1) x 64 = 36928

The output of every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and high dimension goes down as mentioned above.

## Adding a classifier on top of the convnet

In [0]:
model.add(layers.Flatten())
model.add(layers.Dense(units=64,activation='relu'))
model.add(layers.Dense(10,activation='softmax'))

In [0]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)               

The flatten 1 layer returns 3 x 3 x 64 = 576 inputs 

The dense_1 layer contains 576 * 64 + 64 = 36928 parameters

The dense_2 layer contains 65 x 10 + 10 = 650 parameters

## Training the convnet on MNIST images

In [0]:
from keras.datasets import mnist
from keras.utils import to_categorical

In [0]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


In [0]:
train_images.shape

(60000, 28, 28)

In [0]:
test_images.shape

(10000, 28, 28)

In [0]:
train_images = train_images.reshape((60000,28,28,1))
train_images = train_images.astype('float32') / 255

In [0]:
test_images = test_images.reshape((10000,28,28,1))
test_images = test_images.astype('float32') / 255

In [0]:
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [0]:
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['accuracy'])





In [0]:
model.fit(x=train_images, y=train_labels, batch_size=64,epochs=5)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f909c3aaa90>

## Evaluate on the test data

In [0]:
test_loss, test_acc = model.evaluate(x=test_images, y=test_labels)



In [0]:
test_acc

0.9921