Question 1: What are the advantages of a CNN over a fully connected DNN for image classification?

DNN works fine for small images (e.g., MNIST), it breaks down for
larger images because of the huge number of parameters it
requires. For example, a 100 × 100 image has 10,000 pixels, and if
the first layer has just 1,000 neurons (which already severely
restricts the amount of information transmitted to the next layer),
this means a total of 10 million connections. And that’s just the first
layer. CNNs solve this problem using partially connected layers and
weight sharing.

Question 2: Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and "same" padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels.

What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images

In [4]:
from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
model = Sequential()
model.add(Convolution2D(filters = 400,input_shape = (200,300,3),kernel_size = (3,3),strides = (2,2),padding = 'same'))
model.add(Convolution2D(filters = 200,kernel_size = (3,3),strides = (2,2),padding = 'same'))
model.add(Convolution2D(filters = 100,kernel_size = (3,3),strides = (2,2),padding = 'same'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 100, 150, 400)     11200     
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 50, 75, 200)       720200    
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 25, 38, 100)       180100    
Total params: 911,500
Trainable params: 911,500
Non-trainable params: 0
_________________________________________________________________


Amount of RAM Required is:

32*200*300*400 = 96mb for 1 instance 

for 50 instance

4.8GB of RAM

Question 3: If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?

Five Ways are:

Reducing mini batch size

Using higher stride

Removing few layers

Try using 16-bit floats instead of 32-bit floats

Distribute the CNN across multiple devices.
    

Question 4: Why would you want to add a max pooling layer rather than a convolutional layer with the same stride

Reducing computations, memory usage and the number of parameters, a
max pooling layer also introduces some level of invariance to small translations,

By inserting a max pooling layer every few layers in
a CNN, it is possible to get some level of translation invariance at a larger scale.
Moreover, max pooling also offers a small amount of rotational invariance and a
slight scale invariance. Such invariance (even if it is limited) can be useful in cases
where the prediction should not depend on these details, such as in classification
tasks.


Question 5: When would you want to add a local response normalization layer?

The most strongly activated
neurons inhibit other neurons located at the same position in neighboring feature
maps (such competitive activation has been observed in biological neurons). This
encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization. 

Question 6: Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet ?

Gradual Improvements over LeNet-5:

Alexnet - Added Local Response Normalization, Used ReLu Activation Function,trained on GPU

Resnent - Introduced the concept of Deep Networks with the help of skip connections.

GoogleNet - A 22 layer network which was also trained on breadth

SENet - Introduced a SE Block for extreme performace.

Question 7: What is a fully convolutional network? How can you convert a dense layer into a convolutional layer

The idea of FCNs was first introduced in a 2015 paper23 by Jonathan Long et al., for
semantic segmentation (the task of classifying every pixel in an image according to
the class of the object it belongs to). They pointed out that you could replace the
dense layers at the top of a CNN by convolutional layers. To understand this, let’s look
at an example: suppose a dense layer with 200 neurons sits on top of a convolutional
layer that outputs 100 feature maps, each of size 7 × 7 (this is the feature map size, not
the kernel size). Each neuron will compute a weighted sum of all 100 × 7 × 7 activa‐
tions from the convolutional layer (plus a bias term). Now let’s see what happens if we
replace the dense layer with a convolution layer using 200 filters, each 7 × 7, and with
VALID padding. This layer will output 200 feature maps, each 1 × 1 (since the kernel
is exactly the size of the input feature maps and we are using VALID padding). In
other words, it will output 200 numbers, just like the dense layer did, and if you look
closely at the computations performed by a convolutional layer, you will notice that
these numbers will be precisely the same as the dense layer produced. The only differ‐
ence is that the dense layer’s output was a tensor of shape [batch size, 200] while the
convolutional layer will output a tensor of shape [batch size, 1, 1, 200].

To convert a dense layer to a convolutional layer, the number of fil‐
ters in the convolutional layer must be equal to the number of units
in the dense layer, the filter size must be equal to the size of the
input feature maps, and you must use VALID padding. The stride
may be set to 1 or more,

Question 8: What is data augmentation?

Data augmentation artificially increases the size of the training set by generating
many realistic variants of each training instance. This reduces overfitting, making this
a regularization technique. The generated instances should be as realistic as possible:

Question 9: What is the main technical difficulty of semantic segmentation?

The
main difficulty in this task is that when images go through a regular CNN, they grad‐
ually lose their spatial resolution (due to the layers with strides greater than 1): so a
regular CNN may end up knowing that there’s a person in the image, somewhere in
the bottom left of the image, but it will not be much more precise than that

Question 10: Build your own CNN from scratch and try to achieve the highest possible accuracy on MNIST.

In [12]:
import keras
from functools import partial
from keras.datasets import mnist
data = mnist.load_data()
(X_train, y_train), (X_test, y_test) = data
X_train = X_train.reshape(-1, 28, 28, 1)
DefaultConv2D = partial(keras.layers.Conv2D,
 kernel_size=3, activation='relu', padding="SAME")
model = keras.models.Sequential([
 DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
 keras.layers.MaxPooling2D(pool_size=2),
 DefaultConv2D(filters=128),
 DefaultConv2D(filters=128),
 keras.layers.MaxPooling2D(pool_size=2),
 DefaultConv2D(filters=256),
 DefaultConv2D(filters=256),
 keras.layers.MaxPooling2D(pool_size=2),
 keras.layers.Flatten(),
 keras.layers.Dense(units=128, activation='relu'),
 keras.layers.Dropout(0.5),
 keras.layers.Dense(units=64, activation='relu'),
 keras.layers.Dropout(0.5),
 keras.layers.Dense(units=10, activation='softmax'),
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train,y_train,epochs = 10,batch_size = 100)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x28db7adf640>

In [13]:
X_test = X_test.reshape(-1, 28, 28, 1)
model.evaluate(X_test, y_test, verbose=0)

[0.04858801141381264, 0.9883000254631042]

Awesome Accuracy