# 8 Convolutional neural networks
The goal of this exercise is to learn the basic stuff about [convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNN or ConvNet). In the previous exercises the building blocks mostly included simple operations that had some kind of activations functions and the each layer was usually fully connected to the previous one. CNNs take into account the spatial nature of the input data, e.g. an image, and they process it by applying one or more  [kernels](https://en.wikipedia.org/wiki/Kernel_%28image_processing%29). In the case of images, this processing i.e. convolving is also known as filtering. The results of processing the input with a single kernel will be a signle channel, but usually a convolutional layer involves more kernels that then produce more channels. These channels are often called **feature maps** because each kernel is specialized for extraction of a certain kind of features from the input. These feature maps are then combined into a single tensor that can be viewed as an image with multiple channels that can then be passed to further convolutional layers.

For example if the input consists of a grayscale image i.e. an image with only one channel and a $5\times 5$ kernel is applied, the result is a single feature map. The borders of the input image are usuallz padded with zeros in order to ensure that the resulting feature maps has the same number of rows and columns as the input image.

If the input consists of a color image i.e. an image with three channels and a $5\times 5$ kernel is applied, what will actually be applied is an $5\times 5\times 3$ kernel that will simultaneously process all three channels and the result will again be a single feature map. However, if e.g. 16 several kernels are applied, then the result will be 16 feature maps. Should they be passed to another convolutional layer, **each** of its kernels would simultaneously process **all** feature maps so their sizes would be e.g. $3\times 3\times 16$ or $5\times 5\times 16$ where 16 is used to reach all feature maps simultaneously.

The convolution is usually followed by applying an element-wise non-linear operation to each of the values in the feature maps. Finally, what offten follows is the summarization i.e. pooling of the information in the feature maps in order to reduce the spatial dimensions and keep only the more important information. A common approach used here is the so called max pooling. It is a non-linear downsampling where the input is divided into a set of non-overlapping rectangles and for each of them only the the maximum value inside of it is kept.

![Model of a neuron](cnn_img/max_pooling_2x2.png)
<center>Figure 1. Max pooling with $2\times 2$ rectangles (taken from [Wikipedia](https://en.wikipedia.org/wiki/File:Max_pooling.png)).</center>

What usually follows after several convolutional layers is putting the values of all feature maps into a single vector, which is then passed further to fully connected or other kinds of layers.

The number of parameters in the convolutional depends on the number of feature maps and the sizes of the kernels. For example is a convolutional layer with 32 kernels of nominal size $3\times 3$ receives 16 feature maps on its input, it will require $16\times 3\times 3\times 32+32$ where the last 32 covers the kernel biases.


## 8.1 The MNIST dataset revisited (2)
In one of the previous exercises the MNIST dataset was used to demonstrate the use of multilayer perceptron. Here we are going to apply a convolutional neural network to the problem of digits classification. We will use the following layers to build our model:

* [tf.nn.relu](https://www.tensorflow.org/api_docs/python/tf/nn/relu)
* [tf.layers.conv2d](https://www.tensorflow.org/api_docs/python/tf/layers/conv2d)
* [tf.layers.max_pooling2d](https://www.tensorflow.org/api_docs/python/tf/layers/max_pooling2d)
* [tf.layers.dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense)

The [tf.layers.dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense) layer has the same effect as the fully connected layer matrix multiplication that was used in the previous exercise with the MNIST dataset.

**Tasks**

1. Study and run the code below. How is the accuracy compared to the ones obtained in the previous exerises with MNIST?
2. Try to change the number and size of convolutional and fully connected layers. What has the greatest impact on the accuracy?
3. What happens to the accuracy if another non-linearity is used instead of ReLU?

In [1]:
#use MNIST data
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()



import input_data
mnist=input_data.read_data_sets("mnist/", one_hot=True)

#settings
learning_rate=0.001
training_epochs_count=5
batch_size=100
batches_count=int(mnist.train.num_examples/batch_size)
display_step=1

activation_function=tf.nn.relu
optimizer_type=tf.train.AdamOptimizer

#architecture
input_size=784
n_channels_1=32
n_channels_2=64
n_classes=10
n_fully_connected=128
kernel_size=5

#data input
x=tf.placeholder(tf.float32, [None, input_size])

#reshaping the input to its image form so that we can apply convolution
layer=tf.reshape(x, [-1, 28, 28, 1])
y=tf.placeholder(tf.float32, [None, n_classes])

#first convolutional layer
#we will apply n_channels_1 kernels of size kernel_size X kernel_size
#we are padding the input in order for the result to have the same number of rows and columns
layer=tf.layers.conv2d(layer, n_channels_1, kernel_size, padding="SAME")
#applying the non-linearity
layer=tf.nn.relu(layer)
#now we downsample the feature maps from 28 X 28 to 14 X 14
layer=tf.layers.max_pooling2d(layer, 2, 2)

#second convolutional layer
#we will apply n_channels_2 kernels of size kernel_size X kernel_size
layer=tf.layers.conv2d(layer, n_channels_2, kernel_size, padding="SAME")
#again, we apply the non-linearity
layer=tf.nn.relu(layer)
#and max pooling again, now each feature map will be of size 7 X 7
layer=tf.layers.max_pooling2d(layer, 2, 2)

#we have n_channel_2 maps of size 7 X 7
#now reshape them into a single vector
layer=tf.reshape(layer, [-1, 7*7*n_channels_2])
#a fully connected layer
layer=tf.layers.dense(layer, n_fully_connected)
#non-linearity
layer=tf.nn.relu(layer)

#final classification
y_predicted=tf.layers.dense(layer, 10)

cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_predicted, labels=y))
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

session=tf.Session();
session.run(tf.global_variables_initializer())

correct_y_predicted=tf.equal(tf.argmax(y_predicted, 1), tf.argmax(y, 1))
accuracy=tf.reduce_mean(tf.cast(correct_y_predicted, tf.float32))

for epoch in range(training_epochs_count):
    for i in range(batches_count):
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        session.run(optimizer, feed_dict={x:batch_x, y:batch_y})
    if ((epoch+1)%display_step==0):
        print("Epoch #"+str(epoch+1)+" "+str(session.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})))

session.close()

Extracting mnist/train-images-idx3-ubyte.gz
Extracting mnist/train-labels-idx1-ubyte.gz
Extracting mnist/t10k-images-idx3-ubyte.gz
Extracting mnist/t10k-labels-idx1-ubyte.gz
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



  layer=tf.layers.conv2d(layer, n_channels_1, kernel_size, padding="SAME")
  layer=tf.layers.max_pooling2d(layer, 2, 2)
  layer=tf.layers.conv2d(layer, n_channels_2, kernel_size, padding="SAME")
  layer=tf.layers.max_pooling2d(layer, 2, 2)
  layer=tf.layers.dense(layer, n_fully_connected)
  y_predicted=tf.layers.dense(layer, 10)


Epoch #1 0.9854
Epoch #2 0.9891


KeyboardInterrupt: 

## 8.2 Image classification
Image classification is a challenging computer vision problem with the best known competition being [The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)](http://www.image-net.org/challenges/LSVRC/), which includes the ImageNet dataset with millions of $224\times 224$ training images. The class names in one of the tasks there can be found [here](https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a). One of the most important breakthroughs was when in 2012 the convolutional neural network [AlexNet](https://en.wikipedia.org/wiki/AlexNet) won the first place. Ever since many highly successful convolutional neural networks architectures have been proposed, e.g. [VGG-16](https://arxiv.org/abs/1409.1556), [VGG-19](https://arxiv.org/abs/1409.1556), [ResNet](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf), [Inception](https://arxiv.org/abs/1409.4842), etc. Training such networks requires a lot of time because they have many layers with millions of parameters. In this exercise we are going to experiment with pre-trained models of some of the best known architectures. In order to make things simple, we are going to use [Keras](https://keras.io/), *"a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano."* To install Keras, it is enough to type
```
conda install keras
```
in your command prompt/terminal. Alternatively, you can type
```
pip install keras --upgrade
```
and there be any error, then first type
```
conda install pip
```
to refresh your pip and then repeat the first command. Keras already includes APIs to many well-known architectures. Let's first try to classify some images.
### 8.2.1 Using pre-trained models
Try running the following code:

In [8]:
import keras.utils as image
import numpy as np

#choose the architecture
#architecture="resnet"
#architecture="vgg16"
#architecture="vgg19"
architecture="inceptionv3"

if (architecture=="resnet"):
    from tensorflow.keras.applications.resnet50 import ResNet50
    from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
    model=ResNet50(weights="imagenet")
elif (architecture=="vgg16"):
    from tensorflow.keras.applications.vgg16 import VGG16
    from tensorflow.keras.applications.vgg16 import preprocess_input
    model=VGG16(weights="imagenet")
elif (architecture=="vgg19"):
    from tensorflow.keras.applications.vgg19 import VGG19
    from tensorflow.keras.applications.vgg19 import preprocess_input
    model=VGG19(weights="imagenet")
elif (architecture=="inceptionv3"):
    from tensorflow.keras.applications.inception_v3 import InceptionV3
    from tensorflow.keras.applications.inception_v3 import preprocess_input
    model=InceptionV3(weights="imagenet")
    
    from keras.applications.inception_v3 import InceptionV3

#images to be classified
image_paths=["cnn_img/badger.jpg", "cnn_img/rabbit.jpg", "cnn_img/sundial.jpg", "cnn_img/pineapple.jpg", "cnn_img/can.jpg", "cnn_img/accordion.jpg", "cnn_img/old_accordion.jpg", "cnn_img/piano.jpg", "cnn_img/profile.jpg"];
for path in image_paths:
    #loading the image and rescaling it to fit the size for the imagenet architectures
    img=image.load_img(path, target_size=(299, 299))
    x=image.img_to_array(img)
    x=np.expand_dims(x, axis=0)
    x=preprocess_input(x)

    print("Processing image "+path+"...")
    predictions=model.predict(x)
    print("\t", decode_predictions(predictions, top=1)[0][0][1])

Processing image cnn_img/badger.jpg...
	 badger
Processing image cnn_img/rabbit.jpg...
	 wood_rabbit
Processing image cnn_img/sundial.jpg...
	 sundial
Processing image cnn_img/pineapple.jpg...
	 pineapple
Processing image cnn_img/can.jpg...
	 vase
Processing image cnn_img/accordion.jpg...
	 accordion
Processing image cnn_img/old_accordion.jpg...
	 accordion
Processing image cnn_img/piano.jpg...
	 upright
Processing image cnn_img/profile.jpg...
	 accordion


**Tasks**
1. Is there any significant difference between the results of different architectures?

can.jpg is classified as bucket or vase by different architectures

2. Try to classify several other images that you choose on your own. Which cases are problematic?

In pictures where more objects are present, different architectures can classify them in  different order.

### 8.2.2 Creating your own classifier - pincers vs. scissors
Although ImageNet has a lot of classes, sometimes they do not cover some desired cases. Let's assume that we want to tell images with pincers apart from the ones with scissors. Neither pincers nor scissors are among ImageNet classes. Nevertheless, we can still use some parts of the pre-trained models.

Various layers of a deep convolutional network have diferent tasks. The ones closest to the original input image usually look for features such as edges and corners i.e. for low-level features. After them there are layers that look for middle-level features such as circular objects, special curves, etc. Next, there are usually fully connected layers that create high-level semantic features by combining the information from the previous layers. These features are then used by the last layer that performs the actual classification. What we can do here is simply to discard the last layer i.e. not to calculate the class of an image, but to extract the values in on of the fully connected layers. This effectively means that we are going to use the network only as an extractor for high-level features that we would hardly be able to engineer on our own. Let's first see which layers can be found in the VGG-16 network:


In [10]:
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.models import Model
import numpy as np

base_model=ResNet50(weights="imagenet")

for layer in base_model.layers:
    print(layer.name)

input_9
conv1_pad
conv1_conv
conv1_bn
conv1_relu
pool1_pad
pool1_pool
conv2_block1_1_conv
conv2_block1_1_bn
conv2_block1_1_relu
conv2_block1_2_conv
conv2_block1_2_bn
conv2_block1_2_relu
conv2_block1_0_conv
conv2_block1_3_conv
conv2_block1_0_bn
conv2_block1_3_bn
conv2_block1_add
conv2_block1_out
conv2_block2_1_conv
conv2_block2_1_bn
conv2_block2_1_relu
conv2_block2_2_conv
conv2_block2_2_bn
conv2_block2_2_relu
conv2_block2_3_conv
conv2_block2_3_bn
conv2_block2_add
conv2_block2_out
conv2_block3_1_conv
conv2_block3_1_bn
conv2_block3_1_relu
conv2_block3_2_conv
conv2_block3_2_bn
conv2_block3_2_relu
conv2_block3_3_conv
conv2_block3_3_bn
conv2_block3_add
conv2_block3_out
conv3_block1_1_conv
conv3_block1_1_bn
conv3_block1_1_relu
conv3_block1_2_conv
conv3_block1_2_bn
conv3_block1_2_relu
conv3_block1_0_conv
conv3_block1_3_conv
conv3_block1_0_bn
conv3_block1_3_bn
conv3_block1_add
conv3_block1_out
conv3_block2_1_conv
conv3_block2_1_bn
conv3_block2_1_relu
conv3_block2_2_conv
conv3_block2_2_bn
conv3_

At the end you can see fc1 and fc2, which stands for fully connected layers. For example We can extract the values of fc2 by using the following code:

In [104]:
#the last layer before the classification layer
model=Model(inputs=base_model.input, outputs=base_model.get_layer(base_model.layers[-2].name).output)

img_path="cnn_img/rabbit.jpg"
img=image.load_img(img_path, target_size=(224, 224))
x=image.img_to_array(img)
x=np.expand_dims(x, axis=0)
x=preprocess_input(x)

features=model.predict(x)
print(features.shape)
feature_layer_size=features.shape[1];

(1, 2048)


These values can now be used as features and that can later be used with another classifier. Let's first extract the features for our pincer and scissors images.

In [96]:
def create_numbered_paths(home_dir, n):
    return [home_dir+str(i)+".jpg" for i in range(n)]

def create_paired_numbered_paths(first_home_dir, second_home_dir, n):
    image_paths=[]
    for p in zip(create_numbered_paths(first_home_dir, n), create_numbered_paths(second_home_dir, n)):
        image_paths.extend(p)
    return image_paths
        
def create_features(paths, verbose=True):
    n=len(paths)
    features=np.zeros((n, feature_layer_size))
    for i in range(n):
        if (verbose==True):
            print("\t%2d / %2d"%(i+1, n))
        img=image.load_img(paths[i], target_size=(224, 224))
        img=image.img_to_array(img)
        img=np.expand_dims(img, axis=0)
        features[i, :]=preprocess_input(model.predict(img))
    
    return features

pincers_dir="cnn_img/pincers/"
scissors_dir="cnn_img/scissors/"

individual_n=50

#combining all image paths
image_paths=create_paired_numbered_paths(pincers_dir, scissors_dir, individual_n)

#marking their classes
image_classes=[]
for i in range(individual_n):
    #0 stands for the pincer image and 0 stands for the scissors image
    image_classes.extend((0, 1))

#number of all images
n=100
#number of training images
n_train=3
#number of test images
n_test=n-n_train

print("Creating training features...")
#here we will store the features of training images
x_train=create_features(image_paths[:n_train])
#train classes
y_train=np.array(image_classes[:n_train])

print("Creating test features...")
#here we will store the features of test images
x_test=create_features(image_paths[n_train:])
#train classes
y_test=np.array(image_classes[n_train:])

Creating training features...
	 1 /  3
	 2 /  3
	 3 /  3
Creating test features...
	 1 / 97
	 2 / 97
	 3 / 97
	 4 / 97
	 5 / 97
	 6 / 97
	 7 / 97
	 8 / 97
	 9 / 97
	10 / 97
	11 / 97
	12 / 97
	13 / 97
	14 / 97
	15 / 97
	16 / 97
	17 / 97
	18 / 97
	19 / 97
	20 / 97
	21 / 97
	22 / 97
	23 / 97
	24 / 97
	25 / 97
	26 / 97
	27 / 97
	28 / 97
	29 / 97
	30 / 97
	31 / 97
	32 / 97
	33 / 97
	34 / 97
	35 / 97
	36 / 97
	37 / 97
	38 / 97
	39 / 97
	40 / 97
	41 / 97
	42 / 97
	43 / 97
	44 / 97
	45 / 97
	46 / 97
	47 / 97
	48 / 97
	49 / 97
	50 / 97
	51 / 97
	52 / 97
	53 / 97
	54 / 97
	55 / 97
	56 / 97
	57 / 97
	58 / 97
	59 / 97
	60 / 97
	61 / 97
	62 / 97
	63 / 97
	64 / 97
	65 / 97
	66 / 97
	67 / 97
	68 / 97
	69 / 97
	70 / 97
	71 / 97
	72 / 97
	73 / 97
	74 / 97
	75 / 97
	76 / 97
	77 / 97
	78 / 97
	79 / 97
	80 / 97
	81 / 97
	82 / 97
	83 / 97
	84 / 97
	85 / 97
	86 / 97
	87 / 97
	88 / 97
	89 / 97
	90 / 97
	91 / 97
	92 / 97
	93 / 97
	94 / 97
	95 / 97
	96 / 97
	97 / 97


Now that for each image we have its features, we will divide the images into a training and a test set. Then we will use a linear SVM classifier to classify them.

In [111]:
from sklearn import svm

def create_svm_classifier(x, y):
    #we will use linear SVM
    C=1
    classifier=svm.SVC(kernel="linear", C=C);
    classifier.fit(x, y)
    return classifier

def calculate_accuracy(classifier, x, y):
    predicted=classifier.predict(x)
    return np.sum(y==predicted)/y.size

#training the model
classifier=create_svm_classifier(x_train, y_train)

#checking the model's accuracy
print("Accuracy: %.2lf%%"%(100*calculate_accuracy(classifier, x_test, y_test)))

Accuracy: 94.00%


**Tasks**

1. How has to be the training set for the accuracy to drop significantly?

50 and 40 -> 100%
30 and 25 -> 98%
20 and 15 -> 97%
 even for 3 examples we get 97%
 for 2 ex -> 54%
 
 
2. Is there any significant gain if more complex SVM models are used?

no


3. What happens if we extract features from another layer, e.g. fc1?

now there are 1000 features

for 50 -> 82% 
for 40 and 30 -> 65%
for 20 -> 76% ???
for 10 -> 61%
for 2  -> 53%

linear performs better then rbf or poly

### 8.2.1 Creating your own classifier - healthy vs. unhealthy food
The previous example was relatively simple because all images were of same size and each of them had a white background, which allowed the extractor to concentrate only on the features of the actual objects. In this example we will use a slightly more complicated case - namely, will will tell images with healthy food apart from the ones with unhealthy food. FIrst let's repeat the same process as we did in the previous example and create the features:

In [106]:
healthy_dir="cnn_img/healthy/"
unhealthy_dir="cnn_img/unhealthy/"

individual_n=100

#combining all image paths
image_paths=create_paired_numbered_paths(healthy_dir, unhealthy_dir, individual_n)

#marking their classes
image_classes=[]
for i in range(individual_n):
    #0 stands for the pincer image and 0 stands for the scissors image
    image_classes.extend((0, 1))

#number of all images
n=200
#number of training images
n_train=100
#number of test images
n_test=n-n_train

print("Creating training features...")
#here we will store the features of training images
x_train=create_features(image_paths[:n_train])
#train classes
y_train=np.array(image_classes[:n_train])

print("Creating test features...")
#here we will store the features of test images
x_test=create_features(image_paths[n_train:])
#train classes
y_test=np.array(image_classes[n_train:])

Creating training features...
	 1 / 100
	 2 / 100
	 3 / 100
	 4 / 100
	 5 / 100
	 6 / 100
	 7 / 100
	 8 / 100
	 9 / 100
	10 / 100
	11 / 100
	12 / 100
	13 / 100
	14 / 100
	15 / 100
	16 / 100
	17 / 100
	18 / 100
	19 / 100
	20 / 100
	21 / 100
	22 / 100
	23 / 100
	24 / 100
	25 / 100
	26 / 100
	27 / 100
	28 / 100
	29 / 100
	30 / 100
	31 / 100
	32 / 100
	33 / 100
	34 / 100
	35 / 100
	36 / 100
	37 / 100
	38 / 100
	39 / 100
	40 / 100
	41 / 100
	42 / 100
	43 / 100
	44 / 100
	45 / 100
	46 / 100
	47 / 100
	48 / 100
	49 / 100
	50 / 100
	51 / 100
	52 / 100
	53 / 100
	54 / 100
	55 / 100
	56 / 100
	57 / 100
	58 / 100
	59 / 100
	60 / 100
	61 / 100
	62 / 100
	63 / 100
	64 / 100
	65 / 100
	66 / 100
	67 / 100
	68 / 100
	69 / 100
	70 / 100
	71 / 100
	72 / 100
	73 / 100
	74 / 100
	75 / 100
	76 / 100
	77 / 100
	78 / 100
	79 / 100
	80 / 100
	81 / 100
	82 / 100
	83 / 100
	84 / 100
	85 / 100
	86 / 100
	87 / 100
	88 / 100
	89 / 100
	90 / 100
	91 / 100
	92 / 100
	93 / 100
	94 / 100
	95 / 100
	96 / 100
	97 / 100


	31 / 100
	32 / 100
	33 / 100
	34 / 100
	35 / 100
	36 / 100
	37 / 100
	38 / 100
	39 / 100
	40 / 100
	41 / 100
	42 / 100
	43 / 100
	44 / 100
	45 / 100
	46 / 100
	47 / 100
	48 / 100
	49 / 100
	50 / 100
	51 / 100
	52 / 100
	53 / 100
	54 / 100
	55 / 100
	56 / 100
	57 / 100
	58 / 100
	59 / 100
	60 / 100
	61 / 100
	62 / 100
	63 / 100
	64 / 100
	65 / 100
	66 / 100
	67 / 100
	68 / 100
	69 / 100
	70 / 100
	71 / 100
	72 / 100
	73 / 100
	74 / 100
	75 / 100
	76 / 100
	77 / 100
	78 / 100
	79 / 100
	80 / 100
	81 / 100
	82 / 100
	83 / 100
	84 / 100
	85 / 100
	86 / 100
	87 / 100
	88 / 100
	89 / 100
	90 / 100
	91 / 100
	92 / 100
	93 / 100
	94 / 100
	95 / 100
	96 / 100
	97 / 100
	98 / 100
	99 / 100
	100 / 100


Now let's train a model and test its accuracy:

In [112]:
classifier=create_svm_classifier(x_train, y_train)
print("Accuracy: %.2lf%%"%(100*calculate_accuracy(classifier, x_test, y_test)))

Accuracy: 94.00%


**Tasks**
1. What is the effect of choosing some other layers for feature extraction?

94% and 65% acc on fc layers
92% (with rbf)

2. Try the whole food classification with another network as feature extractor.

3. What kind of test images are problematic?


there are images in unhealthy category that could be seen as salads (90, 91 eg.) thus healthy. 