# 2.1 Introduction

Hi Guys, Hope you are doing good. In previous chapter we have learn about multimodal learning and how to implement a basic architecture for processing the information retrieved from different sources. In this chapter we will talk about processing visual information using convolutional neural networks.

What we will learn in this chapter:

1. What is feature extraction in image processing.
2. How convolutional neural networks can be used as feature extractor.
3. VGG network as visual feature extractor.
4. How to use pretrained VGG-Net weights for feature extraction.

So are you ready? Let's start with learning about the features.

# 2.2 Feature Extraction

Features or more specifically image features are those patterns and structures present in the image which could be used to discribe the image. For example if you need to describe an image containing a rectangle object; we can tell that the object has 4 sides and all sides are perpendicular to each. Here our features are a) 4 sides and b) all sides are perpedicular to each. This is a basic explanation of what features are. When we work with real world images; Feature becomes more complex in nature. For example if you need to recognize an image containes yellow car amongst thousand other images. What will you do? Well you will start with identifying car's features in all images. What features will be there? You will look for objects which have 4 wheels; 4 doors 2 wind shields and have Yellow color.

Now you may ask; "Why we need to learn about feature extraction?" well aswer is quite simple. You want to generate image description based on the information present in the image. And this will required feature extraction. As you need to identify a car's feature to identify the car's image. Similarly our neural network must know what objects are present in the image. Simple!

## 2.2.1 Feature Engineering

So our task will start’s with feature engineering. Here feature engineering is related to represent the images of cars (we will continue with our example) in a meaningful way such that we can differentiate the images of cars and other objects with these features. One can think of raw pixel intensities for the task. But if we need to keep in mind that there may be a huge variation in colour, lightning conditions and structure of different cars. It will be quite difficult to classify the cars from other objects. For such tasks feature engineering is a useful thing. 

We apply different kind of algorithm to quantify the colour of images (or Objects). For structural analysis of objects we can apply various kind of image filters which can give us various informations related to different objects. These features may be based on shape properties such as number of edges and corners of the object or structural features such as uniformity of intensity levels or local contrast using weighted averages etc.

Now feature extraction task always comes with a challenge. *Feature Tuning*. To apply most of the image filters you need to tunethe filters so that it can give you most optimal feature values for your data set. This tuning could be change with the change of the data set. So if you have tuned out your filters for etxtracting edge information on one data set there is no guarantee that same filters will also work with other data sets. So what should we do? One solution of this problem is to apply filters with all possible kind of variations could be present in the image. For example if we need to extract shape information from our data set we could use following kind of structural filters with different orientation and sizes.


<img src='./1_haar_features.png' width=500>

But these features are having their own limitations. Such as; these features are not rotation or scale invariant. For example if we want to detect an image which consist of various kind of sizes of faces, we need to perform the operation for multiple scales (different size of images). If the faces are not straight there are less chances of detection. other problem could be redundant features. As we are applying so many features without knowing their significance for our data there are strong chances of redundancy.

Now what you will do? Well its time for me to reveal the secrete. We will use a convolutional neural network to serve this task of feature extraction for us! *Why a CNN for this?* let's answer your question.

# 2.3 CNN for Feature extraction

CNNs are the specific neural network architectures which are mainly designed to solve computer vision problem where 2D data processing required. One of the most important property of the CNN is it uses spatial contextual information to reach at the prediction. It helps to retain positional information which is an important aspect of images. let's understand this with an example. Suppose you need to identify whether an image is having a human face in it or not with the help of a simple feed forward neural network. For this we need to first convert the 2D image data into 1D array as following.

<img src='2_1D_Face_Detection.png' width=400>

Now when we pass this information to the network it will process the information without the contextual knowledge. It means that it will not establish any relationship between different parts of the face such as eyes must always be positioned above the nose and lips, hairs must be positioned above the eyes etc. this will cost us as a marginal loss in the detection accuracy. 

We can solve this problem by use of multiple sub networks which only process a specific part of the image and to get final prediction we can merge predictions from all of the sub networks. A possible architecture can be defined in following manner.

<img src='3_2D_Face_Detection.png' width=600>

So how CNN can help us to achieve this? CNN architecture works in hierarchical manner where each layer has a responsibility to learn different kind of features. For example in our face detection problem some layers of the network will learn features of Eye similarly some will learn different variations of the lips and some other may learn variations in the nose structure. At the end all these information will be passed to final layer which will combine them and predict a probability for each class.

Following is an intuition about the working of the CNN.

<img src='4_FaceDetectionNvidia.png' width=600>

Source : https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/

As you can see each layer of the network learns different kind of features such as initial layers may learn very generalize features such as edges and corners, second layers will learn more complex structures such various shapes and later layers will combine these features to generate more abstract features such as whole face. So what is the secret behind this? How this is happens?

Well a convolutional neural has three main concepts which make them best suitable for computer vision applications.

- Local receptive fields
- Shared weights and biases.
- Pooling Layers

Let’s talk about these terms in brief.

### 2.3.1 Local receptive fields

This helps to utilize the local information of the regions in the image, as an image is nothing but a group of regions, similarly a region in an image is nothing but a group of similar pixels. So when working with CNN our networks can learn local region based information quite well. We'll connect the input pixels to a layer of hidden neurons which is also known as filters. But here we will not connect each image pixel to each hidden neuron. Instead, we keep these connections in a small area only. 

So each neuron from the hidden layer will be connected to a small region (local field) of the input pixels, for example, let’s consider a 5×5 region, which corresponds to 25 input pixels. So, for a specific hidden neuron, we may have connections which look like this:

<img src='5_LocalReceptive.png' width=400>

The region shown in blue is the receptive field and the same receptive field is also known as **convolutional filter**. We will create this kind of receptive field around each pixel in the image and perfom a sum of product using specific coefficients (Weights). These coefficients will be learned during the training process and it will be responsible for extraction of specific features from different location on the image. We will learn about these weights in next paragraph.

### 2.3.2 Shared Weights and Biases

The concept of shared weights and biases lies somewhere in the operation we have discussed above. As we have seen a single filter spans over the entire image to generate the output. It is known as the weight and bias sharing. Here unlike the plain feed forward network every filter matrix shares same pixel information. 

What it means actually? Well suppose a filter of the first layer can be used for horizontal edge detection at one place. Then you can use the same filter to calculate the similar edge features at different locations on the image. This information sharing helps in reusing of the filters which leads to a computation efficient solution. This information sharing ability also helps convolutional neural networks to achieve translations invariance. So if you have an image of a face which could be anywhere in the image and your classifier can still classify it as the face image. 

Similarly our network tries to detect other kind of features at different location on the image; this means each receptive field will have its own output which we called the feature map. Now to get the solution of an image classification problem you need to have lot of convolutional filters at each hidden layer which can detect more abstract features at each level (layer) in the network. Let’s have a sneak pick at some filters which we will learn later in the chapter.

<img src='6_EdgeFilters.jpg' width=400>

Figure above shows the filters from the first layer trained on LeNet architecture. As you can see clearly these 25 features are looks like edge detectors which can detect different kind (orientation) of edges on the input image.
So the whole operation performed above is also known as Convolution and thus these neural networks got the name convolutional neural networks and these filters are known as Convolution filters.

We can summarize the operation of convolution using a 5X5 filter in following equation.

\begin{align}
\sigma\left(b+ \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l,k+m} \right) \\
\end{align}

So here $w_{l,m}$ is an element of the weight matrix (remember these are the 2D filters), $a_{j+l,k+m}$ is the input pixel, $b$ is the bias and $\sigma$ is the activation function applied on the output.

Now let's take a look on pooling operation.

### 2.3.3 Pooling Layers 

As receptive fields and shared weights idea pooling is also an important aspect of the CNN architecture. It helps to reduce the noisy activations for the descendent layers by selecting different kind of pooling operation. Pooling layer compress the dense predictions and reduce the size of the input. This operation is unlike the strided convolutions. In pooling there is no learnable parameter available so it does not have any weight matrix. It selects activations from the previous layers from specific indices and generates the compressed output. Following figure will give you better understanding of the pooling operation.

<img src='7_pooling.png' width=400>

Figure above shows a 2X2 max pooling operation on a learned feature map. As you can see pooling layer creates a 2X2 pixels region over the input feature map and selects maximum value of the region and put it on the output map. You can choose minimum or average of the region as the output value. So when you apply max pooling on a 14X14 region the output will be of 7X7 pixels. By selecting the maximum or average value of the region it picks only the strongest activations which are most likely to be part of region boundaries.

Pooling operation also helps to learn more abstract features. This helps our network to define more refine object properties as down sampling reduces finer details from the images. This helps the network to learn more coarser features.

Now as you can see with the help of local receptive fields, shared weights and pooling a CNN can learn more complex features from images. These features could be very useful to identify different objects present in the image. We will use these features to create complex representation to generate descriptions for images.

# 2.4 VGG-Net as Visual feature extractor

We can get more robust features of an input image using a convolutional neural net (CNN), because it learns features in hierarchical nature, initial layers of a CNN learns about basic features like edges and corners in the images, while when we enter more deeper, it learns more abstract features like object’s shape and patterns, which can help aggregator network with more precise information.

We will use CNN developed by Oxford’s visual geometry group, these networks also known as VGG-nets, there are different flavour of VGG-nets are available with different number of layers, we will use smaller one; VGG-16, it is having 13 convolutional layers and 3 fully connected layers, so the name is VGG-16, there are two more variants of this network VGG-19 and VGG-21. These networks were winner of 2014 image net challenge.

Following is the network architecture for VGG-16;

<img src='8_VGG_16.png' width=400>

As you can see in above architecture, size of all convolutional kernels is 3X3, but number of filters are different in different layers, you can see some max-pool layers in between the convolutional layers, all the activation function in the network are rectified linear unit (ReLu); at the end side there are 3 dense layers with 4096 weights in each, and final layer have a softmax classifier. To get in depth understanding of this network architecture please refer to the paper written by creator of this network, here is the link for the paper; https://arxiv.org/abs/1409.1556, I will highly encourage you to go through the paper, it will give you very in depth analysis of various building blocks of the network, which eventually will help you to create your own versions of this network.

One final note in feature extractor part is, you are not bound to use VGG-16 only; you can try out with different deep learning architectures like different flavours of residual networks, google inception etc.

So we have discussed enough theory for our concept building and this is the time to jump right into the practical part. We will write a code to help us using VGG architecture for feature extraction task. let's start with our implementation. Following code will give you an understanding about how we can load the pretrained vgg-net weights and visualize the actual architecture.

In [2]:
#For using VGG architecture, Keras will be used 
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input

#Follwing will help us to create and loading Models 
from keras.models import Model, load_model

# Load the VGG-16 model from keras
model = VGG16()

# Re-structure the model after modification
model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    
# Let's see how our modified model looks
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 3, 224, 224)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 64, 224, 224)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 64, 224, 224)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 64, 112, 112)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 128, 112, 112)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 128, 112, 112)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 128, 56, 56)       0         
__________

So when you run above code snippet for the first time. Keras will download network weights from the server and will store into your local system. this weight file will be around 550 MB in the size. So download time is depends on your network speed. Once the file downloaded onto the disk it next time keras will load the weights from the local directory.


# 2.5 Feature Extraction using pretrained VGG

As you can see in the network architecture. It is quite clear that we are having 13 convolutional layers with rectified linear units (ReLU) as activation function; 5 Max-pooling layers, 2 Dense, 1 flatten layer and last one is the softmax layer. 

Now if you will pay attention to on the first layer which is the input layer; it takes color image of the size 224X224 as the input. In practical scenerio it is not possible that we always get an image of the same dimension. To solve this problem we will need to preprocess our input image so that each image have the dimension of 224X224. we can do this with the help of keras' image preprocessing class.

Now let's talk about the softmax layer. Softmax is a simple probability based classifier which is used to generate class predictions. Now as we are more interested in feature extraction rather generating predictions from our network; we will need to remove it from our network so we can have a feature vector for our images.

Let's see how we can extract features from an input image.

**Note: Before running the following snippet please restart the kernel**

In [1]:
#For using VGG architecture, Keras will be used 
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input

#Follwing will help us to create and loading Models 
from keras.models import Model, load_model

#We will use Keras for various image processing tasks
from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img

# Load the VGG-16 model from keras
model = VGG16()

# Remove final layer so we can have features at the end
model.layers.pop()

# Re-structure the model after modification
model = Model(inputs=model.inputs, outputs=model.layers[-1].output)

# Load an image from file
filename = '9_test_image.png'
image = load_img(filename, target_size=(224, 224))

# Convert the image pixels to a numpy array
image = img_to_array(image)
print("After resize image Width %d and Height %d"%(image.shape[1],image.shape[2]))

# Reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1],
                        image.shape[2]))

# Now we will prepare the image for the VGG model
# this pre-processing will convert images in compatible
# form for VGG network.
image = preprocess_input(image)

# Now its time to extract the features
feature = model.predict(image, verbose=0)
print("Feature Vector dimension: ",feature.shape)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


After resize image Width 224 and Height 224
Feature Vector dimension:  (1, 4096)


Voila! as you can see we are getting a feature vector of 4096 dimension from our pretrained VGG architecture. We will use the same process to generate features for all of the images from our data set. Then we will combine this information with an recurrent neural network to generate descriptions for our images. 

# Summary

So this how we can use a CNN for visual feature extraction. In this chapter we have learned about what is feature extraction in image processing. Then we have learned about why CNNs are the best choice for feature extraction. And at the end we practically used the VGG network to extract the features from a test image. I think this is it for visual feature extraction. In next chapter we will learn about Recurrent Neural Networks fro Textual information processing. Till than Happy Learning!! 

# Quiz


## Q.1 Which neural network is best for image analysis?

- a) Recurrent Neural Networks.
- b) Convolutional Neural Networks.
- c) Feed forward neural networks.
- d) None of the above.

Ans: B) Convolutional neural network are best suited algorithm for image analysis. It can tune filters based on data set.

## Q.2 What makes CNN best for image analysis?

- a) Local receptive fields.
- b) Shared weights and biases.
- c) Pooling layers.
- d) All of the above.

Ans: D) All together makes a convolutional neeural network.

## Q.3 How many convolution layers presents in VGG-16 architecture?

- a) 16
- b) 13
- c) 12
- d) None of the above

Ans: B) There are 13 Convolution layers in VGG-16, other 3 are dense layers.