# CNN

Computers see in a different way than we do. Their world consists of only numbers. Every image can be represented as 2-dimensional arrays of numbers, known as pixels.

To teach an algorithm on how to recognize patterns in images, we use a specific type of Artificial Neural Network: a Convolutional Neural Network (CNN). 

Their name stems from one of the most important operations in the network called convolution.

Convolutional Neural Networks(CNN or ConvNets) are ordinary neural networks that assume that the inputs are image. 

They are used to analyze and classify images, cluster images by similarity, and perform object recognition within a frame. 

- For example, convolutional neural networks (ConvNets or CNNs) are used to identify faces, individuals, street signs, tumors, platypuses, and many other aspects of visual data.

CNN’s do take biological inspiration from the visual cortex. The visual cortex has small regions of cells that are sensitive to specific regions of the visual field.

There are many improvised versions based on CNN architecture like AlexNet, VGG, YOLO, and many more.

## How Convolutional Neural Networks learn?

Images are made up of pixels. Each pixel is represented by a number between 0 and 255. Therefore each image has a digital representation which is how computers can work with images.

There are 4 major operations in CNN image detection/classification.

- Convolution
- Activation map
- Max pooling
- Flattening
- Fully connected layer

The working is simple: 

- The pixels from the image are fed to the convolutional layer that performs the convolution operation that results in a convolved map
- The convolved map is applied to a ReLU function (activation function) to generate a rectified feature map
- The image is processed with multiple convolutions and ReLU layers for locating the features
- Different pooling layers with various filters are used to identify specific parts of the image
- The pooled feature map is flattened and fed to a fully connected layer to get the final output

## Convolution

Mathematically a convolution is a combined integration of two functions that shows you how one function modifies the other

Convolution operation works on 2 signals in 1D and 2 images in 2D.

The main purpose of a convolutional layer is to detect features or visual features in images such as edges, lines, color drops, etc. 

- This is a very interesting property because, once it has learned a characteristic at a specific point in the image, it can recognize it later in any part of it.

- CNN’s make use of filters (also known as kernels, feature detectors), to detect features, such as edges, are present throughout an image. 

- A filter is just a matrix of values, called weights, that are trained to detect specific features. The filter moves over each part of the image to check if the feature it is meant to detect is present. 

- To provide a value representing how confident it is that a specific feature is present, the filter carries out a convolution operation, which is an element-wise product and sum between two matrices.

- When the feature is present in part of an image, the convolution operation between the filter and that part of the image results in a real number with a high value. If the feature is not present, the resulting value is low.

- We need N number of feature detectors to detect different curves/edges of the image.

Padding refers to the number of pixels added to an image when it is being processed by the kernel of a CNN.

One tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this can add up as we apply many successive convolutional layers.

- One solution to assist the kernel with processing the image, pad the image with zeros(zero-padding) to allow for more space for the kernel to cover the image. Adding padding to an image processed by a CNN allows for a more accurate analysis of images.

- Extra zeros are added at the perimeter of the input image so that all the features are captured.

## Activation Map

The feature maps must be passed through a non-linear mapping. 

The feature maps are summed with a bias term and passed through a non-linear activation function: ReLu. 

The purpose of the activation function is to introduce non-linearity into our network because the images are made of different objects that are not linear to each other so the images are highly non-linear.

## Max Pooling

After ReLU comes to a pooling step, in which the CNN downsamples the convolved feature (to save on processing time), while also reducing the size of the image.

This helps reduce overfitting, which would occur if CNN is given too much information, especially if that information is not relevant in classifying the image.

There are different types of pooling, for example, max pooling and min pooling. 

In max pooling, a window passes over an image according to a set stride value. At each step, the maximum value within the window is pooled into an output matrix, hence the name max pooling. These values then form a new matrix called a pooled feature map.

- An added benefit of max-pooling is that it forces the network to focus on a few neurons instead of all of them which has a regularizing effect on the network, making it less likely to overfit the training data and hopefully generalize well.


## Flattening

After multiple convolution layers and downsampling operations, the 3D representation of the image i.e. the pooled feature map is converted into a feature vector (1D) that is passed into a multi-layer perceptron to output probabilities. 

The rows are concatenated to form a long feature vector. 

If multiple input layers are present, its rows are also concatenated to form an even longer feature vector.

## Fully Connected Layer

In this step, the flattened feature map is passed through a neural network. 

This step is made up of the input layer, the fully connected layer, and the output layer. 

- The fully connected layer is similar to the hidden layer in ANNs but in this case, it’s fully connected. 

- The output layer is where we get the predicted classes. 

The information is passed through the network and the error of prediction is calculated. The error is then backpropagated through the system to improve the prediction.

![CNN](./images/cnn.webp)

The final output produced by the dense layer neural network doesn’t usually add up to one. However, these outputs must be brought down to numbers between zero and one, which represent the probability of each class. 

The output of this dense layer is therefore passed through the Softmax activation function, which maps all the final dense layer outputs to a vector whose elements sum up to one.