A convolutional layer is a fundamental building block of Convolutional Neural Networks (CNNs). It performs a mathematical operation called convolution, which combines two sets of information: the input data (such as an image) and a filter or kernel. The primary purpose of this layer is to extract features from the input data by detecting patterns, such as edges, textures, and shapes.



![Neural%20network%20with%20many%20convolutional%20layers.webp](attachment:Neural%20network%20with%20many%20convolutional%20layers.webp)

## Why is it Used?

**1. Feature Extraction:**
* Convolutional layers automatically learn to detect features from the raw input data. As the network depth increases, the layers capture more complex patterns and higher-level features.

**2. Spatial Hierarchies:**
* They enable the network to understand spatial hierarchies by combining local information to form more abstract representations. Lower layers might detect edges and simple textures, while higher layers detect more complex patterns like object parts and shapes.

**3. Parameter Efficiency:**
* Convolutional layers use shared weights (the same filter slides across different parts of the input), significantly reducing the number of parameters compared to fully connected layers. This makes the model more efficient and easier to train.

## How Convolutional Layers Works

### Convolutional Layer:

* The convolutional layer is the most important layer of a CNN; responsible for dealing with the major computations. The convolutional layer includes input data, a filter, and a feature map.
* Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.

![Image%20matrix%20multiplies%20kernel%20or%20filter%20matrix.webp](attachment:Image%20matrix%20multiplies%20kernel%20or%20filter%20matrix.webp)

* To illustrate how it works, let’s assume we have a color image as input. This image is made up of a matrix of pixels in 3D, representing the three dimensions of the image: height, width, and depth.
* The filter — which is also referred to as kernel — is a two-dimensional array of weights, and is typically a 3×3 matrix. It is applied to a specific area of the image, and a dot product is computed between the input pixels and the weights in the filter. Subsequently, the filter shifts by a stride, and this whole process is repeated until the kernel slides through the entire image, resulting in an output array.
* The resulting output array is also known as a feature map, activation map, or convolved feature.

![1*D6iRfzDkz-sEzyjYoVZ73w.gif](attachment:1*D6iRfzDkz-sEzyjYoVZ73w.gif)

* It is important to note that the weights in the filter remain fixed as it moves across the image. The weight values are adjusted during the training process due to backpropagation and gradient descent.
* Convolution of an image with different filters can perform operations such as edge detection, blur and sharpen by applying filters.
* Besides the weights in the filter, we have other three important parameters that need to be set before the training begins:

**1. Number of Filters:** This parameter is responsible for defining the depth of the output. If we have three distinct filters, we have three different feature maps, creating a depth of three. Filters are small matrices, typically of size 3x3, 5x5, or 7x7, with learnable parameters.

**2. Strides:** Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters to 1 pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on. The below figure shows convolution would work with a stride of 2.

![Strides.webp](attachment:Strides.webp)

**3. Padding:** Sometimes filter does not fit perfectly fit the input image. We have other options:

    1. Zero-padding: This parameter is usually used when the filters do not fit the input image. This sets all elements outside the input matrix to zero, producing a larger or equally sized output.
    2. Valid padding: Also known as no padding. In this specific case, the last convolution is dropped if the dimensions do not align.
    3. Same padding: This padding ensures that the output layer has the exact same size as the input layer.

**4. Activation Function:** 
After each convolution operation, we have the application of a Rectified Linear Unit (ReLU) function, which transforms the feature map and introduces nonlinearity.
![ReLU%20activation%20function.webp](attachment:ReLU%20activation%20function.webp)

![ReLU%20operation.webp](attachment:ReLU%20operation.webp)
There are other non linear functions such as tanh or sigmoid that can also be used instead of ReLU. Most of the data scientists use ReLU since performance wise ReLU is better than the other two.

**5. Pooling Layer:** The pooling layer is responsible for reducing the dimensionality of the input. It also slides a filter across the entire input — without any weights — to populate the output array. We have two main types of pooling:

    1. Max Pooling: As the filter slides through the input, it selects the pixel with the highest value for the output array.
    2. Average Pooling: The value selected for the output is obtained by computing the average within the receptive field.
    
![Pooling%20Layer.webp](attachment:Pooling%20Layer.webp)

**6. Output Feature Maps:**
    
    1. Stacking: The output of a convolutional layer is a set of feature maps, one for each filter. These feature maps are stacked to form a 3D volume.
    2. Dimensions: If the input image has dimensions H×W×D (where D is the depth, e.g., 3 for RGB), and the convolutional layer has F filters, the output dimensions will be H′× W′× F. The height H′and width W′depend on the filter size, stride, and padding.

## Example Workflow:

**1. Input Image:** Consider a 32x32x3 RGB image.

**2. Filters:** Suppose we have 6 filters of size 5x5x3.

**3. Convolution:** Each filter slides over the image with a certain stride (e.g., stride 1), performing the dot product and producing a 2D feature map.

**4. Activation:** Apply ReLU to each element of the feature map.

**5. Output:** Each of the 6 filters produces a 32x32 feature map, resulting in an output volume of 32x32x6.