## Image representation 

Computers sees an input image as array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d( h = Height, w = Width, d = Dimension ). 

Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x 1 array of matrix of grayscale image.



<img src="https://www.matlabsolutions.com/images/matrix1.png" alt="ml" style="width: 100px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1">image matrix with RGB channel</div>

# Filters, Kernels, and Convolutions

## Filters and Kernels in Images

**Filters** (also known as **kernels**) are small matrices used to perform convolution operations on larger matrices, typically representing images. Each filter slides over the input matrix (image) to produce an output matrix (feature map).

- **Filter/Kernel Matrix**: A filter is usually a small matrix with dimensions \( k x k \), where \( k \) is typically odd (e.g., 3x3, 5x5). 

- used to apply effects like the ones you might find in Photoshop or Gimp, such as blurring, sharpening, outlining or embossing. 
- They're also used in machine learning for 'feature extraction', a technique for determining the most important portions of an image.

The values in this matrix are the weights that are either 

- **static** in case of sobel,gaussian kernels
- **learned** during the training process of a CNN.


<img src="https://www.cs.toronto.edu/~lczhang/360/lec/w04/imgs/sobel1.png" alt="ml" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1">convolution kernel being used for edge detection</div>

<div style="text-align: center;" markdown="1">
<a href="https://setosa.io/ev/image-kernels/">A beautiful visualization to play around with various static filters</a>
</div>

## Convolution Operation

**Convolution** is the process of applying a filter to an input matrix to produce a feature map. This operation involves sliding the filter over the input matrix and computing the dot product of the filter with the overlapping part of the input matrix at each position.

**Definition**: Convolution is a mathematical operation used to combine two sets of information. In signal processing, it describes the way in which a signal is modified by a system.

**Signal Processing**: Convolution in signal processing can be used for filtering signals, applying various effects, and understanding the system's response to different inputs.

**Image Processing**

**Application**: Convolution is used to apply filters to images, which can detect edges, blur the image, sharpen it, etc.

**Process**: An image is represented as a matrix of pixel values. A filter (or kernel) is a smaller matrix that is slid over the image matrix to produce a transformed image. Each position of the filter involves a dot product of the filter coefficients and the corresponding image pixel values.
    

### Dimensions and Shapes

1. **Input Matrix (Image)**:
   - Let's consider an input image of size \( H x W \).
   - If the image has \( C \) channels (e.g., RGB image), the dimensions become \( H X W X C \).

2. **Filter (Kernel)**:
   - A filter has dimensions \( k X k \).
   - For multi-channel images, a filter typically has dimensions \( k X k X C \), matching the depth of the input.

3. **Output Matrix (Feature Map)**:
   - The dimensions of the output matrix depend on the input size, filter size, and the stride (step size) with which the filter is applied.
   - If the input is \( H X W \) and the filter is \( k X k \) with stride \( s \), the output dimensions \( H' X W' \) are calculated as:
     
     $ H' =[\frac{H - k}{s}] + 1 $
     
     $ W' =[\frac{W - k}{s}] + 1 $
     

## Summary

- **Filters/Kernels**: Small matrices (e.g., 3x3) used to detect patterns in the input data.
- **Convolution**: The process of sliding a filter over the input matrix to produce a feature map.
- **Dimensions**: 
  - Input matrix: \( H X W \) (or \( H X W X C \) for multi-channel inputs).
  - Filter: \( k X k \) (or \( k X k X C \) for multi-channel inputs).
  - Output matrix: Calculated based on input size, filter size, and stride.

This detailed explanation covers the fundamental concepts of filters, kernels, and convolutions, including the mathematical operations and dimensional transformations involved in CNNs.

## Convolutional Neural Network (CNN) 

## 1 filter convolution at 1 spatial patch and generates a point of feature bank

``` python
# kernel and image_patch are n x n matrices
pixel_out = np.sum(kernel * image_patch) # gives out a scalar point
```

![a](https://miro.medium.com/v2/resize:fit:800/1*hhV1YD1DCvZZle2ptot49Q.png "Image RGB  to feature map reduction")
<div style="text-align: center;" markdown="1">1 filter convolution at 1 spatial patch and generates a scalar point</div>


## 1 filter convolution on full image and generates 2D feature map



<img src="https://upload.wikimedia.org/wikipedia/commons/9/95/Convolutional_Neural_Network_with_Color_Image_Filter.gif" alt="ml" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1">1 filter convolution slides at all spatial patch on image and generates a 2D matrix</div>


## Nx3x3xx3 filter convolution full image and generates 4D feature map

<img src="https://miro.medium.com/v2/resize:fit:600/1*EfXAnrwUObQmFtaxcHUlVg.png" alt="ml" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1">N filter convolution at full spatial region</div>


## Full animated view of the convolution by multiple kernels generating feature map volume


    
<img src="https://animatedai.github.io/media/convolution-animation-3x3-kernel.gif" alt="ml" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> 
<a href="https://gigazine.net/gsc_news/en/20231025-animated-ai/">image credit</a>
animated convolution operation by a set of filter banks
</div>


## Learning  Patterns in Images

### Background : 
**Fully Connected Neural Network (FCNN)**

Inherently an image is 3D matrix having dimension `WxH`(spatial dimension)`xD`(channel).

We can do any pixel level learning by casting this `WXHXD` into `WXHXD`X`1` vector and then passing through an approximation network like FCNN. This however will have few demerits like :

1. **Architecture**:
   - Each neuron in one layer is connected to every neuron in the next layer.
   - Dense connections across layers.

2. **Feature Learning**:
   - No explicit mechanism for spatial hierarchies.
   - Requires flattened input (e.g., images converted to 1D vectors).

3. **Parameter Efficiency**:
   - Large number of parameters due to dense connections.
   - Prone to overfitting with limited training data.

4. **Computation**:
   - High computational cost due to the large number of connections.
   - Not inherently suited for processing spatial data.

5. **Translation Invariance**:
   - Lacks inherent translation invariance.
   - Sensitive to the position of features in the input.

## Jargons

### Kernels

**Kernels** (also known as **filters**) are small matrices used to perform convolution operations on larger matrices, typically representing images. Each kernel slides over the input matrix (image) to produce an output matrix (feature map). The kernel values are the weights applied during the convolution process.

### Predefined Kernels

**Predefined kernels** are kernels with fixed values that are not learned from data. They are typically used in traditional image processing for tasks such as edge detection, sharpening, and blurring. Examples include the Sobel kernel for edge detection and the Gaussian kernel for blurring.

### Learned Kernels/Convolution Filters

**Learned kernels** (also known as **convolution filters**) are kernels whose values are learned during the training process of a convolutional neural network (CNN). These kernels are adjusted to detect specific features in the input data, such as edges, textures, and patterns, contributing to the network's ability to perform tasks like image classification and object detection.

### Filters

**Filters** are another term for kernels. They are the small matrices used in convolution operations to detect specific features in the input data. Filters are applied across the input matrix to produce activation maps.

### Filterbank

A **filterbank** refers to a collection of multiple filters (kernels) applied to the input data in a single convolutional layer. Each filter in the filterbank detects different features, resulting in multiple activation maps that are combined to form the layer's output.

### Padding and Strides

- **Padding**: Padding involves adding extra pixels (usually zeros) around the border of the input image before applying the convolution operation. Padding helps control the spatial dimensions of the output feature map and can preserve the input dimensions. Common types of padding include:
  - **Valid Padding**: No padding is added, resulting in a smaller output.
  - **Same Padding**: Padding is added to keep the output dimensions the same as the input dimensions.

- **Strides**: The stride is the step size with which the filter moves across the input image during the convolution operation. A stride of 1 means the filter moves one pixel at a time, while a stride of 2 means the filter moves two pixels at a time. Stride affects the spatial dimensions of the output feature map.

### Receptive Field

The **receptive field** is the region of the input image that a particular neuron in a CNN layer is responsive to. As the network goes deeper, neurons in higher layers have larger receptive fields, allowing them to capture more global and complex features from the input image.

### Activation Map

An **activation map** (or **feature map**) is the output of a convolutional layer after the convolution operation and the application of an activation function. It represents the activations (i.e., responses) of the neurons in the layer to different parts of the input image, highlighting detected features.

We can compute the spatial size of the output volume as a function of the input volume size `(W)`, the receptive field size of the Conv Layer neurons `(F)`, the stride with which they are applied `(S)`, and the amount of zero padding used `(P)` on the border. 

$$ \text{the spatial size of the output volume} = \frac{(W-F+2P)}{S}+1$$

> **For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.*

<table>
     <tr>
         <th>
<a href="https://github.com/vdumoulin/conv_arithmetic?tab=readme-ov-file">  
   image credit </a></th>
   
 </tr>
<tr>
<td>
<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/no_padding_no_strides.gif" alt="ml" style="width: 200px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> P=0,S=1,W=4,F=3</div>
</td>
<td>
<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/padding_strides.gif" alt="ml" style="width: 200px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> P=1,S=2,W=5,F=3 </div>
</td>
</tr>
<tr>
<td>
<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/same_padding_no_strides.gif" alt="ml" style="width: 200px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> P=1,S=1,W=5,F=3 </div>
</td>
<td>
<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/full_padding_no_strides.gif" alt="ml" style="width: 200px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> P=2,S=1,W=5,F=3 </div>
</td>
</tr>
</table>














<div style="text-align: center;" markdown="1"> 
<a href="https://ezyang.github.io/convolution-visualizer/">Receptive field to Feature map reduction , visualization with padding, stride and filter sizes</a>
</div>

## Activation Function

An **activation function** is a non-linear function applied to the output of a convolution operation (or any other neural network layer) to introduce non-linearity into the model. Common activation functions include:

- **ReLU (Rectified Linear Unit)**: $$\text{ReLU}(x) = \max(0, x) $$
- **Sigmoid**: $$\sigma(x) = \frac{1}{1 + e^{-x}}  $$
- **Tanh**: $$\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$

### Pooling

**Pooling** (or subsampling) is a downsampling operation that reduces the spatial dimensions of the input, helping to reduce the number of parameters and computations in the network. It also helps make the network invariant to small translations of the input. Common types of pooling include:
- **Max Pooling**: Takes the maximum value from each patch of the input.
- **Average Pooling**: Takes the average value from each patch of the input.

Pooling layers are typically used after convolutional layers to progressively reduce the spatial dimensions of the feature maps while retaining important information.

## CNN Architecture

At a high level, CNN architectures contain an upstream feature extractor followed by a downstream classifier. The feature extraction segment is sometimes referred to as the “backbone” or “body” of the network. The classifier is sometimes referred to as the “head” of the network.

1. **Architecture**:
   - Consists of convolutional layers, pooling layers, and sometimes fully connected layers.
   - Convolutional layers apply filters across the input.

2. **Feature Learning**:
   - Captures spatial hierarchies of features.
   - Preserves spatial relationships in the input data.

3. **Parameter Efficiency**:
   - Parameter sharing: each filter is used across the entire input.
   - Sparse connections: each neuron is connected to a local region of the input.

4. **Computation**:
   - Lower computational cost compared to FCNNs for large inputs.
   - Efficient for image and spatial data processing.

5. **Translation Invariance**:
   - More robust to translation of features due to convolution and pooling.
   - Learns features regardless of their position in the input.

## Trainable convolution kernel in CNN


<img src="https://learnopencv.com/wp-content/uploads/2023/01/tensorflow-keras-convolution-one-filter-example.png" alt="ml" style="width: 800px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> 
<a href="https://learnopencv.com">image credit</a>
one full convolution layer in in a CNN
</div>

## Parameter Efficiency

### Fully Connected Layer

A fully connected layer has a weight for every connection between each input pixel and each output neuron, leading to a massive number of parameters when the input size is large.

### Convolutional Layer

A convolutional layer uses shared weights in the form of filters that are convolved over the input image. This means the same filter is applied across different regions of the input, drastically reducing the number of parameters while still being able to capture spatial features.

### Example Calculation with Dimensions

#### Input Image

$$
\text{Dimensions}: 32 \times 32 \times 3
$$

#### Fully Connected Layer

$$Input size: 32×32×3=3072$$
$$Output size: 1000$$

$$
\text{Weights}: 3072 \times 1000 = 3,072,000
$$
$$
\text{Biases}: 1000
$$
$$
\text{Total Parameters}: 3,072,000 + 1000 = 3,073,000
$$

#### Convolutional Layer

$$
\text{Filter Size}: 3 \times 3 \times 3 = 27
$$
$$
\text{Number of Filters}: 32
$$
$$
\text{Weights}: 27 \times 32 = 864
$$
$$
\text{Biases}: 32
$$
$$
\text{Total Parameters}: 864 + 32 = 896
$$

### Conclusion

#### Parameter Reduction Comparison


\begin{array}{|c|c|}
\hline
\text{Layer Type} & \text{Total Parameters} \\
\hline
\text{Fully Connected Layer} & 3,073,000 \\
\hline
\text{Convolutional Layer} & 896 \\
\hline
\end{array}


#### Magnitude of Reduction

$
\text{Reduction Ratio} = \frac{\text{FCNN parameters}}{\text{CNN parameters}} = \frac{3,073,000}{896} \approx 3,429
$

So, a CNN layer with 32 filters reduces the number of parameters by a factor of approximately 3,429 compared to a fully connected layer.
This significant reduction allows CNNs to be more efficient and scalable, particularly for large input sizes, while still being able to effectively capture and learn spatial features in the data.


## Hierachical Learning


<img src="https://learnopencv.com/wp-content/uploads/2023/01/tensorflow-keras-cnn-hierarchical-structure-2048x1181.png" alt="ml" style="width: 800px; margin-left: auto; margin-right: auto;"/>
<div style="text-align: center;" markdown="1"> 
<a href="https://learnopencv.com">image credit</a>
hierarchical learning of features
</div>
<hr>

## Comparison between Fully Connected Neural Networks (FCNN) and Convolutional Neural Networks (CNN)



## Efficiency in Capturing Features

1. **Spatial Hierarchies**:
   - CNNs inherently capture spatial hierarchies by applying multiple layers of convolutions and pooling.
   - FCNNs lack this ability since they treat the input as a flat vector.

2. **Local Patterns**:
   - CNNs excel at detecting local patterns and combining them to form more complex features.
   - FCNNs do not have a mechanism to naturally detect local patterns.

3. **Parameter Reduction**:
   - CNNs significantly reduce the number of parameters through parameter sharing.
   - FCNNs have a high number of parameters due to dense connections.

4. **Robustness to Input Variations**:
   - CNNs are more robust to variations and translations in the input data.
   - FCNNs are more sensitive to the exact position of features.

## Tabular Comparison

| Feature                           | Fully Connected Neural Network (FCNN) | Convolutional Neural Network (CNN) |
|-----------------------------------|---------------------------------------|------------------------------------|
| **Architecture**                  | Dense connections between layers      | Convolutional and pooling layers   |
| **Feature Learning**              | No spatial hierarchies                | Captures spatial hierarchies       |
| **Input Format**                  | Flattened vectors                     | Preserves 2D structure of images   |
| **Parameter Efficiency**          | Large number of parameters            | Fewer parameters due to sharing and sparsity |
| **Computation Cost**              | High                                  | Lower for large inputs             |
| **Translation Invariance**        | Lacks translation invariance          | Inherent translation invariance    |
| **Local Pattern Detection**       | Not specialized for local patterns    | Efficiently detects local patterns |
| **Robustness to Input Variations**| Sensitive to position changes         | Robust to position and translation of features |
| **Use Case Suitability**          | General-purpose tasks                 | Tasks involving spatial data (e.g., image processing) |

## Conclusion

In conclusion, CNNs are more efficient and effective than FCNNs for tasks involving spatial data, such as image processing, due to their ability to capture spatial hierarchies, reduce the number of parameters through shared weights, and robustly detect patterns regardless of their position in the input. FCNNs, on the other hand, are more suitable for tasks where spatial structure is not as important.

## References

- <a href="https://machinelearningmastery.com/how-to-visualize-filters-and-feature-maps-in-convolutional-neural-networks/?fbclid=IwAR3SdRsa8Esc_VyjvjASkwQvh5VO4gr_KSxb7xALWwBWEEck59AIlee8baE">VGG layer by layer filter bank and feature map visualization blog with code</a>

- <a href="https://github.com/sthanhng/CNN-Visualization?tab=readme-ov-file">keras repo</a>