## What is object detection?

<!-- > Object detection is a computer vision and image processing system that detects instances of semantic items of a certain class (such as individuals, buildings, or vehicles) in digital photos and videos. It enables the recognition and localization of objects in an image or video. Object detection methods identify objects by using specialized aspects of object classes, and there are neural network-based and non-neural ways for detecting things. Image annotation, vehicle counting, activity recognition, face detection, face recognition, and video object co-segmentation are just a few of the computer vision jobs that can benefit from this approach. It is essential in a variety of applications, including image retrieval and video surveillance. -->

> Object detection consists of two tasks; namely classification and localization. Using object detection, we can not only identify what class does the object belong, but also get its position. Object detection is finding **what** and **where** (multiple) objects are in an image.

> Unlike simple classification outputs bunch of probabilities, an output of an object detection model is the probability of the object belonging to a class, with four continous values that signify the position of the object.

$\hat{y} = [p, x_1, y_1, x_2, y_2]$

> The position of object is said to be in a *bounding box*. The bounding box signifies the area in which the object is supposed to lie. There are two common ways to define bounding boxes:
  - ($x_1, y_1$) is upper left corner point, ($x_2, y_2$) is bottom right corner point.
  - Two points signify the centre point of the object and the other two values signify the height and width of the object.

## RCNN model
Instead of sliding-window approach, RCNN is the first region based architecture
Main components of RCNN:
- Region proposals: Regions are proposed in this stage. These regions indicate that there is a high chance an object lies is present here. The **selective search** algorithm scans the input image to find regions that contains blobs and proposes Regions of Interests.
- Feature extraction: A pretrained CNN model is used on top of the region proposals to extract features from each of these candidates. The pretrained network is fine-tuned on the warped region proposals using the softmax classification output layer.
- Classification: A Linear SVM model is used to perform classification on the features extracted using the pretrained network on the candidate region-proposals. Linear SVM perform one vs rest classification, to categorize an object into a particular class.

Disadvantages of RCNN:
1. Object detection is very slow
1. 3-stage trianing process
1. Two stage pipeline
1. Training is more expensive in terms of time and space

## Fast RCNN
Fast RCNN consists of a pretrained CNN model, which has its pooling layer replaced by an **ROI Pooling** layer and the FC layer is replaced by two branches -- (K + 1) category softmax layer branch and a category specific bounding box branch.

The entire image is fed into the backbone CNN model and the features from the last convolutional layer are obtained. Depending on the backbone CNN used, the output feature maps are much smaller than the original image size.

The region proposal windows are obtained from a region proposal algorithm like 'selection search'.

The ROIs generated by the 'selective search' algorithm are then projected onto the feature maps by multiplying the subsampling ratio. This is then passed through the ROI pooling layer. The projected ROIs on the feature maps are pooled and passed on to the fully-connected layers. A 7x7 grid is used to perform ROI pooling on a feature map which produces a vector of size 49x1. ROI pooling takes the max value from each grid.

The pooling output is then fed into the successive FC layers, and the softmax and BB-regerssion branches. The softmax classification branch produces  probabiity values of each ROI belonging to K categories and one catch-all background category. The BB regression branch output is used to make the bounding boxes from the region proposal algorithm more precise.

Disadvantages of Fast RCNN:
1. Still a two stage pipeline
1. Selective search algorithm still remains to be a bottleneck as it is very slow to generate region proposals. 2 seconds of inference time is too much for real time object detection in videos.


#### What is ROI pooling:

In object detection and recognition tasks, Region of Interest (ROI) pooling is a technique used in the Fast R-CNN algorithm to extract fixed-length feature vectors from regions of varying sizes within an input image.

The Fast R-CNN algorithm first proposes candidate regions of interest in the image using a selective search algorithm or some other region proposal method. These proposed regions are then pooled into fixed-size feature maps using ROI pooling.

The ROI pooling layer takes as input the proposed regions of interest and the feature maps generated by a convolutional neural network (CNN) applied to the input image. The proposed regions of interest are first transformed into a fixed size using a process called ROI warping, which resizes the proposed region of interest to a fixed spatial extent while preserving the aspect ratio.

Then, the features within each of the fixed-sized regions are pooled using max-pooling, resulting in a fixed-size feature map for each proposed region of interest. These feature maps are then fed into a fully connected layer that produces a feature vector for each proposed region of interest.

The ROI pooling layer is important because it allows the Fast R-CNN algorithm to handle variable-sized inputs and variable-sized regions of interest within the input image. By pooling the features within each proposed region of interest into a fixed-size feature map, the Fast R-CNN algorithm can extract useful features for object detection and recognition while maintaining spatial information about the regions of interest.

Fast RCNN is a deep learning algorithm used for object detection and recognition. It starts by using a pre-trained CNN model to extract features from an input image. These features are obtained from the last convolutional layer of the CNN and are much smaller in size compared to the original image.

Next, candidate regions of interest (ROIs) are generated using a region proposal algorithm like 'selective search'. The ROIs are then projected onto the feature maps by multiplying them with a subsampling ratio and passed through an ROI pooling layer. The ROI pooling layer extracts fixed-size feature maps from the variable-sized ROIs using max-pooling.

The output from the ROI pooling layer is then fed into the successive fully-connected layers, followed by two branches - a softmax layer for classifying each ROI into one of K+1 categories (K object classes and a background class) and a category-specific bounding box regression layer for refining the bounding box coordinates.

During training, the algorithm minimizes two losses - a classification loss and a bounding box regression loss - using backpropagation. The resulting model can then be used to detect objects in new images by proposing ROIs and passing them through the trained Fast RCNN model to obtain the class and bounding box coordinates for each detected object.

In summary, Fast RCNN is an effective algorithm for object detection and recognition that uses pre-trained CNN models, ROI pooling layers, and a multi-task loss function to generate accurate predictions for object classes and bounding boxes in an image.

## Faster RCNN
Key points:
- It proposes a **region proposal network**, which is a fully convolutional network that generates proposals with various scales and aspect ratios. The RPN implements the terminology of **neural network with attention** to tell the object detection where to look.
- Rather than using **pyramids of images** or **pyramid of filters**, Faster RCNN introduces the concept of anchor boxes. An anchor box is a reference box of specific scale and aspect ratio. With multiple reference achor boxes, then multiple scales and aspect ratios exist for the single region. This can be thought of as a **pyramid of reference achor boxes**. Each region is then mapped to each reference anchor box, and thus detecting objects at different scales and aspect ratios.
- The convolutional computations are shared across the RPN and the Fast R-CNN. This reduces the computational time.

## IoU
A metric used to quantify the performance of object detection algorithm, using the ratio of area of intersection of the predicted box and ground truth bounding box, to the union of the area of predicted box and ground truth bounding box.

## Non-max supression
**Non maximum suppression** is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the predicted, and discard any remaining box where $IoU \geq 0.5$ with the box output in the previous step.

- Start with discarding all bounding boxes < probability threshold
- While BoundingBoxes:
  - Take out the lasrgest probability box
  - Remove all other boxes with IoU > threshold
- Do this for each class

## YOLO: You only look once
- The input image is divided into an $s*s$ grid. If the center of an object falls into a grid cell, that grid cell is reponsible for detecting that object.
- Each grid cell predicts $B$ bounding boxes and confidence scores for those boxes.
- These confidence scores reflect how confident the model is that the boc contains an object i.e. the probability of prediction of the box.
- Each bouding box consists of 5 predictions: x, y, w, h and confidence.
  - The $(x, y)$ coordinates represent the center of the box relative to the bounds of the grid cell.
  - The width $w$ and height $h$ are predicted relative to the whole image.
  - The confidence represents the Intersection Over Union (IOU) between the predicted box and any ground truth box.

YOLO algorithm works using the three techniques:
- Residual blocks: The image is divided into various grids. Each grid has a dimension of $S*S$. Every grid will detect objects that appear within them. If an object center appears within a certain grid cell, then this cell will be responsible for detecting it.
- Bounding box regression: YOLO uses a single bounding box regression to predict the height, width, center, and class of objects.
- IOU: The IOU is eqaul to 1 if the predicted bounding box is same as the real box. This mechanism eliminates bounding boxes that are not equal to the real box.
- Non-Max Supression: Setting a threshold for the IOU is not always enough because an object can have multiple boxes with IOU byond the threshold, and leaving all those boxes might include noise. Using Non-max suppression, we only keep the boxes with the highest probability score of detection.

## Canny edge
Canny edge detection is an advanced image processing technique used in computer vision to identify the edges of objects in an image. Canny edge detection is a fundamental technique in the field of computer vision that is primarily used for image processing. The technique's overall objective is to identify edges in an image with high accuracy and low false-positive rates. To start with, a Gaussian filter is applied to smooth the image to reduce noise. Once the image is smoothed, the technique computes the gradient magnitude and direction of the image by applying Sobel operators, which are convolutional filters. The operator computes the gradient vector for each pixel in the image, and the gradient magnitude represents the strength of the edges while the direction represents the orientation of the edges.

After computing the gradient magnitude and direction, the technique applies a non-maximum suppression technique to eliminate non-essential edge points. The non-maximum suppression takes advantage of the fact that edges will have a local maximum in the gradient direction, leading to the retention of the edge pixels that have the highest magnitude.

The edge detection process continues by applying double thresholding techniques. The technique involves setting two thresholds - a low threshold and a high threshold. Edges pixels whose magnitude is above the high threshold are considered strong edges, while those between the high and low thresholds are considered weak edges. Weak edges that are not part of a strong edge will be eliminated, while those that are part of strong edges will be retained.

The final step in the Canny edge detection process is edge tracking by hysteresis. This step connects weak edges that are part of strong edges with a continuous line. By doing so, the Canny edge detection is able to obtain a highly accurate edge map that identifies the true edges in the image with few false positives.

## SIFT
SIFT (Scale-Invariant Feature Transform) is a widely used computer vision technique for image feature extraction and matching. SIFT keypoints are local features of an image that are invariant to scale, orientation, and affine distortion. SIFT algorithm works in steps as follows:
1. Scale-space extrema detection: The first step in SIFT is to detect the key points (or keypoints) in an image that are stable across scale and orientation changes. This is done by constructing a scale space representation of the image and then by detecting extrema of the difference-of-Gaussian function.

2. Keypoint localization: Once the keypoints are found, we must localize them accurately. This is done by fitting a 3D quadratic function to the scale-space extrema.

3. Orientation assignment: SIFT computes keypoint orientations based on image gradient directions. A histogram of gradient directions is built around the keypoint and the peak in the histogram is assigned as the orientation of the keypoint.

4. Descriptor computation: After orientation assignment, a feature descriptor is computed for each keypoint. The descriptor is a 128-dimensional vector that captures the local appearance around the keypoint. The vector is computed using a weighted histogram of gradient directions of the image patch around the keypoint.

5. Matching: Once the feature descriptors are computed for keypoints in multiple images, SIFT matches keypoints between the images using a distance metric. The nearest neighbor or best match is selected based on Euclidean distance between feature descriptors.

6. Outlier rejection: To remove outliers in the matching process, keypoints with a large ratio of second-best match distance to the best-match distance are rejected.

## CNNs
https://medium.com/@ishandandekar/introduction-to-convolutional-neural-networks-part-1-c02b9fa3bcf2
https://medium.com/@ishandandekar/introduction-to-convolutional-neural-networks-part-2-aab33e76cea1



## Compare Object Detection algos
=>
1. Speed: YOLO (You Only Look Once) is the fastest of these algorithms because it takes a single pass over the image and predicts bounding boxes directly. R-CNN and Fast R-CNN are much slower as they run a CNN on every proposed region. Faster R-CNN is faster than R-CNN but slower than YOLO due to its use of Region Proposal Network (RPN) to generate region proposals.

2. Localization accuracy: In terms of localization accuracy, Faster R-CNN generally performs better than the other three algorithms. R-CNN and Fast R-CNN are known to have lower localization accuracy compared to Faster R-CNN, while YOLO is known to have good localization accuracy but with lower precision.

3. Detection accuracy: Faster R-CNN and YOLO are known to have higher accuracy in detecting objects compared to R-CNN and Fast R-CNN.

4. Training time: R-CNN and Fast R-CNN take much longer to train compared to Faster R-CNN and YOLO due to their need to train multiple models.

5. Object class recognition: YOLO performs better in recognizing multiple object classes in an image, while the other algorithms tend to struggle with overlapping objects.

## Face recognition
- Face identification: One-to-manu matches that compare a query face image against all the template images in the database to determine the identity of the query face
  - Person's image is compared with all the other people iages present in the database.
  - One to many comparison

- Face verification: One-to-one match that compares a query face image against a template face image whose identity is being claimed.
  - Person's image is saved in the database, Person's new input image is compared with the existing image in the database.
  - One to one comparison

## Face verification
- Face image embeddings of user stored in a database
- User provides an input image, saying he/she/they are person A
- Distance is calculated using the L2 norm of the user's face embeddings and personA's face embeddings
  $Distance = L_2(User's face-embeddings, Person A's face-emebddings)$
- A distance threshold is set. If the computed distance is than the distance threshold, then the person is verified as person A
- Else: It is not person A

## Face identification
- Face image embeddings of all users are stored in a database
- Person of interest is identified
- Person's face embeddings calculated
- Distance is computed using the L2 norm value for person's face and face embeddings in the database
- If distance is less than a distance threshold then the person is marked as the person in db. Else person is not in db.

## Siamese network
A **Siamese neural network** is an artificial neural network that uses the same weights while working in tandem on two different inputs to compute comparable output vectors.

### Triplet loss for siamese network
$L = max(d(a, p) - d(a, n) + margin, 0)$

## Inception network
Uses $1 * 1$ convolutional layer, to reduce the operational cost. Also called reduce/bottleneck layer.

The idea was to add a **$1 * 1$ convolutional layer** before bigger kernels like $3*3$ and $5*5$, to reduce their depth, which in turn will reduce the number of operations.

1. Inception block contains
  - 1 x 1 convolutional layer
  - 1 x 1 convolutional layer + 3 x 3 convolutional layer
  - 1 x 1 convolutional layer + 5 x 5 convolutional layer
  - 3 x 3 pooling layer + 1 x 1 convolutional layer
1. 1 x 1 conv layers are used for depth dimensionality reduction thus reducing the floating point operations.
1. Uses Global Average Pooling instead of Flatten
1. Uses Auxillary classifiers for prediction and gradient propogation at intermediate parts of the network

## ResNet
Downsides of deeper networks:
1. Adding too many layers makes the network prone to overfit on the training data
  - However ovefitting can be addressed using regularization, dropout and batch normalization
1. Vanishing and exploding gardients
  - During backprop, in the chained multiplication the gradient from the later layers will become very small by the time it reaches the initial layers of the network - applicable to activation functions which are diminishing in nature like `tanh` and `sigmoid`.
  - During backprop, it also might be the case that the gradient grows exponentially quickly during the chained multiplication and takes very large values thus exploding

To solve the vanishing gradient problem, ResNet uses a shortcut that allows the gradient to be directly backpropogated to earlier layers. These shortcuts are called skip connections.

Skip connections allow the model to learn an indentity function which ensures that the layer will perform at least as well as the previous layer.

Residual blocks:
- Shortcut path: Connects the input to an addition of the second branch
- Main path: A series of convolutions and activations. The main path consists of three convolutional layers with ReLU activations.

Residual blocks start with **1 x 1** conv layer to downsample the input dimension volume, and a 3 x 3 conv layer and another 1 x 1 convolutional layer to downsample the output.
This is good technique to keep control of the volume dimensions across many layers. This configuration is called a bottleneck residual block.