# Computer Vision - Introduction 

**1.  Object Classification**-
Tells you what the “main subject” of the image is

**2. Object Localization**- Predict and draw bounding boxes around on object in an image

**3. Object Detection**- Find multiple objects, classify them, and locate where they are in the image.

![](image/intro.jpeg)

**Why it is difficult**

1. Can have varying number of objects in an image and we do not know ahead of time how many we would expect
in an image.
2. Choosing a right crop is not a trivial task as we may encouter any number of images which :
    1. can be at any place.
    2. can be of any aspect ration.
    3. can be of any size.


# General object detection framework

Typically, there are three steps in an object detection framework. 
1. **Object localisation component** 
<br>
A model or algorithm is used to generate regions of interest or region proposals. These region proposals are a large set of bounding boxes spanning the full image.

Some of the famous approaches:
* **Selective Search**  - A clustering based approach which attempts to group pixels and generate proposals based on the generated clusters.
* **Region Proposal using Deep Learning Model (Features extracted from the image to generate regions)** - Based on the features from a deep learning model
* **Brute Force** - Similar to a sliding window that is applied to the image, over several ratios and scales. These regions are generated automatically, without taking into account the image features.


![](image/anchor_boxes.png)


**Things to note:**
* Trade-off that is made with region proposal generation is the number of regions vs. the computational complexity.
* Use problem specific information to reduce the number of ROI’s (e.g. pedestrian typically have a ratio of approximately 1.5, so it is not useful to generate ROI’s with a ratio of 0.25).
![](image/ped_car.jpg)


2. **Object classification component** 
<br>
In the second step, visual features are extracted for each of the bounding boxes, they are evaluated and it is determined whether and which objects are present in the proposals based on visual features.

**Some of the famous approaches:**
* Use pretrained image classification models to extract visual features
* Traditional Computer Vision (filter based approached, histogram methods, etc.)

![](image/cnn_layer.PNG)
![](image/feature_map.PNG)

3. **Non maximum suppression**
<br>
In the final post-processing step, reduce the number of detections in a frame to the actual number of objects present to make sure overlapping boxes are combined into a single bounding box.
<br>
NMS techniques are typically standard across the different detection frameworks, but it is an important step that might require hyperparameter tweaking based on the scenario.

Predicted
![](image/NMS_1.svg)

Desired
![](image/NMS_2.svg)


# Evaluation Metric

**Evaluation**
<br>
Evaluation metric used:
*  Mean Average Precision (mAP or  mAP@0.5 or mAP@0.25) -
    *  It is a number from 0 to 100 and higher values are typically better

# Concepts

## Bounding Box Representation

Bounding box is represented using : 
x_min , y_min , x_max , y_max
![](image/bounding_1.PNG)

But pixel values are next to useless if we don't know the actual dimensions of the image. A better way would be to represent all coordinates is in their fractional form.
![](image/bounding_2.PNG)

1. From boundary coordinates to centre size coordinates 
<br>
 Function- **xy_to_cxcy** 
<br>
x_min,y_min,x_max,y_max -> c_x,c_y,w,h

2. From centre size coordinates to bounding box coordinates 
Function - **cxcy_to_xy**
<br>
c_x , c_y , w , h -> x_min,y_min,x_max,y_max 

![](image/bounding_3.PNG)

3. Offset from bounding box (used in the loss function) 
Function - **cxcy_to_gcxgcy** 
<br>
    * **g_c_x ,g_c_y** -find the offset with respect to the prior box, and scale by the size of the prior box.
    * **g_w , g_h** - scale by the size of the prior box, and convert to the log-space. 
    
4. Decoding the predicted offset to centre size coordinates 
<br>
Function - gcxgcy_to_cxcy



## IOU (Jaccard Index)
 
How well the one box matches the the other box we can compute the IOU (or intersection-over-union, also known as the Jaccard index) between the two bounding boxes.

Steps to Calculate:
1. Find Intersection
2. Find Union

Jaccard Overlap = Intersection / Union

![](image/IOU.PNG)

Check out the [excel sheet](https://gitlab.com/entirety.ai/meetup-intuition-to-implementation/blob/master/Phase%20-%202/SSD/Anchor%20Boxes.xlsx) for the calculations
![](image/IOU_excel.PNG)


[Codes](https://gitlab.com/entirety.ai/meetup-intuition-to-implementation/blob/master/Phase%20-%202/SSD/SSD_Helper_File.ipynb)

## mAP

Precision measures how accurate is your predictions. i.e. the percentage of your predictions are correct.

Recall measures how good you find all the positives. 
![](image/prec_rec.PNG)

However the standard metric of Precision or Recall used in image classification problems cannot be directly applied here because we want both the classification and localisation of a model need to be evaluated.This is where mAP(Mean Average-Precision) is comes into the picture.

For calculating Precision and Recall we need:
1. True Positives
2. False Positives
3. True Negatives
4. False Negatives

Let us see how can we calculate them in the context of Object Detection

**True Positive and False Positive**
<br>
Using **IOU** we can determine if the detection(a Positive) is correct(True) or not(False).
Considering a threshold of 0.5 for IOU and 0.5 for confidence score
So any score >=0.5 and IOU >=0.5 - True Positive 
   any score >=0.5 and IOU < 0.5  - False Positive. 

**False Negative**
<br>
It will be all those objects that our model has missed out.

Using above information we can calculate:
1. Precision for each class = TP/(TP+FP)
2. Recall for each class = TP/(TP+FN)

But the value of Precision and Recall is very much dependent on threshold assigned to Confidence score and IOU.
- For IOU, either we can decide a fixed threshold like in VOC Dataset or calculate in the range (say 0.5 to .95) in the case of COCO Dataset.
- The confidence factor varies across models, 50% confidence in my model design might probably be equivalent to an 80% confidence in someone else’s model design, which would vary the precision recall curve shape.

To overcome that problem we go for ...
<br>
**Average Precision**
Area under the precision-recall curve (PR curve)

Let us take an oversimplified example where we just have 5 image in which we have 5 object of the same class.We consider 10 predictions and if IOU>=0.5 we call it a correct prediction.

![](image/prec_rec_table.PNG)

At rank #3
<br>
Precision is the proportion of TP = 2/3 = 0.67.

Recall is the proportion of TP out of the possible positives = 2/5 = 0.4.
<br>
If we plot it
![](image/prec_rec_plot.PNG)

Things to note:
<br>
Precision will have a zig zag pattern because it goes down with false positives and goes up again with true positives.

A good classifier will be good at ranking correct images near the top of the list, and be able to retrieve a lot of correct images before retrieving any incorrect: its precision will stay high as recall increases. A poor classifier will have to take a large hit in precision to get higher recall.

Average Precision(AP) = ![](image/ap_formula_raw.PNG)

Interpolated Average Precision (IAP)
<br>
[Reference from the paper](http://homepages.inf.ed.ac.uk/ckiw/postscript/ijcv_voc09.pdf)
<br>
***To smooth out the zigzag pattern in the Precision Recall Curve caused by small variations in the
ranking of examples.
Graphically, at each recall level, we replace each precision value with the maximum precision value to the right of that recall level.***

![](image/interpolated_ap_formula.PNG)
![](image/ap_formula.PNG)

Further, there are variations on where to take the samples when computing the interpolated average precision. Some take samples at a fixed 11 points from 0 to 1: {0, 0.1, 0.2, …, 0.9, 1.0}. This is called the 11-point interpolated average precision. Others sample at every k where the recall changes(Area under the Curve)

**Interpolated Average Precision**

![](image/ap.PNG)

![](image/interpolated_ap.PNG)


we divide the recall value from 0 to 1.0 into 11 points — 0, 0.1, 0.2, …, 0.9 and 1.0. Next, we compute the average of maximum precision value for these 11 recall values.

In our example, AP = (5 × 1.0 + 4 × 0.57 + 2 × 0.5)/11

Issues with Interpolated AP:
1. It is less precise due to interpolation.
2. it lost the capability in measuring the difference for methods with low AP. Therefore, a new AP calculation is introduced after 2008 for PASCAL VOC.

**AP (Area under curve AUC)**
<br>
By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision - Recall curve. The intention is to reduce the impact of the wiggles in the curve.

Instead of sampling at fixed values, sample the curve at all unique recall values (r₁, r₂, …), whenever the maximum precision value drops. With this change, we are measuring the exact area under the precision-recall curve after the zigzags are removed.Hence No approximation or interpolation is needed

![](image/auc_plot.PNG)

![](image/auc_formulA.PNG)

[Codes](https://gitlab.com/entirety.ai/meetup-intuition-to-implementation/blob/master/Phase%20-%202/SSD/SSD_without_BN_Dropout_Evaluation.ipynb#Detection%20Explained)

# SSD
**2016 - Google**

![](image/SSD.PNG)

Single Shot Multibox Detector

1. **Single Shot**:In a single forward pass of the network,the task of object localization and object classification are done.
2. **Multibox**-Name of a technique for bounding box regression developed earlier by Szegedy et .al
3. **Detector**-The network does the job of object detector and classifies those detected objects.

VGG-16 
![](image/vgg_16.PNG)


https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11

https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4

https://lilianweng.github.io/lil-log/


**Modification in Vgg Network:**

![](image/cnn_layer.PNG)

1. ** Input Image**  
300 x300  instead of 224 x 224
<br>
2. ** 3rd Layer ** 
Ceil Mode
Significant if the dimensions of the preceding feature map are odd and not even.In this case we get the input as 75 x75 which is halved to to 38, 38 instead of an inconvenient 37, 37.
<br>
3. **Max Pooling** 
5th pooling layer -
From a 2, 2 kernel and 2 stride to a 3, 3 kernel and 1 stride and 1 padding. The effect this has is it no longer halves the dimensions of the feature map from the preceding convolutional layer.
<br>
4. ** Linear Layer **
We will toss fc8  which is the classification layer.
 Rework(using decimate) fc6 and fc7 into convolutional layers conv6 and conv7. 

Rework Strategy-Reparameterize a fully connected layer into a convolutional layer
<br>
**Things to note**
    * An image of size H, W with I input channels, a fully connected layer of output size N is equivalent to a convolutional layer with kernel size equal to the image size H, W and N output channels.
    * fc6 with a flattened input size of 7 * 7 * 512 and an output size of 4096 has parameters of dimensions 4096, 7 * 7 * 512. The equivalent convolutional layer conv6 has a 7, 7 kernel size and 4096 output channels, with reshaped parameters of dimensions 4096, 7, 7, 512
    * fc7 with an input size of 4096 (i.e. the output size of fc6) and an output size 4096 has parameters of dimensions 4096, 4096. The input could be considered as a 1, 1 image with 4096 input channels. The equivalent convolutional layer conv7 has a 1, 1 kernel size and 4096 output channels, with reshaped parameters of dimensions 4096, 1, 1, 4096.
    * These filters are numerous and large – and computationally expensive.Hence opt to reduce both their number and the size of each filter by subsampling parameters.
    fc6 - 1024 filters of size 3 x 3
    fc7 - 1024 filters of size 1 x 1
**Auxiliary Connection**
<br>
Stacking some more convolutional layers on top of our base network
These convolutions provide additional feature maps, each progressively smaller than the last.
<br>
Conv8_1 ,  Conv8_2 
<br>
Conv9_1 ,  Conv9_2
<br>
Conv10_1 ,Conv_10_2 
<br>
Conv11_1 ,Conv_11_2
<br>
Feature Map of Conv_8_2 , Conv9_2 , Conv_10_2 , Conv_11_2 will be used for Detection
6. Multiple Output Feature Map used for Detection:  
  1. Conv4_3   - 38 x 38 x 512
  2. Conv_7     - 19 x 19 x 1024
  3. Conv_8_2 - 10 x 10 x 512
  4. Conv_9_2 -  5 x 5 x 256
  5. Conv_10_2- 3 x 3 x 256
  6. Conv_11_2- 1 x 1 x 256
  

![](image/anchor_boxes_count.PNG)

**Prediction Convolution**
<br>
Two Covolution layer for each feature map for class prediction  and localization prediction.

For each prior at each location on each feature map, we want to predict –
1. the offsets (g_c_x, g_c_y, g_w, g_h) for a bounding box.
2. a set of n_classes scores for the bounding box, where n_classes represents the total number of object types (including a background class).

What we do:
<br>
We need two convolutional layers for each feature map 
1. A ** localization prediction convolutional layer** with a 3, 3 kernel evaluating at each location (i.e. with padding and stride of 1) with 4 filters for each prior present at the location.

The 4 filters for a prior calculate the four encoded offsets (g_c_x, g_c_y, g_w, g_h) for the bounding box predicted from that prior.

2. A **class prediction convolutional layer** with a 3, 3 kernel evaluating at each location (i.e. with padding and stride of 1) with n_classes filters for each prior present at the location.

The n_classes filters for a prior calculate a set of n_classes scores for that prior.


Let us take one output feature map and understand :
 
Considering Conv_9_2 output feature map of size 5 x 5 x256

Step1:
1. For localization:
    1. Convolution:
    5 x 5 x 256 ->  3 x 3 x 24  [6(Anchor boxes) x 4(Offsets)]
    2. Output of size = 5 x 5 x 24 
    3. Resize to 150(5x5x6 ) x 4
   
2. For class prediction 
    1. Convolution:
    5 x 5 x 256 ->  3 x 3 x  126  [ 6 (Anchor Boxes)x 21 [20(Class Labels) +1(Background)]]
    2. Output of size = 5 x 5 x 126 
    3. Resize to 150(5x5x6 ) x 21
    
    
 Similarly we do for all output feature maps and stack the results together .Thus the  Output from the Predicted Convolution module is the following:
 1. locs = 8732 x 4
 2. class scores = 8732 x 21



# MultiBox Loss

The MultiBox loss, a loss function for object detection.

This is a combination of:
1. A localization loss for the predicted locations of the boxes 
2.  A confidence loss for the predicted class scores.

# Performance

![](image/map.PNG)

SSD300\* and SSD512\* applies data augmentation for small objects to improve mAP.)

Data:
”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval.
”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12

# Issues with One Shot Detectors



## Class Imbalance 

There is extreme foreground-background class imbalance problem in one-stage detector

![](image/issue1.PNG)

This imbalance causes two problems:
1. Training is inefficient as most locations are easy negatives that contribute no useful learning signal.
2. The easy negatives can overwhelm training and lead to degenerate models.

## Not Focussed on Hard Examples

**Hard Samples**-Those examples where the difference between the true label and the predicted label is large,thus resulting in higher loss.
<br>
**Easy Samples**-Those examples where the difference between the true label and the predicted label is small,thus resulting in lower loss.

Standard Cross Entropy Loss treats both(hard and easy samples) equally.Due to which these small loss values of easy samples can overwhelm the rare class.

![](image/ce.PNG)

![](image/ce_graph.PNG)

    The loss from easy examples = 100000×0.1 = 10000
    The loss from hard examples = 100×2.3 = 230
    It is about 40× (10000 / 2.3 = 43.) bigger loss from easy examples.
Thus, CE loss is not a good choice when there is extreme class imbalance.

# Approaches to solve:

** CLASS IMBALANCE**

1. **Sampling heuristics**
<br>
Using fixed foreground-to-background ratio (1:3)
2. **Online hard example mining (OHEM)**  
Select a small set of anchors (e.g., 256) for each minibatch
3. **$\alpha$ balanced Cross entropy Loss**
<br>
Add a weighting factor α for class 1 and 1 - α for class -1.
![](image/alpha_ce.PNG)

**$\alpha$** is set by inverse class frequency or treated as a hyperparameter to set by cross validation.

Or
<br>
**$\alpha$** is implicitly implemented by selecting the foreground-to-background ratio of 1:3.
<br>
However,training procedure is still dominated by easily classified background examples

**CLASS IMBALANCE + Not Focussed on Hard Examples** 

1. **Focal Loss**

The loss function is reshaped to down-weight easy examples and thus focus training on hard negatives. A modulating factor (1-pt)^ γ is added to the cross entropy loss.
![](image/fl.PNG)

Scaling factor decays to zero as confidence in the correct class increases.
<br>
γ is tested from [0,5] in the experiment

Proporties of Focal Loss:
1. When an example is misclassified and **pt** is small, the modulating factor is near 1 and the loss is unaffected. As **pt →1**, the factor goes to 0 and the loss for well-classified examples is down-weighted.
2. The focusing parameter $\gamma$ smoothly adjusts the rate at which easy examples are down-weighted. When $\gamma$ = 0, FL is equivalent to CE. When $\gamma$ is increased, the effect of the modulating factor is likewise increased. ($\gamma$=2 works best in experiment.)

For instance, with $\gamma$ = 2, an example classified with pt = 0.9 would have 100 lower loss compared with CE and with pt = 0.968 it would have 1000 lower loss. This in turn increases the importance of correcting misclassified examples.
The loss is scaled down by at most 4× for pt ≤ 0.5 and γ = 2.
<br>


**To handle Class Imbalance:**

2. **$\alpha$-Balanced FL**
![](image/alpha_fl.PNG)

  - γ: Focus more on hard examples.
  - α: Offset class imbalance of number of examples. 
    
**From Paper**
   - α is added into the equation, which yields slightly improved accuracy over the one without α
   - Using sigmoid activation function for computing p resulting in greater numerical stability.

**CLASS IMBALANCE + Not Focussed on Hard Examples + Accuracy** 

1. RetinaNet

# Two Stage Detectors

## R-CNN

![](http://image.slidesharecdn.com/lecture29-convolutionalneuralnetworks-visionspring2015-150504114140-conversion-gate02/95/lecture-29-convolutional-neural-networks-computer-vision-spring2015-31-638.jpg?cb=1430740006)

![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fjhui.github.io%2Fassets%2Frcnn%2Fbound.png&f=1)

**R-CNN Working**

1. Take an input Image
2. Use Selective Search Algorithm to generate approximately ~2k proposals
    ```
    Selective Search:
    1. Generate initial sub-segmentation, we generate many candidate regions
    2. Use greedy algorithm to recursively combine similar regions into larger ones 
    3. Use the generated regions to produce the final candidate region proposals 
    ```
3. Warp all the proposals into a fix size proposals which will be input to the convolutions
4. Feed Each warped proposals(~2k) into Convolutions Network which gives 4096-dimensional Feature Vector
5. Each of the feature vector is send to SVMs for Classification of a object with in a region proposal
6. The Networks all gives 4 values which predicts the offsets of the predicted bounding compared to Ground truth.

**Pros:**
    1. It led the foundation for Two Stage Detectors
    2. R-CNN achieves a meanaverage precision (mAP) of53.7% on PASCAL VOC 2010.

**Cons:**
    1. Selective Search Algorithm and Convolution operation on each proposal makes training, time and memory consuming.
    2. Inference is Extremely slow as it takes around 47 seconds for each test image.
    3. The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that    stage. This could lead to the generation of bad candidate region proposals.

## Fast R-CNN

![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1*E_P1vAEbGT4HNYjqMtIz4g.png&f=1)

![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fantkillerfarm.github.io%2Fimages%2Farticle%2Ffast_rcnn_p_2.png&f=1)

Here In Fast R-CNN few of previous drawbacks of R-CNN are solved 
1. Using a Single Convolution Networks into which we pass the entire image which generates a feature map.
2. The Region of Intreset(RoI) generated using selective search algorithm is then projected on the feature map.
3. The RoI is generated on the scale of original image but the feature map spatial dimension is small in comaprison to RoI so, the mapping is done by converting RoI to feature map scale.
4. Because different size RoIs are cropped from the feature they are feed into [RoI pool layer](https://deepsense.ai/region-of-interest-pooling-explained/) which performs a pooling operation and converts RoI into a fixed size feature map.
5. Pooled feature map are then feed into two seperate branches one which does classification and other Regression.


**For Classification :- Cross Entropy Loss**

**For Regression     :- SmoothL1Loss**

[SmoothL1Loss](https://stats.stackexchange.com/questions/351874/how-to-interpret-smooth-l1-loss) -> It combines both L1 Loss and L2 Loss
![](image/smooth.png)

**Pros:**
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

![](https://cdn-images-1.medium.com/max/800/1*m2QO_wbUPA05mY2q4v7mjg.png)
**Cons:**
When you look at the performance of Fast R-CNN during testing time, including region proposals slows down the algorithm significantly when compared to not using region proposals. Therefore, region proposals become bottlenecks in Fast R-CNN algorithm affecting its performance.

## Faster R-CNN
![](https://lilianweng.github.io/lil-log/assets/images/faster-RCNN.png)
Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network. Hence In Faster R-CNN the authors of paper introduced Regional Proposal Network (RPN)

RPN gives two outputs Objectness Score and Bounding Boxes for K anchor boxes.

Objectness Score :- it predicts whether there is object or not within a RoI so it use 2K predictions

Bounding Boxes :- It Predicts bounding boxes with 4K coordinates.

Using RPN we generate RoI.

After RPN gives RoI it's all similar to Fast R-CNN.

The first step of training a classifier is make a training dataset. The training data is the anchors we get from the above process and the ground-truth boxes. The problem we need to solve here is how we use the ground-truth boxes to label the anchors. The basic idea here is that we want to label the anchors having the higher overlaps with ground-truth boxes as foreground, the ones with lower overlaps as background. Apparently, it needs some tweaks and compromise to seperate foreground and background.

![](https://i.stack.imgur.com/RUJ2b.png)

## Comparison
![](https://i.ytimg.com/vi/v5bFVbQvFRk/maxresdefault.jpg)



## Understanding Basic Faster R-CNN architecture
![](https://raw.githubusercontent.com/chenyuntc/cloud/master/faster-rcnn%E7%9A%84%E5%89%AF%E6%9C%AC%E7%9A%84%E5%89%AF%E6%9C%AC.png)

1. We Use VGG16 as Backbone Network
2. Proposal Creator Which creates RoI using anchor boxes.
3. Proposal Target Creator:- It subsample top RoI and assigns target to them.
4. Anchor Target Generator:- It generates targets for anchor which is used to calculate RPN loss and train RPN
5. Finally we have 4 values in hand RPN_reg_loss, RPN_cls_loss, RoI_reg_loss, RoI_cls_loss we add all 4 of these to get total loss 

## Feature Pyramid Network

Improving Faster R-CNN with FPN

# References

https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection
<br>
https://towardsdatascience.com/understanding-2d-dilated-convolution-operation-with-examples-in-numpy-and-tensorflow-with-d376b3972b25
<br>
https://machinethink.net/blog/object-detection/
<br>
https://towardsdatascience.com/going-deep-into-object-detection-bed442d92b34
<br>
https://d2l.ai/chapter_computer-vision/anchor.html
<br>
https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
<br>
https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359
<br>
https://www.coursera.org/lecture/convolutional-neural-networks/non-max-suppression-dvrjH
<br>
https://lilianweng.github.io/lil-log/2017/10/29/object-recognition-for-dummies-part-1.html
<br>
https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4
<br>
https://medium.com/@jonathan_hui/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c
<br>
https://medium.com/@smallfishbigsea/notes-on-focal-loss-and-retinanet-9c614a2367c6
<br>
R-CNN
<br>
http://www.telesens.co/2018/03/11/object-detection-and-classification-using-r-cnns/
<br>
MAp
<br>
https://github.com/rafaelpadilla/Object-Detection-Metrics#11-point-interpolation
<br>
https://arxiv.org/pdf/1607.03476.pdf
<br>
https://tarangshah.com/blog/2018-01-27/what-is-map-understanding-the-statistic-of-choice-for-comparing-object-detection-models/
<br>
https://mc.ai/the-confusing-metrics-of-ap-and-map-for-object-detection/
<br>
https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
<br>
https://sanchom.wordpress.com/tag/average-precision/