### SSD object detection architecture

### SSD: Single Shot MultiBox Detector (Liu W. et al., 2016)
[Paper](https://arxiv.org/abs/1512.02325)

*Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.*

In [1]:
import torch
import torchvision

assert torch.cuda.is_available() is True
%load_ext watermark

In [2]:
%watermark -p torch,torchvision

torch      : 1.10.2
torchvision: 0.11.3



Comparing to YOLOv1:

* Default boxes mechanism (Anchor boxes from RCNN):

<img src="../assets/4_ssd.png" width="550">


* Moving back to background class (RCNN)


* Offsets $(t_x, t_y, t_w, t_h)$ prediction (RCNN)


* Thus, we need to encode 4(offsets) + C (number of classes) informatiuon for each default box


* Multiple feature maps to detect objects with vairous scales in an image:

Paper figure. SSD vs YOLO:

<img src="../assets/3_ssd.png" width="850">


SSD300 with VGG backbone:

<img src="../assets/2_ssd.svg" width="950">


*Note: different versions of the paper describe different default boxes for VGG16:* 

* 7308: (3, 6, 6, 6, 6, 6) prior boxes
* 8732: (4, 6, 6, 6, 4, 4) prior boxes

* Preparing default boxes / anchors:


Anchor positions for a feature map $m x n$:

$$(\frac{i+0.5}{m}, \frac{j+0.5}{n}), i=0,1...,m; j=0,1,...,n$$

Matching anchors to GT:

$anchor=\left\{ 
  \begin{array}{ c l }
    positive, & \quad IoU_{(anchor, GT)} \ge 0.5 \\
    negative (background), & \quad IoU_{(anchor, GT)} < 0.5
  \end{array}
\right.$



So, unlike YOLOv1, multiple default boxes may be marked as GT for an object in an image.



* 3 anchors for the largest feature map, 6 anchors  - for the rest

* Model loss:


$I_{i,j} = [IoU(GT_j, anchor_i) \ge 0.5]$,

$N = \sum_{i, j} I_{i,j}$, the number of matched default boxes,


$$Loss = \frac{1}{N}[L_{class} + \alpha L_{loc}],$$


* $ L_{loc}$ is $ L_{loc}$ from RCNN,


* $ L_{class} $ is a bit tricky:

    * So many default boxes produces a lot of negative examples. This is a serious problem. Because of that the model could be trained predicting only backgrounds.
    
    * To compute $ L_{class}$ we collect positive examples and only negative ones with highest non-background scores (detector predicted them as any object, "hard negative mining") keeping the ratio $\frac{neg}{pos}=\frac{3}{1}$.

Training scheme:

<img src="../assets/5_ssd.svg" width="450">


Inference scheme:

<img src="../assets/6_ssd.svg" width="450">

In [3]:
[x for x in dir(torchvision.models.detection) if 'ssd' in x]

['ssd', 'ssd300_vgg16', 'ssdlite', 'ssdlite320_mobilenet_v3_large']

In [4]:
print(torchvision.models.detection.SSD.__doc__)


    Implements SSD architecture from `"SSD: Single Shot MultiBox Detector" <https://arxiv.org/abs/1512.02325>`_.

    The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
    image, and should be in 0-1 range. Different images can have different sizes but they will be resized
    to a fixed size before passing it to the backbone.

    The behavior of the model changes depending if it is in training or evaluation mode.

    During training, the model expects both the input tensors, as well as a targets (list of dictionary),
    containing:
        - boxes (``FloatTensor[N, 4]``): the ground-truth boxes in ``[x1, y1, x2, y2]`` format, with
          ``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
        - labels (Int64Tensor[N]): the class label for each ground-truth box

    The model returns a Dict[Tensor] during training, containing the classification and regression
    losses.

    During inference, the model requires only the input ten

### DSSD: Deconvolutional single shot detector (Fu C. Y. et al., 2017)
[Paper](https://arxiv.org/abs/1701.06659)

### FSSD: Feature Fusion Single Shot Multibox Detector (Li Z., Zhou, 2017)
[Paper](https://arxiv.org/abs/1712.00960)

### ASSD: Attentive single shot multibox detector (Yi J., Wu P., Metaxas D. N., 2019)
[Paper](https://arxiv.org/abs/1909.12456)

<img src="../assets/1_ssd.png" width="550">

### U-SSD: Improved SSD Based on U-Net Architecture for End-to-End Table Detection in Document Images (Lee S. H., Chen H. C., 2021)
[Paper](https://www.mdpi.com/2076-3417/11/23/11446/htm)

#### References

* https://paperswithcode.com/method/ssd
* https://pytorch.org/blog/torchvision-ssd-implementation/
* https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/
* https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection