# What models to use for object detection?

- YOLOV5, YOLOX, YOLOR, etc.
- CenterNet
- Faster RCNN
- Cascade RCNN
- EfficientDet

# How to train?

- poisson blending:
![Poisson blending](poisson.png)
- progressive learning (increase resolution during training)
- increase weight decay
- try Adam
- use augmentations (hflip, vflip, rotate90, hsv, translate, scale, clahe, mosaic, mixup, water-augment, transpose)
- image patches: cut the original image (etc. 1280x720) into many patches (etc. 512x320) with overlap, remove boxes near boundary, then only train on those patches
- another option is upscale the training images by two times (2560x1440) and split them into four 1280x720 images; after that discard the split image not containing bounding boxes
- freeze backbone for final epoch fine-tuning, only train regression head. Benefits of this include ~80% reduction of GPU memory cost which allows up to 5x batch size increase, or training at even higher resolutions
- change stride of second (for example) convolutional layer: [-1, 1, Conv, [128, 3, 1]],  # 1-P2/4 example for YOLOV5

# Inference

- try different inference resolutions
- seqNMS - https://arxiv.org/abs/1602.08465
- norfair tracking (for video)
- classification re-score: 
    - crop out all predicted boxes (K-fold OOF) into squares with conf > 0.01. The side length of the square is max(length, width) of the predicted boxes, then extended by 20%. 
    - we calculate the iou as the maximum of the iou values of each predicted box and GT boxes of this image.
    - classification target of each cropped box: iou>0.3, iou>0.4, iou>0.5, iou>0.6, iou>0.7, iou>0.8 and iou>0.9 Simply put, the iou is divided into 7 bins. e.g.: [1,1,1,0,0,0,0] indicates the iou is between 0.5 and 0.6.
    - during inference we average 7 bin outputs as classification score.
    - then we use BCELoss to train those cropped boxes by size 256x256 or 224x224.
    - a very high dropout_rate or drop_path_rate can help a lot to improve the performance of the classification model. We use dropout_rate=0.7 and drop_path_rate=0.5
    - augmentations: hflip, vflip, transpose, 45° rotation and cutout.
    - ensemble

- attention area: 
    - the model has predicted some boxes B at N frame
    - select the boxes from B which has a high confidence(larger than a threshold T), these boxes are marked as "attention area" in the N + 1 frame
    - for every predicted box (not actually every, but confidence threshold > 0.01), if it has an IoU with the "attention area" larger than 0.5, boost the score of this box with += attention_area_conf * IoU, then do the confidence filtering.