FastNMS on Ultralytics YOLOv3 #366

glenn-jocher · 2020-03-04T01:17:51Z

@dbolya we've had a request from @Zzh-tju to implement FastNMS in https://github.com/ultralytics/yolov3 per ultralytics/yolov3#679. Can you point us to the location in your code where the function is? Can we use it for boxes rather than masks?

We currently use multi-class torchvision.ops.boxes.batched_nms() (middle row) as a compromise between speed and accuracy. We apply it once per image (all classes at once), and see an inference speed of 49ms/img (inference + NMS) at 608 image size, conf_thresh=0.001 on a Tesla T4, giving us about 42.0/62.0 mAP@0.5/0.5...0.95 on COCO2014. We do not do masks though, only boxes.

BTW, we also developed the merge nms method below, which is slower simply because it is implemented in python rather than C, but it may be possible to combine fast and merge together to get the best of both worlds.

NMS method	Time s/img	Time mm:ss	mAP @0.5:0.95	mAP @0.5
`'vision_batched', multi_cls=False`	46ms	3:50	41.2	60.8
`'vision_batched', multi_cls=True`	49ms	4:03	41.9	61.8
`'merge', multi_cls=True`	120ms	9:58	42.3	62.0

The text was updated successfully, but these errors were encountered:

dbolya · 2020-03-04T01:36:11Z

Looking at batched_nms, it looks like what we call cross_class NMS, but I'm not sure what that would make multi_cls=True.

Anyway, here's our implementation of Fast NMS:

yolact/layers/functions/detection.py

Lines 137 to 180 in 092554a

    
           def fast_nms(self, boxes, masks, scores, iou_threshold:float=0.5, top_k:int=200, second_threshold:bool=False): 
        
               scores, idx = scores.sort(1, descending=True) 
        
               idx = idx[:, :top_k].contiguous() 
        
               scores = scores[:, :top_k] 
        
               num_classes, num_dets = idx.size() 
        
               boxes = boxes[idx.view(-1), :].view(num_classes, num_dets, 4) 
        
               masks = masks[idx.view(-1), :].view(num_classes, num_dets, -1) 
        
               iou = jaccard(boxes, boxes) 
        
               iou.triu_(diagonal=1) 
        
               iou_max, _ = iou.max(dim=1) 
        
               # Now just filter out the ones higher than the threshold 
        
               keep = (iou_max <= iou_threshold) 
        
               # We should also only keep detections over the confidence threshold, but at the cost of 
        
               # maxing out your detection count for every image, you can just not do that. Because we 
        
               # have such a minimal amount of computation per detection (matrix mulitplication only), 
        
               # this increase doesn't affect us much (+0.2 mAP for 34 -> 33 fps), so we leave it out. 
        
               # However, when you implement this in your method, you should do this second threshold. 
        
               if second_threshold: 
        
                   keep *= (scores > self.conf_thresh) 
        
               # Assign each kept detection to its corresponding class 
        
               classes = torch.arange(num_classes, device=boxes.device)[:, None].expand_as(keep) 
        
               classes = classes[keep] 
        
               boxes = boxes[keep] 
        
               masks = masks[keep] 
        
               scores = scores[keep] 
        
               # Only keep the top cfg.max_num_detections highest scores across all classes 
        
               scores, idx = scores.sort(0, descending=True) 
        
               idx = idx[:cfg.max_num_detections] 
        
               scores = scores[:cfg.max_num_detections] 
        
               classes = classes[idx] 
        
               boxes = boxes[idx] 
        
               masks = masks[idx] 
        
               return boxes, masks, classes, scores

It works on boxes, so you can just ignore the mask stuff. The relevant inputs are boxes ([N, 4]) and scores ([N, num_classes]). The inputs and outputs should both be on the GPU (or whatever your fastest device is, and make sure nothing ever touches the CPU in this function), and we pass in all detections with > 0.05 confidence, but I don't think passing everything in will hurt performance much since we take the top 200 anyway. Also, read the big comment about the second threshold.

Most of the code is setup and postprocessing, the core of the algorithm is actually just:

     iou = jaccard(boxes, boxes) 
     iou.triu_(diagonal=1) 
     iou_max, _ = iou.max(dim=1) 
  
     # Now just filter out the ones higher than the threshold 
     keep = (iou_max <= iou_threshold)

which is what's in the paper.

glenn-jocher · 2020-03-04T01:44:46Z

@dbolya great thanks! torchvision.ops.boxes.batched_nms() just means that the function accepts multiple classes at once. torchvision has a seperate torchvision.ops.boxes.nms() function that only accepts single-class boxes, which you'd need to drop into a for c in classes: type of python loop.

We use BCE for class loss, not CE, so its possible that multiple classes may be above threshold for a given box in our repo. multi_cls=True means that we output multiple detections (same box, different classes) in this case. multi_cls=False means we only pick the very top class above threshold.

I will try to implement this week and post the results here if I'm successful!

glenn-jocher · 2020-03-04T03:24:40Z

@dbolya @Zzh-tju I've imported the YOLACT FastNMS functions into ultralytics/yolov3, and get the following results. The times are for inference+NMS on the 5k COCO2014 val images using a Google Colab instance with Tesla T4.

fast_batched below is the YOLACT FastNMS. I call it batched because I only call it once per image (it handles all classes at once). It is faster than torchvision.ops.boxes.batched_nms(), but with a mAP penalty of about 0.3-0.4 unfortunately. It may be much faster than torchvision, its unclear from these tests, as the below operations are likely dominated by inference time rather than NMS time. When I have time I will rerun on a large GCP VM with 16 cores and a V100 for the best comparison metrics.

NMS method	Time ms/img	Time mm:ss	mAP @0.5:0.95	mAP @0.5
`'vision_batched'`	49ms	4:03	41.9	61.8
`'merge'`	120ms	9:58	42.3	62.0
`'fast_batched'`	44ms	3:41	41.5	61.5

dbolya · 2020-03-04T04:50:48Z

Very interesting, and yeah I'm guessing the situations where fast nms would offer a huge speed increase depend on the detector and the rest of the code. Maybe the setup and post processing are a little too bloated too.

Also, now that you mention it, I'm fairly sure I could create a fast merge nms that would be slightly worse than what you list there but almost a fast as fast NMS. This will have to wait until after a certain very close deadline tho >.>

glenn-jocher · 2020-03-04T06:49:58Z

Update: I discovered a majority of time in ultralytics/yolov3/test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with internal repo code) I get the following much improved times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608

NMS method	Time ms/img	Time mm:ss	mAP @0.5:0.95	mAP @0.5
`'vision_batched'` (default)	15.2 ms	1:16	41.9	61.8
`'merge'`	103.0 ms	8:35	42.3	62.0
`'fast_batched'`	14.6 ms	1:13	41.5	61.5

I get a 4% drop in time for a 1% drop in mAP by switching to fast from vision batched, which isn't bad, though I suspect img-size reductions may yield slightly more favorable ratios. In any case, both implementations are much faster than python for loop nms used in merge. Merge simply creates new boxes using a weighted combination of the scores rather than deleting lower score boxes outright. It seems to provide a +0.4mAP bump, which might take fast nms back to the same mAP produced by vision_batched, but then we'd be back were we started unfortunately.

glenn-jocher · 2020-03-04T17:31:44Z

To further clarify the timing, I added profiling code to test.py that specifically tracks inference and NMS times in ultralytics/yolov3@e482392. This can be accessed with the --profile flag:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:

Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image
FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image

So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).

The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!

glenn-jocher · 2020-03-04T19:19:22Z

CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize()
operations necessary when profiling cuda operations. I've fixed this in ultralytics/yolov3@1430a1e. Corrected results, consistent across several runs:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image
FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image

Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms().

Zzh-tju · 2020-03-05T12:31:57Z

thanks, so it is carried out on one class, isn't is? e.g., cc_fast_NMS, collapse all the classes into 1. How about multi-class?

Zzh-tju · 2020-03-05T12:33:57Z

And how many boxes do you choose? (top n)

glenn-jocher · 2020-03-05T17:47:32Z

@Zzh-tju I imported the FastNMS code here. It's very clever, but unfortunately it seems to be a dead end, as it's slower and produces worse mAP than the default method.
https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/utils/utils.py#L558-L568

I use all boxes above --conf threshold, I don't discard any boxes or put any upper limit on the number of boxes.

The times and tests above are for the usual 5000 image COCO val set using yolov3-spp-ultralytics.pt for all 80 classes. Everything is the exact same in the tests between the default output and the FastNMS output. You can reproduce by simply running

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

glenn-jocher · 2020-03-05T18:36:14Z

@Zzh-tju perhaps I'm not understanding the purpose of the top n boxes. I assumed this was a memory saver or speed enhancer, so I neglected to implement it as I saw no out of memory errors when running on full size COCO images, so I assumed all was well.

Is it possible that since I did not implement the top n boxes I'm not recreating FastNMS correctly? The code I have is very simple, I think it captures the core intention (the upper triangular iou matrix):

        # Batched NMS
        if batched:
            c = pred[:, 5] * 0 if agnostic else pred[:, 5]  # class-agnostic NMS
            boxes, scores = pred[:, :4].clone(), pred[:, 4]
            if method == 'vision_batch':
                i = torchvision.ops.boxes.batched_nms(boxes, scores, c, iou_thres)
            elif method == 'fast_batch':  # FastNMS from https://github.com/dbolya/yolact
                boxes += c.view(-1, 1) * max_wh  # seperate boxes by class
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                i = iou.max(dim=0)[0] < iou_thres

Zzh-tju · 2020-03-05T21:25:43Z

yeah, I mean your batch_nms is cross-class NMS, right? And I'm confused by your mention above what is the difference between multi_cls=True or False?
Do you mean YOLO provide a box with multi-label? And if set to False, it will be only one class for a box. But the NMS for the two modes are cross class. Since 'False' can be faster than 'True', I'm wonder the setting differences between the two.

Another question is how about doing NMS for each class? (Fast NMS vs traditional NMS)

glenn-jocher · 2020-03-05T21:56:39Z

@Zzh-tju it's very simple.

All classes are processed correctly, no matter the class count.
multi_cls allows more than one label per box.
FastNMS appears slower, and produces worse mAP regardless of class count or multi_cls

glenn-jocher · 2020-03-05T22:16:25Z

@Zzh-tju ah I think I understand your confusion. Maybe I should rename multi_cls as multi_label to better explain it. This is what it is doing.
https://en.wikipedia.org/wiki/Multi-label_classification

It's intended for multi-label datasets like OIv5 where a 'person' can also be a 'man' or a 'woman' (i.e. two correct labels for one object). It also helps out COCO mAP a bit, despite it being a single label dataset.

Update: fixed in ultralytics/yolov3@692b006

Zzh-tju · 2020-03-06T08:50:44Z

@glenn-jocher yeah, now I just want to know the speed when doing NMS for each class. For traditional NMS, it must do for a c loop, right? So I guess Fast NMS will be faster since it does once for all classes simultaneously.

glenn-jocher · 2020-03-06T18:58:16Z

@Zzh-tju the speeds provided are for NMS for all 80 COCO classes for each image: 1.6 ms per image for all classes. The batched methods do all classes simultaneously.

glenn-jocher · 2020-03-06T19:01:38Z

https://github.com/pytorch/vision/blob/b6f28ec1a8c5fdb8d01cc61946e8f87dddcfa830/torchvision/ops/boxes.py#L39

def batched_nms(boxes, scores, idxs, iou_threshold):
    # type: (Tensor, Tensor, Tensor, float)
    """
    Performs non-maximum suppression in a batched fashion.
    Each index value correspond to a category, and NMS
    will not be applied between elements of different categories.
    Parameters
    ----------
    boxes : Tensor[N, 4]
        boxes where NMS will be performed. They
        are expected to be in (x1, y1, x2, y2) format
    scores : Tensor[N]
        scores for each one of the boxes
    idxs : Tensor[N]
        indices of the categories for each one of the boxes.
    iou_threshold : float
        discards all overlapping boxes
        with IoU > iou_threshold
    Returns
    -------
    keep : Tensor
        int64 tensor with the indices of
        the elements that have been kept by NMS, sorted
        in decreasing order of scores
    """

Gaondong · 2020-03-26T14:38:36Z

I saw a new matrix nms.
https://arxiv.org/abs/2003.10152
https://github.com/aim-uofa/AdelaiDet/

glenn-jocher · 2020-03-26T21:29:42Z

@Gaondong yes I already tried to implement it, and was unable to reproduce their results.

Gaondong · 2020-03-26T22:37:05Z

@Gaondong yes I already tried to implement it, and was unable to reproduce their results.

Thanks.

glenn-jocher · 2020-03-26T22:47:01Z

@Gaondong see ultralytics/yolov3#679 (comment)

I used this code for Matrix (Soft) NMS:

            elif method == 'matrix':  # Matrix NMS from https://arxiv.org/abs/2003.10152
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                m = iou.max(0)[0].view(-1, 1)  # max values
                decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0]  # gauss with sigma=0.5
                scores *= decay
                i = torch.full((boxes.shape[0],), fill_value=1).bool()

glenn-jocher mentioned this issue Mar 4, 2020

INCREASING NMS SPEED ultralytics/yolov3#679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastNMS on Ultralytics YOLOv3 #366

FastNMS on Ultralytics YOLOv3 #366

glenn-jocher commented Mar 4, 2020 •

edited

dbolya commented Mar 4, 2020

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

dbolya commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

Zzh-tju commented Mar 5, 2020

Zzh-tju commented Mar 5, 2020

glenn-jocher commented Mar 5, 2020

glenn-jocher commented Mar 5, 2020 •

edited

Zzh-tju commented Mar 5, 2020

glenn-jocher commented Mar 5, 2020 •

edited

glenn-jocher commented Mar 5, 2020 •

edited

Zzh-tju commented Mar 6, 2020

glenn-jocher commented Mar 6, 2020

glenn-jocher commented Mar 6, 2020

Gaondong commented Mar 26, 2020

glenn-jocher commented Mar 26, 2020

Gaondong commented Mar 26, 2020

glenn-jocher commented Mar 26, 2020

FastNMS on Ultralytics YOLOv3 #366

FastNMS on Ultralytics YOLOv3 #366

Comments

glenn-jocher commented Mar 4, 2020 • edited

dbolya commented Mar 4, 2020

glenn-jocher commented Mar 4, 2020 • edited

glenn-jocher commented Mar 4, 2020 • edited

dbolya commented Mar 4, 2020 • edited

glenn-jocher commented Mar 4, 2020 • edited

glenn-jocher commented Mar 4, 2020 • edited

glenn-jocher commented Mar 4, 2020 • edited

Zzh-tju commented Mar 5, 2020

Zzh-tju commented Mar 5, 2020

glenn-jocher commented Mar 5, 2020

glenn-jocher commented Mar 5, 2020 • edited

Zzh-tju commented Mar 5, 2020

glenn-jocher commented Mar 5, 2020 • edited

glenn-jocher commented Mar 5, 2020 • edited

Zzh-tju commented Mar 6, 2020

glenn-jocher commented Mar 6, 2020

glenn-jocher commented Mar 6, 2020

Gaondong commented Mar 26, 2020

glenn-jocher commented Mar 26, 2020

Gaondong commented Mar 26, 2020

glenn-jocher commented Mar 26, 2020

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

dbolya commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 4, 2020 •

edited

glenn-jocher commented Mar 5, 2020 •

edited

glenn-jocher commented Mar 5, 2020 •

edited

glenn-jocher commented Mar 5, 2020 •

edited