Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastNMS on Ultralytics YOLOv3 #366

Open
glenn-jocher opened this issue Mar 4, 2020 · 21 comments
Open

FastNMS on Ultralytics YOLOv3 #366

glenn-jocher opened this issue Mar 4, 2020 · 21 comments

Comments

@glenn-jocher
Copy link
Contributor

glenn-jocher commented Mar 4, 2020

@dbolya we've had a request from @Zzh-tju to implement FastNMS in https://github.com/ultralytics/yolov3 per ultralytics/yolov3#679. Can you point us to the location in your code where the function is? Can we use it for boxes rather than masks?

We currently use multi-class torchvision.ops.boxes.batched_nms() (middle row) as a compromise between speed and accuracy. We apply it once per image (all classes at once), and see an inference speed of 49ms/img (inference + NMS) at 608 image size, conf_thresh=0.001 on a Tesla T4, giving us about 42.0/62.0 mAP@0.5/0.5...0.95 on COCO2014. We do not do masks though, only boxes.

BTW, we also developed the merge nms method below, which is slower simply because it is implemented in python rather than C, but it may be possible to combine fast and merge together to get the best of both worlds.

NMS method Time
s/img
Time
mm:ss
mAP
@0.5:0.95
mAP
@0.5
'vision_batched', multi_cls=False 46ms 3:50 41.2 60.8
'vision_batched', multi_cls=True 49ms 4:03 41.9 61.8
'merge', multi_cls=True 120ms 9:58 42.3 62.0
@dbolya
Copy link
Owner

dbolya commented Mar 4, 2020

Looking at batched_nms, it looks like what we call cross_class NMS, but I'm not sure what that would make multi_cls=True.

Anyway, here's our implementation of Fast NMS:

def fast_nms(self, boxes, masks, scores, iou_threshold:float=0.5, top_k:int=200, second_threshold:bool=False):
scores, idx = scores.sort(1, descending=True)
idx = idx[:, :top_k].contiguous()
scores = scores[:, :top_k]
num_classes, num_dets = idx.size()
boxes = boxes[idx.view(-1), :].view(num_classes, num_dets, 4)
masks = masks[idx.view(-1), :].view(num_classes, num_dets, -1)
iou = jaccard(boxes, boxes)
iou.triu_(diagonal=1)
iou_max, _ = iou.max(dim=1)
# Now just filter out the ones higher than the threshold
keep = (iou_max <= iou_threshold)
# We should also only keep detections over the confidence threshold, but at the cost of
# maxing out your detection count for every image, you can just not do that. Because we
# have such a minimal amount of computation per detection (matrix mulitplication only),
# this increase doesn't affect us much (+0.2 mAP for 34 -> 33 fps), so we leave it out.
# However, when you implement this in your method, you should do this second threshold.
if second_threshold:
keep *= (scores > self.conf_thresh)
# Assign each kept detection to its corresponding class
classes = torch.arange(num_classes, device=boxes.device)[:, None].expand_as(keep)
classes = classes[keep]
boxes = boxes[keep]
masks = masks[keep]
scores = scores[keep]
# Only keep the top cfg.max_num_detections highest scores across all classes
scores, idx = scores.sort(0, descending=True)
idx = idx[:cfg.max_num_detections]
scores = scores[:cfg.max_num_detections]
classes = classes[idx]
boxes = boxes[idx]
masks = masks[idx]
return boxes, masks, classes, scores

It works on boxes, so you can just ignore the mask stuff. The relevant inputs are boxes ([N, 4]) and scores ([N, num_classes]). The inputs and outputs should both be on the GPU (or whatever your fastest device is, and make sure nothing ever touches the CPU in this function), and we pass in all detections with > 0.05 confidence, but I don't think passing everything in will hurt performance much since we take the top 200 anyway. Also, read the big comment about the second threshold.

Most of the code is setup and postprocessing, the core of the algorithm is actually just:

     iou = jaccard(boxes, boxes) 
     iou.triu_(diagonal=1) 
     iou_max, _ = iou.max(dim=1) 
  
     # Now just filter out the ones higher than the threshold 
     keep = (iou_max <= iou_threshold) 

which is what's in the paper.

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 4, 2020

@dbolya great thanks! torchvision.ops.boxes.batched_nms() just means that the function accepts multiple classes at once. torchvision has a seperate torchvision.ops.boxes.nms() function that only accepts single-class boxes, which you'd need to drop into a for c in classes: type of python loop.

We use BCE for class loss, not CE, so its possible that multiple classes may be above threshold for a given box in our repo. multi_cls=True means that we output multiple detections (same box, different classes) in this case. multi_cls=False means we only pick the very top class above threshold.

I will try to implement this week and post the results here if I'm successful!

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 4, 2020

@dbolya @Zzh-tju I've imported the YOLACT FastNMS functions into ultralytics/yolov3, and get the following results. The times are for inference+NMS on the 5k COCO2014 val images using a Google Colab instance with Tesla T4.

fast_batched below is the YOLACT FastNMS. I call it batched because I only call it once per image (it handles all classes at once). It is faster than torchvision.ops.boxes.batched_nms(), but with a mAP penalty of about 0.3-0.4 unfortunately. It may be much faster than torchvision, its unclear from these tests, as the below operations are likely dominated by inference time rather than NMS time. When I have time I will rerun on a large GCP VM with 16 cores and a V100 for the best comparison metrics.

NMS method Time
ms/img
Time
mm:ss
mAP
@0.5:0.95
mAP
@0.5
'vision_batched' 49ms 4:03 41.9 61.8
'merge' 120ms 9:58 42.3 62.0
'fast_batched' 44ms 3:41 41.5 61.5

@dbolya
Copy link
Owner

dbolya commented Mar 4, 2020

Very interesting, and yeah I'm guessing the situations where fast nms would offer a huge speed increase depend on the detector and the rest of the code. Maybe the setup and post processing are a little too bloated too.

Also, now that you mention it, I'm fairly sure I could create a fast merge nms that would be slightly worse than what you list there but almost a fast as fast NMS. This will have to wait until after a certain very close deadline tho >.>

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 4, 2020

Update: I discovered a majority of time in ultralytics/yolov3/test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with internal repo code) I get the following much improved times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
NMS method Time
ms/img
Time
mm:ss
mAP
@0.5:0.95
mAP
@0.5
'vision_batched' (default) 15.2 ms 1:16 41.9 61.8
'merge' 103.0 ms 8:35 42.3 62.0
'fast_batched' 14.6 ms 1:13 41.5 61.5

I get a 4% drop in time for a 1% drop in mAP by switching to fast from vision batched, which isn't bad, though I suspect img-size reductions may yield slightly more favorable ratios. In any case, both implementations are much faster than python for loop nms used in merge. Merge simply creates new boxes using a weighted combination of the scores rather than deleting lower score boxes outright. It seems to provide a +0.4mAP bump, which might take fast nms back to the same mAP produced by vision_batched, but then we'd be back were we started unfortunately.

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 4, 2020

To further clarify the timing, I added profiling code to test.py that specifically tracks inference and NMS times in ultralytics/yolov3@e482392. This can be accessed with the --profile flag:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:

Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image
FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image

So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).

The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 4, 2020

CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize()
operations necessary when profiling cuda operations. I've fixed this in ultralytics/yolov3@1430a1e. Corrected results, consistent across several runs:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image
FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image

Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms().

@Zzh-tju
Copy link

Zzh-tju commented Mar 5, 2020

thanks, so it is carried out on one class, isn't is? e.g., cc_fast_NMS, collapse all the classes into 1. How about multi-class?

@Zzh-tju
Copy link

Zzh-tju commented Mar 5, 2020

And how many boxes do you choose? (top n)

@glenn-jocher
Copy link
Contributor Author

@Zzh-tju I imported the FastNMS code here. It's very clever, but unfortunately it seems to be a dead end, as it's slower and produces worse mAP than the default method.
https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/utils/utils.py#L558-L568

I use all boxes above --conf threshold, I don't discard any boxes or put any upper limit on the number of boxes.

The times and tests above are for the usual 5000 image COCO val set using yolov3-spp-ultralytics.pt for all 80 classes. Everything is the exact same in the tests between the default output and the FastNMS output. You can reproduce by simply running

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 5, 2020

@Zzh-tju perhaps I'm not understanding the purpose of the top n boxes. I assumed this was a memory saver or speed enhancer, so I neglected to implement it as I saw no out of memory errors when running on full size COCO images, so I assumed all was well.

Is it possible that since I did not implement the top n boxes I'm not recreating FastNMS correctly? The code I have is very simple, I think it captures the core intention (the upper triangular iou matrix):

        # Batched NMS
        if batched:
            c = pred[:, 5] * 0 if agnostic else pred[:, 5]  # class-agnostic NMS
            boxes, scores = pred[:, :4].clone(), pred[:, 4]
            if method == 'vision_batch':
                i = torchvision.ops.boxes.batched_nms(boxes, scores, c, iou_thres)
            elif method == 'fast_batch':  # FastNMS from https://github.com/dbolya/yolact
                boxes += c.view(-1, 1) * max_wh  # seperate boxes by class
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                i = iou.max(dim=0)[0] < iou_thres

@Zzh-tju
Copy link

Zzh-tju commented Mar 5, 2020

yeah, I mean your batch_nms is cross-class NMS, right? And I'm confused by your mention above what is the difference between multi_cls=True or False?
Do you mean YOLO provide a box with multi-label? And if set to False, it will be only one class for a box. But the NMS for the two modes are cross class. Since 'False' can be faster than 'True', I'm wonder the setting differences between the two.

Another question is how about doing NMS for each class? (Fast NMS vs traditional NMS)

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 5, 2020

@Zzh-tju it's very simple.

  • All classes are processed correctly, no matter the class count.
  • multi_cls allows more than one label per box.
  • FastNMS appears slower, and produces worse mAP regardless of class count or multi_cls

@glenn-jocher
Copy link
Contributor Author

glenn-jocher commented Mar 5, 2020

@Zzh-tju ah I think I understand your confusion. Maybe I should rename multi_cls as multi_label to better explain it. This is what it is doing.
https://en.wikipedia.org/wiki/Multi-label_classification

It's intended for multi-label datasets like OIv5 where a 'person' can also be a 'man' or a 'woman' (i.e. two correct labels for one object). It also helps out COCO mAP a bit, despite it being a single label dataset.

Update: fixed in ultralytics/yolov3@692b006

@Zzh-tju
Copy link

Zzh-tju commented Mar 6, 2020

@glenn-jocher yeah, now I just want to know the speed when doing NMS for each class. For traditional NMS, it must do for a c loop, right? So I guess Fast NMS will be faster since it does once for all classes simultaneously.

@glenn-jocher
Copy link
Contributor Author

@Zzh-tju the speeds provided are for NMS for all 80 COCO classes for each image: 1.6 ms per image for all classes. The batched methods do all classes simultaneously.

@glenn-jocher
Copy link
Contributor Author

https://github.com/pytorch/vision/blob/b6f28ec1a8c5fdb8d01cc61946e8f87dddcfa830/torchvision/ops/boxes.py#L39

def batched_nms(boxes, scores, idxs, iou_threshold):
    # type: (Tensor, Tensor, Tensor, float)
    """
    Performs non-maximum suppression in a batched fashion.
    Each index value correspond to a category, and NMS
    will not be applied between elements of different categories.
    Parameters
    ----------
    boxes : Tensor[N, 4]
        boxes where NMS will be performed. They
        are expected to be in (x1, y1, x2, y2) format
    scores : Tensor[N]
        scores for each one of the boxes
    idxs : Tensor[N]
        indices of the categories for each one of the boxes.
    iou_threshold : float
        discards all overlapping boxes
        with IoU > iou_threshold
    Returns
    -------
    keep : Tensor
        int64 tensor with the indices of
        the elements that have been kept by NMS, sorted
        in decreasing order of scores
    """

@Gaondong
Copy link

@glenn-jocher
Copy link
Contributor Author

@Gaondong yes I already tried to implement it, and was unable to reproduce their results.

@Gaondong
Copy link

@Gaondong yes I already tried to implement it, and was unable to reproduce their results.

Thanks.

@glenn-jocher
Copy link
Contributor Author

@Gaondong see ultralytics/yolov3#679 (comment)

I used this code for Matrix (Soft) NMS:

            elif method == 'matrix':  # Matrix NMS from https://arxiv.org/abs/2003.10152
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                m = iou.max(0)[0].view(-1, 1)  # max values
                decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0]  # gauss with sigma=0.5
                scores *= decay
                i = torch.full((boxes.shape[0],), fill_value=1).bool()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants