-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastNMS on Ultralytics YOLOv3 #366
Comments
Looking at Anyway, here's our implementation of Fast NMS: yolact/layers/functions/detection.py Lines 137 to 180 in 092554a
It works on boxes, so you can just ignore the mask stuff. The relevant inputs are Most of the code is setup and postprocessing, the core of the algorithm is actually just: iou = jaccard(boxes, boxes)
iou.triu_(diagonal=1)
iou_max, _ = iou.max(dim=1)
# Now just filter out the ones higher than the threshold
keep = (iou_max <= iou_threshold) which is what's in the paper. |
@dbolya great thanks! We use BCE for class loss, not CE, so its possible that multiple classes may be above threshold for a given box in our repo. I will try to implement this week and post the results here if I'm successful! |
@dbolya @Zzh-tju I've imported the YOLACT FastNMS functions into ultralytics/yolov3, and get the following results. The times are for inference+NMS on the 5k COCO2014 val images using a Google Colab instance with Tesla T4.
|
Very interesting, and yeah I'm guessing the situations where fast nms would offer a huge speed increase depend on the detector and the rest of the code. Maybe the setup and post processing are a little too bloated too. Also, now that you mention it, I'm fairly sure I could create a fast merge nms that would be slightly worse than what you list there but almost a fast as fast NMS. This will have to wait until after a certain very close deadline tho >.> |
Update: I discovered a majority of time in ultralytics/yolov3/test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with internal repo code) I get the following much improved times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance. python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
I get a 4% drop in time for a 1% drop in mAP by switching to fast from vision batched, which isn't bad, though I suspect img-size reductions may yield slightly more favorable ratios. In any case, both implementations are much faster than python |
To further clarify the timing, I added profiling code to test.py that specifically tracks inference and NMS times in ultralytics/yolov3@e482392. This can be accessed with the python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS: Default: So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.). The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS! |
CORRECTION: My previous analysis was incorrect, it lacked the python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile Default: Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default |
thanks, so it is carried out on one class, isn't is? e.g., cc_fast_NMS, collapse all the classes into 1. How about multi-class? |
And how many boxes do you choose? (top n) |
@Zzh-tju I imported the FastNMS code here. It's very clever, but unfortunately it seems to be a dead end, as it's slower and produces worse mAP than the default method. I use all boxes above The times and tests above are for the usual 5000 image COCO val set using yolov3-spp-ultralytics.pt for all 80 classes. Everything is the exact same in the tests between the default output and the FastNMS output. You can reproduce by simply running
|
@Zzh-tju perhaps I'm not understanding the purpose of the top n boxes. I assumed this was a memory saver or speed enhancer, so I neglected to implement it as I saw no out of memory errors when running on full size COCO images, so I assumed all was well. Is it possible that since I did not implement the top n boxes I'm not recreating FastNMS correctly? The code I have is very simple, I think it captures the core intention (the upper triangular iou matrix): # Batched NMS
if batched:
c = pred[:, 5] * 0 if agnostic else pred[:, 5] # class-agnostic NMS
boxes, scores = pred[:, :4].clone(), pred[:, 4]
if method == 'vision_batch':
i = torchvision.ops.boxes.batched_nms(boxes, scores, c, iou_thres)
elif method == 'fast_batch': # FastNMS from https://github.com/dbolya/yolact
boxes += c.view(-1, 1) * max_wh # seperate boxes by class
iou = box_iou(boxes, boxes).triu_(diagonal=1) # upper triangular iou matrix
i = iou.max(dim=0)[0] < iou_thres |
yeah, I mean your batch_nms is cross-class NMS, right? And I'm confused by your mention above what is the difference between multi_cls=True or False? Another question is how about doing NMS for each class? (Fast NMS vs traditional NMS) |
@Zzh-tju it's very simple.
|
@Zzh-tju ah I think I understand your confusion. Maybe I should rename It's intended for multi-label datasets like OIv5 where a 'person' can also be a 'man' or a 'woman' (i.e. two correct labels for one object). It also helps out COCO mAP a bit, despite it being a single label dataset. Update: fixed in ultralytics/yolov3@692b006 |
@glenn-jocher yeah, now I just want to know the speed when doing NMS for each class. For traditional NMS, it must do for a c loop, right? So I guess Fast NMS will be faster since it does once for all classes simultaneously. |
@Zzh-tju the speeds provided are for NMS for all 80 COCO classes for each image: 1.6 ms per image for all classes. The batched methods do all classes simultaneously. |
def batched_nms(boxes, scores, idxs, iou_threshold):
# type: (Tensor, Tensor, Tensor, float)
"""
Performs non-maximum suppression in a batched fashion.
Each index value correspond to a category, and NMS
will not be applied between elements of different categories.
Parameters
----------
boxes : Tensor[N, 4]
boxes where NMS will be performed. They
are expected to be in (x1, y1, x2, y2) format
scores : Tensor[N]
scores for each one of the boxes
idxs : Tensor[N]
indices of the categories for each one of the boxes.
iou_threshold : float
discards all overlapping boxes
with IoU > iou_threshold
Returns
-------
keep : Tensor
int64 tensor with the indices of
the elements that have been kept by NMS, sorted
in decreasing order of scores
""" |
I saw a new matrix nms. |
@Gaondong yes I already tried to implement it, and was unable to reproduce their results. |
Thanks. |
@Gaondong see ultralytics/yolov3#679 (comment) I used this code for Matrix (Soft) NMS: elif method == 'matrix': # Matrix NMS from https://arxiv.org/abs/2003.10152
iou = box_iou(boxes, boxes).triu_(diagonal=1) # upper triangular iou matrix
m = iou.max(0)[0].view(-1, 1) # max values
decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0] # gauss with sigma=0.5
scores *= decay
i = torch.full((boxes.shape[0],), fill_value=1).bool() |
@dbolya we've had a request from @Zzh-tju to implement FastNMS in https://github.com/ultralytics/yolov3 per ultralytics/yolov3#679. Can you point us to the location in your code where the function is? Can we use it for boxes rather than masks?
We currently use multi-class
torchvision.ops.boxes.batched_nms()
(middle row) as a compromise between speed and accuracy. We apply it once per image (all classes at once), and see an inference speed of 49ms/img (inference + NMS) at 608 image size,conf_thresh=0.001
on a Tesla T4, giving us about 42.0/62.0 mAP@0.5/0.5...0.95 on COCO2014. We do not do masks though, only boxes.BTW, we also developed the
merge
nms method below, which is slower simply because it is implemented in python rather than C, but it may be possible to combinefast
andmerge
together to get the best of both worlds.s/img
mm:ss
@0.5:0.95
@0.5
'vision_batched', multi_cls=False
'vision_batched', multi_cls=True
'merge', multi_cls=True
The text was updated successfully, but these errors were encountered: