Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

rkishore · 2019-05-09T19:44:31Z

Hello, first off, thank you for sharing this amazing work. Much appreciated.

I wanted to report in that I also could not get 30+fps on an Nvidia RTX 2080 GPU with 8GB RAM. I am getting 8-10fps with video and with images, I get ~16fps (0.06sec/image) with the Resnet-101 model, ~20fps (0.05sec/image) with the Resnet-50 model and 17-18fps (0.055sec/image) with the Darket53 model. This is quite impressive but its roughly 1/2 of what is reported in the paper. For images, I used the python timeit module to wrap the evalimage function to report my numbers. Also, it is weird that the difference in speed between the different models is not significant (especially between Resnet-101 and Resnet-50), which indicates to me that something is reducing the processing speed by ~1/2 for all the models.

The command I am using is as below (except I change the model name as needed):

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.4 --top_k=100 --images=./test_images:./test_output_images

I also tried using --benchmark but there is no change in the numbers above.

I was wondering if I could get some help to figure this out.

The text was updated successfully, but these errors were encountered:

rkishore · 2019-05-09T21:16:58Z

Just wanted to report in that running the benchmark on the COCO dataset as per your instructions gets me much closer to the reported numbers. Now I wonder what the difference is between the --benchmark code and the actual per-image instance segmentation code.

With Resnet-101

python3 eval.py --trained_model=weights/yolact_base_54_800000.pth --benchmark --max_images=1000

Config not specified. Parsed yolact_base_config from the file name.

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 29.87 fps

Stats for the last frame:

  Name      | Time (ms)

Average: 29.87 fps, 33.48 ms

With Resnet-50

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --benchmark --max_images=1000
= args.display = 0
Config not specified. Parsed yolact_resnet50_config from the file name.

loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 40.06 fps

Stats for the last frame:

  Name      | Time (ms)

Average: 40.06 fps, 24.96 ms

With Darknet-53

python3 eval.py --trained_model=weights/yolact_darknet53_54_800000.pth --benchmark --max_images=1000
Config not specified. Parsed yolact_darknet53_config from the file name.

loading annotations into memory...
Done (t=0.45s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 34.68 fps

Stats for the last frame:

  Name      | Time (ms)

Average: 34.68 fps, 28.84 ms

dbolya · 2019-05-10T07:04:55Z

The FPS we report comes from the command,
python eval.py --trained_model=<modelname> --benchmark --max_images=400
run on one Titan Xp. --benchmark mode times just the raw model.

Like other papers, our timing only reports the speed of the model itself. That is, timing starts when the image is finished loading and stops when the network outputs masks. Note that this timing does not include 1.) loading the image, 2.) rendering the mask onto the image, or 3.) displaying the image, all of which are included in evalvideo, and the first two of which in evalimage.

Right now, that step 2 is particularly limiting for us, and it's the bottleneck that you see giving you that lower than reported fps. I'm working on fixing this so that we can run the full model from loading to displaying all at 30 fps (see #17), but that's difficult to do with Python (thanks to the GIL) and without direct access to the graphics card (w/o CUDA or a graphics library like OpenGL or Vulkan).

A large amount of time right now is spent rendering the image on the GPU, copying the image to the CPU to draw boxes and text, and then passing the CPU image to OpenCV which just copies it back to the GPU internally. A real production-ready version of this would likely have to be in native C++ using a CUDA matrix as a texture in Vulkan or OpenGL to render directly to the screen, but I'd like to keep the project in native Pytorch for as long as possible (so that everyone can easily start using it / add to it).

Good news is though that I have updated rendering code in the works, and I think I'll be able to get close to that sweet sweet 30 fps with that. It should be out soon, so I'll keep you posted.

rkishore · 2019-05-10T13:31:34Z

@dbolya , thanks for the explanation and taking the time to respond. Do you know how much time does step 1 (i.e. loading the image add to the whole equation) take?

Also, excited to hear about the updated rendering code.

dbolya · 2019-05-10T21:47:49Z

I'm actually really glad you asked that! When I timed it, that step took a whopping 19 ms, which didn't seem right at all.

I then narrowed it down to this line
torch.Tensor(frame).float().cuda()
which a full 16 ms on its own!

Turns out most of that was coming from the torch.Tensor constructor, so I changed that to
torch.from_numpy(frame).float().cuda()
but that still took 15 ms, most of which coming from the .float() on the CPU.

So, I once again rearranged that to get
torch.from_numpy(frame).cuda().float()
which took only 1 ms...

So on the current master branch, step 1 takes 19 ms, but now it's down to 4. I'll push this along with my new rendering code and other speed improvements probably later today. Note though that evalvideo is very multithreaded and the torch.Tensor constructor likely releases the GIL (as it's in C++), so this doesn't look like it had as huge an impact on evaluation (though it did take me from 28 fps on one video to 31).

dbolya · 2019-05-11T07:14:28Z

Pushed the patch. Pull the latest commit and run

python eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.15 --top_k=15 --video_multiframe=2 --video=<your_video>

to test the new video speeds. --images should also have received a similar speed boost.

Also let the video play for a little bit before reporting the FPS because it goes up over time in my experience (seems like the first couple frames take longer than the rest even after initialization).

rkishore · 2019-05-13T16:43:50Z

@dbolya, thanks a lot for the patch. With the command you sent, my early tests show 22-23fps with videos (when displaying the output), and 15-16fps when writing to an output video. Definitely an improvement. My GPU is maxed out so likely need more GPU cores with this implementation.

For --images, I don't see a big change from before (I am getting ~23fps with Resnet 50 when writing out the output and ~29fps when I comment out the cv2.imwrite for the output image). It is likely that the GPU horsepower I have is insufficient + the mask writeout has a sizable penalty. I have an RTX 2080 with 2300 CUDA cores and AFAIK, the Titan Xp has a 1500 more CUDA cores, which may be where the processing speed difference is coming from. What is the processing you get on average with --images on the Titan Xp? I am using the following command:

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.4 --top_k=15 --images=./test_images:./test_output_images

dbolya · 2019-05-13T17:38:48Z

For --images I'm getting 24.77 fps on a Titan Xp with the command you listed there when timing the whole evalimage function. I get 35.72 fps when I omit the cv2.imwrite call from timing (wow why is my server's SSD so slow?) Finally, I get 45.87 fps (the FPS Resnet50 runs at in benchmark mode) if I also omit the cv2.imread call (keeping the FastBaseTransform in).

That 45.87 comes from timing the following 3 lines:

batch = FastBaseTransform()(frame.unsqueeze(0))
preds = net(batch)
img_numpy = pred_display(preds, frame, None, None, undo_transform=False)

Note that I also included a torch.cuda.synchronize() in the timing for good measure but that doesn't matter because of pred_display's call to .cpu().

I guess the bottleneck on my server is disk operations, but those should be done in a separate thread anyway. I haven't bothered to multithread evalimage because --benchmark on COCO is all I needed for the paper, but if you want really fast evalimages you'd better multithread the data loading and saving. Note that I also didn't bother multithreading savevideo (for writing to an output video). I only specifically optimized evalvideo, the real-time demo one.

Also, when you're timing make sure to discard the first ~2 frametimes because Pytorch initializes things on the first or second pass through the network, so the first call for instance can take up to 4 seconds. You can run evalimage on a dummy image beforehand to counteract this if you'd like.

rkishore · 2019-05-13T20:17:33Z

@dbolya, thank you.

For --images, we are not that far off in performance. As your results with cv2.imread is 10fps slower (35fps with cv2.imread and 45fps without it), looks like it takes ~6-7ms for this function?

Also, there is only one cv2.imread in eval.py inside evalimage. When you say you omit this function, I assume you mean you omit it from your time/speed calculations, correct? Because otherwise, where else will you get the input image to process from?

dbolya · 2019-05-13T21:17:40Z

Yeah, I mean I omit it from the calculations.

It looks like this is the performance breakdown:

imread and GPU copy:  6.2 ms
    everything else: 21.8 ms
            imwrite: 12.4 ms

rkishore · 2019-05-14T14:41:00Z

@dbolya, thank you.

zimenglan-sysu-512 · 2019-05-20T13:07:21Z

hi @dbolya
i use the cmd

CUDA_VISIBLE_DEVICES=0 python3.6 eval.py --trained_model=weights/yolact_base_54_800000.pth --score_threshold=0.3 --top_k=100 --image=0001.png

the the fast_nms takes 0.11198925971984863s. so how to get the fast speed?
btw, i use Titan Xp.

dbolya · 2019-05-20T21:01:54Z

@zimenglan-sysu-512 Pytorch uses the first image passed through the network to set itself up, meaning that the first iteration will take much longer than the rest. So the first image you evaluate will be slow (still has some setting up to do), but every image after that will be fast. You need to evaluate multiple images (perhaps with --images or a video with --video) to properly benchmark the speed. Remember to ignore the first frame if you're timing it yourself.

To get the numbers in the paper, download COCO and run
python eval.py --trained_model=<model> --max_images=400 --benchmark

zimenglan-sysu-512 · 2019-05-21T11:41:26Z

thanks @dbolya
you are right, for the first few image, the evaluate will be slow, after that, it will be fast.

Rm1n90 · 2020-02-25T13:36:20Z

Hey @dbolya,
I wonder if I convert the code to C++ with Cuda, What Fps should I expect (assume the maximum FPS that I can achieve now is 14)?

dbolya · 2020-03-10T23:39:12Z

@Rm1n90 Idk, I haven't tested it myself. It'll probably be slightly faster, but not that much (maybe 10%?)

syc10-09 · 2020-07-18T10:59:33Z

thanks for your amazing work!　
From above communication, I learn someting new. However, I still don't know how to solve my problem

When I run eval.py using coco2017 dataset by Titan V, the following results appear：
`Config not specified. Parsed yolact_plus_resnet50_config from the file name.

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 400 / 400 (100.00%) 19.93 fps
Saving data...
Calculating mAP...

   |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |

-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
box | 37.25 | 56.08 | 54.41 | 52.81 | 49.87 | 46.45 | 41.13 | 34.19 | 24.24 | 12.14 | 1.24 |
mask | 35.93 | 53.91 | 51.74 | 49.42 | 46.64 | 43.65 | 38.83 | 32.80 | 23.72 | 14.16 | 4.44 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Process finished with exit code 0
`
The command I am using is as below (except I change the model name as needed):

python3 eval.py --trained_model=weights/yolact_plus_resnet50_54_800000.pth --score_threshold=0.15 --top_k=15 --max_images=400

First,maybe it's stupid but I really don't understand the meaning of the parameter top_k. Could you explain to me?
Second, I don't know why is the program running at 19.93fps? Did I miss something important? What should I do to achieve the effect of the paper 33.5fps?
@dbolya @rkishore
I would appreciate your reply!

damghanian · 2021-01-25T19:10:53Z

Hello, first of all, @dbolya thank you for sharing this work. I have a question.
@rkishore how do you calculate time per image from FPS ? for example you said: "I get ~16fps (0.06sec/image) with the Resnet-101 model, ~20fps (0.05sec/image) with the Resnet-50 model and 17-18fps (0.055sec/image) with the Darket53 model."
I would appreciate your reply!

rkishore changed the title ~~Not able to get 40+ fps processing speed on Nvidia RTX 2080 GPU~~ Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU May 9, 2019

sdimantsd mentioned this issue Jun 23, 2019

convert yolact to ONNX #74

Open

dbolya closed this as completed Mar 10, 2020

danamyu mentioned this issue Jun 10, 2020

Improving 31ms backbone run time #468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

rkishore commented May 9, 2019 •

edited

rkishore commented May 9, 2019

dbolya commented May 10, 2019 •

edited

rkishore commented May 10, 2019

dbolya commented May 10, 2019

dbolya commented May 11, 2019

rkishore commented May 13, 2019

dbolya commented May 13, 2019 •

edited

rkishore commented May 13, 2019

dbolya commented May 13, 2019

rkishore commented May 14, 2019

zimenglan-sysu-512 commented May 20, 2019 •

edited

dbolya commented May 20, 2019

zimenglan-sysu-512 commented May 21, 2019

Rm1n90 commented Feb 25, 2020 •

edited

dbolya commented Mar 10, 2020

syc10-09 commented Jul 18, 2020

damghanian commented Jan 25, 2021

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

Comments

rkishore commented May 9, 2019 • edited

rkishore commented May 9, 2019

dbolya commented May 10, 2019 • edited

rkishore commented May 10, 2019

dbolya commented May 10, 2019

dbolya commented May 11, 2019

rkishore commented May 13, 2019

dbolya commented May 13, 2019 • edited

rkishore commented May 13, 2019

dbolya commented May 13, 2019

rkishore commented May 14, 2019

zimenglan-sysu-512 commented May 20, 2019 • edited

dbolya commented May 20, 2019

zimenglan-sysu-512 commented May 21, 2019

Rm1n90 commented Feb 25, 2020 • edited

dbolya commented Mar 10, 2020

syc10-09 commented Jul 18, 2020

damghanian commented Jan 25, 2021

rkishore commented May 9, 2019 •

edited

dbolya commented May 10, 2019 •

edited

dbolya commented May 13, 2019 •

edited

zimenglan-sysu-512 commented May 20, 2019 •

edited

Rm1n90 commented Feb 25, 2020 •

edited