Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

Closed
rkishore opened this issue May 9, 2019 · 17 comments
Closed

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU #35

rkishore opened this issue May 9, 2019 · 17 comments

Comments

@rkishore
Copy link

rkishore commented May 9, 2019

Hello, first off, thank you for sharing this amazing work. Much appreciated.

I wanted to report in that I also could not get 30+fps on an Nvidia RTX 2080 GPU with 8GB RAM. I am getting 8-10fps with video and with images, I get ~16fps (0.06sec/image) with the Resnet-101 model, ~20fps (0.05sec/image) with the Resnet-50 model and 17-18fps (0.055sec/image) with the Darket53 model. This is quite impressive but its roughly 1/2 of what is reported in the paper. For images, I used the python timeit module to wrap the evalimage function to report my numbers. Also, it is weird that the difference in speed between the different models is not significant (especially between Resnet-101 and Resnet-50), which indicates to me that something is reducing the processing speed by ~1/2 for all the models.

The command I am using is as below (except I change the model name as needed):

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.4 --top_k=100 --images=./test_images:./test_output_images

I also tried using --benchmark but there is no change in the numbers above.

I was wondering if I could get some help to figure this out.

@rkishore rkishore changed the title Not able to get 40+ fps processing speed on Nvidia RTX 2080 GPU Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU May 9, 2019
@rkishore
Copy link
Author

rkishore commented May 9, 2019

Just wanted to report in that running the benchmark on the COCO dataset as per your instructions gets me much closer to the reported numbers. Now I wonder what the difference is between the --benchmark code and the actual per-image instance segmentation code.

With Resnet-101

python3 eval.py --trained_model=weights/yolact_base_54_800000.pth --benchmark --max_images=1000

Config not specified. Parsed yolact_base_config from the file name.

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 29.87 fps

Stats for the last frame:

  Name      | Time (ms)  

----------------+------------
Network Extra | 0.1927
backbone | 6.4748
fpn | 0.6773
proto | 0.4329
pred_heads | 1.6741
makepriors | 0.0078
Detect | 21.1403
Postprocess | 0.9253
Copy | 1.8050
Sync | 0.0101
----------------+------------
Total | 33.3403

Average: 29.87 fps, 33.48 ms

With Resnet-50

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --benchmark --max_images=1000
= args.display = 0
Config not specified. Parsed yolact_resnet50_config from the file name.

loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 40.06 fps

Stats for the last frame:

  Name      | Time (ms)  

----------------+------------
Network Extra | 0.2002
backbone | 3.3638
fpn | 0.6709
proto | 0.4280
pred_heads | 1.6549
makepriors | 0.0080
Detect | 15.1751
Postprocess | 0.9490
Copy | 1.7866
Sync | 0.0093
----------------+------------
Total | 24.2458

Average: 40.06 fps, 24.96 ms

With Darknet-53

python3 eval.py --trained_model=weights/yolact_darknet53_54_800000.pth --benchmark --max_images=1000
Config not specified. Parsed yolact_darknet53_config from the file name.

loading annotations into memory...
Done (t=0.45s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 34.68 fps

Stats for the last frame:

  Name      | Time (ms)  

----------------+------------
Network Extra | 0.1865
backbone | 3.4872
fpn | 0.6769
proto | 0.4471
pred_heads | 1.6641
makepriors | 0.0086
Detect | 19.1224
Postprocess | 0.9185
Copy | 1.8112
Sync | 0.0089
----------------+------------
Total | 28.3314

Average: 34.68 fps, 28.84 ms

@dbolya
Copy link
Owner

dbolya commented May 10, 2019

The FPS we report comes from the command,
python eval.py --trained_model=<modelname> --benchmark --max_images=400
run on one Titan Xp. --benchmark mode times just the raw model.

Like other papers, our timing only reports the speed of the model itself. That is, timing starts when the image is finished loading and stops when the network outputs masks. Note that this timing does not include 1.) loading the image, 2.) rendering the mask onto the image, or 3.) displaying the image, all of which are included in evalvideo, and the first two of which in evalimage.

Right now, that step 2 is particularly limiting for us, and it's the bottleneck that you see giving you that lower than reported fps. I'm working on fixing this so that we can run the full model from loading to displaying all at 30 fps (see #17), but that's difficult to do with Python (thanks to the GIL) and without direct access to the graphics card (w/o CUDA or a graphics library like OpenGL or Vulkan).

A large amount of time right now is spent rendering the image on the GPU, copying the image to the CPU to draw boxes and text, and then passing the CPU image to OpenCV which just copies it back to the GPU internally. A real production-ready version of this would likely have to be in native C++ using a CUDA matrix as a texture in Vulkan or OpenGL to render directly to the screen, but I'd like to keep the project in native Pytorch for as long as possible (so that everyone can easily start using it / add to it).

Good news is though that I have updated rendering code in the works, and I think I'll be able to get close to that sweet sweet 30 fps with that. It should be out soon, so I'll keep you posted.

@rkishore
Copy link
Author

@dbolya , thanks for the explanation and taking the time to respond. Do you know how much time does step 1 (i.e. loading the image add to the whole equation) take?

Also, excited to hear about the updated rendering code.

@dbolya
Copy link
Owner

dbolya commented May 10, 2019

I'm actually really glad you asked that! When I timed it, that step took a whopping 19 ms, which didn't seem right at all.

I then narrowed it down to this line
torch.Tensor(frame).float().cuda()
which a full 16 ms on its own!

Turns out most of that was coming from the torch.Tensor constructor, so I changed that to
torch.from_numpy(frame).float().cuda()
but that still took 15 ms, most of which coming from the .float() on the CPU.

So, I once again rearranged that to get
torch.from_numpy(frame).cuda().float()
which took only 1 ms...

So on the current master branch, step 1 takes 19 ms, but now it's down to 4. I'll push this along with my new rendering code and other speed improvements probably later today. Note though that evalvideo is very multithreaded and the torch.Tensor constructor likely releases the GIL (as it's in C++), so this doesn't look like it had as huge an impact on evaluation (though it did take me from 28 fps on one video to 31).

@dbolya
Copy link
Owner

dbolya commented May 11, 2019

Pushed the patch. Pull the latest commit and run

python eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.15 --top_k=15 --video_multiframe=2 --video=<your_video>

to test the new video speeds. --images should also have received a similar speed boost.

Also let the video play for a little bit before reporting the FPS because it goes up over time in my experience (seems like the first couple frames take longer than the rest even after initialization).

@rkishore
Copy link
Author

@dbolya, thanks a lot for the patch. With the command you sent, my early tests show 22-23fps with videos (when displaying the output), and 15-16fps when writing to an output video. Definitely an improvement. My GPU is maxed out so likely need more GPU cores with this implementation.

For --images, I don't see a big change from before (I am getting ~23fps with Resnet 50 when writing out the output and ~29fps when I comment out the cv2.imwrite for the output image). It is likely that the GPU horsepower I have is insufficient + the mask writeout has a sizable penalty. I have an RTX 2080 with 2300 CUDA cores and AFAIK, the Titan Xp has a 1500 more CUDA cores, which may be where the processing speed difference is coming from. What is the processing you get on average with --images on the Titan Xp? I am using the following command:

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.4 --top_k=15 --images=./test_images:./test_output_images

@dbolya
Copy link
Owner

dbolya commented May 13, 2019

For --images I'm getting 24.77 fps on a Titan Xp with the command you listed there when timing the whole evalimage function. I get 35.72 fps when I omit the cv2.imwrite call from timing (wow why is my server's SSD so slow?) Finally, I get 45.87 fps (the FPS Resnet50 runs at in benchmark mode) if I also omit the cv2.imread call (keeping the FastBaseTransform in).

That 45.87 comes from timing the following 3 lines:

batch = FastBaseTransform()(frame.unsqueeze(0))
preds = net(batch)
img_numpy = pred_display(preds, frame, None, None, undo_transform=False)

Note that I also included a torch.cuda.synchronize() in the timing for good measure but that doesn't matter because of pred_display's call to .cpu().

I guess the bottleneck on my server is disk operations, but those should be done in a separate thread anyway. I haven't bothered to multithread evalimage because --benchmark on COCO is all I needed for the paper, but if you want really fast evalimages you'd better multithread the data loading and saving. Note that I also didn't bother multithreading savevideo (for writing to an output video). I only specifically optimized evalvideo, the real-time demo one.

Also, when you're timing make sure to discard the first ~2 frametimes because Pytorch initializes things on the first or second pass through the network, so the first call for instance can take up to 4 seconds. You can run evalimage on a dummy image beforehand to counteract this if you'd like.

@rkishore
Copy link
Author

@dbolya, thank you.

For --images, we are not that far off in performance. As your results with cv2.imread is 10fps slower (35fps with cv2.imread and 45fps without it), looks like it takes ~6-7ms for this function?

Also, there is only one cv2.imread in eval.py inside evalimage. When you say you omit this function, I assume you mean you omit it from your time/speed calculations, correct? Because otherwise, where else will you get the input image to process from?

@dbolya
Copy link
Owner

dbolya commented May 13, 2019

Yeah, I mean I omit it from the calculations.

It looks like this is the performance breakdown:

imread and GPU copy:  6.2 ms
    everything else: 21.8 ms
            imwrite: 12.4 ms

@rkishore
Copy link
Author

@dbolya, thank you.

@zimenglan-sysu-512
Copy link

zimenglan-sysu-512 commented May 20, 2019

hi @dbolya
i use the cmd

CUDA_VISIBLE_DEVICES=0 python3.6 eval.py --trained_model=weights/yolact_base_54_800000.pth --score_threshold=0.3 --top_k=100 --image=0001.png

the the fast_nms takes 0.11198925971984863s. so how to get the fast speed?
btw, i use Titan Xp.

@dbolya
Copy link
Owner

dbolya commented May 20, 2019

@zimenglan-sysu-512 Pytorch uses the first image passed through the network to set itself up, meaning that the first iteration will take much longer than the rest. So the first image you evaluate will be slow (still has some setting up to do), but every image after that will be fast. You need to evaluate multiple images (perhaps with --images or a video with --video) to properly benchmark the speed. Remember to ignore the first frame if you're timing it yourself.

To get the numbers in the paper, download COCO and run
python eval.py --trained_model=<model> --max_images=400 --benchmark

@zimenglan-sysu-512
Copy link

thanks @dbolya
you are right, for the first few image, the evaluate will be slow, after that, it will be fast.

@Rm1n90
Copy link

Rm1n90 commented Feb 25, 2020

Hey @dbolya,
I wonder if I convert the code to C++ with Cuda, What Fps should I expect (assume the maximum FPS that I can achieve now is 14)?

@dbolya
Copy link
Owner

dbolya commented Mar 10, 2020

@Rm1n90 Idk, I haven't tested it myself. It'll probably be slightly faster, but not that much (maybe 10%?)

@syc10-09
Copy link

thanks for your amazing work! 
From above communication, I learn someting new. However, I still don't know how to solve my problem

When I run eval.py using coco2017 dataset by Titan V, the following results appear:
`Config not specified. Parsed yolact_plus_resnet50_config from the file name.

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 400 / 400 (100.00%) 19.93 fps
Saving data...
Calculating mAP...

   |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |

-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
box | 37.25 | 56.08 | 54.41 | 52.81 | 49.87 | 46.45 | 41.13 | 34.19 | 24.24 | 12.14 | 1.24 |
mask | 35.93 | 53.91 | 51.74 | 49.42 | 46.64 | 43.65 | 38.83 | 32.80 | 23.72 | 14.16 | 4.44 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Process finished with exit code 0
`
The command I am using is as below (except I change the model name as needed):

python3 eval.py --trained_model=weights/yolact_plus_resnet50_54_800000.pth --score_threshold=0.15 --top_k=15 --max_images=400

First,maybe it's stupid but I really don't understand the meaning of the parameter top_k. Could you explain to me?
Second, I don't know why is the program running at 19.93fps? Did I miss something important? What should I do to achieve the effect of the paper 33.5fps?
@dbolya @rkishore
I would appreciate your reply!

@damghanian
Copy link

Hello, first of all, @dbolya thank you for sharing this work. I have a question.
@rkishore how do you calculate time per image from FPS ? for example you said: "I get ~16fps (0.06sec/image) with the Resnet-101 model, ~20fps (0.05sec/image) with the Resnet-50 model and 17-18fps (0.055sec/image) with the Darket53 model."
I would appreciate your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants