-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83
Comments
Hi, thanks for writing. Can I have some questions? You are saying that you encode 24 FullHD frames 0.05 second. This means 2 ms per frame? This is rather more that I'd expect but I cannot say that it is incorrect, it depends on image and encoder properties. The problem may also be that whether there is "only" 24 images, the GPU initialization doesn't amortize well, to be concrete, CUDA initialization can cost as much as encoding 200 frames. How much improvement would you imagine by parallel encoding? The thing is that the performance bottleneck is mostly PCIe transfers. I've created example encode.c.txt and some evaluations:
So running in multiple processes improves the performance by some 36%, multi-threaded something less (there is some space for improvement here). When run within the NVidia Profiler, GPUJPEG performence is limited by memory transfers of uncompressed image to GPU (for encode). From the profiler, I can imagine improvement by, let say, 20% but surely not multiple times. Of course I am writing about images in CPU RAM - if you have or can have them already on GPU, it may be quite a different story. |
Hi, I've update on the above. I've tweaked the example from my previous post to use pinned memory and now its performance is optimal (it is entirely utilizing host-to-device bandwidth according to NVIDIA profiler), it is here. The point is that the input (raw) image buffer must be allocated with CUDA API to achieve optimal performance, also CUDA streams should be used. Is the example above something you were looking for or have I misunderstood it somehow? Of course it could be added to GPUJPEG API, including some thread scheduler to create some higher-level API for user but I'd rather keep it simple |
I am closing for now since I believe that the question was more or less answered (according to my understanding). If not, feel free to reopen the issue. |
Thank you for this great library; it has been extremely helpful.
My use case requires the encoding of n frames at a time. Sequential calls to
gpujpeg_encoder_encode
takes a long time to process the n frames sequentially. I have employed thestd::async
function to have multiple concurrent executions ofgpujpeg_encoder_encode
, however I am getting the same delay as not using at all.I tried using a separate
gpujpeg_encoder
instance per frame but that did not help as well.GPU is T4
Running ubuntu 18.04
They are 24 1080 x 1920 frames taking around 0.05 seconds (using sequential and parallel execution schemes;
std::async
)Thank you.
The text was updated successfully, but these errors were encountered: