Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83

ahmednofal · 2022-12-20T06:40:57Z

Thank you for this great library; it has been extremely helpful.

My use case requires the encoding of n frames at a time. Sequential calls to gpujpeg_encoder_encode takes a long time to process the n frames sequentially. I have employed the std::async function to have multiple concurrent executions of gpujpeg_encoder_encode, however I am getting the same delay as not using at all.

I tried using a separate gpujpeg_encoder instance per frame but that did not help as well.

GPU is T4
Running ubuntu 18.04

They are 24 1080 x 1920 frames taking around 0.05 seconds (using sequential and parallel execution schemes; std::async)

Thank you.

The text was updated successfully, but these errors were encountered:

MartinPulec · 2023-01-05T09:12:13Z

Hi, thanks for writing.

Can I have some questions? You are saying that you encode 24 FullHD frames 0.05 second. This means 2 ms per frame? This is rather more that I'd expect but I cannot say that it is incorrect, it depends on image and encoder properties. The problem may also be that whether there is "only" 24 images, the GPU initialization doesn't amortize well, to be concrete, CUDA initialization can cost as much as encoding 200 frames.

How much improvement would you imagine by parallel encoding? The thing is that the performance bottleneck is mostly PCIe transfers. I've created example encode.c.txt and some evaluations:

$ time ./encode 8000 1
real    0m8.680s
$ time ./a.out 2000 4
real    0m6.414s
$ time (./a.out 2000 1 & ./a.out 2000 1 & ./a.out 2000 1 & ./a.out 2000 1 & wait)
real    0m5.501s

So running in multiple processes improves the performance by some 36%, multi-threaded something less (there is some space for improvement here). When run within the NVidia Profiler, GPUJPEG performence is limited by memory transfers of uncompressed image to GPU (for encode). From the profiler, I can imagine improvement by, let say, 20% but surely not multiple times. Of course I am writing about images in CPU RAM - if you have or can have them already on GPU, it may be quite a different story.

MartinPulec · 2023-01-05T16:23:35Z

Hi, I've update on the above. I've tweaked the example from my previous post to use pinned memory and now its performance is optimal (it is entirely utilizing host-to-device bandwidth according to NVIDIA profiler), it is here. The point is that the input (raw) image buffer must be allocated with CUDA API to achieve optimal performance, also CUDA streams should be used.

Is the example above something you were looking for or have I misunderstood it somehow? Of course it could be added to GPUJPEG API, including some thread scheduler to create some higher-level API for user but I'd rather keep it simple

MartinPulec · 2023-03-31T12:37:14Z

I am closing for now since I believe that the question was more or less answered (according to my understanding). If not, feel free to reopen the issue.

MartinPulec closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83

Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83

ahmednofal commented Dec 20, 2022

MartinPulec commented Jan 5, 2023

MartinPulec commented Jan 5, 2023

MartinPulec commented Mar 31, 2023

Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83

Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83

Comments

ahmednofal commented Dec 20, 2022

MartinPulec commented Jan 5, 2023

MartinPulec commented Jan 5, 2023

MartinPulec commented Mar 31, 2023