Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple images encoding on a single gpu in parallel( concurrent encoding ) #83

Closed
ahmednofal opened this issue Dec 20, 2022 · 3 comments

Comments

@ahmednofal
Copy link

Thank you for this great library; it has been extremely helpful.

My use case requires the encoding of n frames at a time. Sequential calls to gpujpeg_encoder_encode takes a long time to process the n frames sequentially. I have employed the std::async function to have multiple concurrent executions of gpujpeg_encoder_encode, however I am getting the same delay as not using at all.

I tried using a separate gpujpeg_encoder instance per frame but that did not help as well.

GPU is T4
Running ubuntu 18.04

They are 24 1080 x 1920 frames taking around 0.05 seconds (using sequential and parallel execution schemes; std::async)

Thank you.

@MartinPulec
Copy link
Collaborator

Hi, thanks for writing.

Can I have some questions? You are saying that you encode 24 FullHD frames 0.05 second. This means 2 ms per frame? This is rather more that I'd expect but I cannot say that it is incorrect, it depends on image and encoder properties. The problem may also be that whether there is "only" 24 images, the GPU initialization doesn't amortize well, to be concrete, CUDA initialization can cost as much as encoding 200 frames.

How much improvement would you imagine by parallel encoding? The thing is that the performance bottleneck is mostly PCIe transfers. I've created example encode.c.txt and some evaluations:

$ time ./encode 8000 1
real    0m8.680s
$ time ./a.out 2000 4
real    0m6.414s
$ time (./a.out 2000 1 & ./a.out 2000 1 & ./a.out 2000 1 & ./a.out 2000 1 & wait)
real    0m5.501s

So running in multiple processes improves the performance by some 36%, multi-threaded something less (there is some space for improvement here). When run within the NVidia Profiler, GPUJPEG performence is limited by memory transfers of uncompressed image to GPU (for encode). From the profiler, I can imagine improvement by, let say, 20% but surely not multiple times. Of course I am writing about images in CPU RAM - if you have or can have them already on GPU, it may be quite a different story.

@MartinPulec
Copy link
Collaborator

Hi, I've update on the above. I've tweaked the example from my previous post to use pinned memory and now its performance is optimal (it is entirely utilizing host-to-device bandwidth according to NVIDIA profiler), it is here. The point is that the input (raw) image buffer must be allocated with CUDA API to achieve optimal performance, also CUDA streams should be used.

Is the example above something you were looking for or have I misunderstood it somehow? Of course it could be added to GPUJPEG API, including some thread scheduler to create some higher-level API for user but I'd rather keep it simple

@MartinPulec
Copy link
Collaborator

I am closing for now since I believe that the question was more or less answered (according to my understanding). If not, feel free to reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants