Is there a way to avoid the thread lock in cuda driver? #169

coreylowman · 2023-07-10T12:29:13Z

Okay, after too much time investigating. It seems the Cuda Driver (not NCCL) is using a global MUTEX which makes multithread/multigpu quite useless.
https://forums.developer.nvidia.com/t/cuda-wont-concurrently-run-kernels-on-multiple-devices-from-within-same-process/240388
https://forums.developer.nvidia.com/t/multithreaded-tensorrt-performance-drops-dramatically/184882/8
https://forums.developer.nvidia.com/t/cuda-introduces-heavy-locks/61357

Basically all threads will contend for that lock not enabling the CPU to send kernels fast enough.

If there's any way to avoid that mutex as it makes the code significantly simpler to be multi thread rather than multiprocess.
But the tone in the related issues makes me think it's a known issue and nothing is going to be done about it.

Edit: I updated the code and internals to reflect that hopefully saving future devs from being bitten in the same way.

Originally posted by @Narsil in #164 (comment)

coreylowman mentioned this issue Jul 11, 2023

Multi-GPU Support coreylowman/dfdx#595

Open

gerwin3 mentioned this issue Dec 29, 2023

Create a Stream with a given device; DeviceBuffer should use the stream's device oddity-ai/async-cuda#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to avoid the thread lock in cuda driver? #169

Is there a way to avoid the thread lock in cuda driver? #169

coreylowman commented Jul 10, 2023 •

edited

Is there a way to avoid the thread lock in cuda driver? #169

Is there a way to avoid the thread lock in cuda driver? #169

Comments

coreylowman commented Jul 10, 2023 • edited

coreylowman commented Jul 10, 2023 •

edited