ggml : become thread-safe #3960

ggerganov · 2023-11-05T15:56:19Z

We should be able to run inference on multiple graphs, backends and devices in parallel.
Currently, there are CUDA singletons that break this requirement and possibly there could be other problems.

pseudotensor · 2024-02-27T22:38:59Z

Any updates?

pseudotensor · 2024-03-06T20:41:05Z

FYI, I noticed with recent llama.cpp that while there are speed improvements, the thread safety has gotten worse. Now when I run in 2 threads a TTS model and a GGUF model, everything crashes in latest llama.cpp when did not used to.

I get:

CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at /tmp/pip-install-r9_ew1og/llama-cpp-python_858d7004c05a49ac876ecf41f562f687/vendor/llama.cpp/ggml-cuda.cu:11578
  cudaDeviceSynchronize()
GGML_ASSERT: /tmp/pip-install-r9_ew1og/llama-cpp-python_858d7004c05a49ac876ecf41f562f687/vendor/llama.cpp/ggml-cuda.cu:255: !"CUDA error"
!!!! kernel execution error. (m: 3072, n: 77, k: 1024, error: 13)

Worked perfectly with heavy use before.

zsogitbe · 2024-03-14T13:34:51Z

I am a bit confused. I thought that @slaren solved this problem with 'llama : add pipeline parallelism support (#6017)'? Or do you mean here something else?

@slaren said that when he was ready with 6017 he will fix the backend to release all CUDA memory, what is a big problem to many of us. I am not impatient just would like to understand what is happening.

slaren · 2024-03-14T13:37:29Z

What I meant is that I will work on this after the pipeline parallelism is merged, which is what I am doing. It will still take a while to complete, as fixing this will require infrastructure changes in other parts of the code. Sorry for the confusion.

zsogitbe · 2024-03-14T13:58:47Z

I understand. Thank you that you care about this issue and that you will work on it! I have tried to solve it but I could not.

DEVDEVIL007 · 2024-03-19T06:17:06Z

#5396

slaren · 2024-03-20T10:28:43Z

#6170 should fix this issue in the CUDA backend.

github-actions · 2024-05-05T01:06:51Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

martindevans · 2024-05-05T01:08:01Z

Is this actually fixed?

zsogitbe · 2024-05-05T04:45:49Z

Some additional problems #6909 maybe because of this fix?

slaren · 2024-05-08T00:22:41Z

@martindevans It should be fixed, please report any issues with thread safety. For example, using multiple llama contexts simultaneously each with a different CUDA GPU on different threads should now be possible. CPU and Metal also should be thread-safe, other backends probably not.

martindevans · 2024-05-08T00:43:57Z

That's great to hear! I'll experiment with removing some of the locks we added into LLamaSharp and will report any bugs. Thanks.

pseudotensor · 2024-05-08T00:56:03Z

What about same GPU? Why isn't that thread safe too?

slaren · 2024-05-08T01:00:54Z

It should also be thread-safe, but I don't expect that to be a very useful use case.

ggerganov added the refactoring Refactoring label Nov 5, 2023

wsxiaoys mentioned this issue Nov 30, 2023

fix: avoid llama.cpp's racing TabbyML/tabby#923

Merged

ggerganov changed the title ~~llama : become thread-safe~~ ggml : become thread-safe Jan 31, 2024

ggerganov mentioned this issue Jan 31, 2024

Multi Threaded issue with CUDA compiled library 1.5.4 ggerganov/whisper.cpp#1814

Open

ggerganov mentioned this issue Mar 6, 2024

GPU Memory Leak #5873

Closed

martindevans mentioned this issue Mar 12, 2024

Thread Safety in llama.cpp SciSharp/LLamaSharp#596

Open

github-actions bot added the stale label Apr 20, 2024

github-actions bot closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : become thread-safe #3960

ggml : become thread-safe #3960

ggerganov commented Nov 5, 2023 •

edited

pseudotensor commented Feb 27, 2024

pseudotensor commented Mar 6, 2024

zsogitbe commented Mar 14, 2024

slaren commented Mar 14, 2024

zsogitbe commented Mar 14, 2024

DEVDEVIL007 commented Mar 19, 2024

slaren commented Mar 20, 2024

github-actions bot commented May 5, 2024

martindevans commented May 5, 2024

zsogitbe commented May 5, 2024

slaren commented May 8, 2024 •

edited

martindevans commented May 8, 2024

pseudotensor commented May 8, 2024

slaren commented May 8, 2024

ggml : become thread-safe #3960

ggml : become thread-safe #3960

Comments

ggerganov commented Nov 5, 2023 • edited

pseudotensor commented Feb 27, 2024

pseudotensor commented Mar 6, 2024

zsogitbe commented Mar 14, 2024

slaren commented Mar 14, 2024

zsogitbe commented Mar 14, 2024

DEVDEVIL007 commented Mar 19, 2024

slaren commented Mar 20, 2024

github-actions bot commented May 5, 2024

martindevans commented May 5, 2024

zsogitbe commented May 5, 2024

slaren commented May 8, 2024 • edited

martindevans commented May 8, 2024

pseudotensor commented May 8, 2024

slaren commented May 8, 2024

ggerganov commented Nov 5, 2023 •

edited

slaren commented May 8, 2024 •

edited