Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #8801 (comment) .

The problem is that --split-mode row uses multiple CUDA streams to overlap data transfer with computation but on CUDA GPUs that are Volta or newer each of these streams allocates a temporary buffer for the stream-k fixup. And because it's not safe to allocate temporary buffers with multiple CUDA streams in parallel this leads to a race condition. However, because both stream-k decomposition and --split-mode row mitigates tail effects it's fine to just not use stream-k if multiple parallel CUDA streams are in use; the performance difference with 3x RTX 4090 is ~1%.

Long-term it may make sense to think about how we can safely handle multiple parallel CUDA streams. I think a good approach would be to move the parallelism from something CUDA-specific to something at the level of GGML compute graphs. That would also enable the reuse of the code for other backends.

@JohannesGaessler JohannesGaessler added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Sep 10, 2024
@slaren
Copy link
Member

slaren commented Sep 10, 2024

You could use a different pool per stream. Since the pools are objects now, it should be a simple change. I thought about moving the implementation of tensor parallelism to ggml_backend_sched, but it is a lot of work for no clear benefit outside of datacenter GPUs with very fast interconnects, and I think it will make backend-specific optimizations harder to implement.

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 10, 2024
@JohannesGaessler JohannesGaessler merged commit 5af118e into ggml-org:master Sep 11, 2024
52 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Dec 23, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
@JohannesGaessler JohannesGaessler deleted the cuda-fix-sm-row branch May 5, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: b3188 breaks row split mode for multiple GPUs

2 participants