Skip to content

TP: fix 0-sized tensor slices, AllReduce fallback#21808

Open
JohannesGaessler wants to merge 3 commits intoggml-org:masterfrom
JohannesGaessler:tp-fix-0-slice
Open

TP: fix 0-sized tensor slices, AllReduce fallback#21808
JohannesGaessler wants to merge 3 commits intoggml-org:masterfrom
JohannesGaessler:tp-fix-0-slice

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented Apr 12, 2026

Partially fixes #21765 .

With Qwen 3.5 26b a4b 27b there are only 2 KV heads so with 3+ GPUs some of them will get zero-sized slices of the data. This edge case is not being handled correctly on master. This PR makes it so that the corresponding nodes are disabled and the buffer for the AllReduce memset to 0 so that after the AllReduce all GPUs have the correct data. As of right now the buffer is zeroed out via GGML_SCALE with a factor of 0.0f for the AllReduce fallback implementation - this is not safe w.r.t. NaNs but it seems we currently lack the tooling to properly memset a tensor as part of a ggml_cgraph. The same issue is present in llm_graph_context::build_rs.

Additionally, on master the synchronization of 3+ GPUs is not being handled correctly for the AllReduce fallback. The problem is that in those cases 2+ reduction steps are needed but the same buffer is used for each step so there are race conditions. This PR extends the number of buffers accordingly.

Requirements

@JohannesGaessler JohannesGaessler requested a review from a team as a code owner April 12, 2026 13:01
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 12, 2026
@JohannesGaessler JohannesGaessler requested a review from CISC as a code owner April 12, 2026 19:41
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I added a patch for Gemma 4 26b a4b on 3 GPUs. There were issues with aliasing because every 6 layers the layer structure is different and one of the GPUs would receive no tensors at all. The linked issue should now be fully fixed by this PR.

@slavap
Copy link
Copy Markdown

slavap commented Apr 14, 2026

Fails for me on four Instinct mi50 cards, BUT also fails on two with the same error, HIP_VISIBLE_DEVICES=0,1
Similar problem on 2 GPUs #21686
So it fails not only on 3+ GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: crashes when trying to use qwen3.5-27b and gemma4-26b-4a using tensor parallelism.

3 participants