TP: fix 0-sized tensor slices, AllReduce fallback by JohannesGaessler · Pull Request #21808 · ggml-org/llama.cpp

JohannesGaessler · 2026-04-12T13:01:32Z

Partially fixes #21765 .

With Qwen 3.5 ~~26b a4b~~ 27b there are only 2 KV heads so with 3+ GPUs some of them will get zero-sized slices of the data. This edge case is not being handled correctly on master. This PR makes it so that the corresponding nodes are disabled and the buffer for the AllReduce memset to 0 so that after the AllReduce all GPUs have the correct data. As of right now the buffer is zeroed out via GGML_SCALE with a factor of 0.0f for the AllReduce fallback implementation - this is not safe w.r.t. NaNs but it seems we currently lack the tooling to properly memset a tensor as part of a ggml_cgraph. The same issue is present in llm_graph_context::build_rs.

Additionally, on master the synchronization of 3+ GPUs is not being handled correctly for the AllReduce fallback. The problem is that in those cases 2+ reduction steps are needed but the same buffer is used for each step so there are race conditions. This PR extends the number of buffers accordingly.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

JohannesGaessler · 2026-04-12T19:46:59Z

I added a patch for Gemma 4 26b a4b on 3 GPUs. There were issues with aliasing because every 6 layers the layer structure is different and one of the GPUs would receive no tensors at all. The linked issue should now be fully fixed by this PR.

ggml/src/ggml-backend-meta.cpp

slavap · 2026-04-14T03:07:31Z

Fails for me on four Instinct mi50 cards, BUT also fails on two with the same error, HIP_VISIBLE_DEVICES=0,1
Similar problem on 2 GPUs #21686
So it fails not only on 3+ GPUs.

TP: fix 0-sized tensor slices, AllReduce fallback

f91d566

JohannesGaessler requested a review from a team as a code owner April 12, 2026 13:01

JohannesGaessler mentioned this pull request Apr 12, 2026

Misc. bug: crashes when trying to use qwen3.5-27b and gemma4-26b-4a using tensor parallelism. #21765

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 12, 2026

fix layer structure <-> GPU count aliasing

5a8a56a

JohannesGaessler requested a review from CISC as a code owner April 12, 2026 19:41

TheBlueMatt mentioned this pull request Apr 13, 2026

TP: fix arbitrary -ot #21717

Open

gaugarg-nv reviewed Apr 13, 2026

View reviewed changes

ggml/src/ggml-backend-meta.cpp Show resolved Hide resolved

add missing std::fill

8cd16dc

slavap mentioned this pull request Apr 14, 2026

Tensor parallelism added to llama.cpp mixa3607/ML-gfx906#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP: fix 0-sized tensor slices, AllReduce fallback#21808

TP: fix 0-sized tensor slices, AllReduce fallback#21808
JohannesGaessler wants to merge 3 commits intoggml-org:masterfrom
JohannesGaessler:tp-fix-0-slice

JohannesGaessler commented Apr 12, 2026 •

edited

Loading

Uh oh!

JohannesGaessler commented Apr 12, 2026

Uh oh!

Uh oh!

slavap commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

JohannesGaessler commented Apr 12, 2026

Uh oh!

Uh oh!

slavap commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohannesGaessler commented Apr 12, 2026 •

edited

Loading