Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

AriX · 2024-03-20T08:15:05Z

When using batched decoding with >1 parallel sequences, llama.cpp produces nonsensical outputs. Here is an example:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=2+2020202+2<|/im_end|>2<|<|

## 2+20+202022|0202020202202020202020202

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=0\n<|im assistant|> 
02+02+2+20+2+20+20+20+2+20+2+0+2+2+02020+202+2+2+2+2+

However, if I use only 1 parallel sequence instead of 2, the output becomes reasonable:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 1 110 80
[...]
 <|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nCan you explain how you got the answer?<|im_end|>\n\n<|im_start|>assistant\nSure! To find the sum of 20 and 2

I manually bisected and found that the problem was introduced by @ggerganov's change d7b800b (#4280). Indeed, after reverting the GGML_PAD that was added to kv_self.n, the model output becomes reasonable even with multiple batched sequences:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 20-20?<|im_end|>\n<|im_start|>assistant\n20-20=0<|im_end|>\n

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|im_end|>\n<|im_start|>assistant\n20*20=400<|im_end|>\n

I'm not familiar enough with the details here to understand the utility or necessity of the GGML_PAD operation. Any idea why this is causing this issue? Is it possible that we should omit that for Metal specifically?

Notes:

I am testing using Metal on a MacBook Pro with M2 Max
Using OpenHermes 2.5 (Mistral) model at Q4_0 quantization, however the problem occurs with other quants too (tested Q8_0)
I wonder if there is an additional issue here - note that after reverting the breaking change, the model outputs with parallelization are still consistently different from the outputs without parallelization.

Thank you for all of the wonderful work going into this project!

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-03-20T14:56:25Z

Could you test if #6177 behaves correctly in your tests

AriX · 2024-03-20T17:48:17Z

@ggerganov Thanks so much for the quick look on this!

It did improve the behavior, but something is still wrong. Applying the change from #6177:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 1 finished at n_cur = 76
main: stream 0 finished at n_cur = 120

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 2+20?<|im_end|>\n<|im_start|>assistant\n22+20=42<|im_end|>\n\n<|im_start|>user

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|

As you can see, when the model continues after the initial response, it prompts itself "What is 2+20?" and then responds "22+20=42" which doesn't make sense after that prompt.

If I additionally revert the padding change, replacing
kv_self.n = std::min(kv_self.size, std::max(32u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));
with
kv_self.n = std::min(kv_self.size, std::max(32u, llama_kv_cache_cell_max(kv_self)));

The model prompts itself with a different question, and answers it correctly:

main: generating 2 sequences ...

main: stream 0 finished at n_cur = 120
main: stream 1 finished at n_cur = 120

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 20-20?<|im_end|>\n<|im_start|>assistant\n20-20=0<|im_end|>\n\n<|im_start|>user

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|im_end|>\n<|im_start|>assistant\n20*20=400<|im_end|>\n<|im_start|>user\n

I'm not totally sure how to read into this result, but my intuition is that the model responding incorrectly to such a simple question indicates something is going wrong.

If I run the same query (which was partially hallucinated) without parallelization, it answers correctly:

./main -m openhermes-2.5-mistral-7b.Q8_0.gguf -p "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 2+20?<|im_end|>\n<|im_start|>assistant\n"
[...]
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 2+20?<|im_end|>\n<|im_start|>assistant\n2+20=22<|im_end|>

Worth noting again that without multiple sequences, the model answers in a very different (and subjectively better) way. However, this may be a separate issue, as this difference in behavior was present in the original implementation of batching and is not part of this regression.

./main -m openhermes-2.5-mistral-7b.Q8_0.gguf -p "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n"
 <|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nCan you explain how you got the answer?<|im_end|>\n\n<|im_start|>assistant\nSure! To find the sum of 20 and 2

ggerganov · 2024-03-21T12:57:33Z

Thanks for the detailed look. I've updated #6177 - think it should be good now. Could you give it another try and let me know if you agree

AriX · 2024-03-22T06:20:03Z

Thanks for the detailed look. I've updated #6177 - think it should be good now. Could you give it another try and let me know if you agree

It looks good now - thank you!

AriX added the bug-unconfirmed label Mar 20, 2024

ggerganov mentioned this issue Mar 20, 2024

metal : pad n_ctx by 32 #6177

Merged

ggerganov closed this as completed in #6177 Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

AriX commented Mar 20, 2024 •

edited

ggerganov commented Mar 20, 2024

AriX commented Mar 20, 2024 •

edited

ggerganov commented Mar 21, 2024

AriX commented Mar 22, 2024

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

Comments

AriX commented Mar 20, 2024 • edited

ggerganov commented Mar 20, 2024

AriX commented Mar 20, 2024 • edited

ggerganov commented Mar 21, 2024

AriX commented Mar 22, 2024

AriX commented Mar 20, 2024 •

edited

AriX commented Mar 20, 2024 •

edited