Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

Closed
AriX opened this issue Mar 20, 2024 · 4 comments 路 Fixed by #6177
Closed

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

AriX opened this issue Mar 20, 2024 · 4 comments 路 Fixed by #6177

Comments

@AriX
Copy link

AriX commented Mar 20, 2024

When using batched decoding with >1 parallel sequences, llama.cpp produces nonsensical outputs. Here is an example:

MacintoBookPro3:llama.cpp Ari馃崏  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=2+2020202+2<|/im_end|>2<|<|

## 2+20+202022|0202020202202020202020202

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=0\n<|im assistant|> 
02+02+2+20+2+20+20+20+2+20+2+0+2+2+02020+202+2+2+2+2+

However, if I use only 1 parallel sequence instead of 2, the output becomes reasonable:

MacintoBookPro3:llama.cpp Ari馃崏  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 1 110 80
[...]
 <|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nCan you explain how you got the answer?<|im_end|>\n\n<|im_start|>assistant\nSure! To find the sum of 20 and 2

I manually bisected and found that the problem was introduced by @ggerganov's change d7b800b (#4280). Indeed, after reverting the GGML_PAD that was added to kv_self.n, the model output becomes reasonable even with multiple batched sequences:

MacintoBookPro3:llama.cpp Ari馃崏  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 20-20?<|im_end|>\n<|im_start|>assistant\n20-20=0<|im_end|>\n

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|im_end|>\n<|im_start|>assistant\n20*20=400<|im_end|>\n

I'm not familiar enough with the details here to understand the utility or necessity of the GGML_PAD operation. Any idea why this is causing this issue? Is it possible that we should omit that for Metal specifically?

Notes:

  • I am testing using Metal on a MacBook Pro with M2 Max
  • Using OpenHermes 2.5 (Mistral) model at Q4_0 quantization, however the problem occurs with other quants too (tested Q8_0)
  • I wonder if there is an additional issue here - note that after reverting the breaking change, the model outputs with parallelization are still consistently different from the outputs without parallelization.

Thank you for all of the wonderful work going into this project!

@ggerganov
Copy link
Owner

Could you test if #6177 behaves correctly in your tests

@AriX
Copy link
Author

AriX commented Mar 20, 2024

@ggerganov Thanks so much for the quick look on this!

It did improve the behavior, but something is still wrong. Applying the change from #6177:

MacintoBookPro3:llama.cpp Ari馃崏  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 1 finished at n_cur = 76
main: stream 0 finished at n_cur = 120

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 2+20?<|im_end|>\n<|im_start|>assistant\n22+20=42<|im_end|>\n\n<|im_start|>user

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|

As you can see, when the model continues after the initial response, it prompts itself "What is 2+20?" and then responds "22+20=42" which doesn't make sense after that prompt.

If I additionally revert the padding change, replacing
kv_self.n = std::min(kv_self.size, std::max(32u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));
with
kv_self.n = std::min(kv_self.size, std::max(32u, llama_kv_cache_cell_max(kv_self)));

The model prompts itself with a different question, and answers it correctly:

main: generating 2 sequences ...

main: stream 0 finished at n_cur = 120
main: stream 1 finished at n_cur = 120

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 20-20?<|im_end|>\n<|im_start|>assistant\n20-20=0<|im_end|>\n\n<|im_start|>user

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|im_end|>\n<|im_start|>assistant\n20*20=400<|im_end|>\n<|im_start|>user\n

I'm not totally sure how to read into this result, but my intuition is that the model responding incorrectly to such a simple question indicates something is going wrong.

If I run the same query (which was partially hallucinated) without parallelization, it answers correctly:

./main -m openhermes-2.5-mistral-7b.Q8_0.gguf -p "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 2+20?<|im_end|>\n<|im_start|>assistant\n"
[...]
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 2+20?<|im_end|>\n<|im_start|>assistant\n2+20=22<|im_end|>

Worth noting again that without multiple sequences, the model answers in a very different (and subjectively better) way. However, this may be a separate issue, as this difference in behavior was present in the original implementation of batching and is not part of this regression.

./main -m openhermes-2.5-mistral-7b.Q8_0.gguf -p "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n"
 <|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nCan you explain how you got the answer?<|im_end|>\n\n<|im_start|>assistant\nSure! To find the sum of 20 and 2

@ggerganov
Copy link
Owner

Thanks for the detailed look. I've updated #6177 - think it should be good now. Could you give it another try and let me know if you agree

@AriX
Copy link
Author

AriX commented Mar 22, 2024

Thanks for the detailed look. I've updated #6177 - think it should be good now. Could you give it another try and let me know if you agree

It looks good now - thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants