-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan: Interactive mode broken #5217
Comments
@stduhpf Let's continue here.
I just tested it again on master and it's also fine for me there. So I probably didn't fix it, I couldn't reproduce it.. I'm running Linux, and it works fine on Nvidia and AMD GPUs. Any idea what could be causing this issue for you? |
@0cc4m I'm running Windows 10, on AMD hardware (RX 5700 XT, lastest drivers). I have no idea, what could be the root cause of it, maybe some race condition? It happens consistently, but the way it messes up is different each time, even wiith the same parameters. |
Ok, so it was working when I first tried your PR, at commit 0cc4m@a5cca6c. (I still have the build I made back then) Somehow it broke since then. |
EDIT: nevermind, 0cc4m@0f64857 is not working at all, it just falls back to CPU (I'm too used to work with rebases instead of merges) |
Yeah, so 0cc4m@a5cca6c works, and 0cc4m@48ad459 does not. So the breaking change should be there. @0cc4m |
I tried Mistral 7B Instruct, which has a n_vocab of 32000 on my RX 5700 XT on Windows and didn't see any problems. Using the same Dolphin model as your example, which has n_vocab=32001 I ran into similar nondeterministic nonsense responses. After changing BK from 8 to 16 on this line I get the expected behaviour. std::initializer_list<uint32_t> warptile_s = { vk_device.subgroup_size, 32, 32, 16, 32, 32, 2, 2, 2, vk_device.subgroup_size }; Instead of that change doubling the size of Edit: interestingly on Arch Linux the RADV driver doesn't appear to run into this issue, but AMDVLK does. |
@stduhpf Thanks for figuring out the source commit! Really helpful. @Engininja2 Wow, you found it. I was able to reproduce it with |
Yep, #5223 fixes it now, thank you! |
Running models in interactive, instruct or chatML mode, or using the server's chat interface leads to broken generation when using the Vulkan build with a non-zero amount of layers offloaded to GPU. Simple text completion works properly though.
Expected behaviour (CLBlast build)
.\v\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
Vulkan behaviour
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
The server also seem to have simillar issues when re-using cached prompts (for example when the user submits a second message).
The actual output isn't consistent either, and seem to change everytime, even with fixed seed and zero temperature, given the same user input.
This does only happen with Vulkan and with at least one layer offloaded to GPU:
More examples:
Other -ngl values:
CPU only (working as expected)
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 0 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
One single layer offloaded (already broken, but in a different way)
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 1 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
It's funny that it kinda understood the second question, but used the wrong language.
Completion only (no issue here)
CLBlast
.\buildCLBlast\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128
Vulkan
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128
In case it's relevant:
vulkaninfo --summary
The text was updated successfully, but these errors were encountered: