Vulkan: Interactive mode broken #5217

stduhpf · 2024-01-30T14:45:34Z

Running models in interactive, instruct or chatML mode, or using the server's chat interface leads to broken generation when using the Vulkan build with a non-zero amount of layers offloaded to GPU. Simple text completion works properly though.

Expected behaviour (CLBlast build)

.\v\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0

Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1010:xnack-'
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
 Hello there! How can I assist you today?

> Can you tell me what time it is?
 Of course! It's currently 1:45 PM. Is there anything else I can help you with?
 
> 

llama_print_timings:        load time =    5129.82 ms
llama_print_timings:      sample time =       5.07 ms /    36 runs   (    0.14 ms per token,  7106.20 tokens per second)
llama_print_timings: prompt eval time =    6830.90 ms /    78 tokens (   87.58 ms per token,    11.42 tokens per second)
llama_print_timings:        eval time =    2929.09 ms /    35 runs   (   83.69 ms per token,    11.95 tokens per second)
llama_print_timings:       total time =   62423.45 ms /   113 tokens

Vulkan behaviour

.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0

Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
dharmi, the user is a chatbot.

User: Hi Llama, how are you doing today?

Llama: I'm doing well, thank you for asking! Just enjoying my day and helping people with their questions. How can I assist you today?

> Can you tell me what time it is?
 batting an eye at the keyboard.

>

llama_print_timings:        load time =    3888.82 ms
llama_print_timings:      sample time =      14.16 ms /    71 runs   (    0.20 ms per token,  5015.19 tokens per second)
llama_print_timings: prompt eval time =    6604.30 ms /    78 tokens (   84.67 ms per token,    11.81 tokens per second)
llama_print_timings:        eval time =    1645.61 ms /    70 runs   (   23.51 ms per token,    42.54 tokens per second)
llama_print_timings:       total time =   45446.02 ms /   148 tokens

As you can see, with the Vulkan build the LLM seems to treat the user's input as just noise, while understanding the initial prompt properly.

The server also seem to have simillar issues when re-using cached prompts (for example when the user submits a second message).
The actual output isn't consistent either, and seem to change everytime, even with fixed seed and zero temperature, given the same user input.

This does only happen with Vulkan and with at least one layer offloaded to GPU:

More examples:

Other -ngl values:

CPU only (working as expected)

.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 0 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0

Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
 Hello there! How can I assist you today?

> Can you tell me what time it is?
 Of course! It's currently 1:45 PM. Is there anything else I can help you with?

>

llama_print_timings:        load time =     802.68 ms
llama_print_timings:      sample time =       5.17 ms /    36 runs   (    0.14 ms per token,  6960.56 tokens per second)
llama_print_timings: prompt eval time =    3547.22 ms /    78 tokens (   45.48 ms per token,    21.99 tokens per second)
llama_print_timings:        eval time =    5921.23 ms /    35 runs   (  169.18 ms per token,     5.91 tokens per second)
llama_print_timings:       total time =   20858.80 ms /   113 tokens

One single layer offloaded (already broken, but in a different way)

.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 1 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0

Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
 Fußball ist eine beliebte Sportart in Deutschland. Es wird von vielen Menschen gespielt und gefolgt.


> Can you tell me what time it is?
 Uhrzeit ist eine Zeit, die von der Lokalzeit abhängt. Können Sie bitte Ihre Lokalzeit und Zeitzone angeben? Ich werde mich freuen, Ihnen die aktuelle Uhrzeit zu geben.

>

llama_print_timings:        load time =     975.89 ms
llama_print_timings:      sample time =      12.58 ms /    85 runs   (    0.15 ms per token,  6754.61 tokens per second)
llama_print_timings: prompt eval time =    3650.96 ms /    78 tokens (   46.81 ms per token,    21.36 tokens per second)
llama_print_timings:        eval time =   13061.39 ms /    84 runs   (  155.49 ms per token,     6.43 tokens per second)
llama_print_timings:       total time =   28959.43 ms /   162 tokens

It's funny that it kinda understood the second question, but used the wrong language.

Completion only (no issue here)

CLBlast

.\buildCLBlast\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128

Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1010:xnack-'
[...]

 This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.

User: Hi Llama! How are you today?

Llama: Hello there! I'm doing well, thank you for asking. How about yourself?

User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story?

Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could
llama_print_timings:        load time =    4971.64 ms
llama_print_timings:      sample time =      19.82 ms /   128 runs   (    0.15 ms per token,  6459.10 tokens per second)
llama_print_timings: prompt eval time =    2129.71 ms /    43 tokens (   49.53 ms per token,    20.19 tokens per second)
llama_print_timings:        eval time =    8192.75 ms /   127 runs   (   64.51 ms per token,    15.50 tokens per second)
llama_print_timings:       total time =   10364.14 ms /   170 tokens
Log end

Vulkan

.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128

Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]

 This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.

User: Hi Llama! How are you today?

Llama: Hello there! I'm doing well, thank you for asking. How about yourself?

User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story?

Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could
llama_print_timings:        load time =    3933.92 ms
llama_print_timings:      sample time =      27.70 ms /   128 runs   (    0.22 ms per token,  4620.94 tokens per second)
llama_print_timings: prompt eval time =     598.12 ms /    43 tokens (   13.91 ms per token,    71.89 tokens per second)
llama_print_timings:        eval time =    2923.36 ms /   127 runs   (   23.02 ms per token,    43.44 tokens per second)
llama_print_timings:       total time =    3574.34 ms /   170 tokens
Log end

In case it's relevant:

vulkaninfo --summary

WARNING: [Loader Message] Code 0 : Layer VK_LAYER_RTSS uses API version 1.1 which is older than the application specified API version of 1.3. May cause issues.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.261


Instance Extensions: count = 13
-------------------------------
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_win32_surface                   : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 17
---------------------------
VK_LAYER_AMD_switchable_graphics    AMD switchable graphics layer                 1.3.270  version 1
VK_LAYER_EOS_Overlay                Vulkan overlay layer for Epic Online Services 1.2.136  version 1
VK_LAYER_EOS_Overlay                Vulkan overlay layer for Epic Online Services 1.2.136  version 1
VK_LAYER_KHRONOS_profiles           Khronos Profiles layer                        1.3.275  version 1
VK_LAYER_KHRONOS_shader_object      Khronos Shader object layer                   1.3.275  version 1
VK_LAYER_KHRONOS_synchronization2   Khronos Synchronization2 layer                1.3.275  version 1
VK_LAYER_KHRONOS_validation         Khronos Validation Layer                      1.3.275  version 1
VK_LAYER_LUNARG_api_dump            LunarG API dump layer                         1.3.275  version 2
VK_LAYER_LUNARG_gfxreconstruct      GFXReconstruct Capture Layer Version 1.0.2    1.3.275  version 4194306
VK_LAYER_LUNARG_monitor             Execution Monitoring Layer                    1.3.275  version 1
VK_LAYER_LUNARG_screenshot          LunarG image capture layer                    1.3.275  version 1
VK_LAYER_OBS_HOOK                   Open Broadcaster Software hook                1.3.216  version 1
VK_LAYER_RENDERDOC_Capture          Debugging capture layer for RenderDoc         1.2.131  version 17
VK_LAYER_ROCKSTAR_GAMES_social_club Rockstar Games Social Club Layer              1.0.70   version 1
VK_LAYER_RTSS                       RTSS overlay hook bootstrap                   1.1.73   version 1
VK_LAYER_VALVE_steam_fossilize      Steam Pipeline Caching Layer                  1.3.207  version 1
VK_LAYER_VALVE_steam_overlay        Steam Overlay Layer                           1.3.207  version 1

Devices:
========
GPU0:
        apiVersion         = 1.3.270
        driverVersion      = 2.0.294
        vendorID           = 0x1002
        deviceID           = 0x731f
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = AMD Radeon RX 5700 XT
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 24.1.1 (AMD proprietary shader compiler)
        conformanceVersion = 1.3.3.1
        deviceUUID         = 00000000-2800-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000

The text was updated successfully, but these errors were encountered:

0cc4m · 2024-01-30T20:30:37Z

@stduhpf Let's continue here.

It doesn't seem to fix #5217. It still behaves pretty much the same.

Really? I tried interactive with your commands and it was fine.

Ah that's very strange then. Maybe it's a GPU architecture-dependant thing, or something is broken with my hardware...

I just tested it again on master and it's also fine for me there. So I probably didn't fix it, I couldn't reproduce it.. I'm running Linux, and it works fine on Nvidia and AMD GPUs. Any idea what could be causing this issue for you?

stduhpf · 2024-01-30T20:53:05Z

@0cc4m I'm running Windows 10, on AMD hardware (RX 5700 XT, lastest drivers). I have no idea, what could be the root cause of it, maybe some race condition? It happens consistently, but the way it messes up is different each time, even wiith the same parameters.

stduhpf · 2024-01-30T21:34:37Z

Ok, so it was working when I first tried your PR, at commit 0cc4m@a5cca6c. (I still have the build I made back then) Somehow it broke since then.
I'll try building with different commits to binary search what changes broke it.

stduhpf · 2024-01-30T21:44:52Z

@0cc4m 0cc4m@0f64857 is the last working commit for me.
It seems that the merge commit 0cc4m@9c4c15a somehow caused this issue. (which is bad news, because this commit changed a lot of things)

EDIT: nevermind, 0cc4m@0f64857 is not working at all, it just falls back to CPU (I'm too used to work with rebases instead of merges)

stduhpf · 2024-01-30T22:05:54Z

Yeah, so 0cc4m@a5cca6c works, and 0cc4m@48ad459 does not. So the breaking change should be there. @0cc4m

Engininja2 · 2024-01-31T03:24:38Z

I tried Mistral 7B Instruct, which has a n_vocab of 32000 on my RX 5700 XT on Windows and didn't see any problems. Using the same Dolphin model as your example, which has n_vocab=32001 I ran into similar nondeterministic nonsense responses.

After changing BK from 8 to 16 on this line I get the expected behaviour.

std::initializer_list<uint32_t> warptile_s = { vk_device.subgroup_size,  32,  32,  16, 32, 32, 2, 2, 2, vk_device.subgroup_size };

Instead of that change doubling the size of buf_a & buf_b in mulmat_body in the shaders worked too with worse prompt processing speed. Same for replacing both vk_device.subgroup_size with 32.

Edit: interestingly on Arch Linux the RADV driver doesn't appear to run into this issue, but AMDVLK does.

0cc4m · 2024-01-31T05:33:55Z

@stduhpf Thanks for figuring out the source commit! Really helpful.

@Engininja2 Wow, you found it. I was able to reproduce it with amdvlk. I have no clue why the AMD Windows driver and amdvlk failed with that shader when it works on Nvidia and RADV, but changing BK to 16 seems a simple fix. I added the fix to #5223, it worked for me on amdvlk, can you try it on Windows?

stduhpf · 2024-01-31T10:09:16Z

Yep, #5223 fixes it now, thank you!

stduhpf added the bug-unconfirmed label Jan 30, 2024

0cc4m self-assigned this Jan 30, 2024

0cc4m mentioned this issue Jan 30, 2024

Vulkan Fixes #5223

Merged

flatsiedatsie mentioned this issue Jan 30, 2024

GPU speed-up on Raspberry Pi 5 Mozilla-Ocho/llamafile#226

Open

0cc4m added bug Something isn't working and removed bug-unconfirmed labels Jan 31, 2024

stduhpf closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Interactive mode broken #5217

Vulkan: Interactive mode broken #5217

stduhpf commented Jan 30, 2024 •

edited

Loading

0cc4m commented Jan 30, 2024

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

Engininja2 commented Jan 31, 2024 •

edited

Loading

0cc4m commented Jan 31, 2024

stduhpf commented Jan 31, 2024

Vulkan: Interactive mode broken #5217

Vulkan: Interactive mode broken #5217

Comments

stduhpf commented Jan 30, 2024 • edited Loading

More examples:

Other -ngl values:

Completion only (no issue here)

0cc4m commented Jan 30, 2024

stduhpf commented Jan 30, 2024 • edited Loading

stduhpf commented Jan 30, 2024 • edited Loading

stduhpf commented Jan 30, 2024 • edited Loading

stduhpf commented Jan 30, 2024 • edited Loading

Engininja2 commented Jan 31, 2024 • edited Loading

0cc4m commented Jan 31, 2024

stduhpf commented Jan 31, 2024

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

stduhpf commented Jan 30, 2024 •

edited

Loading

Engininja2 commented Jan 31, 2024 •

edited

Loading