-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I get garbage output when offloading any layer to GPU when running Phi-2 models with the Vulkan backend. The issue seems to be with the first and last layers mostly.
.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X
(main: build = 2035 (7977a2a0))
-ngl 0 (control)
Here is a reciepe for tomato soup:
Ingredients:
- 4 cups of chicken broth
- 2 tablespoons of butter
- 1 onion, chopped
- 2 cloves of garlic, minced
- 2 tomatoes, peeled and diced
- Salt and pepper to taste
- Parsley for garnish
Directions:
- In a large pot, melt the butter over medium heat. Add the onion and garlic and cook until soft, about 10 minutes.
- Stir in the chicken broth and bring to a boil. Reduce the heat and simmer for 15 minutes, stirring occasionally.
- Add the tomatoes and season with salt and pepper. Cook for another 10 minutes,
llama_print_timings: load time = 329.52 ms
llama_print_timings: sample time = 29.20 ms / 128 runs ( 0.23 ms per token, 4382.96 tokens per second)
llama_print_timings: prompt eval time = 310.37 ms / 11 tokens ( 28.22 ms per token, 35.44 tokens per second)
llama_print_timings: eval time = 8578.80 ms / 127 runs ( 67.55 ms per token, 14.80 tokens per second)
llama_print_timings: total time = 8949.84 ms / 138 tokens
Log end
-ngl 1
Here is a reciepe for tomato soup:
Ingredients:- "
[end of text]
llama_print_timings: load time = 641.73 ms
llama_print_timings: sample time = 1.47 ms / 7 runs ( 0.21 ms per token, 4768.39 tokens per second)
llama_print_timings: prompt eval time = 312.33 ms / 11 tokens ( 28.39 ms per token, 35.22 tokens per second)
llama_print_timings: eval time = 666.72 ms / 6 runs ( 111.12 ms per token, 9.00 tokens per second)
llama_print_timings: total time = 983.36 ms / 17 tokens
Log end
Starts ok, but glitches after a few tokens generated. (in this case it generated an eos token, so it ended the generation early, but with a different prompt/higer temp, the output is just noisy gibberish)
using `-p "Here is a reciepe for tomato soup:\n\n"`
Here is a reciepe for tomato soup:
- "Tomato SOUP
Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe----------------------------- "-- SOUP
Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe-------------------------
-ngl 2
Here is a reciepe for tomato soup:
- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe-----------------------------------------------------
llama_print_timings: load time = 562.83 ms
llama_print_timings: sample time = 27.43 ms / 128 runs ( 0.21 ms per token, 4665.91 tokens per second)
llama_print_timings: prompt eval time = 304.00 ms / 11 tokens ( 27.64 ms per token, 36.18 tokens per second)
llama_print_timings: eval time = 8149.43 ms / 127 runs ( 64.17 ms per token, 15.58 tokens per second)
llama_print_timings: total time = 8507.07 ms / 138 tokens
Log end
(ngl 2 to 32 all produce the same output, only the inference speed changes)
-ngl 32
Here is a reciepe for tomato soup:
- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe------------------------------------------------------
llama_print_timings: load time = 1180.39 ms
llama_print_timings: sample time = 32.76 ms / 128 runs ( 0.26 ms per token, 3906.97 tokens per second)
llama_print_timings: prompt eval time = 184.50 ms / 11 tokens ( 16.77 ms per token, 59.62 tokens per second)
llama_print_timings: eval time = 2464.90 ms / 127 runs ( 19.41 ms per token, 51.52 tokens per second)
llama_print_timings: total time = 2707.77 ms / 138 tokens
Log end
-ngl 33 (all layers)
Here is a reciepe for tomato soup:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
llama_print_timings: load time = 1076.60 ms
llama_print_timings: sample time = 34.70 ms / 128 runs ( 0.27 ms per token, 3688.97 tokens per second)
llama_print_timings: prompt eval time = 168.66 ms / 11 tokens ( 15.33 ms per token, 65.22 tokens per second)
llama_print_timings: eval time = 1424.31 ms / 127 runs ( 11.21 ms per token, 89.17 tokens per second)
llama_print_timings: total time = 1652.01 ms / 138 tokens
Log end
(always repeating a single token, seems to use mostly '!', 'G' or 'o')
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working