Skip to content

Phi-2 completely broken on Vulkan #5243

@stduhpf

Description

@stduhpf

I get garbage output when offloading any layer to GPU when running Phi-2 models with the Vulkan backend. The issue seems to be with the first and last layers mostly.

.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X

(main: build = 2035 (7977a2a0))

-ngl 0 (control)

Here is a reciepe for tomato soup:

Ingredients:
- 4 cups of chicken broth
- 2 tablespoons of butter
- 1 onion, chopped
- 2 cloves of garlic, minced
- 2 tomatoes, peeled and diced
- Salt and pepper to taste
- Parsley for garnish

Directions:
- In a large pot, melt the butter over medium heat. Add the onion and garlic and cook until soft, about 10 minutes.
- Stir in the chicken broth and bring to a boil. Reduce the heat and simmer for 15 minutes, stirring occasionally.
- Add the tomatoes and season with salt and pepper. Cook for another 10 minutes,
llama_print_timings:        load time =     329.52 ms
llama_print_timings:      sample time =      29.20 ms /   128 runs   (    0.23 ms per token,  4382.96 tokens per second)
llama_print_timings: prompt eval time =     310.37 ms /    11 tokens (   28.22 ms per token,    35.44 tokens per second)
llama_print_timings:        eval time =    8578.80 ms /   127 runs   (   67.55 ms per token,    14.80 tokens per second)
llama_print_timings:       total time =    8949.84 ms /   138 tokens
Log end

-ngl 1

Here is a reciepe for tomato soup:

Ingredients:- "
 [end of text]

llama_print_timings:        load time =     641.73 ms
llama_print_timings:      sample time =       1.47 ms /     7 runs   (    0.21 ms per token,  4768.39 tokens per second)
llama_print_timings: prompt eval time =     312.33 ms /    11 tokens (   28.39 ms per token,    35.22 tokens per second)
llama_print_timings:        eval time =     666.72 ms /     6 runs   (  111.12 ms per token,     9.00 tokens per second)
llama_print_timings:       total time =     983.36 ms /    17 tokens
Log end

Starts ok, but glitches after a few tokens generated. (in this case it generated an eos token, so it ended the generation early, but with a different prompt/higer temp, the output is just noisy gibberish)

using `-p "Here is a reciepe for tomato soup:\n\n"`
Here is a reciepe for tomato soup:

 - "Tomato SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe----------------------------- "-- SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe-------------------------

-ngl 2

Here is a reciepe for tomato soup:

- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe-----------------------------------------------------
llama_print_timings:        load time =     562.83 ms
llama_print_timings:      sample time =      27.43 ms /   128 runs   (    0.21 ms per token,  4665.91 tokens per second)
llama_print_timings: prompt eval time =     304.00 ms /    11 tokens (   27.64 ms per token,    36.18 tokens per second)
llama_print_timings:        eval time =    8149.43 ms /   127 runs   (   64.17 ms per token,    15.58 tokens per second)
llama_print_timings:       total time =    8507.07 ms /   138 tokens
Log end

(ngl 2 to 32 all produce the same output, only the inference speed changes)

-ngl 32

Here is a reciepe for tomato soup:
- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe------------------------------------------------------
llama_print_timings:        load time =    1180.39 ms
llama_print_timings:      sample time =      32.76 ms /   128 runs   (    0.26 ms per token,  3906.97 tokens per second)
llama_print_timings: prompt eval time =     184.50 ms /    11 tokens (   16.77 ms per token,    59.62 tokens per second)
llama_print_timings:        eval time =    2464.90 ms /   127 runs   (   19.41 ms per token,    51.52 tokens per second)
llama_print_timings:       total time =    2707.77 ms /   138 tokens
Log end

-ngl 33 (all layers)

Here is a reciepe for tomato soup:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
llama_print_timings:        load time =    1076.60 ms
llama_print_timings:      sample time =      34.70 ms /   128 runs   (    0.27 ms per token,  3688.97 tokens per second)
llama_print_timings: prompt eval time =     168.66 ms /    11 tokens (   15.33 ms per token,    65.22 tokens per second)
llama_print_timings:        eval time =    1424.31 ms /   127 runs   (   11.21 ms per token,    89.17 tokens per second)
llama_print_timings:       total time =    1652.01 ms /   138 tokens
Log end

(always repeating a single token, seems to use mostly '!', 'G' or 'o')

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions