Phi-2 completely broken on Vulkan

I get garbage output when offloading any layer to GPU when running Phi-2 models with the Vulkan backend. The issue seems to be with the first and last layers mostly.

### `.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X`

(main: build = 2035 (7977a2a0))

#### `-ngl 0` (control)

```
Here is a reciepe for tomato soup:

Ingredients:
- 4 cups of chicken broth
- 2 tablespoons of butter
- 1 onion, chopped
- 2 cloves of garlic, minced
- 2 tomatoes, peeled and diced
- Salt and pepper to taste
- Parsley for garnish

Directions:
- In a large pot, melt the butter over medium heat. Add the onion and garlic and cook until soft, about 10 minutes.
- Stir in the chicken broth and bring to a boil. Reduce the heat and simmer for 15 minutes, stirring occasionally.
- Add the tomatoes and season with salt and pepper. Cook for another 10 minutes,
llama_print_timings:        load time =     329.52 ms
llama_print_timings:      sample time =      29.20 ms /   128 runs   (    0.23 ms per token,  4382.96 tokens per second)
llama_print_timings: prompt eval time =     310.37 ms /    11 tokens (   28.22 ms per token,    35.44 tokens per second)
llama_print_timings:        eval time =    8578.80 ms /   127 runs   (   67.55 ms per token,    14.80 tokens per second)
llama_print_timings:       total time =    8949.84 ms /   138 tokens
Log end
```

#### `-ngl 1`

```
Here is a reciepe for tomato soup:

Ingredients:- "
 [end of text]

llama_print_timings:        load time =     641.73 ms
llama_print_timings:      sample time =       1.47 ms /     7 runs   (    0.21 ms per token,  4768.39 tokens per second)
llama_print_timings: prompt eval time =     312.33 ms /    11 tokens (   28.39 ms per token,    35.22 tokens per second)
llama_print_timings:        eval time =     666.72 ms /     6 runs   (  111.12 ms per token,     9.00 tokens per second)
llama_print_timings:       total time =     983.36 ms /    17 tokens
Log end
```
Starts ok, but glitches after a few tokens generated. (in this case it generated an eos token, so it ended the generation early, but with a different prompt/higer temp, the output is just noisy gibberish)
<details>
<summary>  using `-p "Here is a reciepe for tomato soup:\n\n"` </summary>

```
Here is a reciepe for tomato soup:

 - "Tomato SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe----------------------------- "-- SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe-------------------------
 ```
</details>

#### `-ngl 2`

```
Here is a reciepe for tomato soup:

- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe-----------------------------------------------------
llama_print_timings:        load time =     562.83 ms
llama_print_timings:      sample time =      27.43 ms /   128 runs   (    0.21 ms per token,  4665.91 tokens per second)
llama_print_timings: prompt eval time =     304.00 ms /    11 tokens (   27.64 ms per token,    36.18 tokens per second)
llama_print_timings:        eval time =    8149.43 ms /   127 runs   (   64.17 ms per token,    15.58 tokens per second)
llama_print_timings:       total time =    8507.07 ms /   138 tokens
Log end
```

(ngl 2 to 32 all produce the same output, only the inference speed changes)

#### `-ngl 32`

```
Here is a reciepe for tomato soup:
- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe------------------------------------------------------
llama_print_timings:        load time =    1180.39 ms
llama_print_timings:      sample time =      32.76 ms /   128 runs   (    0.26 ms per token,  3906.97 tokens per second)
llama_print_timings: prompt eval time =     184.50 ms /    11 tokens (   16.77 ms per token,    59.62 tokens per second)
llama_print_timings:        eval time =    2464.90 ms /   127 runs   (   19.41 ms per token,    51.52 tokens per second)
llama_print_timings:       total time =    2707.77 ms /   138 tokens
Log end
```

#### `-ngl 33` (all layers)

```
Here is a reciepe for tomato soup:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
llama_print_timings:        load time =    1076.60 ms
llama_print_timings:      sample time =      34.70 ms /   128 runs   (    0.27 ms per token,  3688.97 tokens per second)
llama_print_timings: prompt eval time =     168.66 ms /    11 tokens (   15.33 ms per token,    65.22 tokens per second)
llama_print_timings:        eval time =    1424.31 ms /   127 runs   (   11.21 ms per token,    89.17 tokens per second)
llama_print_timings:       total time =    1652.01 ms /   138 tokens
Log end
```
(always repeating a single token, seems to use mostly '!', 'G' or 'o') 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phi-2 completely broken on Vulkan #5243

`.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X`

`-ngl 0` (control)

`-ngl 1`

`-ngl 2`

`-ngl 32`

`-ngl 33` (all layers)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phi-2 completely broken on Vulkan #5243

Description

.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X

-ngl 0 (control)

-ngl 1

-ngl 2

-ngl 32

-ngl 33 (all layers)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X`

`-ngl 0` (control)

`-ngl 1`

`-ngl 2`

`-ngl 32`

`-ngl 33` (all layers)