Skip to content

Misc. bug: Vulkan output is gibberish on Intel GPU #17106

@mimi89999

Description

@mimi89999

Name and Version

$ ./llama-cli --version
load_backend: loaded RPC backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-icelake.so
version: 6989 (eeee367de)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./llama-cli -hf ggml-org/SmolLM3-3B-GGUF -p "Hello"

Problem description & steps to reproduce

The output I got is gibberish:

user
Hello
assistant
upported D F;

 D R P23D323D4 PP
D3D3DDD
> 

When using the CPU backend bu setting -ngl 0 or simply by removing the Vulkan module, my output is correct

First Bad Commit

No response

Relevant log output

load_backend: loaded RPC backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-icelake.so
* Host huggingface.co:443 was resolved.
* IPv6: 2600:9000:2436:1200:17:b174:6d00:93a1, 2600:9000:2436:1800:17:b174:6d00:93a1, 2600:9000:2436:4e00:17:b174:6d00:93a1, 2600:9000:2436:b400:17:b174:6d00:93a1, 2600:9000:2436:8000:17:b174:6d00:93a1, 2600:9000:2436:ba00:17:b174:6d00:93a1, 2600:9000:2436:c200:17:b174:6d00:93a1, 2600:9000:2436:5800:17:b174:6d00:93a1
* IPv4: 108.138.51.26, 108.138.51.21, 108.138.51.8, 108.138.51.41
*   Trying [2600:9000:2436:1200:17:b174:6d00:93a1]:443...
* ALPN: curl offers h2,http/1.1
* SSL Trust Anchors:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
*   CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / X25519MLKEM768 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*   subject: CN=huggingface.co
*   start date: Apr 13 00:00:00 2025 GMT
*   expire date: May 12 23:59:59 2026 GMT
*   issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   subjectAltName: "huggingface.co" matches cert's "huggingface.co"
* SSL certificate verified via OpenSSL.
* Established connection to huggingface.co (2600:9000:2436:1200:17:b174:6d00:93a1 port 443) from 2606:4700:110:886d:e733:73d2:1a78:aed9 port 60996 
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://huggingface.co/v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: huggingface.co]
* [HTTP/2] [1] [:path: /v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest]
* [HTTP/2] [1] [user-agent: llama-cpp]
* [HTTP/2] [1] [accept: application/json]
> GET /v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest HTTP/2
Host: huggingface.co
User-Agent: llama-cpp
Accept: application/json

* Request completely sent off
< HTTP/2 200 
< content-type: application/json; charset=utf-8
< content-length: 976
< date: Sat, 08 Nov 2025 15:37:19 GMT
< etag: W/"3d0-7FgnnKEkDoOon2Kc6uh711e78Lk"
< x-powered-by: huggingface-moon
< x-request-id: Root=1-690f63af-7772639f425bcd6d7e052116
< ratelimit: "pages";r=99;t=221
< ratelimit-policy: "fixed window";"pages";q=100;w=300
< cross-origin-opener-policy: same-origin
< referrer-policy: strict-origin-when-cross-origin
< access-control-max-age: 86400
< access-control-allow-origin: https://huggingface.co
< vary: Origin
< access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
< x-cache: Miss from cloudfront
< via: 1.1 ca098aee4fd72030e464a2f263541478.cloudfront.net (CloudFront)
< x-amz-cf-pop: WAW51-P2
< x-amz-cf-id: RjEdqRQiQiUA1VhiHfocZqNbeoaA_q5hf8kQzchqFeO1bKbP4zg3pQ==
< 
* Connection #0 to host huggingface.co:443 left intact
common_download_file_single_online: using cached file: /home/michel/.cache/llama.cpp/ggml-org_SmolLM3-3B-GGUF_SmolLM3-Q4_K_M.gguf
build: 6989 (eeee367de) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics (TGL GT2)) (0000:00:02.0) - 6861 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 326 tensors from /home/michel/.cache/llama.cpp/ggml-org_SmolLM3-3B-GGUF_SmolLM3-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = smollm3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 3.1B
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                          general.languages arr[str,8]       = ["en", "fr", "es", "it", "pt", "zh", ...
llama_model_loader: - kv   5:                        smollm3.block_count u32              = 36
llama_model_loader: - kv   6:                     smollm3.context_length u32              = 65536
llama_model_loader: - kv   7:                   smollm3.embedding_length u32              = 2048
llama_model_loader: - kv   8:                smollm3.feed_forward_length u32              = 11008
llama_model_loader: - kv   9:               smollm3.attention.head_count u32              = 16
llama_model_loader: - kv  10:            smollm3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                     smollm3.rope.freq_base f32              = 5000000,000000
llama_model_loader: - kv  12:   smollm3.attention.layer_norm_rms_epsilon f32              = 0,000001
llama_model_loader: - kv  13:                         smollm3.vocab_size u32              = 128256
llama_model_loader: - kv  14:               smollm3.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128012
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128012
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {# ───── defaults ───...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   73 tensors
llama_model_loader: - type q4_K:  216 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1,78 GiB (4,96 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load:   - 128012 ('<|im_end|>')
load: special tokens cache size = 256
load: token to piece cache size = 0,7997 MB
print_info: arch             = smollm3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 65536
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0,0e+00
print_info: f_norm_rms_eps   = 1,0e-06
print_info: f_clamp_kqv      = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale    = 0,0e+00
print_info: f_attn_scale     = 0,0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 65536
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3,08 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128012 '<|im_end|>'
print_info: EOT token        = 128012 '<|im_end|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128012 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: EOG token        = 128012 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   205,49 MiB
load_tensors:      Vulkan0 model buffer size =  1819,10 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000,0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0,49 MiB
llama_kv_cache:    Vulkan0 KV buffer size =   288,00 MiB
llama_kv_cache: size =  288,00 MiB (  4096 cells,  36 layers,  1/1 seqs), K (f16):  144,00 MiB, V (f16):  144,00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan0 compute buffer size =   254,50 MiB
llama_context: Vulkan_Host compute buffer size =    12,02 MiB
llama_context: graph nodes  = 1105
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 1112204385
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
Hello
assistant
upported D F;

 D R P23D323D4 PP
D3D3DDD
> 
llama_perf_sampler_print:    sampling time =       3,15 ms /    35 runs   (    0,09 ms per token, 11097,02 tokens per second)
llama_perf_context_print:        load time =    2552,59 ms
llama_perf_context_print: prompt eval time =     351,46 ms /     9 tokens (   39,05 ms per token,    25,61 tokens per second)
llama_perf_context_print:        eval time =    3254,03 ms /    25 runs   (  130,16 ms per token,     7,68 tokens per second)
llama_perf_context_print:       total time =    4410,05 ms /    34 tokens
llama_perf_context_print:    graphs reused =         25
llama_memory_breakdown_print: | memory breakdown [MiB]                               | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Vulkan0 (Intel(R) Iris(R) Xe Graphics (TGL GT2)) | 11771 = 4421 + (2361 =  1819 +     288 +     254) +        4989 |
llama_memory_breakdown_print: |   - Host                                             |                  217 =   205 +       0 +      12                |
Interrupted by user

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions