-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Name and Version
$ ./llama-cli --version
load_backend: loaded RPC backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-icelake.so
version: 6989 (eeee367de)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
./llama-cli -hf ggml-org/SmolLM3-3B-GGUF -p "Hello"Problem description & steps to reproduce
The output I got is gibberish:
user
Hello
assistant
upported D F;
D R P23D323D4 PP
D3D3DDD
>
When using the CPU backend bu setting -ngl 0 or simply by removing the Vulkan module, my output is correct
First Bad Commit
No response
Relevant log output
load_backend: loaded RPC backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-icelake.so
* Host huggingface.co:443 was resolved.
* IPv6: 2600:9000:2436:1200:17:b174:6d00:93a1, 2600:9000:2436:1800:17:b174:6d00:93a1, 2600:9000:2436:4e00:17:b174:6d00:93a1, 2600:9000:2436:b400:17:b174:6d00:93a1, 2600:9000:2436:8000:17:b174:6d00:93a1, 2600:9000:2436:ba00:17:b174:6d00:93a1, 2600:9000:2436:c200:17:b174:6d00:93a1, 2600:9000:2436:5800:17:b174:6d00:93a1
* IPv4: 108.138.51.26, 108.138.51.21, 108.138.51.8, 108.138.51.41
* Trying [2600:9000:2436:1200:17:b174:6d00:93a1]:443...
* ALPN: curl offers h2,http/1.1
* SSL Trust Anchors:
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / X25519MLKEM768 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
* subject: CN=huggingface.co
* start date: Apr 13 00:00:00 2025 GMT
* expire date: May 12 23:59:59 2026 GMT
* issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
* Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* subjectAltName: "huggingface.co" matches cert's "huggingface.co"
* SSL certificate verified via OpenSSL.
* Established connection to huggingface.co (2600:9000:2436:1200:17:b174:6d00:93a1 port 443) from 2606:4700:110:886d:e733:73d2:1a78:aed9 port 60996
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://huggingface.co/v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: huggingface.co]
* [HTTP/2] [1] [:path: /v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest]
* [HTTP/2] [1] [user-agent: llama-cpp]
* [HTTP/2] [1] [accept: application/json]
> GET /v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest HTTP/2
Host: huggingface.co
User-Agent: llama-cpp
Accept: application/json
* Request completely sent off
< HTTP/2 200
< content-type: application/json; charset=utf-8
< content-length: 976
< date: Sat, 08 Nov 2025 15:37:19 GMT
< etag: W/"3d0-7FgnnKEkDoOon2Kc6uh711e78Lk"
< x-powered-by: huggingface-moon
< x-request-id: Root=1-690f63af-7772639f425bcd6d7e052116
< ratelimit: "pages";r=99;t=221
< ratelimit-policy: "fixed window";"pages";q=100;w=300
< cross-origin-opener-policy: same-origin
< referrer-policy: strict-origin-when-cross-origin
< access-control-max-age: 86400
< access-control-allow-origin: https://huggingface.co
< vary: Origin
< access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
< x-cache: Miss from cloudfront
< via: 1.1 ca098aee4fd72030e464a2f263541478.cloudfront.net (CloudFront)
< x-amz-cf-pop: WAW51-P2
< x-amz-cf-id: RjEdqRQiQiUA1VhiHfocZqNbeoaA_q5hf8kQzchqFeO1bKbP4zg3pQ==
<
* Connection #0 to host huggingface.co:443 left intact
common_download_file_single_online: using cached file: /home/michel/.cache/llama.cpp/ggml-org_SmolLM3-3B-GGUF_SmolLM3-Q4_K_M.gguf
build: 6989 (eeee367de) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics (TGL GT2)) (0000:00:02.0) - 6861 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 326 tensors from /home/michel/.cache/llama.cpp/ggml-org_SmolLM3-3B-GGUF_SmolLM3-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = smollm3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 3.1B
llama_model_loader: - kv 3: general.license str = apache-2.0
llama_model_loader: - kv 4: general.languages arr[str,8] = ["en", "fr", "es", "it", "pt", "zh", ...
llama_model_loader: - kv 5: smollm3.block_count u32 = 36
llama_model_loader: - kv 6: smollm3.context_length u32 = 65536
llama_model_loader: - kv 7: smollm3.embedding_length u32 = 2048
llama_model_loader: - kv 8: smollm3.feed_forward_length u32 = 11008
llama_model_loader: - kv 9: smollm3.attention.head_count u32 = 16
llama_model_loader: - kv 10: smollm3.attention.head_count_kv u32 = 4
llama_model_loader: - kv 11: smollm3.rope.freq_base f32 = 5000000,000000
llama_model_loader: - kv 12: smollm3.attention.layer_norm_rms_epsilon f32 = 0,000001
llama_model_loader: - kv 13: smollm3.vocab_size u32 = 128256
llama_model_loader: - kv 14: smollm3.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = smaug-bpe
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 128012
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 128012
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.chat_template str = {# ───── defaults ───...
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - kv 25: general.file_type u32 = 15
llama_model_loader: - type f32: 73 tensors
llama_model_loader: - type q4_K: 216 tensors
llama_model_loader: - type q6_K: 37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 1,78 GiB (4,96 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: - 128012 ('<|im_end|>')
load: special tokens cache size = 256
load: token to piece cache size = 0,7997 MB
print_info: arch = smollm3
print_info: vocab_only = 0
print_info: n_ctx_train = 65536
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 36
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-06
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: f_attn_scale = 0,0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 5000000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 65536
print_info: rope_finetuned = unknown
print_info: model type = 3B
print_info: model params = 3,08 B
print_info: general.name = n/a
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128012 '<|im_end|>'
print_info: EOT token = 128012 '<|im_end|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128012 '<|im_end|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: EOG token = 128012 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU_Mapped model buffer size = 205,49 MiB
load_tensors: Vulkan0 model buffer size = 1819,10 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 5000000,0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0,49 MiB
llama_kv_cache: Vulkan0 KV buffer size = 288,00 MiB
llama_kv_cache: size = 288,00 MiB ( 4096 cells, 36 layers, 1/1 seqs), K (f16): 144,00 MiB, V (f16): 144,00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: Vulkan0 compute buffer size = 254,50 MiB
llama_context: Vulkan_Host compute buffer size = 12,02 MiB
llama_context: graph nodes = 1105
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: interactive mode on.
sampler seed: 1112204385
sampler params:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
user
Hello
assistant
upported D F;
D R P23D323D4 PP
D3D3DDD
>
llama_perf_sampler_print: sampling time = 3,15 ms / 35 runs ( 0,09 ms per token, 11097,02 tokens per second)
llama_perf_context_print: load time = 2552,59 ms
llama_perf_context_print: prompt eval time = 351,46 ms / 9 tokens ( 39,05 ms per token, 25,61 tokens per second)
llama_perf_context_print: eval time = 3254,03 ms / 25 runs ( 130,16 ms per token, 7,68 tokens per second)
llama_perf_context_print: total time = 4410,05 ms / 34 tokens
llama_perf_context_print: graphs reused = 25
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Vulkan0 (Intel(R) Iris(R) Xe Graphics (TGL GT2)) | 11771 = 4421 + (2361 = 1819 + 288 + 254) + 4989 |
llama_memory_breakdown_print: | - Host | 217 = 205 + 0 + 12 |
Interrupted by userMetadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working