-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
~/software/llama.cpp/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
version: 6921 (a864132)
built with cc (Gentoo 14.3.1_p20250801 p4) 14.3.1 20250801 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CPU
Hardware
13th Gen Intel(R) Core(TM) i9-13900K; NVIDIA GeForce RTX 4080
Models
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M
Problem description & steps to reproduce
Full command
~/software/llama.cpp/bin/llama-cli -c 262144 -ngl 0 -dev none -t 31 -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M
The output is gibberish (repeating character or whole word). During the text generation, there is still about 28 Gb free RAM.
The behavior does not improve if I decrease the context size.
If I remove -dev none without changing anything else and leaving all other options, the output is fine. In this case llama.cpp uses a little bit of GPU memory, but does not actually use GPU for compute.
I would appreciate any further debug ideas/instructions as well as any pointers on how to fix this in the code.
I'm not sure whether this is my problem or an actual bug.
First Bad Commit
No response
Relevant log output
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 27: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 28: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 37: tokenizer.chat_template str = {# Copyright 2025-present Unsloth. Ap...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 15
llama_model_loader: - kv 40: quantize.imatrix.file str = Qwen3-Coder-30B-A3B-Instruct-GGUF/ima...
llama_model_loader: - kv 41: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-Coder-30B-A...
llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 384
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 154
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 17.28 GiB (4.86 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 17583.34 MiB
load_tensors: CPU_REPACK model buffer size = 13432.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 262144
llama_context: n_ctx_per_seq = 262144
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache: CPU KV buffer size = 24576.00 MiB
llama_kv_cache: size = 24576.00 MiB (262144 cells, 48 layers, 1/1 seqs), K (f16): 12288.00 MiB, V (f16): 12288.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CPU compute buffer size = 792.01 MiB
llama_context: graph nodes = 3031
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 31
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 31 (n_threads_batch = 31) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: interactive mode on.
sampler seed: 2764127722
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 262144
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 262144, n_batch = 2048, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
> create a hello world program in python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
# Python
> EOF by user
llama_perf_sampler_print: sampling time = 6.26 ms / 72 runs ( 0.09 ms per token, 11501.60 tokens per second)
llama_perf_context_print: load time = 32135.05 ms
llama_perf_context_print: prompt eval time = 279.64 ms / 15 tokens ( 18.64 ms per token, 53.64 tokens per second)
llama_perf_context_print: eval time = 5419.63 ms / 56 runs ( 96.78 ms per token, 10.33 tokens per second)
llama_perf_context_print: total time = 26931.33 ms / 71 tokens
llama_perf_context_print: graphs reused = 56
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 42951 = 17583 + 24576 + 792 |
llama_memory_breakdown_print: | - CPU_REPACK | 13432 = 13432 + 0 + 0 |