Skip to content

Eval bug: Gibberish output on Vulkan backend in Android (Phones) (Termux) #16881

@DevGitPit

Description

@DevGitPit

Name and Version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
version: 6884 (229bf68)
built with clang version 21.1.4 for aarch64-unknown-linux-android24

dev/llm/llama.cpp/build_vulkan/bin/llama-cli -m /sdcard/download/LFM2-8B-A1B-Q4_0.gguf -ngl 14 ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 6884 (229bf68) with clang version 21.1.4 for aarch64-unknown-linux-android24
main: llama backend init main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) (unknown id) - 11296 MiB free llama_model_loader: loaded meta data with 39 key-value pairs and 256 tensors from /sdcard/download/LFM2-8B-A1B-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = lfm2moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = LFM2 8B A1B llama_model_loader: - kv 3: general.basename str = LFM2
llama_model_loader: - kv 4: general.size_label str = 8B-A1B
llama_model_loader: - kv 5: general.license str = other
llama_model_loader: - kv 6: general.license.name str = lfm1.0
llama_model_loader: - kv 7: general.license.link str = LICENSE
llama_model_loader: - kv 8: general.tags arr[str,5] = ["liquid", "lfm2", "edge", "moe", "te...
llama_model_loader: - kv 9: general.languages arr[str,8] = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv 10: lfm2moe.block_count u32 = 24
llama_model_loader: - kv 11: lfm2moe.context_length u32 = 128000 llama_model_loader: - kv 12: lfm2moe.embedding_length u32 = 2048
llama_model_loader: - kv 13: lfm2moe.feed_forward_length u32 = 7168
llama_model_loader: - kv 14: lfm2moe.attention.head_count u32 = 32llama_model_loader: - kv 15: lfm2moe.attention.head_count_kv arr[i32,24] = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, ... llama_model_loader: - kv 16: lfm2moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 17: lfm2moe.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 18: lfm2moe.expert_used_count u32 = 4 llama_model_loader: - kv 19: lfm2moe.expert_count u32 = 32
llama_model_loader: - kv 20: lfm2moe.expert_feed_forward_length u32 = 1792
llama_model_loader: - kv 21: lfm2moe.leading_dense_block_count u32 = 2
llama_model_loader: - kv 22: lfm2moe.expert_gating_func u32 = 2 llama_model_loader: - kv 23: lfm2moe.vocab_size u32 = 65536
llama_model_loader: - kv 24: lfm2moe.shortconv.l_cache u32 = 3
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 26: tokenizer.ggml.pre str = lfm2 llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,65536] = ["<|pad|>", "<|startoftext|>", "<|end... llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,65536] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,63683] = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �... llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 7 llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 34: tokenizer.ggml.add_sep_token bool = false llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {{- bos_token -}}{%- set system_promp...
llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - kv 38: general.file_type u32 = 2 llama_model_loader: - type f32: 123 tensors llama_model_loader: - type q4_0: 132 tensors llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.41 GiB (4.54 BPW) load: printing all EOG tokens: load: - 2 ('<|endoftext|>') load: - 7 ('<|im_end|>') load: special tokens cache size = 507 load: token to piece cache size = 0.3759 MB
print_info: arch = lfm2moe
print_info: vocab_only = 0 print_info: n_ctx_train = 128000 print_info: n_embd = 2048 print_info: n_layer = 24
print_info: n_head = 32 print_info: n_head_kv = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 8, 0, 0] print_info: n_rot = 64
print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64 print_info: n_gqa = [0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 4, 0, 0] print_info: n_embd_k_gqa = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0] print_info: n_embd_v_gqa = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0] print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 7168 print_info: n_expert = 32 print_info: n_expert_used = 4 print_info: n_expert_groups = 0 print_info: n_group_used = 0
print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000 print_info: rope_finetuned = unknown print_info: model type = 8B.A1B print_info: model params = 8.34 B print_info: general.name = LFM2 8B A1B print_info: n_ff_exp = 1792 print_info: expert_gating_func = sigmoid print_info: vocab type = BPE
print_info: n_vocab = 65536 print_info: n_merges = 63683 print_info: BOS token = 1 '<|startoftext|>' print_info: EOS token = 7 '<|im_end|>' print_info: EOT token = 2 '<|endoftext|>' print_info: PAD token = 0 '<|pad|>' print_info: LF token = 708 'Ċ' print_info: EOG token = 2 '<|endoftext|>' print_info: EOG token = 7 '<|im_end|>' print_info: max token length = 30 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 14 repeating layers to GPU load_tensors: offloaded 14/25 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1749.85 MiB load_tensors: Vulkan0 model buffer size = 2762.46 MiB ......................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.25 MiB llama_kv_cache: CPU KV buffer size = 16.00 MiB llama_kv_cache: Vulkan0 KV buffer size = 32.00 MiB llama_kv_cache: size = 48.00 MiB ( 4096 cells, 6 layers, 1/1 seqs), K (f16): 24.00 MiB, V (f16): 24.00 MiB llama_memory_recurrent: CPU RS buffer size = 0.12 MiB llama_memory_recurrent: Vulkan0 RS buffer size = 0.16 MiB llama_memory_recurrent: size = 0.28 MiB ( 1 cells, 24 layers, 1 seqs), R (f32): 0.28 MiB, S (f32): 0.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: Vulkan0 compute buffer size = 241.09 MiB llama_context: Vulkan_Host compute buffer size = 12.03 MiB llama_context: graph nodes = 1275 llama_context: graph splits = 127 (with bs=512), 9 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|im_end|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 8 main: chat template is available, enabling conversation mode (disable it with -no-cnv) main: chat template example: <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user
Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|>
<|im_start|>user How are you?<|im_end|> <|im_start|>assistant system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | OPENMP = 1 | REPACK = 1 | main: interactive mode on. sampler seed: 1593276118 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. == - Press Ctrl+C to interject at any time.

  • Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.
  • Not using system message. To change it, set a different value via -sys PROMPT

** > what can you do?
Lucas Sala responsible simultaneous mill Ly Edu barbirit weighted Wald身 Sig Facilitydia Serra compensation Serra Rever > **
llama_perf_sampler_print: sampling time = 3.96 ms / 32 runs ( 0.12 ms per token, 8080.81 tokens per second)
llama_perf_context_print: load time = 17651.40 ms
llama_perf_context_print: prompt eval time = 4895.68 ms / 14 tokens ( 349.69 ms per token, 2.86 tokens per second)
llama_perf_context_print: eval time = 3531.31 ms / 18 runs ( 196.18 ms per token, 5.10 tokens per second)
llama_perf_context_print: total time = 17720.00 ms / 32 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Vulkan0 (Adreno (TM) 732) | 11296 = 11296 + (3035 = 2762 + 32 + 241) + 17592186041380 |
llama_memory_breakdown_print: | - Host | 1778 = 1749 + 16 + 12 |
Interrupted by user ^[[F
 ~   09:53 
 ^C

 ~   09:53   dev/llm/llama.cpp/build_vulkan/bin/llama-cli -m /sdcard/download/LFM2-8B-A1B-Q4_0.gguf -ngl 0 ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none build: 6884 (229bf68) with clang version 21.1.4 for aarch64-unknown-linux-android24 main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) (unknown id) - 11296 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 256 tensors from /sdcard/download/LFM2-8B-A1B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = lfm2moe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = LFM2 8B A1B llama_model_loader: - kv 3: general.basename str = LFM2
llama_model_loader: - kv 4: general.size_label str = 8B-A1B llama_model_loader: - kv 5: general.license str = other llama_model_loader: - kv 6: general.license.name str = lfm1.0
llama_model_loader: - kv 7: general.license.link str = LICENSE
llama_model_loader: - kv 8: general.tags arr[str,5] = ["liquid", "lfm2", "edge", "moe", "te... llama_model_loader: - kv 9: general.languages arr[str,8] = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv 10: lfm2moe.block_count u32 = 24llama_model_loader: - kv 11: lfm2moe.context_length u32 = 128000 llama_model_loader: - kv 12: lfm2moe.embedding_length u32 = 2048 llama_model_loader: - kv 13: lfm2moe.feed_forward_length u32 = 7168
llama_model_loader: - kv 14: lfm2moe.attention.head_count u32 = 32
llama_model_loader: - kv 15: lfm2moe.attention.head_count_kv arr[i32,24] = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, ...
llama_model_loader: - kv 16: lfm2moe.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 17: lfm2moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 18: lfm2moe.expert_used_count u32 = 4
llama_model_loader: - kv 19: lfm2moe.expert_count u32 = 32llama_model_loader: - kv 20: lfm2moe.expert_feed_forward_length u32 = 1792 llama_model_loader: - kv 21: lfm2moe.leading_dense_block_count u32 = 2 llama_model_loader: - kv 22: lfm2moe.expert_gating_func u32 = 2
llama_model_loader: - kv 23: lfm2moe.vocab_size u32 = 65536
llama_model_loader: - kv 24: lfm2moe.shortconv.l_cache u32 = 3 llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 26: tokenizer.ggml.pre str = lfm2 llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,65536] = ["<|pad|>", "<|startoftext|>", "<|end... llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,65536] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,63683] = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 7
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 34: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{- bos_token -}}{%- set system_promp... llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - kv 38: general.file_type u32 = 2 llama_model_loader: - type f32: 123 tensors llama_model_loader: - type q4_0: 132 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.41 GiB (4.54 BPW) load: printing all EOG tokens: load: - 2 ('<|endoftext|>') load: - 7 ('<|im_end|>') load: special tokens cache size = 507 load: token to piece cache size = 0.3759 MB print_info: arch = lfm2moe print_info: vocab_only = 0 print_info: n_ctx_train = 128000 print_info: n_embd = 2048 print_info: n_layer = 24 print_info: n_head = 32 print_info: n_head_kv = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 8, 0, 0] print_info: n_rot = 64 print_info: n_swa = 0 print_info: is_swa_any = 0
print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64
print_info: n_gqa = [0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 4, 0, 0]
print_info: n_embd_k_gqa = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]
print_info: n_embd_v_gqa = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0] print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 7168
print_info: n_expert = 32
print_info: n_expert_used = 4
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000 print_info: rope_finetuned = unknown
print_info: model type = 8B.A1B print_info: model params = 8.34 B print_info: general.name = LFM2 8B A1B
print_info: n_ff_exp = 1792 print_info: expert_gating_func = sigmoid
print_info: vocab type = BPE
print_info: n_vocab = 65536
print_info: n_merges = 63683
print_info: BOS token = 1 '<|startoftext|>' print_info: EOS token = 7 '<|im_end|>'
print_info: EOT token = 2 '<|endoftext|>' print_info: PAD token = 0 '<|pad|>' print_info: LF token = 708 'Ċ' print_info: EOG token = 2 '<|endoftext|>' print_info: EOG token = 7 '<|im_end|>'
print_info: max token length = 30
load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/25 layers to GPU load_tensors: CPU_Mapped model buffer size = 4512.31 MiB .........................................................................
llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048 llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto llama_context: kv_unified = false
llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.25 MiB llama_kv_cache: CPU KV buffer size = 48.00 MiB llama_kv_cache: size = 48.00 MiB ( 4096 cells, 6 layers, 1/1 seqs), K (f16): 24.00 MiB, V (f16): 24.00 MiB llama_memory_recurrent: CPU RS buffer size = 0.28 MiB
llama_memory_recurrent: size = 0.28 MiB ( 1 cells, 24 layers, 1 seqs), R (f32): 0.28 MiB, S (f32): 0.00 MiB
llama_context: Flash Attention was auto, set to enabled llama_context: Vulkan0 compute buffer size = 241.00 MiB llama_context: Vulkan_Host compute buffer size = 12.03 MiB
llama_context: graph nodes = 1275 llama_context: graph splits = 312 (with bs=512), 19 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|im_end|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8 main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example: <|im_start|>system You are a helpful assistant<|im_end|>
<|im_start|>user Hello<|im_end|> <|im_start|>assistant
Hi there<|im_end|> <|im_start|>user How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | OPENMP = 1 | REPACK = 1 |
main: interactive mode on. sampler seed: 2388692355
sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. == - Press Ctrl+C to interject at any time.

  • Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''. - Not using system message. To change it, set a different value via -sys PROMPT
    ** > what can you do? I can assist you in a wide range of ways! Here are some of the things I can help with:
    1. Answer Questions: Provide explanations, definitions, and

**
llama_perf_sampler_print: sampling time = 5.17 ms / 49 runs ( 0.11 ms per token, 9479.59 tokens per second) llama_perf_context_print: load time = 15699.77 ms
llama_perf_context_print: prompt eval time = 1726.42 ms / 14 tokens ( 123.32 ms per token, 8.11 tokens per second)
llama_perf_context_print: eval time = 6594.47 ms / 35 runs ( 188.41 ms per token, 5.31 tokens per second)
llama_perf_context_print: total time = 14849.40 ms / 49 tokens llama_perf_context_print: graphs reused = 0 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Vulkan0 (Adreno (TM) 732) | 11296 = 11296 + ( 241 = 0 + 0 + 241) + 17592186044175 |
llama_memory_breakdown_print: | - Host | 4572 = 4512 + 48 + 12 | Interrupted by user

 ~   09:54 
 dev/llm/llama.cpp/build_vulkan/bin/llama-cli -m /sdcard/download/LFM2-8B-A1B-Q4_0.gguf ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 6884 (229bf68) with clang version 21.1.4 for aarch64-unknown-linux-android24
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) (unknown id) - 11296 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 256 tensors from /sdcard/download/LFM2-8B-A1B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = lfm2moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = LFM2 8B A1B
llama_model_loader: - kv 3: general.basename str = LFM2
llama_model_loader: - kv 4: general.size_label str = 8B-A1B
llama_model_loader: - kv 5: general.license str = other llama_model_loader: - kv 6: general.license.name str = lfm1.0
llama_model_loader: - kv 7: general.license.link str = LICENSE
llama_model_loader: - kv 8: general.tags arr[str,5] = ["liquid", "lfm2", "edge", "moe", "te...
llama_model_loader: - kv 9: general.languages arr[str,8] = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv 10: lfm2moe.block_count u32 = 24
llama_model_loader: - kv 11: lfm2moe.context_length u32 = 128000 llama_model_loader: - kv 12: lfm2moe.embedding_length u32 = 2048
llama_model_loader: - kv 13: lfm2moe.feed_forward_length u32 = 7168
llama_model_loader: - kv 14: lfm2moe.attention.head_count u32 = 32
llama_model_loader: - kv 15: lfm2moe.attention.head_count_kv arr[i32,24] = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, ...
llama_model_loader: - kv 16: lfm2moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: lfm2moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 18: lfm2moe.expert_used_count u32 = 4
llama_model_loader: - kv 19: lfm2moe.expert_count u32 = 32
llama_model_loader: - kv 20: lfm2moe.expert_feed_forward_length u32 = 1792
llama_model_loader: - kv 21: lfm2moe.leading_dense_block_count u32 = 2
llama_model_loader: - kv 22: lfm2moe.expert_gating_func u32 = 2
llama_model_loader: - kv 23: lfm2moe.vocab_size u32 = 65536 llama_model_loader: - kv 24: lfm2moe.shortconv.l_cache u32 = 3
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 26: tokenizer.ggml.pre str = lfm2
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,65536] = ["<|pad|>", "<|startoftext|>", "<|end...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,65536] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,63683] = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 7
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 34: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{- bos_token -}}{%- set system_promp...
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 2 llama_model_loader: - type f32: 123 tensors llama_model_loader: - type q4_0: 132 tensors llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0 print_info: file size = 4.41 GiB (4.54 BPW)
load: printing all EOG tokens:
load: - 2 ('<|endoftext|>')
load: - 7 ('<|im_end|>')
load: special tokens cache size = 507
load: token to piece cache size = 0.3759 MB
print_info: arch = lfm2moe
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 2048
print_info: n_layer = 24
print_info: n_head = 32
print_info: n_head_kv = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 8, 0, 0]
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = [0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 4, 0, 0]
print_info: n_embd_k_gqa = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]
print_info: n_embd_v_gqa = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 7168 print_info: n_expert = 32 print_info: n_expert_used = 4
print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 128000 print_info: rope_finetuned = unknown print_info: model type = 8B.A1B print_info: model params = 8.34 B
print_info: general.name = LFM2 8B A1B print_info: n_ff_exp = 1792
print_info: expert_gating_func = sigmoid print_info: vocab type = BPE print_info: n_vocab = 65536 print_info: n_merges = 63683 print_info: BOS token = 1 '<|startoftext|>' print_info: EOS token = 7 '<|im_end|>' print_info: EOT token = 2 '<|endoftext|>' print_info: PAD token = 0 '<|pad|>' print_info: LF token = 708 'Ċ' print_info: EOG token = 2 '<|endoftext|>' print_info: EOG token = 7 '<|im_end|>'
print_info: max token length = 30 load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU load_tensors: offloaded 25/25 layers to GPU
load_tensors: CPU_Mapped model buffer size = 105.01 MiB load_tensors: ** Vulkan0 model buffer size = 4512.30 MiB ... [Process completed (signal 9) - press Enter] **

Operating systems

Other? (Please let us know in description)

GGML backends

Vulkan

Hardware

Snapdragon 7+ Gen 3

Models

Dense and MoE both >8b Parameters.

Problem description & steps to reproduce

Garbled output.

First Bad Commit

No response

Relevant log output

Log Output Pasted above.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions