Eval bug: Gibberish output on Vulkan backend in Android (Phones) (Termux)

### Name and Version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
version: 6884 (229bf6862)
built with clang version 21.1.4 for aarch64-unknown-linux-android24

dev/llm/llama.cpp/build_vulkan/bin/llama-cli   -m /sdcard/download/LFM2-8B-A1B-Q4_0.gguf -ngl 14                                           ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 6884 (229bf6862) with clang version 21.1.4 for aarch64-unknown-linux-android24
main: llama backend init                       main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) (unknown id) - 11296 MiB free                                        llama_model_loader: loaded meta data with 39 key-value pairs and 256 tensors from /sdcard/download/LFM2-8B-A1B-Q4_0.gguf (version GGUF V3 (latest))                                         llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                            llama_model_loader: - kv   0:                       general.architecture str              = lfm2moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = LFM2 8B A1B                                      llama_model_loader: - kv   3:                           general.basename str              = LFM2
llama_model_loader: - kv   4:                         general.size_label str              = 8B-A1B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = lfm1.0
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                               general.tags arr[str,5]       = ["liquid", "lfm2", "edge", "moe", "te...
llama_model_loader: - kv   9:                          general.languages arr[str,8]       = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv  10:                        lfm2moe.block_count u32              = 24
llama_model_loader: - kv  11:                     lfm2moe.context_length u32              = 128000                                           llama_model_loader: - kv  12:                   lfm2moe.embedding_length u32              = 2048
llama_model_loader: - kv  13:                lfm2moe.feed_forward_length u32              = 7168
llama_model_loader: - kv  14:               lfm2moe.attention.head_count u32              = 32llama_model_loader: - kv  15:            lfm2moe.attention.head_count_kv arr[i32,24]      = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, ...         llama_model_loader: - kv  16:                     lfm2moe.rope.freq_base f32              = 1000000.000000                                   llama_model_loader: - kv  17:   lfm2moe.attention.layer_norm_rms_epsilon f32              = 0.000010                                         llama_model_loader: - kv  18:                  lfm2moe.expert_used_count u32              = 4 llama_model_loader: - kv  19:                       lfm2moe.expert_count u32              = 32
llama_model_loader: - kv  20:         lfm2moe.expert_feed_forward_length u32              = 1792
llama_model_loader: - kv  21:          lfm2moe.leading_dense_block_count u32              = 2
llama_model_loader: - kv  22:                 lfm2moe.expert_gating_func u32              = 2 llama_model_loader: - kv  23:                         lfm2moe.vocab_size u32              = 65536
llama_model_loader: - kv  24:                  lfm2moe.shortconv.l_cache u32              = 3
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2                                             llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = lfm2                                             llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...         llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...         llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,63683]   = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �...                  llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 1 llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 7 llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = true                                             llama_model_loader: - kv  34:               tokenizer.ggml.add_sep_token bool             = false                                            llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false                                            llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{- bos_token -}}{%- set system_promp...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2 llama_model_loader: - kv  38:                          general.file_type u32              = 2 llama_model_loader: - type  f32:  123 tensors  llama_model_loader: - type q4_0:  132 tensors  llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)     print_info: file type   = Q4_0                 print_info: file size   = 4.41 GiB (4.54 BPW)  load: printing all EOG tokens:                 load:   - 2 ('<|endoftext|>')                  load:   - 7 ('<|im_end|>')                     load: special tokens cache size = 507          load: token to piece cache size = 0.3759 MB
print_info: arch             = lfm2moe
print_info: vocab_only       = 0               print_info: n_ctx_train      = 128000          print_info: n_embd           = 2048            print_info: n_layer          = 24
print_info: n_head           = 32              print_info: n_head_kv        = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 8, 0, 0]                                      print_info: n_rot            = 64
print_info: n_swa            = 0               print_info: is_swa_any       = 0               print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64              print_info: n_gqa            = [0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 4, 0, 0]                                      print_info: n_embd_k_gqa     = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]                          print_info: n_embd_v_gqa     = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]                          print_info: f_norm_eps       = 0.0e+00         print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00         print_info: f_max_alibi_bias = 0.0e+00         print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00         print_info: n_ff             = 7168            print_info: n_expert         = 32              print_info: n_expert_used    = 4               print_info: n_expert_groups  = 0               print_info: n_group_used     = 0
print_info: causal attn      = 1               print_info: pooling type     = 0               print_info: rope type        = 2               print_info: rope scaling     = linear          print_info: freq_base_train  = 1000000.0       print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000          print_info: rope_finetuned   = unknown         print_info: model type       = 8B.A1B          print_info: model params     = 8.34 B          print_info: general.name     = LFM2 8B A1B     print_info: n_ff_exp             = 1792        print_info: expert_gating_func   = sigmoid     print_info: vocab type       = BPE
print_info: n_vocab          = 65536           print_info: n_merges         = 63683           print_info: BOS token        = 1 '<|startoftext|>'                                            print_info: EOS token        = 7 '<|im_end|>'  print_info: EOT token        = 2 '<|endoftext|>'                                              print_info: PAD token        = 0 '<|pad|>'     print_info: LF token         = 708 'Ċ'         print_info: EOG token        = 2 '<|endoftext|>'                                              print_info: EOG token        = 7 '<|im_end|>'  print_info: max token length = 30              load_tensors: loading model tensors, this can take a while... (mmap = true)                   load_tensors: offloading 14 repeating layers to GPU                                           load_tensors: offloaded 14/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1749.85 MiB                                   load_tensors:      Vulkan0 model buffer size =  2762.46 MiB                                   .........................................................................                     llama_context: constructing llama_context      llama_context: n_seq_max     = 1               llama_context: n_ctx         = 4096            llama_context: n_ctx_per_seq = 4096            llama_context: n_batch       = 2048            llama_context: n_ubatch      = 512             llama_context: causal_attn   = 1               llama_context: flash_attn    = auto            llama_context: kv_unified    = false           llama_context: freq_base     = 1000000.0       llama_context: freq_scale    = 1               llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized                            llama_context:        CPU  output buffer size =     0.25 MiB                                  llama_kv_cache:        CPU KV buffer size =    16.00 MiB                                      llama_kv_cache:    Vulkan0 KV buffer size =    32.00 MiB                                      llama_kv_cache: size =   48.00 MiB (  4096 cells,   6 layers,  1/1 seqs), K (f16):   24.00 MiB, V (f16):   24.00 MiB                         llama_memory_recurrent:        CPU RS buffer size =     0.12 MiB                              llama_memory_recurrent:    Vulkan0 RS buffer size =     0.16 MiB                              llama_memory_recurrent: size =    0.28 MiB (     1 cells,  24 layers,  1 seqs), R (f32):    0.28 MiB, S (f32):    0.00 MiB                   llama_context: Flash Attention was auto, set to enabled                                       llama_context:    Vulkan0 compute buffer size =   241.09 MiB                                  llama_context: Vulkan_Host compute buffer size =    12.03 MiB                                 llama_context: graph nodes  = 1275             llama_context: graph splits = 127 (with bs=512), 9 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf                                common_init_from_params: added <|im_end|> logit bias = -inf                                   common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096                        common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)                                   main: llama threadpool init, n_threads = 8     main: chat template is available, enabling conversation mode (disable it with -no-cnv)        main: chat template example:                   <|im_start|>system                             You are a helpful assistant<|im_end|>          <|im_start|>user
Hello<|im_end|>                                <|im_start|>assistant                          Hi there<|im_end|>
<|im_start|>user                               How are you?<|im_end|>                         <|im_start|>assistant                                                                                                                        system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | OPENMP = 1 | REPACK = 1 |                                                                              main: interactive mode on.                     sampler seed: 1593276118                       sampler params:                                        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                              dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096                                                  top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800                                             mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                               sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist      generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
                                               == Running in interactive mode. ==              - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.    - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


** > what can you do?
 Lucas Sala responsible simultaneous mill Ly Edu barbirit weighted Wald身 Sig Facilitydia Serra compensation Serra Rever                     > **
llama_perf_sampler_print:    sampling time =       3.96 ms /    32 runs   (    0.12 ms per token,  8080.81 tokens per second)
llama_perf_context_print:        load time =   17651.40 ms
llama_perf_context_print: prompt eval time =    4895.68 ms /    14 tokens (  349.69 ms per token,     2.86 tokens per second)
llama_perf_context_print:        eval time =    3531.31 ms /    18 runs   (  196.18 ms per token,     5.10 tokens per second)
llama_perf_context_print:       total time =   17720.00 ms /    32 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB]        | total    free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - Vulkan0 (Adreno (TM) 732) | 11296 = 11296 + (3035 =  2762 +      32 +     241) + 17592186041380 |
llama_memory_breakdown_print: |   - Host                      |                  1778 =  1749 +      16 +      12                   |
Interrupted by user                            ^[[F
 ~   09:53 
 ^C

 ~   09:53                             dev/llm/llama.cpp/build_vulkan/bin/llama-cli   -m /sdcard/download/LFM2-8B-A1B-Q4_0.gguf -ngl 0                                            ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none    build: 6884 (229bf6862) with clang version 21.1.4 for aarch64-unknown-linux-android24         main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) (unknown id) - 11296 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 256 tensors from /sdcard/download/LFM2-8B-A1B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = lfm2moe                                          llama_model_loader: - kv   1:                               general.type str              = model                                            llama_model_loader: - kv   2:                               general.name str              = LFM2 8B A1B                                      llama_model_loader: - kv   3:                           general.basename str              = LFM2
llama_model_loader: - kv   4:                         general.size_label str              = 8B-A1B                                           llama_model_loader: - kv   5:                            general.license str              = other                                            llama_model_loader: - kv   6:                       general.license.name str              = lfm1.0
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                               general.tags arr[str,5]       = ["liquid", "lfm2", "edge", "moe", "te...         llama_model_loader: - kv   9:                          general.languages arr[str,8]       = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv  10:                        lfm2moe.block_count u32              = 24llama_model_loader: - kv  11:                     lfm2moe.context_length u32              = 128000                                           llama_model_loader: - kv  12:                   lfm2moe.embedding_length u32              = 2048                                             llama_model_loader: - kv  13:                lfm2moe.feed_forward_length u32              = 7168
llama_model_loader: - kv  14:               lfm2moe.attention.head_count u32              = 32
llama_model_loader: - kv  15:            lfm2moe.attention.head_count_kv arr[i32,24]      = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, ...
llama_model_loader: - kv  16:                     lfm2moe.rope.freq_base f32              = 1000000.000000                                   llama_model_loader: - kv  17:   lfm2moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  lfm2moe.expert_used_count u32              = 4
llama_model_loader: - kv  19:                       lfm2moe.expert_count u32              = 32llama_model_loader: - kv  20:         lfm2moe.expert_feed_forward_length u32              = 1792                                             llama_model_loader: - kv  21:          lfm2moe.leading_dense_block_count u32              = 2 llama_model_loader: - kv  22:                 lfm2moe.expert_gating_func u32              = 2
llama_model_loader: - kv  23:                         lfm2moe.vocab_size u32              = 65536
llama_model_loader: - kv  24:                  lfm2moe.shortconv.l_cache u32              = 3 llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2                                             llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = lfm2                                             llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...         llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...         llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,63683]   = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 7
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = true                                             llama_model_loader: - kv  34:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{- bos_token -}}{%- set system_promp...         llama_model_loader: - kv  37:               general.quantization_version u32              = 2 llama_model_loader: - kv  38:                          general.file_type u32              = 2 llama_model_loader: - type  f32:  123 tensors  llama_model_loader: - type q4_0:  132 tensors  llama_model_loader: - type q6_K:    1 tensors  print_info: file format = GGUF V3 (latest)     print_info: file type   = Q4_0                 print_info: file size   = 4.41 GiB (4.54 BPW)  load: printing all EOG tokens:                 load:   - 2 ('<|endoftext|>')                  load:   - 7 ('<|im_end|>')                     load: special tokens cache size = 507          load: token to piece cache size = 0.3759 MB    print_info: arch             = lfm2moe         print_info: vocab_only       = 0               print_info: n_ctx_train      = 128000          print_info: n_embd           = 2048            print_info: n_layer          = 24              print_info: n_head           = 32              print_info: n_head_kv        = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 8, 0, 0]                                      print_info: n_rot            = 64              print_info: n_swa            = 0               print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64              print_info: n_embd_head_v    = 64
print_info: n_gqa            = [0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 4, 0, 0]
print_info: n_embd_k_gqa     = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]
print_info: n_embd_v_gqa     = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]                          print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00         print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 7168
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0       print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000          print_info: rope_finetuned   = unknown
print_info: model type       = 8B.A1B          print_info: model params     = 8.34 B          print_info: general.name     = LFM2 8B A1B
print_info: n_ff_exp             = 1792        print_info: expert_gating_func   = sigmoid
print_info: vocab type       = BPE
print_info: n_vocab          = 65536
print_info: n_merges         = 63683
print_info: BOS token        = 1 '<|startoftext|>'                                            print_info: EOS token        = 7 '<|im_end|>'
print_info: EOT token        = 2 '<|endoftext|>'                                              print_info: PAD token        = 0 '<|pad|>'     print_info: LF token         = 708 'Ċ'         print_info: EOG token        = 2 '<|endoftext|>'                                              print_info: EOG token        = 7 '<|im_end|>'
print_info: max token length = 30
load_tensors: loading model tensors, this can take a while... (mmap = true)                   load_tensors: offloading 0 repeating layers to GPU                                            load_tensors: offloaded 0/25 layers to GPU     load_tensors:   CPU_Mapped model buffer size =  4512.31 MiB                                   .........................................................................
llama_context: constructing llama_context      llama_context: n_seq_max     = 1               llama_context: n_ctx         = 4096            llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048            llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto            llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0       llama_context: freq_scale    = 1               llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized                            llama_context:        CPU  output buffer size =     0.25 MiB                                  llama_kv_cache:        CPU KV buffer size =    48.00 MiB                                      llama_kv_cache: size =   48.00 MiB (  4096 cells,   6 layers,  1/1 seqs), K (f16):   24.00 MiB, V (f16):   24.00 MiB                         llama_memory_recurrent:        CPU RS buffer size =     0.28 MiB
llama_memory_recurrent: size =    0.28 MiB (     1 cells,  24 layers,  1 seqs), R (f32):    0.28 MiB, S (f32):    0.00 MiB
llama_context: Flash Attention was auto, set to enabled                                       llama_context:    Vulkan0 compute buffer size =   241.00 MiB                                  llama_context: Vulkan_Host compute buffer size =    12.03 MiB
llama_context: graph nodes  = 1275             llama_context: graph splits = 312 (with bs=512), 19 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf                                common_init_from_params: added <|im_end|> logit bias = -inf                                   common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8     main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:                   <|im_start|>system                             You are a helpful assistant<|im_end|>
<|im_start|>user                               Hello<|im_end|>                                <|im_start|>assistant
Hi there<|im_end|>                             <|im_start|>user                               How are you?<|im_end|>
<|im_start|>assistant                                                                         
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | OPENMP = 1 | REPACK = 1 |                               
main: interactive mode on.                     sampler seed: 2388692355
sampler params:                                        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                              dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800                                             mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                               sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist      generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
                                               == Running in interactive mode. ==              - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.    - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.                                - Not using system message. To change it, set a different value via -sys PROMPT                                                             
** > what can you do?                             I can assist you in a wide range of ways! Here are some of the things I can help with:
                                               1. **Answer Questions**: Provide explanations, definitions, and
>                                              
**
llama_perf_sampler_print:    sampling time =       5.17 ms /    49 runs   (    0.11 ms per token,  9479.59 tokens per second)                llama_perf_context_print:        load time =   15699.77 ms
llama_perf_context_print: prompt eval time =    1726.42 ms /    14 tokens (  123.32 ms per token,     8.11 tokens per second)
llama_perf_context_print:        eval time =    6594.47 ms /    35 runs   (  188.41 ms per token,     5.31 tokens per second)
llama_perf_context_print:       total time =   14849.40 ms /    49 tokens                     llama_perf_context_print:    graphs reused =          0                                       llama_memory_breakdown_print: | memory breakdown [MiB]        | total    free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - Vulkan0 (Adreno (TM) 732) | 11296 = 11296 + ( 241 =     0 +       0 +     241) + 17592186044175 |
llama_memory_breakdown_print: |   - Host                      |                  4572 =  4512 +      48 +      12                   |        Interrupted by user

 ~   09:54 
 dev/llm/llama.cpp/build_vulkan/bin/llama-cli   -m /sdcard/download/LFM2-8B-A1B-Q4_0.gguf    ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 6884 (229bf6862) with clang version 21.1.4 for aarch64-unknown-linux-android24
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) (unknown id) - 11296 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 256 tensors from /sdcard/download/LFM2-8B-A1B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = lfm2moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = LFM2 8B A1B
llama_model_loader: - kv   3:                           general.basename str              = LFM2
llama_model_loader: - kv   4:                         general.size_label str              = 8B-A1B
llama_model_loader: - kv   5:                            general.license str              = other                                            llama_model_loader: - kv   6:                       general.license.name str              = lfm1.0
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                               general.tags arr[str,5]       = ["liquid", "lfm2", "edge", "moe", "te...
llama_model_loader: - kv   9:                          general.languages arr[str,8]       = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv  10:                        lfm2moe.block_count u32              = 24
llama_model_loader: - kv  11:                     lfm2moe.context_length u32              = 128000                                           llama_model_loader: - kv  12:                   lfm2moe.embedding_length u32              = 2048
llama_model_loader: - kv  13:                lfm2moe.feed_forward_length u32              = 7168
llama_model_loader: - kv  14:               lfm2moe.attention.head_count u32              = 32
llama_model_loader: - kv  15:            lfm2moe.attention.head_count_kv arr[i32,24]      = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, ...
llama_model_loader: - kv  16:                     lfm2moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:   lfm2moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  lfm2moe.expert_used_count u32              = 4
llama_model_loader: - kv  19:                       lfm2moe.expert_count u32              = 32
llama_model_loader: - kv  20:         lfm2moe.expert_feed_forward_length u32              = 1792
llama_model_loader: - kv  21:          lfm2moe.leading_dense_block_count u32              = 2
llama_model_loader: - kv  22:                 lfm2moe.expert_gating_func u32              = 2
llama_model_loader: - kv  23:                         lfm2moe.vocab_size u32              = 65536                                            llama_model_loader: - kv  24:                  lfm2moe.shortconv.l_cache u32              = 3
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2                                             llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = lfm2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...         llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,63683]   = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 1 llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 7
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  34:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{- bos_token -}}{%- set system_promp...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - kv  38:                          general.file_type u32              = 2 llama_model_loader: - type  f32:  123 tensors  llama_model_loader: - type q4_0:  132 tensors  llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0                 print_info: file size   = 4.41 GiB (4.54 BPW)
load: printing all EOG tokens:
load:   - 2 ('<|endoftext|>')
load:   - 7 ('<|im_end|>')
load: special tokens cache size = 507
load: token to piece cache size = 0.3759 MB
print_info: arch             = lfm2moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 2048
print_info: n_layer          = 24
print_info: n_head           = 32
print_info: n_head_kv        = [0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 8, 0, 0]
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = [0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 4, 0, 0]
print_info: n_embd_k_gqa     = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]
print_info: n_embd_v_gqa     = [0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 0, 512, 0, 0, 512, 0, 0]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00         print_info: f_logit_scale    = 0.0e+00         print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 7168            print_info: n_expert         = 32              print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0               print_info: n_group_used     = 0               print_info: causal attn      = 1               print_info: pooling type     = 0               print_info: rope type        = 2               print_info: rope scaling     = linear          print_info: freq_base_train  = 1000000.0       print_info: freq_scale_train = 1               print_info: n_ctx_orig_yarn  = 128000          print_info: rope_finetuned   = unknown         print_info: model type       = 8B.A1B          print_info: model params     = 8.34 B
print_info: general.name     = LFM2 8B A1B     print_info: n_ff_exp             = 1792
print_info: expert_gating_func   = sigmoid     print_info: vocab type       = BPE             print_info: n_vocab          = 65536           print_info: n_merges         = 63683           print_info: BOS token        = 1 '<|startoftext|>'                                            print_info: EOS token        = 7 '<|im_end|>'  print_info: EOT token        = 2 '<|endoftext|>'                                              print_info: PAD token        = 0 '<|pad|>'     print_info: LF token         = 708 'Ċ'         print_info: EOG token        = 2 '<|endoftext|>'                                              print_info: EOG token        = 7 '<|im_end|>'
print_info: max token length = 30              load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU   load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   105.01 MiB                                   load_tensors:      ** Vulkan0 model buffer size =  4512.30 MiB                                   ...                                            [Process completed (signal 9) - press Enter]   **

### Operating systems

Other? (Please let us know in description)

### GGML backends

Vulkan

### Hardware

Snapdragon 7+ Gen 3

### Models

Dense and MoE both >8b Parameters.

### Problem description & steps to reproduce

Garbled output.

### First Bad Commit

_No response_

### Relevant log output

```shell
Log Output Pasted above.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Gibberish output on Vulkan backend in Android (Phones) (Termux) #16881

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Gibberish output on Vulkan backend in Android (Phones) (Termux) #16881

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions