Skip to content

Bug: run llama.cpp failed with Vulkan-supported and quantized model in Android Termux  #10406

@linxhome

Description

@linxhome

What happened?

  1. llama.cpp failed with Vulkan-supported and quantized model in Android Termux .
  2. It run well in CPU mode with quantized model and fp16 model.
  3. but if gpu layer is set non-zero ,the quantized model cannot run well and throw the error "libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Device::waitForFences: ErrorDeviceLost"

Name and Version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16
version: 4122 (9b75f03)
built with clang version 19.1.3 for aarch64-unknown-linux-android29

What operating system are you seeing the problem on?

No response

Relevant log output

$ llama.cpp/android-build/bin/llama-cli -m [storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf](http://storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf) -p "give me asstory about 200 words" -n 200 -ngl 99 --batch-size 16 --ctx-size 1024 --verbose-prompt
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16
build: 4122 (9b75f03c) with clang version 19.1.3 for aarch64-unknown-linux-android29
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (Mali-G715) - 15233 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from [storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf](http://storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf) (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               [general.name](http://general.name/) str              = Qwen2.5 Coder 0.5B Instruct AWQ
llama_model_loader: - kv   3:                           general.finetune str              = Instruct-AWQ
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864                              llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645                            llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  170 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24                                                                                    llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64                                                                                    llm_load_print_meta: n_gqa            = 7                                                                                     llm_load_print_meta: n_embd_k_gqa     = 128                                                                                   llm_load_print_meta: n_embd_v_gqa     = 128                                                                                   llm_load_print_meta: f_norm_eps       = 0.0e+00                                                                               llm_load_print_meta: f_norm_rms_eps   = 1.0e-06                                                                               llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                               llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0                                                                                     llm_load_print_meta: n_expert_used    = 0                                                                                     llm_load_print_meta: causal attn      = 1                                                                                     llm_load_print_meta: pooling type     = 0                                                                                     llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0                                                                                     llm_load_print_meta: ssm_d_state      = 0                                                                                     llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0                                                                                     llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q8_0                                                                                  llm_load_print_meta: model params     = 630.17 M                                                                              llm_load_print_meta: model size       = 638.74 MiB (8.50 BPW)
llm_load_print_meta: [general.name](http://general.name/)     = Qwen2.5 Coder 0.5B Instruct AWQ
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'                                                                   llm_load_print_meta: EOT token        = 151645 '<|im_end|>'                                                                   llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'                                                                llm_load_print_meta: LF token         = 148848 'ÄĬ'                                                                           llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'                                                               llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'                                                                llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'                                                                  llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'                                                                llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'                                                                 llm_load_print_meta: max token length = 256                                                                                   ggml_vulkan: Compiling shaders..............................Done!                                                             llm_load_tensors: offloading 24 repeating layers to GPU                                                                       llm_load_tensors: offloading output layer to GPU                                                                              llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   137.94 MiB
llm_load_tensors:      Vulkan0 model buffer size =   500.79 MiB
...........................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 1024
llama_new_context_with_model: n_ctx_per_seq = 1024
llama_new_context_with_model: n_batch       = 32
llama_new_context_with_model: n_ubatch      = 32
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0                                                                       llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:    Vulkan0 KV buffer size =    12.00 MiB
llama_new_context_with_model: KV self size  =   12.00 MiB, K (f16):    6.00 MiB, V (f16):    6.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    18.66 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     0.23 MiB
llama_new_context_with_model: graph nodes  = 846                                                                              llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Device::waitForFences: ErrorDeviceLost
Aborted

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions