-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Closed
Copy link
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale
Description
What happened?
- llama.cpp failed with Vulkan-supported and quantized model in Android Termux .
- It run well in CPU mode with quantized model and fp16 model.
- but if gpu layer is set non-zero ,the quantized model cannot run well and throw the error "libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Device::waitForFences: ErrorDeviceLost"
Name and Version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16
version: 4122 (9b75f03)
built with clang version 19.1.3 for aarch64-unknown-linux-android29
What operating system are you seeing the problem on?
No response
Relevant log output
$ llama.cpp/android-build/bin/llama-cli -m [storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf](http://storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf) -p "give me asstory about 200 words" -n 200 -ngl 99 --batch-size 16 --ctx-size 1024 --verbose-prompt
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | warp size: 16
build: 4122 (9b75f03c) with clang version 19.1.3 for aarch64-unknown-linux-android29
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (Mali-G715) - 15233 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from [storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf](http://storage/shared/tmp/llama-cpp/qwen2.5-coder-0.5b-instruct-q8_0.gguf) (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: [general.name](http://general.name/) str = Qwen2.5 Coder 0.5B Instruct AWQ
llama_model_loader: - kv 3: general.finetune str = Instruct-AWQ
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 0.5B
llama_model_loader: - kv 6: qwen2.block_count u32 = 24
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864 llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 7
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q8_0: 170 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 896
llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 14
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 128 llm_load_print_meta: n_embd_v_gqa = 128 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4864
llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 630.17 M llm_load_print_meta: model size = 638.74 MiB (8.50 BPW)
llm_load_print_meta: [general.name](http://general.name/) = Qwen2.5 Coder 0.5B Instruct AWQ
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 ggml_vulkan: Compiling shaders..............................Done! llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 137.94 MiB
llm_load_tensors: Vulkan0 model buffer size = 500.79 MiB
...........................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_ctx_per_seq = 1024
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: Vulkan0 KV buffer size = 12.00 MiB
llama_new_context_with_model: KV self size = 12.00 MiB, K (f16): 6.00 MiB, V (f16): 6.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.58 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 18.66 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 0.23 MiB
llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Device::waitForFences: ErrorDeviceLost
AbortedMetadata
Metadata
Assignees
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale