-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Labels
Description
Name and Version
Summary
Running llama-cpp on an RK3588 (ARM64) platform produces garbled output after receiving any user input. Upon investigation, some tensors contain inf values, which seem to trigger incorrect inference results.
Environment
- llama-cli version
geduer@ulan:~/dev/llm/llama.cpp/build/bin$ ./llama-cli --version arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. ggml_opencl: selecting platform: 'ARM Platform' ggml_opencl: selecting device: 'Mali-LODX r0p0' Unsupported GPU: Mali-LODX r0p0 register_backend: registered backend OpenCL (0 devices) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (CPU) version: 4913 (35cae5ba) built with cc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 for aarch64-linux-gnu
- Hardware: RK3588 (ARM64)
- OS/Compiler: Ubuntu / GCC
- Commit: Latest
mainbranch of llama-cpp (as of ) - Build Commands:
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Debug -DBUILD_SHARED_LIBS=OFF -DGGML_OPENCL=ON -DGGML_VULKAN_CHECK_RESULTS=ON ninja
Operating systems
Linux
GGML backends
CPU
Hardware
RK3588-ARM64 CPU
Models
DeepSeek-R1-Distill-Qwen-1.5B-Q4_0-GGUF/deepseek-r1-distill-qwen-1.5b-q4_0.gguf
Problem description & steps to reproduce
- Download the latest llama-cpp code on an RK3588 environment.
- Download the official DeepSeek R1-1.5B model file.
- Build llama-cli using the Debug configuration:
cd llama-cpp
mkdir build && cd build
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Debug -DBUILD_SHARED_LIBS=OFF -DGGML_OPENCL=ON -DGGML_VULKAN_CHECK_RESULTS=ON
ninja- Run the resulting binary:
cd ~/dev/llm/llama.cpp/build/bin && ./llama-cli -m ~/dev/llm/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0-GGUF/deepseek-r1-distill-qwen-1.5b-q4_0.gguf -b 128 -ngl 0 -c 2048 -p \"Hello\"- Input any question. The output eventually aborted. eg:
</think>
Hello! How can I assist you today? 😊
> how to debug opencl?
<think>
</think>
To debug OpenCL code,llama-cli: /home/geduer/dev/llm/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:10306: ggml_compute_forward_soft_max_f32: Assertion `sum > 0.0' failed.
Aborted (core dumped)
First Bad Commit
Commit: a4f011e
Relevant log output
geduer@ulan:~/dev/llm/llama.cpp/build$ cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Debug -DBUILD_SHARED_LIBS=OFF -DGGML_OPENCL=ON -DGGML_VULKAN_CHECK_RESULTS=ON
-- The C compiler identification is GNU 13.2.0
-- The CXX compiler identification is GNU 13.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.40.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Failed
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=cortex-a76.cortex-a55+crc+crypto+dotprod+noi8mm+nosve
-- Looking for CL_VERSION_3_0
-- Looking for CL_VERSION_3_0 - found
-- Found OpenCL: /usr/lib/aarch64-linux-gnu/libOpenCL.so (found version "3.0")
-- Found Python3: /usr/bin/python3 (found version "3.11.6") found components: Interpreter
-- OpenCL will use matmul kernels optimized for Adreno
-- Including OpenCL backend
-- Configuring done (7.5s)
-- Generating done (0.2s)
-- Build files have been written to: /home/geduer/dev/llm/llama.cpp/build
geduer@ulan:~/dev/llm/llama.cpp/build$ ninja
[19/222] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.40.1")
[134/222] Building CXX object examples/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
/home/geduer/dev/llm/llama.cpp/examples/perplexity/perplexity.cpp: In lambda function:
/home/geduer/dev/llm/llama.cpp/examples/perplexity/perplexity.cpp:1736:41: note: parameter passing for argument of type ‘std::pair<double, double>’ when C++17 is enabled changed to match C++14 in GCC 10.1
1736 | return std::make_pair(0., 0.);
| ^
[222/222] Linking CXX executable bin/llama-gemma3-cli
geduer@ulan:~/dev/llm/llama.cpp/build$ cd ~/dev/llm/llama.cpp/build/bin && ./llama-cli -m ~/dev/llm/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0-GGUF/deepseek-r1-distill-qwen-1.5b-q4_0.gguf -b 128 -ngl 0 -c 2048 -p \"Hello\"
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
ggml_opencl: selecting platform: 'ARM Platform'
ggml_opencl: selecting device: 'Mali-LODX r0p0'
Unsupported GPU: Mali-LODX r0p0
register_backend: registered backend OpenCL (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
build: 4913 (35cae5ba) with cc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 for aarch64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /home/geduer/dev/llm/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0-GGUF/deepseek-r1-distill-qwen-1.5b-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 4: general.size_label str = 1.5B
llama_model_loader: - kv 5: qwen2.block_count u32 = 28
llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
llama_model_loader: - kv 7: qwen2.embedding_length u32 = 1536
llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 8960
llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 12
llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 23: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - kv 25: general.file_type u32 = 2
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_0: 197 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 1011.16 MiB (4.77 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 1536
print_info: n_layer = 28
print_info: n_head = 12
print_info: n_head_kv = 2
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 6
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 8960
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1.5B
print_info: model params = 1.78 B
print_info: general.name = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token = 151643 '<|end▁of▁sentence|>'
print_info: EOT token = 151643 '<|end▁of▁sentence|>'
print_info: PAD token = 151643 '<|end▁of▁sentence|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_AARCH64 model buffer size = 702.84 MiB
load_tensors: CPU_Mapped model buffer size = 1003.78 MiB
........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 128
llama_context: n_ubatch = 128
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.58 MiB
init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init: CPU KV buffer size = 56.00 MiB
llama_context: KV self size = 56.00 MiB, K (f16): 28.00 MiB, V (f16): 28.00 MiB
llama_context: CPU compute buffer size = 76.19 MiB
llama_context: graph nodes = 986
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 3062269646
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 128, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
"Hello"<think>
</think>
Hello! How can I assist you today? 😊
> how to debug opencl?
<think>
</think>
To debug OpenCL code,llama-cli: /home/geduer/dev/llm/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:10306: ggml_compute_forward_soft_max_f32: Assertion `sum > 0.0' failed.
Aborted (core dumped)