-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Name and Version
llama.cpp release version: b6765
built with gcc 12.3.1
openeuler 22.03 LTS-sp4 aarch64
cann:8.2.RC1
Ascend hdk: 24.1.1.1
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server --host 0.0.0.0 --port 8080 -m /model/qwen/Qwen3-30B-A3B-Thinking-2507/Qwen3-30B-A3B-Thinking-2507-F16.gguf --jinja -ngl 10Problem description & steps to reproduce
/model/llama.cpp-b6765/ggml/src/ggml-cann/ggml-cann.cpp:69: CANN error
CANN error: EL0003: [PID: 1445] 2025-10-21-03:40:09.141.920 The argument is invalid.
Solution: Try again with a valid argument.
TraceBack (most recent call last):
Convert memory address from virtual to dma physical failed, retCode=0x7020014.[FUNC:MemcpyAsyncTaskInitV3][FILE:memory_task.cc][LINE:694]
rtMemcpyAsync execute failed, reason=[driver error:invalid handle][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
[Call][Rts]call rts api [rtMemcpyAsync] failed, retCode is 107017[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
current device: 0, in function ggml_cann_async_memcpy at /model/llama.cpp-b6765/ggml/src/ggml-cann/../ggml-cann/aclnn_ops.h:985
aclrtMemcpyAsync(dst, len, src, len, kind, ctx->stream())
First Bad Commit
No response
Relevant log output
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
ggml_backend_cann_context: device 0 async operator submission is OFF
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache: CANN0 KV buffer size = 80.00 MiB
llama_kv_cache: CPU KV buffer size = 304.00 MiB
llama_kv_cache: size = 384.00 MiB ( 4096 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_context: layer 38 is assigned to device CANN0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled
llama_context: CANN0 compute buffer size = 894.25 MiB
llama_context: CANN_Host compute buffer size = 76.01 MiB
llama_context: graph nodes = 3222
llama_context: graph splits = 536 (with bs=512), 117 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
new_pool_for_device: device 0 use vmm pool
/model/llama.cpp-b6765/ggml/src/ggml-cann/ggml-cann.cpp:69: CANN error
CANN error: EL0003: [PID: 1445] 2025-10-21-03:40:09.141.920 The argument is invalid.
Solution: Try again with a valid argument.
TraceBack (most recent call last):
Convert memory address from virtual to dma physical failed, retCode=0x7020014.[FUNC:MemcpyAsyncTaskInitV3][FILE:memory_task.cc][LINE:694]
rtMemcpyAsync execute failed, reason=[driver error:invalid handle][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
[Call][Rts]call rts api [rtMemcpyAsync] failed, retCode is 107017[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
current device: 0, in function ggml_cann_async_memcpy at /model/llama.cpp-b6765/ggml/src/ggml-cann/../ggml-cann/aclnn_ops.h:985
aclrtMemcpyAsync(dst, len, src, len, kind, ctx->stream())
libggml-base.so(+0x148d4)[0xffffb14c48d4]
libggml-base.so(ggml_print_backtrace+0x21c)[0xffffb14c4d8c]
libggml-base.so(ggml_abort+0x134)[0xffffb14c4f54]
libggml-cann.so(+0x2950c)[0xffffb068950c]
libggml-cann.so(+0x298b0)[0xffffb06898b0]
libggml-cann.so(+0x2c8a4)[0xffffb068c8a4]
libggml-base.so(ggml_backend_sched_graph_compute_async+0x830)[0xffffb14dec10]
libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x9c)[0xffffb15f41fc]
libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xf4)[0xffffb15f4508]
libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x2c4)[0xffffb15fafe4]
libllama.so(llama_decode+0x10)[0xffffb15fbdd0]
./llama-server(+0x1ab704)[0xaaaac58eb704]
./llama-server(+0x9766c)[0xaaaac57d766c]
./llama-server(+0x280a0)[0xaaaac57680a0]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb10273fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb10274cc]
./llama-server(+0x2a4b0)[0xaaaac576a4b0]