Misc. bug: Fail to run moe model at Ascend 310p(300I DUO) NPU

### Name and Version

llama.cpp release version: b6765
built with gcc 12.3.1
openeuler 22.03 LTS-sp4 aarch64
cann：8.2.RC1
Ascend hdk: 24.1.1.1

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./llama-server  --host 0.0.0.0 --port 8080 -m /model/qwen/Qwen3-30B-A3B-Thinking-2507/Qwen3-30B-A3B-Thinking-2507-F16.gguf  --jinja   -ngl 10
```

### Problem description & steps to reproduce

/model/llama.cpp-b6765/ggml/src/ggml-cann/ggml-cann.cpp:69: CANN error
CANN error: EL0003: [PID: 1445] 2025-10-21-03:40:09.141.920 The argument is invalid.
        Solution: Try again with a valid argument.
        TraceBack (most recent call last):
        Convert memory address from virtual to dma physical failed, retCode=0x7020014.[FUNC:MemcpyAsyncTaskInitV3][FILE:memory_task.cc][LINE:694]
        rtMemcpyAsync execute failed, reason=[driver error:invalid handle][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        [Call][Rts]call rts api [rtMemcpyAsync] failed, retCode is 107017[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

  current device: 0, in function ggml_cann_async_memcpy at /model/llama.cpp-b6765/ggml/src/ggml-cann/../ggml-cann/aclnn_ops.h:985
  aclrtMemcpyAsync(dst, len, src, len, kind, ctx->stream())

### First Bad Commit

_No response_

### Relevant log output

```shell
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
ggml_backend_cann_context: device 0 async operator submission is OFF
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache:      CANN0 KV buffer size =    80.00 MiB
llama_kv_cache:        CPU KV buffer size =   304.00 MiB
llama_kv_cache: size =  384.00 MiB (  4096 cells,  48 layers,  1/1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context: layer 38 is assigned to device CANN0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled
llama_context:      CANN0 compute buffer size =   894.25 MiB
llama_context:  CANN_Host compute buffer size =    76.01 MiB
llama_context: graph nodes  = 3222
llama_context: graph splits = 536 (with bs=512), 117 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
new_pool_for_device: device 0 use vmm pool
/model/llama.cpp-b6765/ggml/src/ggml-cann/ggml-cann.cpp:69: CANN error
CANN error: EL0003: [PID: 1445] 2025-10-21-03:40:09.141.920 The argument is invalid.
        Solution: Try again with a valid argument.
        TraceBack (most recent call last):
        Convert memory address from virtual to dma physical failed, retCode=0x7020014.[FUNC:MemcpyAsyncTaskInitV3][FILE:memory_task.cc][LINE:694]
        rtMemcpyAsync execute failed, reason=[driver error:invalid handle][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        [Call][Rts]call rts api [rtMemcpyAsync] failed, retCode is 107017[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

  current device: 0, in function ggml_cann_async_memcpy at /model/llama.cpp-b6765/ggml/src/ggml-cann/../ggml-cann/aclnn_ops.h:985
  aclrtMemcpyAsync(dst, len, src, len, kind, ctx->stream())
libggml-base.so(+0x148d4)[0xffffb14c48d4]
libggml-base.so(ggml_print_backtrace+0x21c)[0xffffb14c4d8c]
libggml-base.so(ggml_abort+0x134)[0xffffb14c4f54]
libggml-cann.so(+0x2950c)[0xffffb068950c]
libggml-cann.so(+0x298b0)[0xffffb06898b0]
libggml-cann.so(+0x2c8a4)[0xffffb068c8a4]
libggml-base.so(ggml_backend_sched_graph_compute_async+0x830)[0xffffb14dec10]
libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x9c)[0xffffb15f41fc]
libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xf4)[0xffffb15f4508]
libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x2c4)[0xffffb15fafe4]
libllama.so(llama_decode+0x10)[0xffffb15fbdd0]
./llama-server(+0x1ab704)[0xaaaac58eb704]
./llama-server(+0x9766c)[0xaaaac57d766c]
./llama-server(+0x280a0)[0xaaaac57680a0]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb10273fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb10274cc]
./llama-server(+0x2a4b0)[0xaaaac576a4b0]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Fail to run moe model at Ascend 310p(300I DUO) NPU #16695

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Fail to run moe model at Ascend 310p(300I DUO) NPU #16695

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions