I read the llama.cpp source code。
I am confused as to why the function llama_build_graph needs to be called every time the function llama_decode is called.
The function llama_build_graph cannot be called during program initialization, which will reduce the inference time.
static int llama_decode_internal(
llama_context & lctx,
llama_batch batch) {
....
ggml_cgraph * gf = llama_build_graph(lctx, batch, false);
.....
}
Thanks