-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Description
Dear partners, I'm currently conducting a verification of a hybrid inference function based on llama.cpp. Here are the detailed details:
Environment and Version Information
-
Current Version Used: commit: 6bb4908
-
Application Platform: Oppo find x7 ultra
-
Chip: Qualcomm SM8650
Desired Function
The plan is to use the NPU of SM8650 to implement the prefill process, and the subsequent decode process will be completed by llama.cpp. The entire inference process is centered around llama.cpp.
Implementation Method
To verify the feasibility of this solution, I embedded the code of Qualcomm Genie into llama.cpp. After Genie completes the prefill operation, the generated kvcache is processed through conversion and then filled into llama.cpp. The specific processing method is shown in the figure below:
Personal Understanding of kvcache Storage
I have my own understanding of the kvcache storage in llama.cpp, and the specific content is shown in the figure:

Current Problem Encountered
There is an error in the output logic of the program.
Questions for Consultation
I have the following two questions to consult you all:
-
a. Is there any problem with my current understanding of the kvcache storage method in llama.cpp?
-
b. If the prefill process is skipped in llama.cpp and the decode process is directly carried out, apart from updating the kvcache and kv cache head, are there any other contents that need to be updated?
I hope to get your help. Thank you again!
