Urgent Help Needed! Problems Encountered in Hybrid Inference Function Verification Based on llama.cpp

Dear partners, I'm currently conducting a verification of a hybrid inference function based on llama.cpp. Here are the detailed details:
### Environment and Version Information

- Current Version Used: commit: 6bb4908a17150b49373b5f977685b2e180a04f6f

- Application Platform: Oppo find x7 ultra

- Chip: Qualcomm SM8650

### Desired Function
The plan is to use the NPU of SM8650 to implement the prefill process, and the subsequent decode process will be completed by llama.cpp. The entire inference process is centered around llama.cpp.
### Implementation Method
To verify the feasibility of this solution, I embedded the code of Qualcomm Genie into llama.cpp. After Genie completes the prefill operation, the generated kvcache is processed through conversion and then filled into llama.cpp. The specific processing method is shown in the figure below: 

![IMG_export_20250211_181133867.jpg](https://github.com/user-attachments/assets/d77a9ca0-ca79-4620-9c75-c9750a0dfdfd)



### Personal Understanding of kvcache Storage
I have my own understanding of the kvcache storage in llama.cpp, and the specific content is shown in the figure: 
![IMG_export_20250211_181136956.jpg](https://github.com/user-attachments/assets/d1e4b42b-9e0a-44a6-abd5-b0e059b852d5)


### Current Problem Encountered
There is an error in the output logic of the program.
### Questions for Consultation
I have the following two questions to consult you all:

- a. Is there any problem with my current understanding of the kvcache storage method in llama.cpp?

- b. If the prefill process is skipped in llama.cpp and the decode process is directly carried out, apart from updating the kvcache and kv cache head, are there any other contents that need to be updated?

I hope to get your help. Thank you again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Urgent Help Needed! Problems Encountered in Hybrid Inference Function Verification Based on llama.cpp #11805

Environment and Version Information

Desired Function

Implementation Method

Personal Understanding of kvcache Storage

Current Problem Encountered

Questions for Consultation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Urgent Help Needed! Problems Encountered in Hybrid Inference Function Verification Based on llama.cpp #11805

Description

Environment and Version Information

Desired Function

Implementation Method

Personal Understanding of kvcache Storage

Current Problem Encountered

Questions for Consultation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions