Skip to content

Urgent Help Needed! Problems Encountered in Hybrid Inference Function Verification Based on llama.cpp #11805

@mailonghua

Description

@mailonghua

Dear partners, I'm currently conducting a verification of a hybrid inference function based on llama.cpp. Here are the detailed details:

Environment and Version Information

  • Current Version Used: commit: 6bb4908

  • Application Platform: Oppo find x7 ultra

  • Chip: Qualcomm SM8650

Desired Function

The plan is to use the NPU of SM8650 to implement the prefill process, and the subsequent decode process will be completed by llama.cpp. The entire inference process is centered around llama.cpp.

Implementation Method

To verify the feasibility of this solution, I embedded the code of Qualcomm Genie into llama.cpp. After Genie completes the prefill operation, the generated kvcache is processed through conversion and then filled into llama.cpp. The specific processing method is shown in the figure below:

IMG_export_20250211_181133867.jpg

Personal Understanding of kvcache Storage

I have my own understanding of the kvcache storage in llama.cpp, and the specific content is shown in the figure:
IMG_export_20250211_181136956.jpg

Current Problem Encountered

There is an error in the output logic of the program.

Questions for Consultation

I have the following two questions to consult you all:

  • a. Is there any problem with my current understanding of the kvcache storage method in llama.cpp?

  • b. If the prefill process is skipped in llama.cpp and the decode process is directly carried out, apart from updating the kvcache and kv cache head, are there any other contents that need to be updated?

I hope to get your help. Thank you again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions