Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Closed
ggerganov opened this issue Mar 12, 2023 · 10 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed high priority Very important issue 🦙. llama

Comments

@ggerganov
Copy link
Owner

Idea from: #23 (comment)

We can add a --cache_prompt flag that if added will dump the computed KV caches of the prompt processing to the disk in a file with name produced by the hash of the prompt. Next time you run, it will first check if we have stored KV cache for this hash and load it straight from disk instead of computing it.

Great task for contributing to the project!

@ggerganov ggerganov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers high priority Very important issue 🦙. llama labels Mar 12, 2023
@ggerganov ggerganov pinned this issue Mar 13, 2023
@Msa360
Copy link

Msa360 commented Mar 14, 2023

@ggerganov Do you want to store k, v or K, V?

@im-not-tom
Copy link

im-not-tom commented Mar 17, 2023

Hello and sorry to bother you,

I made a attempt at doing this and would like to ask for correction, as probably gravely misunderstood something.

Without dumping my entire code, I have two functions that basically boil to memcpy-ing content model.memory_[kv]->data
in and out. My understanding is that this, along with saving/restoring n_past should be enough to save current state of prompt.

Am I mistaken? I'm, of course, asking simply because result of my attempt is state that appears entirelly random, often outputing nothing at all or garbage text.

bool gptj_make_savepoint(const struct gptj_model & model, gptj_savepoint & savepoint) {                       
    size_t nelements = ggml_nelements(model.memory_k);  
    assert(nelements == ggml_nelements(model.memory_v));
    savepoint.memory.clear();
    assert(ggml_type_size(model.memory_k->type) == sizeof(float));                          
    assert(ggml_type_size(model.memory_v->type) == sizeof(float));                          
    savepoint.memory.resize(nelements * 2);                                                 
    memcpy(                                                                                 
        &savepoint.memory[0],                                                               
        ggml_get_data(model.memory_k),                                                      
        sizeof(float) * nelements                                                           
    );                                                                                      
    memcpy(                                                                                 
        &savepoint.memory[nelements],                                                       
        ggml_get_data(model.memory_v),                                                      
        sizeof(float) * nelements                                                           
    );                                                                                      
    return true;                                                                            
}                                                                                           


bool gptj_apply_savepoint(const gptj_savepoint & savepoint, struct gptj_model & model) {                            
    size_t nelements = savepoint.memory.size() / 2;                                    
    assert(nelements == ggml_nelements(model.memory_k));                                
    assert(nelements == ggml_nelements(model.memory_v));                               
                                                                                       
    memcpy(                                                                            
        ggml_get_data(model.memory_k),                                                 
        &savepoint.memory[0],                                                          
        sizeof(float) * nelements                                                      
    );                                                                                 
    memcpy(                                                                            
        ggml_get_data(model.memory_v),                                                 
        &savepoint.memory[nelements],                                                  
        sizeof(float) * nelements                                                      
    );                                                                                 
                                                                                       
    return true;                                                                       
}                                                                                      

@setzer22
Copy link

setzer22 commented Mar 17, 2023

@im-not-tom I got this working on my project and those are basically the steps I followed 👍

I have verified this works by saving memory, restoring on a different process (after loading the model again), and then comparing the two logits vectors returned by llama_eval (the one from the original process and the one from the new process). If you do it well, the results should be identical.

@LostRuins
Copy link
Collaborator

LostRuins commented Mar 17, 2023

This requires a future input prompt to be identical to the original prompt, correct? So if I wanted to swap out a few words early in the text each time, it would still require reprocessing everything.

I'm struggling to understand, why does Huggingface Transformers not have this issue? It seems like generating from a 500 word input prompt has the same latency as a 3 word prompt there (constant time). Whereas with llama.cpp, the prompt processing time scales linearly with prompt length.

@setzer22 setzer22 mentioned this issue Mar 18, 2023
@anzz1
Copy link
Contributor

anzz1 commented Mar 21, 2023

This a great idea.
As explained in my comment here #23 (comment)

I think the right way going forward would be to separate the concepts of the immutable model and its' mutable context to their own separate, simple structs.

Those could be then manipulated and evaluated against in various ways any new features and implementations would see fit. Load / save from disk, mmap to memory , transfer over sockets , even from hardware devices like a serial port. Serve multiple instances by creating threads or share the model and/or context between processes.

Currently I see the problem of having multiple concurrent PR's , each of which try to implement new functionality by directly modifying the main program in their different ways. A simple C API to access the 'model' and 'context' structs could keep the main program lean, clean and fast and have the ability to add all sorts of new functionality using separate modules which could interface with this API.

You've done absolutely fantastic job making this minimal, fast and ultra-portable zero-dependency port of LLaMA. It's absolutely delightful in its' current state and I think modularity would be the right approach moving forward instead of growing the main program to a large monolith with a web of non-portable #ifdefs scattered around everywhere. With every functionality-adding module living inside its separate .cpp file independent of each other, any functionality could be simply added or left out by the makefile script.

I'd see that this could spawn a whole environment where a new "modules" directory could be added to the repo root, then people could make whatever modules which can add new functionality. Living inside the modules folder and being separated from the main program and being included by makefile options, they could also less strictly conform to the rules of no dependencies and full platform compatibility. Allowing people to make new functionality without taking into account every platform and also have the ability of opting-in to the features they want and have nothing forced upon them.

If a non-modular approach would be taken, it would inevitably lead this marvelous minimalistic codebase to grow to a large monolith and force people to fork the minimal version and cherry-pick the bugfixes and features they want each in their own forks, creating a divergence which in my view would hurt the project in the long run.

@KerfuffleV2
Copy link
Collaborator

When implementing the feature, this comment may be useful: rustformers/llm#38 (comment)

TL;DR is memory_k and memory_v are compressible proportional to how much of the context space is used. For example, if --ctx_size 512 and you save the state after feeding 1-2 tokens in you get about a 95% reduction, if you feed it 256 tokens you get around 50%, if you feed it 511 tokens you get virtually no compression. However, that's an edge case.

zstd compression at the lowest/fastest setting works well and increasing the compression level doesn't do a lot. Since the memory is quite large (2GB with --ctx_size 2048), being able to save something like the Alpaca instruction prefix stuff with 30MB is a huge difference compared to having to load/save a massive 2GB file.

So I think it's really worth it to use some sort of lightweight compression scheme for at least the memory tensors when saving/restoring state.

@xaedes
Copy link
Collaborator

xaedes commented Apr 14, 2023

I have implemented functions for getting and setting the rest of the model state.
It additionally includes: random number generator state, logits, embedding and kv_cache.

It was necessary to store the logits so that we can eval tokens, save state, restart program, load state and then sample.
Otherwise the sampling did not have access to the required logits and indeed segfaulted on the initially empty logits vector.

For completeness I also stored the embedding vector.

Because the whole state is not in one contiguous memory buffer I decided on an output pointer parameter to get the state data.
The user is responsible to allocate the memory where the state is written to.

xaedes@075f5f5

@abetlen
Copy link
Collaborator

abetlen commented Apr 15, 2023

@xaedes do you have a PR for this?

@xaedes
Copy link
Collaborator

xaedes commented Apr 21, 2023

Just created the pull request: #1105

@ejones
Copy link
Collaborator

ejones commented Apr 29, 2023

I believe #1169 covers this

@ejones ejones closed this as completed Apr 29, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed high priority Very important issue 🦙. llama
Projects
None yet
Development

No branches or pull requests

10 participants