Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

ggerganov · 2023-03-12T21:55:25Z

We can add a --cache_prompt flag that if added will dump the computed KV caches of the prompt processing to the disk in a file with name produced by the hash of the prompt. Next time you run, it will first check if we have stored KV cache for this hash and load it straight from disk instead of computing it.

Great task for contributing to the project!

The text was updated successfully, but these errors were encountered:

Msa360 · 2023-03-14T20:48:14Z

@ggerganov Do you want to store k, v or K, V?

im-not-tom · 2023-03-17T12:43:50Z

Hello and sorry to bother you,

I made a attempt at doing this and would like to ask for correction, as probably gravely misunderstood something.

Without dumping my entire code, I have two functions that basically boil to memcpy-ing content model.memory_[kv]->data
in and out. My understanding is that this, along with saving/restoring n_past should be enough to save current state of prompt.

Am I mistaken? I'm, of course, asking simply because result of my attempt is state that appears entirelly random, often outputing nothing at all or garbage text.

bool gptj_make_savepoint(const struct gptj_model & model, gptj_savepoint & savepoint) {                       
    size_t nelements = ggml_nelements(model.memory_k);  
    assert(nelements == ggml_nelements(model.memory_v));
    savepoint.memory.clear();
    assert(ggml_type_size(model.memory_k->type) == sizeof(float));                          
    assert(ggml_type_size(model.memory_v->type) == sizeof(float));                          
    savepoint.memory.resize(nelements * 2);                                                 
    memcpy(                                                                                 
        &savepoint.memory[0],                                                               
        ggml_get_data(model.memory_k),                                                      
        sizeof(float) * nelements                                                           
    );                                                                                      
    memcpy(                                                                                 
        &savepoint.memory[nelements],                                                       
        ggml_get_data(model.memory_v),                                                      
        sizeof(float) * nelements                                                           
    );                                                                                      
    return true;                                                                            
}                                                                                           


bool gptj_apply_savepoint(const gptj_savepoint & savepoint, struct gptj_model & model) {                            
    size_t nelements = savepoint.memory.size() / 2;                                    
    assert(nelements == ggml_nelements(model.memory_k));                                
    assert(nelements == ggml_nelements(model.memory_v));                               
                                                                                       
    memcpy(                                                                            
        ggml_get_data(model.memory_k),                                                 
        &savepoint.memory[0],                                                          
        sizeof(float) * nelements                                                      
    );                                                                                 
    memcpy(                                                                            
        ggml_get_data(model.memory_v),                                                 
        &savepoint.memory[nelements],                                                  
        sizeof(float) * nelements                                                      
    );                                                                                 
                                                                                       
    return true;                                                                       
}

setzer22 · 2023-03-17T14:29:50Z

@im-not-tom I got this working on my project and those are basically the steps I followed 👍

I have verified this works by saving memory, restoring on a different process (after loading the model again), and then comparing the two logits vectors returned by llama_eval (the one from the original process and the one from the new process). If you do it well, the results should be identical.

LostRuins · 2023-03-17T15:27:53Z

This requires a future input prompt to be identical to the original prompt, correct? So if I wanted to swap out a few words early in the text each time, it would still require reprocessing everything.

I'm struggling to understand, why does Huggingface Transformers not have this issue? It seems like generating from a 500 word input prompt has the same latency as a 3 word prompt there (constant time). Whereas with llama.cpp, the prompt processing time scales linearly with prompt length.

anzz1 · 2023-03-21T01:13:59Z

This a great idea.
As explained in my comment here #23 (comment)

I think the right way going forward would be to separate the concepts of the immutable model and its' mutable context to their own separate, simple structs.

Those could be then manipulated and evaluated against in various ways any new features and implementations would see fit. Load / save from disk, mmap to memory , transfer over sockets , even from hardware devices like a serial port. Serve multiple instances by creating threads or share the model and/or context between processes.

Currently I see the problem of having multiple concurrent PR's , each of which try to implement new functionality by directly modifying the main program in their different ways. A simple C API to access the 'model' and 'context' structs could keep the main program lean, clean and fast and have the ability to add all sorts of new functionality using separate modules which could interface with this API.

You've done absolutely fantastic job making this minimal, fast and ultra-portable zero-dependency port of LLaMA. It's absolutely delightful in its' current state and I think modularity would be the right approach moving forward instead of growing the main program to a large monolith with a web of non-portable #ifdefs scattered around everywhere. With every functionality-adding module living inside its separate .cpp file independent of each other, any functionality could be simply added or left out by the makefile script.

I'd see that this could spawn a whole environment where a new "modules" directory could be added to the repo root, then people could make whatever modules which can add new functionality. Living inside the modules folder and being separated from the main program and being included by makefile options, they could also less strictly conform to the rules of no dependencies and full platform compatibility. Allowing people to make new functionality without taking into account every platform and also have the ability of opting-in to the features they want and have nothing forced upon them.

If a non-modular approach would be taken, it would inevitably lead this marvelous minimalistic codebase to grow to a large monolith and force people to fork the minimal version and cherry-pick the bugfixes and features they want each in their own forks, creating a divergence which in my view would hurt the project in the long run.

KerfuffleV2 · 2023-03-21T22:07:57Z

When implementing the feature, this comment may be useful: rustformers/llm#38 (comment)

TL;DR is memory_k and memory_v are compressible proportional to how much of the context space is used. For example, if --ctx_size 512 and you save the state after feeding 1-2 tokens in you get about a 95% reduction, if you feed it 256 tokens you get around 50%, if you feed it 511 tokens you get virtually no compression. However, that's an edge case.

zstd compression at the lowest/fastest setting works well and increasing the compression level doesn't do a lot. Since the memory is quite large (2GB with --ctx_size 2048), being able to save something like the Alpaca instruction prefix stuff with 30MB is a huge difference compared to having to load/save a massive 2GB file.

So I think it's really worth it to use some sort of lightweight compression scheme for at least the memory tensors when saving/restoring state.

xaedes · 2023-04-14T01:42:26Z

I have implemented functions for getting and setting the rest of the model state.
It additionally includes: random number generator state, logits, embedding and kv_cache.

It was necessary to store the logits so that we can eval tokens, save state, restart program, load state and then sample.
Otherwise the sampling did not have access to the required logits and indeed segfaulted on the initially empty logits vector.

For completeness I also stored the embedding vector.

Because the whole state is not in one contiguous memory buffer I decided on an output pointer parameter to get the state data.
The user is responsible to allocate the memory where the state is written to.

xaedes@075f5f5

abetlen · 2023-04-15T16:17:51Z

@xaedes do you have a PR for this?

xaedes · 2023-04-21T15:29:55Z

Just created the pull request: #1105

ejones · 2023-04-29T02:57:37Z

I believe #1169 covers this

Add unlimited max_tokens

ggerganov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers high priority Very important issue 🦙. llama labels Mar 12, 2023

ggerganov pinned this issue Mar 13, 2023

blackhole89 mentioned this issue Mar 13, 2023

Add interactive mode #61

Merged

AlexandreCassagne mentioned this issue Mar 13, 2023

LLaMA: Fine tuning tinygrad/tinygrad#690

Closed

setzer22 mentioned this issue Mar 15, 2023

Implementation of prompt caching rustformers/llm#14

Merged

setzer22 mentioned this issue Mar 18, 2023

Rust Bindings #248

Closed

ggerganov unpinned this issue Mar 22, 2023

sgoll mentioned this issue Mar 30, 2023

Reducing the time needed to reload a piece of text into the model by caching the state #202

Closed

chrfalch mentioned this issue Apr 1, 2023

Feature: Added api for getting/setting the kv_cache #685

Merged

ejones closed this as completed Apr 29, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggerganov#64 from jm12138/add_unlimited_max_tokens

2472420

Add unlimited max_tokens

RomeoV mentioned this issue Jan 9, 2024

[FR] Send other buffers as context karthink/gptel#176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

ggerganov commented Mar 12, 2023

Msa360 commented Mar 14, 2023

im-not-tom commented Mar 17, 2023 •

edited

Loading

setzer22 commented Mar 17, 2023 •

edited

Loading

LostRuins commented Mar 17, 2023 •

edited

Loading

anzz1 commented Mar 21, 2023 •

edited

Loading

KerfuffleV2 commented Mar 21, 2023

xaedes commented Apr 14, 2023 •

edited

Loading

abetlen commented Apr 15, 2023

xaedes commented Apr 21, 2023

ejones commented Apr 29, 2023

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Comments

ggerganov commented Mar 12, 2023

Msa360 commented Mar 14, 2023

im-not-tom commented Mar 17, 2023 • edited Loading

setzer22 commented Mar 17, 2023 • edited Loading

LostRuins commented Mar 17, 2023 • edited Loading

anzz1 commented Mar 21, 2023 • edited Loading

KerfuffleV2 commented Mar 21, 2023

xaedes commented Apr 14, 2023 • edited Loading

abetlen commented Apr 15, 2023

xaedes commented Apr 21, 2023

ejones commented Apr 29, 2023

im-not-tom commented Mar 17, 2023 •

edited

Loading

setzer22 commented Mar 17, 2023 •

edited

Loading

LostRuins commented Mar 17, 2023 •

edited

Loading

anzz1 commented Mar 21, 2023 •

edited

Loading

xaedes commented Apr 14, 2023 •

edited

Loading