Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching for evaluated prompts #44

Closed
abetlen opened this issue Apr 8, 2023 · 40 comments
Closed

Implement caching for evaluated prompts #44

abetlen opened this issue Apr 8, 2023 · 40 comments
Labels
enhancement New feature or request high-priority

Comments

@abetlen
Copy link
Owner

abetlen commented Apr 8, 2023

The goal of this feature is to reduce latency for repeated calls to the chat_completion api by saving the kv_cache keyed by the prompt tokens.

The basic version of this is to simply save the kv_state after the prompt is generated.

Additionally we should investigate if it's possible save and restore the kv_state after the completion has been generated as well.

@MillionthOdin16
Copy link
Contributor

I don't understand why we don't just use interactive mode. Almost all the users coming from llama.CPP Are used to an interface where they send a message and get a quick response because there is no clearing of state between messages from the user, meaning there's also no need to reload the state. As I understood it, the KV cache was a way to store the prompt state, because it is used multiple times during the course of a conversation and helps improve responsiveness during long conversations. Given the way that it's used in the base llama.CPP executable, and the fact that in the current implementation of interactive mode, storing of the entire conversation state won't improve performance (it would only allow continuation of previous conversations during a different session), I don't know that this is something that they're going to add in the immediate future.

For me, being able to get completions from the bot with full context of the ongoing conversation is my main use case. So there's pretty much no situation where I would want the current conversation context cleared or reset.

And I thought this was similar for the openAI implementation, where you send the current message, but don't need to send the full message history. Any type of recomputation or loading of model state decreases the performance and makes it slower than the base llama.CPP implementation imo.

After all, I think if people are using chat mode, from a user perspective, they want a continuous and performant chat. Even if that means running models and independent contexts simultaneously, which reduces scalability in the short term without the ability to load and save the states.

@abetlen
Copy link
Owner Author

abetlen commented Apr 11, 2023

@MillionthOdin16 are you talking about the OpenAI server or just using the Llama class? For the actual OpenAI API each request is entirely independent of all other requests (e.g. you always send the full history to the /chat/completions endpoint) so you do need to reset the model each time. This is why I'm looking into the KV state solution so we can just reload the state if we've seen e.g the first n-1 messages of an n message chat get sent over.

If you're just looking for interactive mode though I believe that's been implemented in this example if you just want to use it in a program and don't care about the API https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_chat_cpp.py

@MillionthOdin16
Copy link
Contributor

I'm talking about the openAI server. My point is from the user perspective, the most important factor of chat completions is speed of responses. Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is. So for the chat completions endpoint, this creates an issue because the longer the conversation is, the longer it's going to take to get a response. The reason we have the issue is because the CPP implementation of llama is different than the normal GPU implementations relating to processing of the prompt before it generates a response.

So I'm saying that the most efficient solution in this instance might be to not clear the context and save that processing time for each subsequent completion by keeping the session going. It diverges from how openAI implements it, but it's the only option we have right now. And chat completions isn't usable without it because it's too slow.

@MillionthOdin16
Copy link
Contributor

I'm basically advocating for a temporary hack to prevent the context from being cleared during chat completions so that we get a significant performance boost, until either we get a proper state saving capability, or the prompt processing time issue is resolved.

The issue is frustrating because we're so close to having an API that is performant and chat capable, but there's just a couple things holding it back, and I'm advocating for a temporary hack to allow good performance, until we can actually properly implement it.

@0xdevalias
Copy link

0xdevalias commented Apr 11, 2023

@MillionthOdin16
Copy link
Contributor

Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is

@MillionthOdin16 Is that still meaningfully so since the recent performance regressions appear to have been fixed?

So that's not what I mean in this case. I created issue 603 on llama.cpp and now that we have that performance boost, it would be awesome to get as much boost in the API over here as we can. I meant the issue/undetermined cause here: ggerganov/llama.cpp#719

I've seen people more familiar with LLMs mention some oddities about the initial processing in the past, but haven't seen a straightforward explanation. As I understand, it appears llama.cpp has some differences between how it processes the initial data before generating tokens, and it's much slower than the transformer's implementation (CPU vs GPU aside).

So I was just saying that if we get a performance boost from that avenue, as well as the ability to store a conversation's state, a proper implementation will be much faster than atm, and we wouldn't need this workaround. Hope that makes sense.

Right now there's so many small, different issues going on in the main repo that it's hard to keep track of haha.

@oobabooga
Copy link
Contributor

+1 to this. Many people are requesting this feature here: oobabooga/text-generation-webui#866

It would be nice to have a use_cache=True flag (or something similar) in Llama.generate.

@abetlen
Copy link
Owner Author

abetlen commented Apr 13, 2023

@oobabooga I'm still working on the caching API but for now I've added a reset option for Llama.generate which defaults to False, if you want to continue from the end of a previous generation just call with reset=False.

@oobabooga
Copy link
Contributor

@abetlen is it safe to use reset=False at all times, or will that cause incorrect generations if you completely replace the prompt with a new one?

@ghost
Copy link

ghost commented Apr 14, 2023

@oobabooga What reset=False basically does it it maintains the current model context. So once you feed in a prompt with reset=False the prompt remains inside Llama.cpp and will be kept in memory for faster generation of new output.

E.g. You feed in the prompt "My favorite season is", the model replies "spring because everything is growing. I like to walk". Generation stops. You feed in to generate "outside in the spring weather" (not the full prompt!) and to the model the full prompt is now "My favorite season is spring because everything is growing. I like to walk outside in the spring weather".

I tested this in your own webui 😁 by setting reset=False and doing generations in notebook mode, if I clear the textbox the AI still continues chatting as if the original text was still there. The text is also generated right away with no delay as it doesn't need to digest the prompt again!

This is getting a bit off-topic but to implement this in the webui I think the easiest way would be to save the prompt right before sending it to generate. Then when the user calls generate again you can compare the saved prompt with the user's prompt to see if they are merely appending to the existing prompt or are editing it. If it is an append call with reset=False and send only the appended text over, otherwise send everything over and force a reset.

@gjmulder
Copy link
Contributor

for now I've added a reset option for Llama.generate which defaults to False, if you want to continue from the end of a previous generation just call with reset=False.

Will this be exposed through the REST API at some point?

@abetlen
Copy link
Owner Author

abetlen commented Apr 15, 2023

@oobabooga @eiery @gjmulder this is now pushed to main, just looking for someoone to test generate and __call__ before publishing the PyPI release, the code is a bit of a mess right now but the interface should remain the same.

The process to set the cache from code is

llama = llama_cpp.Llama(...)
llama.set_cache(llama_cpp.LlamaCache)

then you can call generate and __call__ as usual and if the prompt contains the previously generated tokens or the previously returned string the cache will just continue from after those tokens / bytes.

If you're using the REST server it's enough to set the CACHE=1 environment variable.

If it works like you guys expect I'll publish to PyPI tonight or tomorrow.

@gjmulder
Copy link
Contributor

$ CACHE=1 python3 -m llama_cpp.server ?

@oobabooga
Copy link
Contributor

@abetlen I have made a test where I generated 80 tokens, and then generated another 80 tokens on top of the result without modifying the prompt. These were the results:

With self.model.set_cache(LlamaCache):

Output generated in 51.90 seconds (1.54 tokens/s, 80 tokens, context 761, seed 559310896)
generate cache hit
Output generated in 61.77 seconds (1.30 tokens/s, 80 tokens, context 841, seed 1141193019)

Without set_cache:

Output generated in 51.90 seconds (1.54 tokens/s, 80 tokens, context 761, seed 1808425735)
Output generated in 55.96 seconds (1.43 tokens/s, 80 tokens, context 841, seed 1321984670)

I can see that there is a new generate cache hit message, but I don't seem to get any performance improvement. Not sure if I am doing this correctly.

@abetlen
Copy link
Owner Author

abetlen commented Apr 15, 2023

$ CACHE=1 python3 -m llama_cpp.server ?

@gjmulder Correct

@oobabooga woops, for generate I implemented the check but didn't actually remove the old tokens from the list of tokens to eval. Should be fixed now.

@ghost
Copy link

ghost commented Apr 15, 2023

@abetlen Caching is working well for me in your latest release 🎊 .

I'm running it using a modified oobabooga UI with self.model.set_cache(LlamaCache) set and generation starts instantly with no ingestion delay if the previous text is not altered. If the text is altered we get a cache miss and regenerate fully with no issues. The performance increase with caching is huge as seen below.

Loading llama-13B-ggml...
llama.cpp weights detected: models/llama-13B-ggml/ggml-model-q4_0.bin

llama.cpp: loading model from models/llama-13B-ggml/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 33.95 seconds (0.38 tokens/s, 13 tokens, context 39, seed 558843024)
generate cache hit
Output generated in 14.88 seconds (1.34 tokens/s, 20 tokens, context 61, seed 1039449246)
generate cache hit
Output generated in 12.17 seconds (1.31 tokens/s, 16 tokens, context 88, seed 523733239)
Output generated in 68.35 seconds (1.08 tokens/s, 74 tokens, context 121, seed 912952673)
generate cache hit
Output generated in 31.92 seconds (1.82 tokens/s, 58 tokens, context 210, seed 1327347234)
Output generated in 66.78 seconds (0.25 tokens/s, 17 tokens, context 349, seed 1946798230)
generate cache hit
Output generated in 24.49 seconds (1.31 tokens/s, 32 tokens, context 379, seed 429283322)
generate cache hit
Output generated in 9.80 seconds (1.12 tokens/s, 11 tokens, context 420, seed 559845450)
Output generated in 77.58 seconds (0.10 tokens/s, 8 tokens, context 472, seed 1239183125)
generate cache hit
Output generated in 17.79 seconds (1.52 tokens/s, 27 tokens, context 492, seed 2013844718)
generate cache hit
Output generated in 7.60 seconds (1.32 tokens/s, 10 tokens, context 527, seed 609475087)
Output generated in 103.58 seconds (0.19 tokens/s, 20 tokens, context 564, seed 1553215150)

@abetlen
Copy link
Owner Author

abetlen commented Apr 15, 2023

@eiery very glad to hear!

Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something.

@abetlen
Copy link
Owner Author

abetlen commented Apr 16, 2023

@gjmulder or anyone else able to test the server? It's been working on my end but want an independent confirmation.

@ghost
Copy link

ghost commented Apr 16, 2023

@eiery very glad to hear!

Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something.

Having such a cache would be helpful indeed especially if you do frequent editing. You could also afford to generate multiple replies with different parameters and let the user choose which one they like best.

Honestly if that's implemented performance should be excellent until you hit the 2048 token limit and need to rotate the buffer/do tricks like summarization. I guess caching of the initial prompt will help if it's a long one but ingesting over a thousand tokens for every generation will tack on a couple minutes every time. Luckily there are smart people at llama.cpp working on that...

@abetlen
Copy link
Owner Author

abetlen commented Apr 16, 2023

@oobabooga @eiery Okay I've pushed the 0.1.34 release to pypi and the wheels should be building right now. This includes the new cache api. I'll keep this issue open until to track the proper cache support and close #68

@oobabooga
Copy link
Contributor

I have made a new test with llama-cpp-python==0.1.34 and I confirm that the second generation starts immediately when the cache is enabled. Very nice! I'm using it here oobabooga/text-generation-webui@d2ea925

@digiwombat
Copy link

I grabbed this. Confirmed speeds are up when hitting cache. Good times. Getting ~1t/s on a 5950X with a 30b model compared to ~0.2t/s before. No errors so far.

I will say that I'd somewhat expect clicking the continue button to always hit cache, but that has not been the case. Not sure if it's a context order issue (the context isn't being updated after the next send, rather than after the end of the generation) or a more naive comparison method (comparing the entire context buffer to the most recent context entirely and any mismatch forces a full regen), but I would expect to cache hit clicking continue in webui, assuming no edits to existing context. That could be non-trivial, but kobold.cpp's smartcontext implementation has helped there. Different use case (maintaining world/character data at the head of the context stack), obviously, but a chunk-based cache comparison could be valuable.

I will say, I don't know enough about how/whether context compounds, so maybe keeping later chunks unchanged would be a problem if you regenerate the first context without regenerating everything after it.

@gjmulder
Copy link
Contributor

Can I confirm that the cache is only for the /v1/chat/completions end point and not the /v1/completions endpoint?

I gave up on using the chat completions end point as it seemed to not understand roles and users using alpacas. I'm now using alpaca-lora-30B with the completion end point which is producing better responses. 🤷‍♂️

@abetlen
Copy link
Owner Author

abetlen commented Apr 17, 2023

@gjmulder this should actually work for both APIs, is it not working for /v1/completions?

@gjmulder
Copy link
Contributor

gjmulder commented Apr 17, 2023

@abetlen I might be being stupid here... how do I tell for certain that it is enabled?

I'd need to generate the same text from the same prompt with and without caching.

@abetlen
Copy link
Owner Author

abetlen commented Apr 17, 2023

@gjmulder For the completion endpoint you would just need to pass in the prompt + returned text to the completion the next time you call the API.

@Priestru
Copy link

Currently open source models we work with are not great enough to provide clean output that has no need for correction. Is there option to keep cache not for the last generated prompt but also or instead prompt for 1 message before. This would enable user edit last response and regenerate messages in trade for a minor latency increase.

I seen idea of making two llama.cpp instances to run in parallel where one is just used to store "state" in case it's addressed and they exchange those between each other following user's actions.

@snxraven
Copy link

ggerganov/llama.cpp#1105
This may be relevant

@snxraven
Copy link

The above was merged, we should be able to set cache as needed

@abetlen
Copy link
Owner Author

abetlen commented Apr 23, 2023

@snxraven merged in the low-level api here too. Currently working on an implementation for LlamaCache.

@snxraven
Copy link

Thank you for the hard work! I am super excited for this!!!

@Priestru
Copy link

Priestru commented Apr 24, 2023

There is much hope for the success. With only 1608.00 MB of RAM required for 13B models, the cost is quite affordable. If we can obtain a state from just one message prior, we would be able to freely regenerate outputs and make necessary fixes. This approach would reduce the context awaiting evaluation to only the most recent output and the new prompt, which would require minimal delay given the speedy 25ms/token processing facilitated by cuBLAS. It seems that we are very close to achieving a highly performative implementation for a very large context. This would be especially beneficial for characters because their context doesn't change and written as personality, ability to not re-evaluate it each time is a key to large descriptions. We can go far beyond 2k tokens with this trick in the future.

@Priestru
Copy link

Also one another observation. When you get close to context window limit it begins to cut older messages. As a result caching effectively doesn't work anymore because previous prompt never match. Is there anything that could be done about it?

@abetlen
Copy link
Owner Author

abetlen commented Apr 25, 2023

@snxraven and all, I've pushed the first version of the cache API, it uses the new llama.cpp state api but currently only limited to a single saved state per cache.

@snxraven
Copy link

snxraven commented Apr 25, 2023

@abetlen I attempted this with my current codebase but I am only able to see cache misses.

Has anything changed in the new code that would break my standard chat history?

This is the current code which was running the temp cache system:
https://git.ssh.surf/snxraven/llama-cpp-python-djs-bot/src/branch/main/llamabot.js

@abetlen
Copy link
Owner Author

abetlen commented Apr 25, 2023

@snxraven I did fix a bug that caused unnecessary cache misses but that should be in after v0.1.37, I would also check the http://localhost:8000/docs page and just test the cache there as well.

@snxraven
Copy link

@abetlen Ill give the docs test a try, My test with my codebase was on 0.1.38.

Ill post my findings here later :)

@abetlen
Copy link
Owner Author

abetlen commented Apr 25, 2023

I'm working through two issues at the moment with the caching, one is the best way to store the llama_state, for a 7B 4bit quantized model with a context length of 2048 tokens, the llama_state size is ~ 1Gb. Currently I'm keepingt the (single) llama_state object in memory for simplicity but it's clear I'd probably need to move to a filesystem cache to handle larger cache sizes.

The other issue I'm trying to resolve is when to save the llama_state, currently this is done after prompt processing which makes sense for basic completions, however for chats this means that the last returned message has to be re-processed each time.

@Priestru
Copy link

Priestru commented Apr 25, 2023

The other issue I'm trying to resolve is when to save the llama_state, currently this is done after prompt processing which makes sense for basic completions, however for chats this means that the last returned message has to be re-processed each time.

This couldn't be a problem. Swipes, regeneration and editing of a last received message entirely rely on state being preserved before generation of the latest message. With average length of output almost never exceeding 200 tokens (~5 seconds) it's very cheap price to pay to re-process. Once we get handle on these things i believe one could make a toggle for preferences of a user, but for the first implementation i believe this is a good compromise. Amount of control user obtains for the price of extra reprocessing is immensurable.

Moreover, GUI implementation that are placed above output, like a SillyTavern rely on editing of the LLM's output in an automatic mode without user's input. Those cut extra lines and unnecessary symbols and so on. On the level of llama-cpp-python there is literally no option to know what has been cut or changed, if state is saved after output is done it would never match.

I'm working through two issues at the moment with the caching, one is the best way to store the llama_state, for a 7B 4bit quantized model with a context length of 2048 tokens, the llama_state size is ~ 1Gb. Currently I'm keepingt the (single) llama_state object in memory for simplicity but it's clear I'd probably need to move to a filesystem cache to handle larger cache sizes.

If we get an option to write down cache on an SSD, putting reducing RAM requirements aside it would right away enable us with ability to anytime load permanent cache state that has description, dialog examples and other information already processed. Not only it allows us to freely work with other temporary context (summarize, truncate, edit), but it also enables ability to chat simultaneously with multiple personalities with only one model being loaded. The power it actually brings is hard to overestimate, it's so much more than just free RAM.

@abetlen
Copy link
Owner Author

abetlen commented May 5, 2023

I'm closing this issue favour of #158

The current behaviour of generate has been updated in the latest release to allow the model to continue from the end of the longest prompt prefix which has already been evaluated by the model.

@abetlen abetlen closed this as completed May 5, 2023
@abetlen abetlen unpinned this issue May 5, 2023
xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high-priority
Projects
None yet
Development

No branches or pull requests

8 participants