Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Feature Request #95

Closed
snxraven opened this issue Apr 18, 2023 · 6 comments
Closed

Cache Feature Request #95

snxraven opened this issue Apr 18, 2023 · 6 comments

Comments

@snxraven
Copy link

The current implementation of caching is wonderful, its been a great help speeding up conversations.

I do notice this trips up when a secondary user starts a conversation, would it be possible to allow for multi-conversation caching?

The main issue currently is the fact that the cache grows large over time and if the second user submits a question and then the first user submits another their entire chat history is being reran all over.

@jmtatsch
Copy link

I agree the "dummy" caching feature is already really useful. Makes all the difference between me wanting to use it or rather going to openai ;)

Regarding a real caching feature we are waiting for upstream to persist the state correctly, right?

Hypothetically, how large would a single state be?
llama_model_load_internal: mem required = 9807.47 MB (+ 1608.00 MB per state)
Is that the relevant state size?

@abetlen
Copy link
Owner

abetlen commented Apr 18, 2023

@snxraven @jmtatsch definitely high on my list, unfortunately at the moment I'm blocked because I can't restore the model state. I've tried the kv_state api in both the python bindings and regular c++ but haven't been unable to actually use this to cut down any processing times or checkpoint the model.

If anyone get even a basic example to work using that API I'd be happy to implement this.

@snxraven
Copy link
Author

Relevant issue has been opened by another dev over at llama.cpp: ggerganov/llama.cpp#1054

@snxraven
Copy link
Author

Furthermore, this issue has been reopened so it may be fixed within llama.cpp:
ggerganov/llama.cpp#730

If you would like we can close this issue since there obviously is a solution coming.

@SagsMug
Copy link
Contributor

SagsMug commented Apr 21, 2023

I did once do this by simply having multiple instances of llama running.
This works because mmap uses the cache to load the model, so you‘re effectively only penalized by the state size.
But this could be undesireable because of cpu usage, and having multiple multi gb states in ram

@snxraven
Copy link
Author

Closing this in favor of #44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants