-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I want to have longer conversations with the model, but as I understand, the number of tokens in the prompt message + the number of tokens generated by the model should be less than the context size (n_ctx
). As the conversation develops, more tokens are stored in messages
, which would eventually result in the following error:
ValueError: Requested tokens (2083) exceed context window of 2048
I am new to all this, but I think the way it is handled in llama.cpp
is that it is managed with the n_keep
parameter (here is a helpful discussion). However, I can't find how to avoid the mentioned error using llama-cpp-python
. I think the chat context should be reset at some point, because even if we use n_keep
(which I don't know how to do using this repo), the context window would still get filled up.
So, my question is how can I handle longer chats when using create_chat_completion
and avoid getting the mentioned error using some functionality similar to n_keep
combined with resetting total_tokens
and emptying the current context when filled up?
Thank you!