Question: How to keep longer conversations without breaking the code?

I want to have longer conversations with the model, but as I understand, the number of tokens in the prompt message + the number of tokens generated by the model should be less than the context size (`n_ctx`). As the conversation develops, more tokens are stored in `messages`, which would eventually result in the following error:

`ValueError: Requested tokens (2083) exceed context window of 2048`

I am new to all this, but I think the way it is handled in `llama.cpp` is that it is managed with the `n_keep` parameter ([here](https://github.com/ggerganov/llama.cpp/discussions/1838) is a helpful discussion). However, I can't find how to avoid the mentioned error using `llama-cpp-python`. I think the chat context should be reset at some point, because even if we use `n_keep` (which I don't know how to do using this repo), the context window would still get filled up.

So, my question is how can I handle longer chats when using `create_chat_completion` and avoid getting the mentioned error using some functionality similar to `n_keep` combined with resetting `total_tokens` and emptying the current context when filled up?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: How to keep longer conversations without breaking the code? #1139

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: How to keep longer conversations without breaking the code? #1139

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions