[Question]: Very short responses when getting completions from llama-cpp-python #873

origintopleft · 2023-09-01T23:49:05Z

origintopleft
Sep 1, 2023

Contact Details

No response

What is your question?

I'm using LibreChat with the OpenAI endpoint, but instead of actual OpenAI, OPENAI_REVERSE_PROXY is pointed to the local system on port 8000, where llama-cpp-python is serving Llama 2 70B through an OpenAI compatible API. Mostly, this works. The only problem is the responses we get are very short.

Our previous chat UI was able to display messages of various lengths. Is there a setting I could change in order to allow longer responses?

More Details

Deployed using docker compose from the git repo.

Docker version 24.0.5, from Ubuntu repositories
llama-cpp-python from Aug 17

What is the main subject of your question?

No response

Screenshots

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Answered by danny-avila

Sep 2, 2023

Interesting, maybe max_tokens needs to be sent to the request, looks like the default for that project is 16 (which is incredibly low)?

abetlen/llama-cpp-python#542

Add a line after api\app\clients\OpenAIClient.js on line 64
this.modelOptions.max_tokens = 2000;

    if (!this.modelOptions) {
      this.modelOptions = {
        ...modelOptions,
        model: modelOptions.model || 'gpt-3.5-turbo',
        temperature:
          typeof modelOptions.temperature === 'undefined' ? 0.8 : modelOptions.temperature,
        top_p: typeof modelOptions.top_p === 'undefined' ? 1 : modelOptions.top_p,
        presence_penalty:
          typeof modelOptions.presence_penalty === 'undefined' ? 1 : modelOp…

View full answer

danny-avila · 2023-09-02T16:34:31Z

danny-avila
Sep 2, 2023
Maintainer

Interesting, maybe max_tokens needs to be sent to the request, looks like the default for that project is 16 (which is incredibly low)?

abetlen/llama-cpp-python#542

Add a line after api\app\clients\OpenAIClient.js on line 64
this.modelOptions.max_tokens = 2000;

    if (!this.modelOptions) {
      this.modelOptions = {
        ...modelOptions,
        model: modelOptions.model || 'gpt-3.5-turbo',
        temperature:
          typeof modelOptions.temperature === 'undefined' ? 0.8 : modelOptions.temperature,
        top_p: typeof modelOptions.top_p === 'undefined' ? 1 : modelOptions.top_p,
        presence_penalty:
          typeof modelOptions.presence_penalty === 'undefined' ? 1 : modelOptions.presence_penalty,
        stop: modelOptions.stop,
      };
    }
this.modelOptions.max_tokens = 2000; // new line

11 replies

ghost Sep 14, 2023

The thing is - you can count normal messages accurately up to the exact token match as the API. For function calling you can't - but I assume that's not a big deal. Having no max_tokens is IMO worse than counting tokens and properly pruning message context, even if it's more complex.

danny-avila Sep 14, 2023
Maintainer

omitting max_tokens is only an issue in your case, where it affects a reverse proxy's handling of response tokens; otherwise, OpenAI fills this in for us, so I don't see how it's worse.

I do see that the example cookbook I referenced has been updated since I've last seen it, so I will need to update my handling, but mind you, I don't believe we can objectively say that we can 'count normal messages accurately up to the exact token match as the API' when the cookbook itself has these warnings:

Below is an example function for counting tokens for messages passed to gpt-3.5-turbo or gpt-4.
Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee.

    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
# ...
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")

danny-avila Sep 14, 2023
Maintainer

Note that I am doing a fair amount of token handling already, so I am taking into account the "remaining" tokens which is essentially what max_tokens represents and counting tokens as accurately as I know how (which I admitted needs an update). My qualm is with sending this "remaining tokens value" to the API, which is not necessary, unless you explicitly want shorter response than the remaining tokens or max_tokens possible. Most users want longer responses not shorter, and i hope to mediate the shorter response desire with 'stop' tokens in presets.

I concede the benefit in adding max_tokens as a 'preset' parameter when you intentionally want responses shorter than the current context window. technically, this is already supported in my token handling logic and would not be hard to add, and I would send it to the API in this case, but not by default.

danny-avila Sep 14, 2023
Maintainer

@Yardanico thanks for this discussion earlier, helped me revisit what exactly I was doing wrong in the token count. Have added new tests and I'm matching the same count as updated cookbook example: #945

danny-avila Sep 25, 2023
Maintainer

some good discussion on max_tokens here: vllm-project/vllm#852

danny-avila · 2023-09-02T16:38:24Z

danny-avila
Sep 2, 2023
Maintainer

You can also try editing this line in llama-cpp-python

https://github.com/abetlen/llama-cpp-python/blob/18337267c175f88c46fbad079a7c18da57e1b520/llama_cpp/llama.py#L863

1 reply

origintopleft Sep 2, 2023
Author

Typically on the llama-cpp-python side, settings like that are exposed as environment variables. Or at least that's what I used the last couple of times I had to change a llama setting. If it's not already exposed there, I'll see about getting a pull request started on their end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Very short responses when getting completions from llama-cpp-python #873

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: Very short responses when getting completions from llama-cpp-python #873

origintopleft Sep 1, 2023

Contact Details

What is your question?

More Details

What is the main subject of your question?

Screenshots

Code of Conduct

Replies: 2 comments · 12 replies

danny-avila Sep 2, 2023 Maintainer

ghost Sep 14, 2023

danny-avila Sep 14, 2023 Maintainer

danny-avila Sep 14, 2023 Maintainer

danny-avila Sep 14, 2023 Maintainer

danny-avila Sep 25, 2023 Maintainer

danny-avila Sep 2, 2023 Maintainer

origintopleft Sep 2, 2023 Author

origintopleft
Sep 1, 2023

Replies: 2 comments 12 replies

danny-avila
Sep 2, 2023
Maintainer

danny-avila Sep 14, 2023
Maintainer

danny-avila Sep 14, 2023
Maintainer

danny-avila Sep 14, 2023
Maintainer

danny-avila Sep 25, 2023
Maintainer

danny-avila
Sep 2, 2023
Maintainer

origintopleft Sep 2, 2023
Author