Skip to content

Conversation

@HanClinto
Copy link
Contributor

@HanClinto HanClinto commented Jul 10, 2024

In #8402, we added the ability to set default request parameters on the command line.

One shortcoming of that PR is that the author failed to update the /props endpoint, so it was returning bogus information.

Example:

  1. Start the server with some changed values for grammar and n_ctx:
./llama-server -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file "./grammars/no-e.gbnf" -c=1024
  1. Navigate to http://localhost:8080/props and note that -- other than correctly listing the model that is loaded -- any of the default sampling parameters set via the CLI are not shown:
{
  "system_prompt": "",
  "default_generation_settings": {
    "n_ctx": 2048,
    "n_predict": -1,
    "model": "/Users/snakamoto/Library/Caches/llama.cpp/phi-2.Q4_K_M.gguf",
    "seed": -1,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.0500000007450581,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.100000001490116,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "total_slots": 1,
  "chat_template": ""
}

In particular, note that this endpoint (incorrectly) returns 2048 for n_ctx, and a blank string for grammar.

There were a few possible ways to fix this, but the lowest-friction method was, during init(), to initialize each of the slot's sampling parameters by copying from the global context's sampling parameters. This is similar to the one-liner method that we used in #8402, but while that operated at runtime (when the jobs are fired off), this one operates at initialization.

The first slot is then chosen, and the default parameters are serialized to json and stored in default_generation_settings_for_props -- the same as happened before. It's nice to have this serialized and saved this way, because even if the slot's parameters are overwritten by a later request, the value stored in default_generation_settings_for_props will always represent the defaults.

And this is what the end result looks like when querying /props:

{
  "system_prompt": "",
  "default_generation_settings": {
    "n_ctx": 1024,
    "n_predict": -1,
    "model": "/Users/snakamoto/Library/Caches/llama.cpp/phi-2.Q4_K_M.gguf",
    "seed": -1,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.0500000007450581,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.100000001490116,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "root ::= [^eE]*\n",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "total_slots": 1,
  "chat_template": ""
}

It now contains the correct values of n_ctx = 1024 and our non-blank grammar -- success!

This solution does not add any increased memory usage, and I can't think of any edge cases that it falls down. Yesterday when I first tried to fix this, I got wrapped around the axle with an overly complicated approach. I'm glad I slept on it for a day because I think that today's solution is much more elegant.

Tagging @ngxson in particular for review on this one.

Thank you!

@HanClinto HanClinto merged commit 278d0e1 into ggml-org:master Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants