Skip to content

Feature request: more information in usage #342

@sowbug

Description

@sowbug

I'm using Open WebUI as my ds4 frontend. It can display information about the response, such as tokens per second, if the backend provides it in the usage block:

curl http://a-llama-cpp-server:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "gemma-4-12B-it-qat",
    "messages": [{"role": "user", "content": "Say hello"}],
    "stream": false
  }'
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello!",
        "reasoning_content": "The user said \"Say hello\".\nThe user wants me to say the word \"hello\" or a variation of it.\n\n    *   Standard: Hello!\n    *   Friendly: Hi there!\n    *   Enthusiastic: Hello, how can I help you today?"
      }
    }
  ],
  "created": 1780761818,
  "model": "gemma-4-12B-it-qat",
  "system_fingerprint": "b9518-7c158fbb4",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 67,
    "prompt_tokens": 18,
    "total_tokens": 85,
    "prompt_tokens_details": {
      "cached_tokens": 1
    }
  },
  "id": "chatcmpl-0BYrADMFNyBokAwslPFLB3TZLIQUg3BF",
  "timings": {
    "cache_n": 1,
    "prompt_n": 17,
    "prompt_ms": 234.305,
    "prompt_per_token_ms": 13.78264705882353,
    "prompt_per_second": 72.55500309425747,
    "predicted_n": 67,
    "predicted_ms": 2435.025,
    "predicted_per_token_ms": 36.34365671641791,
    "predicted_per_second": 27.515117914600463
  }
}

Compare a current ds4 response:

curl http://a-ds4-server:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ds4",
    "messages": [{"role": "user", "content": "Say hello"}],
    "stream": false                                        
  }'
{
  "id": "chatcmpl-1",
  "object": "chat.completion",
  "created": 1780761748,
  "model": "ds4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?",
        "reasoning_content": "We need to respond to the user's request. The user said \"Say hello\". That is a simple instruction. As an AI, I should comply and say hello. So I will respond with a greeting."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 52,
    "total_tokens": 58,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "cache_write_tokens": 6
    }
  }
}

If more usage stats are available in the code, it would be very nice if they were plumbed through to usage. I'd be happy to work on this if it's a good first issue, but I wanted to request the feature first before dumping a PR on you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions