-
Notifications
You must be signed in to change notification settings - Fork 97
Description
Summary
The in-process vLLM backend (mellea/backends/vllm.py) never sets mot.usage, so callers always receive None for token counts regardless of whether the generation succeeded.
Affected code
VLLMBackend.post_processing records tool calls, the generate log, and telemetry metadata, but contains no usage-population step.
The processing method accumulates only the decoded text from vllm.RequestOutput.outputs[0].text; the token ID arrays are discarded.
How other backends handle this
Every other backend that can compute token counts does so unconditionally in its post-processing step:
| Backend | Source of counts |
|---|---|
| HuggingFace | GenerateDecoderOnlyOutput.sequences shape |
| OpenAI / LiteLLM | usage field in API response |
| Ollama | prompt_eval_count / eval_count in response |
| WatsonX | usage field in API response |
vllm.RequestOutput exposes both prompt_token_ids and outputs[0].token_ids, so counts can be derived without any extra API call.
Expected behaviour
mot.usage should be set to {"prompt_tokens": N, "completion_tokens": M, "total_tokens": N+M} after every successful vLLM generation, consistent with other backends.
Notes
generate_from_raw(batch path, line ~462) also does not set usage — same fix needed there.- Discovered while auditing usage consistency across backends following fix for fix: HuggingFace backend mot.usage always None without telemetry enabled #694 (HuggingFace usage regression).