Skip to content

feat: vLLM (in-process) backend never populates mot.usage #696

@planetf1

Description

@planetf1

Summary

The in-process vLLM backend (mellea/backends/vllm.py) never sets mot.usage, so callers always receive None for token counts regardless of whether the generation succeeded.

Affected code

VLLMBackend.post_processing records tool calls, the generate log, and telemetry metadata, but contains no usage-population step.

The processing method accumulates only the decoded text from vllm.RequestOutput.outputs[0].text; the token ID arrays are discarded.

How other backends handle this

Every other backend that can compute token counts does so unconditionally in its post-processing step:

Backend Source of counts
HuggingFace GenerateDecoderOnlyOutput.sequences shape
OpenAI / LiteLLM usage field in API response
Ollama prompt_eval_count / eval_count in response
WatsonX usage field in API response

vllm.RequestOutput exposes both prompt_token_ids and outputs[0].token_ids, so counts can be derived without any extra API call.

Expected behaviour

mot.usage should be set to {"prompt_tokens": N, "completion_tokens": M, "total_tokens": N+M} after every successful vLLM generation, consistent with other backends.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions