-
Notifications
You must be signed in to change notification settings - Fork 2
Quota middleware v2 – streaming-safe token accounting + OpenMeter direct HTTP #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ing-hooks Implement usage accounting hooks
…rometheus Add metrics helper and usage extra
…upport Add real OpenMeter metering
…leware-for-fastapi Fix quota streaming tail handling
…trics Fix final prompt token accounting
…env-vars Make token limit optional via shared helper
…ta-exceeded Fix 429 response when quota hit during stream
…ng-window-accounting Fix token quota rollback for streaming
…-exceed Handle quota breach mid-stream
|
Pushed OpenMeter HTTP refactor & README. Ready for another look—thanks! |
HashamUlHaq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feature!
Following is my analysis:
- Pre-request: Estimate with tiktoken
Before the request is sent to the LLM backend, the middleware:
- Parses the prompt/messages.
- Uses tiktoken (or a fallback) to estimate the number of input tokens (tokens_in).
- Checks if this estimated usage would exceed the user’s quota.
- If over quota: The request is rejected immediately.
- Post-request: Canonical count from LLM response
After the LLM backend (vLLM, Ollama, OpenAI, etc.) returns a response:
- The middleware inspects the response for a usage field.
- If present, it uses the prompt_tokens and completion_tokens from the LLM’s own metrics as the canonical count for both - incoming and outgoing tokens.
- If not present, it falls back to counting tokens in the output using tiktoken.
- These canonical numbers are what get reported to the usage metering backend (Prometheus, OpenMeter, etc.).
Good thing:
In the fallback method, the middleware does NOT recalculate the input tokens (tokens_in)—it only recalculates the output tokens (tokens_out) if the LLM response does not provide a usage field.
Suggestions:
- The fallback method is
len(list(text.encode())), which might inflate the numbers significantly. A more accurate approach might be to uselen(text) // 4.
Other suggestions (not as important):
- Set a maximum request size to avoid excessive memory or CPU usage during tokenization.
@HashamUlHaq could you review when you have a chance? In particular:
Scope of the patch
SSE frames (data: ...)
[DONE] sentinels
plain JSON bodies