Skip to content

Conversation

@hammadtq
Copy link
Collaborator

@hammadtq hammadtq commented Jul 21, 2025

@HashamUlHaq could you review when you have a chance? In particular:
Scope of the patch

  • Full rewrite of middleware/quota.py
  • Uses a sliding-window meter store (InMemoryMeterStore + RedisMeterStore)
  • Streaming-aware quota enforcement via the internal _Streamer helper
  • Robust tail-parsing in finalize() for:
    SSE frames (data: ...)
    [DONE] sentinels
    plain JSON bodies
  • Prometheus usage counters are now updated once with the canonical numbers taken from the LLM response.

@hammadtq hammadtq marked this pull request as draft July 21, 2025 21:41
@hammadtq hammadtq requested a review from HashamUlHaq July 23, 2025 02:50
@hammadtq hammadtq marked this pull request as ready for review July 23, 2025 02:53
@hammadtq
Copy link
Collaborator Author

Pushed OpenMeter HTTP refactor & README. Ready for another look—thanks!

@hammadtq hammadtq changed the title Quota middleware v2 — streaming-safe token accounting Quota middleware v2 – streaming-safe token accounting **+ OpenMeter direct HTTP** Jul 23, 2025
@hammadtq hammadtq changed the title Quota middleware v2 – streaming-safe token accounting **+ OpenMeter direct HTTP** Quota middleware v2 – streaming-safe token accounting + OpenMeter direct HTTP Jul 23, 2025
Copy link
Collaborator

@HashamUlHaq HashamUlHaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feature!
Following is my analysis:

  1. Pre-request: Estimate with tiktoken

Before the request is sent to the LLM backend, the middleware:

  • Parses the prompt/messages.
  • Uses tiktoken (or a fallback) to estimate the number of input tokens (tokens_in).
  • Checks if this estimated usage would exceed the user’s quota.
  • If over quota: The request is rejected immediately.
  1. Post-request: Canonical count from LLM response

After the LLM backend (vLLM, Ollama, OpenAI, etc.) returns a response:

  • The middleware inspects the response for a usage field.
  • If present, it uses the prompt_tokens and completion_tokens from the LLM’s own metrics as the canonical count for both - incoming and outgoing tokens.
  • If not present, it falls back to counting tokens in the output using tiktoken.
  • These canonical numbers are what get reported to the usage metering backend (Prometheus, OpenMeter, etc.).

Good thing:
In the fallback method, the middleware does NOT recalculate the input tokens (tokens_in)—it only recalculates the output tokens (tokens_out) if the LLM response does not provide a usage field.

Suggestions:

  1. The fallback method is len(list(text.encode())), which might inflate the numbers significantly. A more accurate approach might be to use len(text) // 4.

Other suggestions (not as important):

  1. Set a maximum request size to avoid excessive memory or CPU usage during tokenization.

@hammadtq hammadtq merged commit e80c891 into dev Jul 23, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants