Skip to content

feat: Lemonade version mismatch warning, eval perf tracking, MCP stats#637

Merged
kovtcharov merged 6 commits intomainfrom
optimize/agent-response-quality
Mar 30, 2026
Merged

feat: Lemonade version mismatch warning, eval perf tracking, MCP stats#637
kovtcharov merged 6 commits intomainfrom
optimize/agent-response-quality

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

  • Lemonade version mismatch warning: Warns when running server version differs from expected; checks both CLI and running server versions with minor/patch diff warnings
  • Eval performance tracking: Aggregates per-turn performance data (tok/s, TTFT, token counts) into scenario-level summaries and scorecard Performance section
  • MCP agent UI stats: Captures inference stats from done SSE events and surfaces them in MCP responses
  • Eval judge prompt improvements: Updated judge and simulator prompts for better scoring

Test plan

  • python -m pytest tests/unit/test_lemonade_version_check.py -xvs
  • Run gaia eval and verify performance summary in scorecard
  • Verify Lemonade version warning appears when versions differ

kovtcharov and others added 2 commits March 25, 2026 12:36
- Warn when Lemonade Server version doesn't match expected (minor/patch),
  not just on major version mismatch. Also check server-reported version
  from health endpoint and display it during initialization.
- Add performance data collection to eval framework: per-turn inference
  stats (tok/s, TTFT, token counts) aggregated into scenario and
  scorecard summaries.
- Capture inference stats from SSE done events in agent_ui_mcp and
  expose them in get_messages responses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added mcp MCP integration changes llm LLM backend changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Mar 27, 2026
…alth

get_status() previously took ctx_size from the first model in
all_models_loaded, which could be an embedding model with an
irrelevant context size. Now iterates to find the first non-embedding
model, falling back to the legacy health.context_size field.
@kovtcharov kovtcharov marked this pull request as ready for review March 30, 2026 21:45
@kovtcharov kovtcharov self-assigned this Mar 30, 2026
@kovtcharov kovtcharov requested a review from itomek March 30, 2026 22:04
kovtcharov and others added 2 commits March 30, 2026 15:04
… errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix pylint W1404: merge implicit string concatenation in version warning
- Fix flake8 F401: remove unused pytest import in version check tests
- Guard get_mcp_status_report() against None _mcp_manager (crashes when
  MCP dependencies are not installed)
- Use `or 0` pattern in MCP perf logging to handle None stat values
@kovtcharov kovtcharov added this pull request to the merge queue Mar 30, 2026
Merged via the queue into main with commit 780a711 Mar 30, 2026
36 checks passed
@kovtcharov kovtcharov deleted the optimize/agent-response-quality branch March 30, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval Evaluation framework changes llm LLM backend changes mcp MCP integration changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants