Add mcp universe benchmark #36

chughtapan · 2025-12-04T01:11:49Z

…CP-Universe fork

…structions

…st infrastructure

Reverts the https:// to file:// URL changes that were introduced as a CI/CD workaround. These tests should use realistic https URLs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…hmark # Conflicts: # pyproject.toml # uv.lock

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Restore tests/utils/logger.py to original StructuredEventLogger signatures, adding new methods for MCP-Universe support - Move HumanReadableLogger to tests/benchmarks/mcp_universe/reporting.py (matching AppWorld's pattern) - Fix README install instructions to use UV_GIT_LFS=1 uv pip install - Remove redundant bfcl and mcpuniverse-eval optional dep groups 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fix fragile HumanReadableLogger init (use proper constructor) - Simplify complex ternary expression for expected value extraction - Remove unused error classification variables (always 0) - Move late json import to top of file - Consolidate apply_patch() using dict iteration - Extract magic numbers as constants (MAX_ITERATIONS, MAX_TOKENS, GITHUB_API_*) - Extract _find_repo_with_fewest_issues helper to eliminate duplication - Use specific exceptions instead of broad Exception catching - Replace assert with explicit ValueError validation - Remove no-op placeholder methods and their calls Net reduction: -64 lines 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add proper type annotations to evaluator_patch.py functions - Use cast() for json.load() returns in loader.py - Remove manual secrets loading (FastAgent handles automatically) - Remove unnecessary mypy/ruff overrides for empty bfcl data dir - Rename msg -> user_msg to avoid type shadowing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Extract helper functions to reduce complexity (C901, PLR0912, PLR0915) - Add LoggingContext and EvaluationCheck dataclasses to reduce params (PLR0913) - Use list comprehensions instead of append loops (PERF401) - Auto-apply evaluator patches on module import for cleaner imports - Remove all MCP-Universe ruff ignores from pyproject.toml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Extract _validate_test() to handle validation + human logging - Extract _log_evaluation_results() for cleaner separation - Remove asyncio.sleep(0), simplify parametrize and assert - Main test function now cleanly separates run vs validate modes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Execution phase now only writes to structured JSONL log - Human-readable log generated during validation by replaying structured events - Added HumanReadableLogger.from_structured_log() classmethod for replay - Removed LoggingContext dataclass and simplified _process_message_logs - Removed unused functions: _find_tool_name, _get_final_assistant_message, _determine_completion_status 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make max_iterations a parameter with default value instead of duplicating the constant in two files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a comprehensive benchmark integration for MCP-Universe's repository management tasks. The implementation includes test infrastructure, evaluation logic with patches for compatibility issues, human-readable logging capabilities, and configuration for running 28 GitHub-based tasks using FastAgent with the GitHub MCP server v0.15.0.

Key changes:

Added MCP-Universe benchmark test suite with pytest infrastructure
Implemented evaluator patches to fix false negatives in the original MCP-Universe evaluation functions
Created human-readable logging system for test results and manual annotation

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/utils/fastagent_helpers.py	Added JSON serialization mode parameter to model_dump call
tests/conftest.py	Added output_dir fixture for test result directories
tests/benchmarks/mcp_universe/test_mcp_universe.py	Main test implementation for running and validating MCP-Universe tasks
tests/benchmarks/mcp_universe/reporting.py	Human-readable logging infrastructure for benchmark results
tests/benchmarks/mcp_universe/mcp_server_config.json	Docker-based GitHub MCP server configuration
tests/benchmarks/mcp_universe/instruction.txt	Agent instruction template for task execution
tests/benchmarks/mcp_universe/fastagent.config.yaml	FastAgent configuration with pinned GitHub MCP server version
tests/benchmarks/mcp_universe/evaluator_patch.py	Patches for MCP-Universe evaluator compatibility with GitHub MCP Server v0.15.0
tests/benchmarks/mcp_universe/evaluator.py	Evaluation orchestration for repository management tasks
tests/benchmarks/mcp_universe/init.py	Package initialization
tests/benchmarks/mcp_universe/README.md	Comprehensive documentation for setup and usage
tests/benchmarks/appworld/mcp_server.py	Removed type ignore comments from decorators
pyproject.toml	Added mcpuniverse dependency and related overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/benchmarks/mcp_universe/evaluator_patch.py

vinamra57 and others added 30 commits October 17, 2025 11:41

Add MCP-Universe repository management benchmark integration (28 tasks)

cd4f929

Remove MCP-Universe submodule and use git dependency from vinamra57/M…

51edbf8

…CP-Universe fork

Align MCP-Universe benchmark with BFCL patterns and enhance system in…

3fe0387

…structions

Add MCP-Universe benchmark implementation with evaluator patch and te…

1418851

…st infrastructure

Merge branch 'main' into add-mcp-universe-benchmark

59c38af

fix dependencies

212c081

update gh mcp server version and remove patch fixes

4efd31f

Merge branch 'chughtapan:main' into add-mcp-universe-benchmark

9d7fbe0

patch evaluator bugs

bf27326

fix imports

cfcab63

fix ci/cd failures

8116e79

fix evaluator errors

0dc0218

improve logs for readability

ebae298

fix bugs in and improve evaluator patch

6e33361

cleanup a bit of code

a09cdae

Revert middleware test URL workarounds

10620c3

Reverts the https:// to file:// URL changes that were introduced as a CI/CD workaround. These tests should use realistic https URLs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into add-mcp-universe-benc…

2a5f138

…hmark # Conflicts: # pyproject.toml # uv.lock

Regenerate uv.lock with mcpuniverse dependency

49f0834

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Remove more deadcode

15138d7

nit

bc8091e

Inline task loading, remove loader.py

1f32a2f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fix evaluator.py to not import deleted loader module

d52875f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Apply ruff-format

95c2c37

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Remove duplicate MAX_ITERATIONS constant

0d85d71

Make max_iterations a parameter with default value instead of duplicating the constant in two files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tapan Chugh and others added 3 commits December 16, 2025 14:34

nit: fix format

2a36079

nit: fix format

00ca253

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add output_dir fixture required by MCP-Universe tests

2493834

chughtapan requested a review from Copilot December 17, 2025 00:47

Copilot started reviewing on behalf of chughtapan December 17, 2025 00:48 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

tests/benchmarks/mcp_universe/evaluator_patch.py Show resolved Hide resolved

chughtapan merged commit 0168d09 into main Dec 17, 2025
9 checks passed

chughtapan deleted the add-mcp-universe-benchmark branch December 17, 2025 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add mcp universe benchmark #36

Add mcp universe benchmark #36

Uh oh!

chughtapan commented Dec 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add mcp universe benchmark #36

Add mcp universe benchmark #36

Uh oh!

Conversation

chughtapan commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chughtapan commented Dec 4, 2025 •

edited

Loading