Skip to content

Conversation

@chughtapan
Copy link
Owner

@chughtapan chughtapan commented Dec 4, 2025

vinamra57 and others added 30 commits October 17, 2025 11:41
Reverts the https:// to file:// URL changes that were introduced
as a CI/CD workaround. These tests should use realistic https URLs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…hmark

# Conflicts:
#	pyproject.toml
#	uv.lock
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Restore tests/utils/logger.py to original StructuredEventLogger
  signatures, adding new methods for MCP-Universe support
- Move HumanReadableLogger to tests/benchmarks/mcp_universe/reporting.py
  (matching AppWorld's pattern)
- Fix README install instructions to use UV_GIT_LFS=1 uv pip install
- Remove redundant bfcl and mcpuniverse-eval optional dep groups

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix fragile HumanReadableLogger init (use proper constructor)
- Simplify complex ternary expression for expected value extraction
- Remove unused error classification variables (always 0)
- Move late json import to top of file
- Consolidate apply_patch() using dict iteration
- Extract magic numbers as constants (MAX_ITERATIONS, MAX_TOKENS, GITHUB_API_*)
- Extract _find_repo_with_fewest_issues helper to eliminate duplication
- Use specific exceptions instead of broad Exception catching
- Replace assert with explicit ValueError validation
- Remove no-op placeholder methods and their calls

Net reduction: -64 lines

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add proper type annotations to evaluator_patch.py functions
- Use cast() for json.load() returns in loader.py
- Remove manual secrets loading (FastAgent handles automatically)
- Remove unnecessary mypy/ruff overrides for empty bfcl data dir
- Rename msg -> user_msg to avoid type shadowing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Extract helper functions to reduce complexity (C901, PLR0912, PLR0915)
- Add LoggingContext and EvaluationCheck dataclasses to reduce params (PLR0913)
- Use list comprehensions instead of append loops (PERF401)
- Auto-apply evaluator patches on module import for cleaner imports
- Remove all MCP-Universe ruff ignores from pyproject.toml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Extract _validate_test() to handle validation + human logging
- Extract _log_evaluation_results() for cleaner separation
- Remove asyncio.sleep(0), simplify parametrize and assert
- Main test function now cleanly separates run vs validate modes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Execution phase now only writes to structured JSONL log
- Human-readable log generated during validation by replaying structured events
- Added HumanReadableLogger.from_structured_log() classmethod for replay
- Removed LoggingContext dataclass and simplified _process_message_logs
- Removed unused functions: _find_tool_name, _get_final_assistant_message, _determine_completion_status

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make max_iterations a parameter with default value instead of duplicating
the constant in two files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tapan Chugh and others added 3 commits December 16, 2025 14:34
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive benchmark integration for MCP-Universe's repository management tasks. The implementation includes test infrastructure, evaluation logic with patches for compatibility issues, human-readable logging capabilities, and configuration for running 28 GitHub-based tasks using FastAgent with the GitHub MCP server v0.15.0.

Key changes:

  • Added MCP-Universe benchmark test suite with pytest infrastructure
  • Implemented evaluator patches to fix false negatives in the original MCP-Universe evaluation functions
  • Created human-readable logging system for test results and manual annotation

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/utils/fastagent_helpers.py Added JSON serialization mode parameter to model_dump call
tests/conftest.py Added output_dir fixture for test result directories
tests/benchmarks/mcp_universe/test_mcp_universe.py Main test implementation for running and validating MCP-Universe tasks
tests/benchmarks/mcp_universe/reporting.py Human-readable logging infrastructure for benchmark results
tests/benchmarks/mcp_universe/mcp_server_config.json Docker-based GitHub MCP server configuration
tests/benchmarks/mcp_universe/instruction.txt Agent instruction template for task execution
tests/benchmarks/mcp_universe/fastagent.config.yaml FastAgent configuration with pinned GitHub MCP server version
tests/benchmarks/mcp_universe/evaluator_patch.py Patches for MCP-Universe evaluator compatibility with GitHub MCP Server v0.15.0
tests/benchmarks/mcp_universe/evaluator.py Evaluation orchestration for repository management tasks
tests/benchmarks/mcp_universe/init.py Package initialization
tests/benchmarks/mcp_universe/README.md Comprehensive documentation for setup and usage
tests/benchmarks/appworld/mcp_server.py Removed type ignore comments from decorators
pyproject.toml Added mcpuniverse dependency and related overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chughtapan chughtapan merged commit 0168d09 into main Dec 17, 2025
9 checks passed
@chughtapan chughtapan deleted the add-mcp-universe-benchmark branch December 17, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants