Skip to content
@eval-sys

EVAL SYS

Evaluation Systems Organization

EVAL SYS is a living, open-source community to track and advance model agentic capabilities. We’ll be releasing benchmarks, datasets, toolchains, models to push the field forward. Initiated by LobeHub, we would love to collaborate with research labs, MCP servers, independent contributors, and more.

Join us, contribute, or reach out!


MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

MCPMark

Pinned Loading

  1. mcpmark mcpmark Public

    MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark and a collection of diverse, verifiable tasks designed to evaluate model capabilities in real-wo…

    Python 87 2

  2. mcpmark-experiments mcpmark-experiments Public

    Collection of evaluation results for MCPMark

Repositories

Showing 4 of 4 repositories
  • mcpmark-experiments Public

    Collection of evaluation results for MCPMark

    eval-sys/mcpmark-experiments’s past year of commit activity
    0 0 0 0 Updated Aug 27, 2025
  • mcpmark Public

    MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark and a collection of diverse, verifiable tasks designed to evaluate model capabilities in real-world MCP use.

    eval-sys/mcpmark’s past year of commit activity
    Python 87 Apache-2.0 2 2 1 Updated Aug 27, 2025
  • .github Public

    Community health files for the @eval-sys organization

    eval-sys/.github’s past year of commit activity
    0 0 0 0 Updated Aug 26, 2025
  • eval-sys/mcp-eval-website’s past year of commit activity
    TypeScript 1 0 0 0 Updated Aug 18, 2025

Most used topics

Loading…