EVAL SYS is a living, open-source community to track and advance model agentic capabilities. We’ll be releasing benchmarks, datasets, toolchains, models to push the field forward. Initiated by LobeHub, we would love to collaborate with research labs, MCP servers, independent contributors, and more.
Join us, contribute, or reach out!
MCPMark: Stress-Testing Comprehensive MCP Use
An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).
MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.