This repository contains an example task used in AIMultiple’s agentic benchmarks.
The full task specification is available in task-6-web.md.
This task is designed to evaluate autonomous coding agents on:
- Backend correctness
- Frontend integration
- Authentication & authorization
- Status workflow enforcement
- Data isolation guarantees
- End-to-end functionality
Agents are expected to complete the task in a one-shot setting without human intervention.
Full benchmark results and methodologies:
https://research.aimultiple.com/ai-coding-benchmark
https://aimultiple.com/agentic-cli