Skip to content

aimultiple-benchmark/agent-benchmark-example-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Agent Benchmark – Example Task

This repository contains an example task used in AIMultiple’s agentic benchmarks.

The full task specification is available in task-6-web.md.

Purpose

This task is designed to evaluate autonomous coding agents on:

  • Backend correctness
  • Frontend integration
  • Authentication & authorization
  • Status workflow enforcement
  • Data isolation guarantees
  • End-to-end functionality

Agents are expected to complete the task in a one-shot setting without human intervention.

Benchmark Results

Full benchmark results and methodologies:

https://research.aimultiple.com/ai-coding-benchmark

https://aimultiple.com/agentic-cli

https://research.aimultiple.com/agentic-llm/

https://research.aimultiple.com/ai-code-editor/

About

This repository contains an example task used in AIMultiple’s agentic benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors