autoresearch

Autonomous codebase improvement. Three independent teams run against your code in a loop: one finds problems, one fixes them, one simplifies what's left. Everything happens on a branch. Nothing touches main until you merge.

Where this came from

Andrej Karpathy's autoresearch showed that you can let an agent modify training code, run experiments, keep what works, and throw away what doesn't. You go to sleep, you wake up to a better model. pi-autoresearch and drivelineresearch/autoresearch-claude-code generalized this beyond ML to any codebase.

This project uses up to three teams with information barriers between them instead of a single agent. Each team starts from a clean context and only sees what the coordinator passes to it. The team fixing bugs doesn't know how they were found, and the team simplifying code doesn't know what was broken.

How it works

Each cycle runs three stages:

The Red team reads the codebase and writes a findings report with file paths, line numbers, and descriptions of what's wrong. It doesn't modify anything.
The coordinator strips the analysis methodology from the findings and passes only the "what and where" to the Green team. The Green team fixes issues one at a time, running tests after each commit.
The Refactor team gets the codebase in its current state with no context about what was found or fixed. It picks 3-5 simplifications, runs tests after each change.

The coordinator verifies tests, logs results, and starts the next cycle.

What to expect

On a 25K-line Go project, five cycles produced 49 commits on a feature branch: 31 bug fixes (8 breaking core functionality), 6 new test suites, 5 performance optimizations replacing N+1 query patterns, and about 100 lines of dead code removed.

Your results will depend on the size, test coverage, and existing quality of the target project. Codebases with good test coverage get the most value since the Green team can verify its fixes. Projects with few tests will see the Red team flag missing coverage as a priority.

Install

Claude Code

git clone git@github.com:ehmo/autoresearch.git
cd autoresearch
./install.sh

Creates symlinks in ~/.claude/ for the skill and slash command. The script checks that Claude Code is installed first.

To use a non-default config directory, export CLAUDE_DIR before running:

export CLAUDE_DIR=/custom/path
./install.sh

Codex

Add the contents of agents/codex.md to your project's AGENTS.md or equivalent instruction file.

Other agents

The full protocol is in skills/autoresearch/SKILL.md. It's plain markdown. Drop it into whatever instruction format your agent uses.

Usage

Start a session:

/autoresearch ~/path/to/project

The coordinator detects your stack, finds the test command, creates a branch, and starts running cycles. It asks for confirmation before modifying anything.

Resume after a break:

/autoresearch resume
/autoresearch resume myproject

Check progress:

/autoresearch status

Configuration

Create .autoresearch.yml in your project root to customize behavior:

# Override auto-detected test command
test_command: "make test-unit"

# Limit what the teams can see and modify
include:
  - "src/"
  - "lib/"
exclude:
  - "vendor/"
  - "generated/"

# Stop after this many cycles (default: runs until diminishing returns)
max_cycles: 10

# Skip the refactor stage
teams:
  - red
  - green

Without a config file, autoresearch detects everything automatically and runs all three teams.

How sessions are stored

Session data lives in sessions/<project-name>/ within the autoresearch repo:

sessions/myproject/
  session.md        # What's been done, what's left to try
  results.tsv       # One row per team per cycle
  ideas.md          # Findings deferred for later cycles
  cycles/
    001/
      red-findings.md
      green-patch.md
      refactor-patch.md
      eval-results.md

The sessions/ directory is gitignored.

Requirements

An AI coding agent with sub-agent support for full clean-room separation (Claude Code), or a single-agent setup with reduced separation (Codex, others)
Git
A test suite that exits non-zero on failure

Stack detection covers Go, Node/TypeScript, Rust, Python, Java/Kotlin, Ruby, PHP, Elixir, and anything with a Makefile.

Limitations

Works best on projects with decent test coverage. Without tests, the Green team has no way to verify its fixes don't break things.

Large monorepos benefit from the include/exclude config to keep the teams focused on relevant code.

The information barriers between teams are enforced by sub-agent context separation, not cryptography. Each sub-agent starts fresh with only the information the coordinator passes to it.

Not tested on Windows outside of WSL. The install script uses symlinks which require elevated permissions on native Windows.

Uninstall

./uninstall.sh

Removes symlinks from ~/.claude/. Session data in sessions/ is kept.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
commands		commands
skills/autoresearch		skills/autoresearch
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
example.autoresearch.yml		example.autoresearch.yml
install.sh		install.sh
uninstall.sh		uninstall.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch

Where this came from

How it works

What to expect

Install

Claude Code

Codex

Other agents

Usage

Configuration

How sessions are stored

Requirements

Limitations

Uninstall

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch

Where this came from

How it works

What to expect

Install

Claude Code

Codex

Other agents

Usage

Configuration

How sessions are stored

Requirements

Limitations

Uninstall

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages