GitHub - CodeClash-ai/CodeClash: 🆕 Benchmarking Goal-Oriented Software Engineering

👋 Overview

CodeClash is a benchmark for evaluating AI systems on goal-oriented software engineering.

Today's AI coding evals are task-oriented (e.g., HumanEval, SWE-bench). Models are given explicit instructions. We then verify correctness with unit tests.

But building software is fundamentally driven by goals ("improve user retention", "reduce costs", "increase revenue"). Reaching our goals via code is a self-directed, iterative, and often competitive process. To capture this dynamism of real software development, we introduce CodeClash!

Check out our arXiv paper and website for the full details!

🏎️ Quick Start

To start, follow these steps to set up CodeClash and run a test battle:

$ git clone git@github.com:CodeClash-ai/CodeClash.git
$ cd CodeClash
$ pip install -e '.[dev]'
$ python main.py configs/test/battlesnake.yaml

Tip

CodeClash requires Docker to create execution environments. CodeClash was developed and tested on Ubuntu 22.04.4 LTS.

Once this works, you should be set up to run a real tournament! To run Claude Sonnet 4.5 against o3 in a BattleSnake tournament with 5 rounds and 1000 competition simulations per round, run:

$ python main.py configs/examples/BattleSnake__claude-sonnet-4-5-20250929__o3__r5__s1000.yaml

⚔️ How It Works

In CodeClash, 2+ LM agents compete in a code arena over the course of a multi-round tournament.

For the duration of the tournament, each agent is iteratively improving their own codebase to win a high-level, competitive objective (e.g., accumulate resources, survive the longest, etc).

Each round consists of two phases:

Edit phase: LM agents make whatever changes they want to their codebase.
Competition phase: The modified codebases are pitted against each other in the arena.

Critically, LMs don't play the game directly. Their code serves as their competitive proxy. The winner is the LM agent who wins the most rounds.

🚀 Get Involved

Check out our docs for more details on running different arenas, configuring tournaments, etc.
Explore 2000+ tournaments via our viewer.
See our contribution guide for what we're excited about!
Have a big idea? Open an issue, and let's turn it into an insight!

💫 Contributions

We're actively working on several follow ups! Check out the Contributing Guide for more.

Contact Person: John Yang, Kilian Lieret (Email: johnby@stanford.edu, kl5675@princeton.edu)

🪪 License

MIT. Check LICENSE for more information.

✍️ Citation

@misc{yang2025codeclashbenchmarkinggoalorientedsoftware,
    title={CodeClash: Benchmarking Goal-Oriented Software Engineering},
    author={John Yang and Kilian Lieret and Joyce Yang and Carlos E. Jimenez and Ofir Press and Ludwig Schmidt and Diyi Yang},
    year={2025},
    eprint={2511.00839},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2511.00839},
}

📕 Our Other Projects

Name		Name	Last commit message	Last commit date
Latest commit History 624 Commits
.github		.github
codeclash		codeclash
configs		configs
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👋 Overview

🏎️ Quick Start

⚔️ How It Works

🚀 Get Involved

💫 Contributions

🪪 License

✍️ Citation

📕 Our Other Projects

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

CodeClash-ai/CodeClash

Folders and files

Latest commit

History

Repository files navigation

👋 Overview

🏎️ Quick Start

⚔️ How It Works

🚀 Get Involved

💫 Contributions

🪪 License

✍️ Citation

📕 Our Other Projects

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages