This repository accompanies the survey Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems. We study the emerging role of code in agentic AI: code is no longer only a generated artifact, but increasingly serves as an executable, inspectable, and stateful harness through which agents reason, act, model environments, receive feedback, and coordinate. The repository organizes representative papers around three connected layers: Harness Interface, Harness Mechanisms, and Scaling the Harness, covering directions such as coding assistants, GUI/OS automation, scientific discovery, and embodied intelligence.
Tip
π We welcome paper suggestions, pull requests, and collaborations on code as agent harness. Please contact us at xuyingn2@illinois.edu, kt42@illinois.edu, twei10@illinois.edu, zihaoli5@illinois.edu, and bei4@illinois.edu. We will keep updating this repository with recent work on code-centric agentic systems and harness engineering.
Note
π If you find this resource useful, please cite and the repo:
@article{ning2026codeasharness,
title = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems},
author = {Ning, Xuying and Tieu, Katherine and Fu, Dongqi and Wei, Tianxin and Li, Zihao and Bei, Yuanchen and others},
journal = {arXiv preprint arXiv:2605.18747},
year = {2026}
}[2026-05] π Our survey Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems is available on arXiv. Slides and project page links will be added here once available.
- π News
- π Table of Contents
- π§© Harness Interface
- π οΈ Harness Mechanisms
- π₯ Scaling the Harness: Multi-Agent Code-Centric Systems
- π Applications and Emerging Fields
Code as the basic interface between a model and its task environment. Programs convert model outputs into executable, inspectable, and stateful structures: code makes reasoning executable, action programmable, and environment state inspectable.
Programs externalize internal logic into verifiable computation, allowing interpreters, symbolic solvers, execution traces, or process rewards to check and refine intermediate steps.
Generated programs serve as policies, tool calls, behavior trees, or reusable skills for embodied, GUI, software, and tool-use environments.
Program states, repositories, traces, simulators, and tests represent state, dynamics, and feedback signals for agent interaction.
Once code is placed inside the agent loop, the harness must decide what to execute next, preserve useful state, expose the right tools, and convert failures into corrective actions.
Planning is harness control: it structures how the agent externalizes intent into executable steps, schedules interactions with code artifacts and tools, and regulates the trajectory of reasoning, execution, and revision over time.
Memory in code-as-agent-harness systems is a state-management layer: which information stays in the active context, which is compacted, and which is offloaded to durable external storage.
Tool usage is the action and observation layer of the code-agent harness: agents search repositories, inspect files, edit code, run commands, execute tests, call APIs, and verify intermediate results β all under typed schemas, sandboxes, and lifecycle hooks.
| Paper | Venue |
|---|---|
| Environment-in-the-Loop: Rethinking Code Migration with LLM-based Agents | arXiv 2026 |
| Test-Time Adaptation for LLM Agents via Environment Interaction | ICLR 2026 |
Iterative debugging closes the harness loop: development environments expose feedback (compiler diagnostics, runtime errors, tests, critique), and the agent transforms these signals into diagnosis, revision, and progressively better debugging behavior.
| Paper | Venue |
|---|---|
| Towards Agentic Runtime Healing | arXiv 2024 |
| Large Language Model Guided Self-Debugging Code Generation | arXiv 2025 |
| Code Repair with LLMs gives an Exploration-Exploitation Tradeoff | NeurIPS 2024 |
| Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step | ACL 2024 Findings |
| Paper | Venue |
|---|---|
| Interactive Debugging and Steering of Multi-Agent AI Systems | CHI 2025 |
| RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance | International Conference on Agents 2024 |
| Paper | Venue |
|---|---|
| Teaching Your Models to Understand Code via Focal Preference Alignment | arXiv 2025 |
| ReVeal: Self-Evolving Code Agents via Reliable Self-Verification | NeurIPS 2025 |
When multiple agents operate over code, the harness must coordinate roles, share intermediate artifacts, maintain common state, and verify collective progress through repositories, tests, traces, and structured workflows.
Distinct agents own slices of the shared code harness β synthesis, understanding, verification, execution, and planning.
Code-centric multi-agent interaction is artifact-mediated: agents observe and modify shared code, and grounding comes from the objective state exposed by execution.
| Paper | Venue |
|---|---|
| AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation | arXiv 2023 |
| SEW: Self-evolving agentic workflows for automated code generation | arXiv 2025 |
| Paper | Venue |
|---|---|
| ChatDev: Communicative Agents for Software Development | ACL 2024 |
| Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation | arXiv 2025 |
Topology of agent interaction (chain, cyclic, hierarchical, star, adaptive) is one of the most consequential design decisions in multi-agent code generation.
Code is uniquely executable, producing objective oracle signals that anchor multi-agent coordination.
| Paper | Venue |
|---|---|
| ChatDev: Communicative Agents for Software Development | ACL 2024 |
| L2MAC: Large language model automatic computer for extensive code generation | ICLR 2024 |
| Paper | Venue |
|---|---|
| AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing | arXiv 2024 |
| Paper | Venue |
|---|---|
| AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing | arXiv 2024 |
| Paper | Venue |
|---|---|
| MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing | arXiv 2025 |
| Paper | Venue |
|---|---|
| MAGE: A multi-agent engine for automated RTL code generation | DAC 2025 |
How multi-agent systems maintain a consistent shared view of program state.
| Paper | Venue |
|---|---|
| L2MAC: Large language model automatic computer for extensive code generation | ICLR 2024 |
| Paper | Venue |
|---|---|
| HyperAgent: Generalist software engineering agents to solve coding tasks at scale | arXiv 2024 |
| Paper | Venue |
|---|---|
| MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework | ICLR 2024 |
| Paper | Venue |
|---|---|
| ChatDev: Communicative Agents for Software Development | ACL 2024 |
| Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation | arXiv 2025 |
| Paper | Venue |
|---|---|
| Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization | arXiv 2024 |
Four levels of formalization for the shared substrate: implicit/file-only, repository-based, execution-based, and blackboard.
How a multi-agent code system decides the shared harness has reached an acceptable final state.
| Paper | Venue |
|---|---|
| AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing | arXiv 2024 |
| Paper | Venue |
|---|---|
| MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing | arXiv 2025 |
| Paper | Venue |
|---|---|
| QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks | arXiv 2025 |
| Paper | Venue |
|---|---|
| ChatDev: Communicative Agents for Software Development | ACL 2024 |
| MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework | ICLR 2024 |
Code-centric agentic systems become operational in tangible domains where code defines observable state, executable actions, persistent memory, and feedback signals.
Repositories, tests, issue threads, and development tools form a persistent program world; assistants act over it as code-centric agents.
| Paper | Venue |
|---|---|
| Claude Code [Blog] | 2025 |
| Introducing Codex [Blog] | 2025 |
| About GitHub Copilot Cloud Agent [Blog] | 2025 |
| DeepAgents [GitHub] | 2025 |
| Model Context Protocol [Docs] | 2024 |
| Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions | ACM TOSEM 2025 |
| The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents | arXiv 2025 |
| AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness | arXiv 2026 |
| Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses | arXiv 2026 |
| Meta-Harness: End-to-End Optimization of Model Harnesses | arXiv 2026 |
| Natural-Language Agent Harnesses | arXiv 2026 |
| Paper | Venue |
|---|---|
| Learning to Commit: Generating Organic Pull Requests via Online Repository Memory | arXiv 2026 |
| CodeTaste: Can LLMs Generate Human-Level Code Refactorings? | arXiv 2026 |
| SWE-bench+: Enhanced Coding Benchmark for LLMs | ICSE Companion 2025 |
| Paper | Venue |
|---|---|
| Composer: Building a fast frontier model with reinforcement learning [Blog] | 2025 |
| Improving Composer through real-time reinforcement learning [Blog] | 2025 |
| Addendum to GPT-5 system card: GPT-5-Codex [Report] | 2025 |
| Building more with GPT-5.1-Codex-Max [Blog] | 2025 |
| How Anthropic teams use Claude Code [Report] | 2025 |
GUI/OS environments are program worlds in the most literal sense: every observation is rendered code, and every action is a call into another piece of code.
| Paper | Venue |
|---|---|
| 3.5 Models and Computer Use [Blog] | 2024 |
| Introducing Operator [Blog] | 2025 |
| Project Mariner [Blog] | 2025 |
| AutoWebGLM: A Large Language Model-based Web Navigating Agent | KDD 2024 |
Code grounds embodied actions in physical feasibility, accumulates reusable skills as memory, and supports auditable real-world deployment.
Hypotheses are encoded as differential equations or generative models; protocols as XDL or Opentrons scripts; analyses as Jupyter notebooks. Code carries scientific reasoning, scientific action, and the scientific environment itself.
| Paper | Venue |
|---|---|
| AIDE: AI-Driven Exploration in the Space of Code | arXiv 2025 |
| El Agente: An autonomous agent for quantum chemistry | Matter 2025 |
| PaperQA: Retrieval-Augmented Generative Agent for Scientific Research | arXiv 2023 |
| Towards an AI co-scientist | arXiv 2025 |
| Paper | Venue |
|---|---|
| AlphaEvolve: A coding agent for scientific and algorithmic discovery | arXiv 2025 |
As recommendation moves from static prediction toward interactive agents, personalization systems must reason over latent and evolving user preferences through structured, editable preference states and executable feedback pipelines.
| Paper | Venue |
|---|---|
| LLM-Powered User Simulator for Recommender System | AAAI 2025 |
| User Behavior Simulation with Large Language Model-Based Agents | ACM TOIS 2025 |
| Paper | Venue |
|---|---|
| Conversational Recommendation: Formulation, Methods, and Evaluation | SIGIR 2020 |
We thank the broader community for the contributions surveyed here. If your paper should be added or moved, please open a pull request or issue.
This repository is released under the MIT License.




