Awesome Code as Agent Harness Papers

This repository accompanies the survey Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems. We study the emerging role of code in agentic AI: code is no longer only a generated artifact, but increasingly serves as an executable, inspectable, and stateful harness through which agents reason, act, model environments, receive feedback, and coordinate. The repository organizes representative papers around three connected layers: Harness Interface, Harness Mechanisms, and Scaling the Harness, covering directions such as coding assistants, GUI/OS automation, scientific discovery, and embodied intelligence.

Tip

👋 We welcome paper suggestions, pull requests, and collaborations on code as agent harness. Please contact us at xuyingn2@illinois.edu, kt42@illinois.edu, twei10@illinois.edu, zihaoli5@illinois.edu, and bei4@illinois.edu. We will keep updating this repository with recent work on code-centric agentic systems and harness engineering.

Note

📚 If you find this resource useful, please cite and the repo:

@article{ning2026codeasharness,
  title   = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems},
  author  = {Ning, Xuying and Tieu, Katherine and Fu, Dongqi and Wei, Tianxin and Li, Zihao and Bei, Yuanchen and others},
  journal = {arXiv preprint arXiv:2605.18747},
  year    = {2026}
}

🔔 News

[2026-05] 🚀 Our survey Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems is available on arXiv. Slides and project page links will be added here once available.

📋 Table of Contents

🔔 News
📋 Table of Contents
🧩 Harness Interface
🛠️ Harness Mechanisms
👥 Scaling the Harness: Multi-Agent Code-Centric Systems
🚀 Applications and Emerging Fields

🧩 Harness Interface

Code as the basic interface between a model and its task environment. Programs convert model outputs into executable, inspectable, and stateful structures: code makes reasoning executable, action programmable, and environment state inspectable.

💭 Code for Reasoning

Programs externalize internal logic into verifiable computation, allowing interpreters, symbolic solvers, execution traces, or process rewards to check and refine intermediate steps.

Program-Delegated Reasoning

Paper	Venue
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks	TMLR 2023
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	ICLR 2024
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator	ICML 2024
Method-Based Reasoning for Large Language Models: Extraction, Reuse, and Continuous Improvement	arXiv 2025
Code-Enabled Language Models Can Outperform Reasoning Models on Diverse Tasks	arXiv 2025
When Do Program-of-Thought Works for Reasoning?	AAAI 2024
PAL: Program-aided Language Models	ICML 2023
Show Your Work: Scratchpads for Intermediate Computation with Language Models	arXiv 2021
Reasoning Like Program Executors	EMNLP 2022
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments	ACL 2025 Findings
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	NeurIPS 2022

Hybrid Symbolic–Neural Execution

Paper	Venue
Self-Verifying Reflection Helps Transformers with CoT Reasoning	NeurIPS 2025
SSR: Socratic Self-Refine for Large Language Model Reasoning	arXiv 2025
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance	ICML 2025
Graph of Thoughts: Solving Elaborate Problems with Large Language Models	AAAI 2024
Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation	IROS 2025

Iterative Code-Grounded Reasoning

Paper	Venue
NExT: Teaching Large Language Models to Reason about Code Execution	ICML 2024
What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces	arXiv 2025
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation	ICML 2025
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment	arXiv 2025
RLTF: Reinforcement Learning from Unit Test Feedback	TMLR 2023
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning	ICML 2025
Execution guided line-by-line code generation	NeurIPS 2025
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning	arXiv 2025
CYCLE: Learning to Self-Refine the Code Generation	OOPSLA 2024
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback	ACL 2024
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning	NeurIPS 2022
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation	ACL 2025 Findings
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting	NeurIPS 2023
Self-Edit: Fault-Aware Code Editor for Code Generation	ACL 2023

🤖 Code for Acting

Generated programs serve as policies, tool calls, behavior trees, or reusable skills for embodied, GUI, software, and tool-use environments.

Grounded Skill Selection

Paper	Venue
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances	CoRL 2022
Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners	CoRL 2023
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance	CoRL 2023
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse	arXiv 2026
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition	CoRL 2023
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models	ICRA 2024

Programmatic Policy Generation

Paper	Venue
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis	ICML 2024
CP-Agent: Agentic Constraint Programming	arXiv 2025
LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation	ICRA 2026
NormCode: A Semi-Formal Language for Auditable AI Planning	arXiv 2025
ALRM: Agentic LLM for Robotic Manipulation	arXiv 2026
RACAS: Controlling Diverse Robots With a Single Agentic System	arXiv 2026
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models	npj Robotics 2026
Code as Policies: Language Model Programs for Embodied Control	ICRA 2023
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation	arXiv 2025
Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models	IJCAI 2025

Lifelong Code-Based Agents

Paper	Venue
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills	arXiv 2025
ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning	arXiv 2025
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience	arXiv 2026
Voyager: An Open-Ended Embodied Agent with Large Language Models	TMLR 2023
Lifelong Language-Conditioned Robotic Manipulation Learning	arXiv 2026

🌍 Code for Environment Modeling

Program states, repositories, traces, simulators, and tests represent state, dynamics, and feedback signals for agent interaction.

Structured World Representations

Paper	Venue
From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries	NeurIPS 2025
PoE-World: Compositional World Modeling with Products of Programmatic Experts	NeurIPS 2025
Code2World: A GUI World Model via Renderable Code Generation	arXiv 2026
Code2Worlds: Empowering Coding LLMs for 4D World Generation	arXiv 2026
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation	EMNLP 2023

Execution-Trace World Modeling

Paper	Venue
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning	NeurIPS 2024
CWM: An Open-Weights LLM for Research on Code Generation with World Models	arXiv 2025
Reinforcement World Model Learning for LLM-based Agents	arXiv 2026
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning	arXiv 2026
Aligning Agentic World Models via Knowledgeable Experience Learning	arXiv 2026
WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment	NeurIPS 2024

Code-Grounded Evaluation Environments

Paper	Venue
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution	ICML 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code	ICLR 2025
SWE-bench: Can Language Models Resolve Real-world Github Issues?	ICLR 2024
AgentBench: Evaluating LLMs as Agents	ICLR 2024
CoRe: Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks	NeurIPS 2025
Geogrambench: Benchmarking the geometric program reasoning in modern llms	arXiv 2025
CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis	arXiv 2026
Endless Terminals: Scaling RL Environments for Terminal Agents	arXiv 2026
Reflexion: Language Agents with Verbal Reinforcement Learning	NeurIPS 2023
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution	ACL 2025
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback	NeurIPS 2023

🛠️ Harness Mechanisms

Once code is placed inside the agent loop, the harness must decide what to execute next, preserve useful state, expose the right tools, and convert failures into corrective actions.

🗺️ Planning for Code Agents

Planning is harness control: it structures how the agent externalizes intent into executable steps, schedules interactions with code artifacts and tools, and regulates the trajectory of reasoning, execution, and revision over time.

Linear Decomposition Planning

Paper	Venue
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	ICLR 2024
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
Self-planning Code Generation with Large Language Models	TOSEM 2024
Knowledge-Aware Code Generation with Large Language Models	arXiv 2024
PaT: Planning-after-Trial for Efficient Test-Time Code Generation	2025
A Little Help Goes a Long Way: Tutoring LLMs in Solving Competitive Programming through Hints	TSE 2025

Structure-Grounded Planning

Paper	Venue
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation	ICLR 2026
Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks	arXiv 2025
DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation	AAMAS 2026
CodePlan: Repository-Level Coding Using LLMs and Planning	FSE 2024
LocAgent: Graph-Guided LLM Agents for Code Localization	ACL 2025
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool	AAAI 2025

Search-Based Planning

Paper	Venue
Planning in Natural Language Improves LLM Search for Code Generation	ICLR 2025
Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks	ACL 2025 Findings
Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs	NeurIPS 2025
Meta-Harness: End-to-End Optimization of Model Harnesses	arXiv 2026
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal	ACL 2025
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search	NeurIPS 2024
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models	NAACL 2025
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation	EMNLP 2025
SFS: Smarter Code Space Search Improves LLM Inference Scaling	ICLR 2025

Orchestration-Based Planning

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024
CodeCoR: An LLM-based self-reflective multi-agent framework for code generation	arXiv 2025
Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code	arXiv 2025
SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair	arXiv 2026
Requirements Development and Formalization for Reliable Code Generation: A Multi-Agent Vision	ASE 2025
AlgoForge: Specializing Code Generation Agents through Collaborative Reinforcement Learning	2025
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving	ACL 2024
Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair	Frontiers in AI 2025
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering	ACM MM 2024

🧠 Memory and Context Engineering

Memory in code-as-agent-harness systems is a state-management layer: which information stays in the active context, which is compacted, and which is offloaded to durable external storage.

Working Memory

Paper	Venue
On the Failure of Latent State Persistence in Large Language Models	arXiv 2025
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?	arXiv 2025
CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory	arXiv 2025
RepairAgent: An Autonomous, LLM-Based Agent for Program Repair	ICSE 2025
Agentless: Demystifying LLM-based Software Engineering Agents	FSE 2025
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering	NeurIPS 2024

Semantic Memory

Paper	Venue
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs	arXiv 2025
Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey	arXiv 2026
AgentSM: Semantic Memory for Agentic Text-to-SQL	arXiv 2026
A Survey on Large Language Models for Code Generation	TOSEM 2026
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation	EMNLP 2023
AutoCodeRover: Autonomous Program Improvement	ISSTA 2024
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges	ACL 2024
A Survey on the Memory Mechanism of Large Language Model-Based Agents	TOIS 2025

Experiential Memory

Paper	Venue
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory	arXiv 2025
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences	arXiv 2026
Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL	arXiv 2024
Towards Large Language Models with Human-Like Episodic Memory	Trends in Cognitive Sciences 2025
Episodic Memories Generation and Evaluation Benchmark for Large Language Models	ICLR 2025
ExpeL: LLM Agents Are Experiential Learners	AAAI 2024

Long-Term Memory

Paper	Venue
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory	arXiv 2026
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents	arXiv 2026
MemGPT: Towards LLMs as Operating Systems	arXiv 2023
Your Code Agent Can Grow Alongside You with Structured Memory	arXiv 2026
TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation	arXiv 2025
Memory OS of AI Agent	EMNLP 2025
Evaluating Very Long-Term Conversational Memory of LLM Agents	ACL 2024

Multi-Agent Memory

Paper	Venue
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution	ICSE 2026
GameGPT: Multi-agent Collaborative Framework for Game Development	arXiv 2023
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
MIRIX: Multi-Agent Memory System for LLM-Based Agents	arXiv 2025
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization	arXiv 2024
Compressing Code Context for LLM-based Issue Resolution	arXiv 2026
Scaling Long-Horizon LLM Agent via Context-Folding	arXiv 2025
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces	arXiv 2026
SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?	ICLR 2024
G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems	NeurIPS 2025

🔧 Tool Usage for Code Agents

Tool usage is the action and observation layer of the code-agent harness: agents search repositories, inspect files, edit code, run commands, execute tests, call APIs, and verify intermediate results — all under typed schemas, sandboxes, and lifecycle hooks.

Function-Oriented Tool Use

Paper	Venue
ToolCoder: Teach Code Generation Models to use API search tools	arXiv 2023
CodeQA: Advanced Programming Question-Answering Using LLM Agent and RAG	IEEE TENCON 2024
RAG-Based AI Agents for Enterprise Software Development: Implementation Patterns and Production Deployment	2025
The Devil Is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models	ASE 2023

Environment-Interaction Tool Use

Paper	Venue
Environment-in-the-Loop: Rethinking Code Migration with LLM-based Agents	arXiv 2026
Test-Time Adaptation for LLM Agents via Environment Interaction	ICLR 2026

Verification-Driven Tool Use

Paper	Venue
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation	arXiv 2025
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents	arXiv 2025

Workflow-Orchestration Tool Use

Paper	Venue
ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph	arXiv 2024
ControlLLM: Augment Language Models with Tools by Searching on Graphs	ECCV 2024
Agent Harness for Large Language Model Agents: A Survey	Preprints 2026
Executable Code Actions Elicit Better LLM Agents	ICML 2024
OpenHands: An Open Platform for AI Software Developers as Generalist Agents	ICLR 2025
On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub	TOSEM 2025

🧪 Feedback-Guided Iterative Debugging

Iterative debugging closes the harness loop: development environments expose feedback (compiler diagnostics, runtime errors, tests, critique), and the agent transforms these signals into diagnosis, revision, and progressively better debugging behavior.

Development Environments for Agentic Coding

Contextual Environments for Repository-Aware Generation

Paper	Venue
On the Impacts of Contexts on Repository-Level Code Generation	NAACL 2025 Findings
A Survey on Model Context Protocol: Architecture, State-of-the-art, Challenges and Future Directions	TechRxiv 2025
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases	NAACL 2025
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation	EMNLP 2024 (Demo)
Knowledge Graph Based Repository-Level Code Generation	LLM4Code@ICSE 2025
From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems	arXiv 2025
Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches	arXiv 2026
A³-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware	TSE 2024

Interactive Environments for Human–LLM Collaboration

Paper	Venue
Conversational AI as a Coding Assistant: Understanding Programmers' Interactions with and Expectations from Large Language Models for Coding	arXiv 2025
The Design Space of LLM-Based AI Coding Assistants: An Analysis of 90 Systems in Academia and Industry	VL/HCC 2025
Language Server Protocol: Defines a Common Protocol for Language Servers [Spec]	—
Deductive Verification via the Debug Adapter Protocol	arXiv 2021
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions	TOSEM 2025
The Programmer's Assistant: Conversational Interaction with a Large Language Model for Software Development	IUI 2023
Human-AI Experience in Integrated Development Environments: A Systematic Literature Review	Empirical Software Engineering 2026

Execution and Validation Environments

Paper	Venue
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing	arXiv 2025
Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning	arXiv 2025
FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks	arXiv 2026
LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops	ICSME 2025
Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety	ICLR 2026
Kubeintellect: A modular llm-orchestrated agent framework for end-to-end kubernetes management	arXiv 2025
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios	ACL 2025 Findings
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?	EMNLP 2024

Engineering Platforms for Deployment and Workflow Integration

Paper	Venue
LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead	TOSEM 2024
AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation	arXiv 2025
ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework	arXiv 2025
From challenges to metrics: An LLM-driven DevOps recommendation system grounded in evidence-based mappings	Array 2025
AI Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions	IEEE FLLM 2025
A Multi-Agent Coding Assistant for Cloud-Native Development: From Requirements to Deployable Microservices	Preprints 2025
Continuous QoS-compliant Orchestration in the Cloud-Edge Continuum	Software: Practice and Experience 2024
From Code Generation to AI Collaboration: The Role of Multi-Agent Systems in Software Engineering	2025
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations	COLM 2024

Feedback Mechanisms for Iterative Debugging

Compilation and Static-Analysis Feedback

Paper	Venue
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs	arXiv 2025
Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis	Discover Artificial Intelligence 2024
Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency	arXiv 2025
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback	ACL 2024 Findings
Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness	arXiv 2025

Runtime Error and Exception Feedback

Paper	Venue
Towards Agentic Runtime Healing	arXiv 2024
Large Language Model Guided Self-Debugging Code Generation	arXiv 2025
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff	NeurIPS 2024
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step	ACL 2024 Findings

Test-Based Execution Feedback

Paper	Venue
Teaching Large Language Models to Self-Debug	arXiv 2023
Learning to generate unit tests for automated debugging	COLM 2025
TestART: Improving LLM-Based Unit Testing via Co-Evolution of Automated Generation and Repair Iteration	arXiv 2024
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging	ICSE 2026
Revisit Self-Debugging with Self-Generated Tests for Code Generation	ACL 2025
LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation	TSE 2024

Critique-Driven Feedback (Human or Auxiliary Agents)

Paper	Venue
Interactive Debugging and Steering of Multi-Agent AI Systems	CHI 2025
RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance	International Conference on Agents 2024

Feedback-Driven Debugging and Self-Improvement

Paper	Venue
Teaching Your Models to Understand Code via Focal Preference Alignment	arXiv 2025
ReVeal: Self-Evolving Code Agents via Reliable Self-Verification	NeurIPS 2025

👥 Scaling the Harness: Multi-Agent Code-Centric Systems

When multiple agents operate over code, the harness must coordinate roles, share intermediate artifacts, maintain common state, and verify collective progress through repositories, tests, traces, and structured workflows.

🎭 Functional Role Specialization

Distinct agents own slices of the shared code harness — synthesis, understanding, verification, execution, and planning.

Program Synthesis Agents

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024
ChatDev: Communicative Agents for Software Development	ACL 2024
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025
Self-collaboration Code Generation via ChatGPT	TOSEM 2024

Program Understanding Agents

Paper	Venue
HyperAgent: Generalist software engineering agents to solve coding tasks at scale	arXiv 2024
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement	ISSTA 2025
CleanAgent: Automating data standardization with LLM-based agents	arXiv 2024
MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution	NeurIPS 2024

Verification Agents

Paper	Venue
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks	arXiv 2025
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation	arXiv 2025

Execution Agents

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
HyperAgent: Generalist software engineering agents to solve coding tasks at scale	arXiv 2024
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025

Planning Agents

Paper	Venue
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization	arXiv 2024
Self-Evolving Multi-Agent Collaboration Networks for Software Development	ICLR 2025
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents	ICSE 2025

💬 Interaction Modes

Code-centric multi-agent interaction is artifact-mediated: agents observe and modify shared code, and grounding comes from the objective state exposed by execution.

Collaborative Synthesis

Paper	Venue
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology	arXiv 2024
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement	ASE 2024

Critique and Repair

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
SEW: Self-evolving agentic workflows for automated code generation	arXiv 2025

Adversarial Validation

Paper	Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025

Reasoning Debate

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation	arXiv 2025

🕸️ Workflow Topology

Topology of agent interaction (chain, cyclic, hierarchical, star, adaptive) is one of the most consequential design decisions in multi-agent code generation.

Pre-Defined Heuristic Topologies (Waterfall / Iterative / Hierarchical / Star)

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024
L2MAC: Large language model automatic computer for extensive code generation	ICLR 2024
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025
HyperAgent: Generalist software engineering agents to solve coding tasks at scale	arXiv 2024
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization	arXiv 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation	arXiv 2025

Objective-Driven and Adaptive Topologies

Paper	Venue
FlowReasoner: Reinforcing Query-Level Meta-Agents	arXiv 2025
BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization	arXiv 2025
SEW: Self-evolving agentic workflows for automated code generation	arXiv 2025

⚡ Execution Feedback Integration

Code is uniquely executable, producing objective oracle signals that anchor multi-agent coordination.

Compiler and Syntax Feedback

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
L2MAC: Large language model automatic computer for extensive code generation	ICLR 2024

Test Pass/Fail Signals

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks	arXiv 2025

Fuzzer Crash Traces

Paper	Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024

Static Analysis Warnings

Paper	Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024

Performance Profiling Results

Paper	Venue
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing	arXiv 2025

Fine-Grained Simulation Feedback

Paper	Venue
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025

🔄 Shared-Harness Synchronization

How multi-agent systems maintain a consistent shared view of program state.

Shared Blackboard

Paper	Venue
L2MAC: Large language model automatic computer for extensive code generation	ICLR 2024

Parallel Branches with Merge

Paper	Venue
HyperAgent: Generalist software engineering agents to solve coding tasks at scale	arXiv 2024

Structured Context Scheduling

Paper	Venue
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024

Hierarchical Memory

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation	arXiv 2025

Agent Pool Scaling

Paper	Venue
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization	arXiv 2024

🏛️ Shared Harness Representation

Four levels of formalization for the shared substrate: implicit/file-only, repository-based, execution-based, and blackboard.

Implicit / File-Only Representation

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024
CodeCoR: An LLM-based self-reflective multi-agent framework for code generation	arXiv 2025
SEW: Self-evolving agentic workflows for automated code generation	arXiv 2025
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology	arXiv 2024
SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering	ICML 2025

Repository-Based Representation

Paper	Venue
HyperAgent: Generalist software engineering agents to solve coding tasks at scale	arXiv 2024
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement	ISSTA 2025

Execution-Based Representation

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks	arXiv 2025
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing	arXiv 2025
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation	arXiv 2025
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025

Blackboard / Shared-State Representation

Paper	Venue
L2MAC: Large language model automatic computer for extensive code generation	ICLR 2024
GameGPT: Multi-agent Collaborative Framework for Game Development	arXiv 2023
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation	arXiv 2025

🎯 Harness-State Convergence

How a multi-agent code system decides the shared harness has reached an acceptable final state.

Correctness Convergence (Test-Gated)

Paper	Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	arXiv 2023
L2MAC: Large language model automatic computer for extensive code generation	ICLR 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation	arXiv 2025

Security Convergence

Paper	Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	arXiv 2024

Performance Convergence

Paper	Venue
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing	arXiv 2025

Score-Based Convergence

Paper	Venue
MAGE: A multi-agent engine for automated RTL code generation	DAC 2025
CodeCoR: An LLM-based self-reflective multi-agent framework for code generation	arXiv 2025
Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling	arXiv 2025

Consensus Convergence

Paper	Venue
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks	arXiv 2025

Implicit Convergence

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024

🚀 Applications and Emerging Fields

Code-centric agentic systems become operational in tangible domains where code defines observable state, executable actions, persistent memory, and feedback signals.

💻 Code Assistants

Repositories, tests, issue threads, and development tools form a persistent program world; assistants act over it as code-centric agents.

The Repository as a Persistent Program World

Paper	Venue
RepoCoder: Repository-Level Code Completion through Iterative Retrieval and Generation	EMNLP 2023
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases	NAACL 2025
AutoCodeRover: Autonomous Program Improvement	ISSTA 2024

Agent Harnesses as Executable Development Interfaces

Paper	Venue
Claude Code [Blog]	2025
Introducing Codex [Blog]	2025
About GitHub Copilot Cloud Agent [Blog]	2025
DeepAgents [GitHub]	2025
Model Context Protocol [Docs]	2024
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions	ACM TOSEM 2025
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents	arXiv 2025
AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness	arXiv 2026
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses	arXiv 2026
Meta-Harness: End-to-End Optimization of Model Harnesses	arXiv 2026
Natural-Language Agent Harnesses	arXiv 2026

Execution Feedback as Grounded Verification

Paper	Venue
Agentless: Demystifying LLM-based Software Engineering Agents	arXiv 2024
RepairAgent: An Autonomous, LLM-Based Agent for Program Repair	ICSE 2025
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?	arXiv 2025
Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering	arXiv 2024

Memory and Context Management at Repository Scale

Paper	Venue
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation	EMNLP 2024 (Demo)
ContextBench: A Benchmark for Context Retrieval in Coding Agents	arXiv 2026
CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory	arXiv 2025
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences	arXiv 2026

Developer Intent and Project Conventions as Latent State

Paper	Venue
Learning to Commit: Generating Organic Pull Requests via Online Repository Memory	arXiv 2026
CodeTaste: Can LLMs Generate Human-Level Code Refactorings?	arXiv 2026
SWE-bench+: Enhanced Coding Benchmark for LLMs	ICSE Companion 2025

From Inline Completion to Autonomous SWE Agents

Paper	Venue
Evaluating Large Language Models Trained on Code	arXiv 2021
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot	arXiv 2023
Expectation vs.\ Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models	CHI Extended Abstracts 2022
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming	CHI 2024

From Patch Generation to Software Lifecycle Participation

Paper	Venue
SWE-bench: Can Language Models Resolve Real-world Github Issues?	ICLR 2024
SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?	ICML 2025
SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?	arXiv 2025
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces	arXiv 2026
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents	ACL 2024
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains	ICLR 2025
AI Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions	IEEE FLLM 2025
Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey	arXiv 2026
Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration	FSE 2025
CodeAgent: Autonomous Communicative Agents for Code Review	EMNLP 2024

Multi-Agent Code Assistance and Shared Repositories

Paper	Venue
ChatDev: Communicative Agents for Software Development	ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-Level Coding Challenges	ACL 2024
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling	ACL 2025

The Harness as a Distillation Surface

Paper	Venue
Composer: Building a fast frontier model with reinforcement learning [Blog]	2025
Improving Composer through real-time reinforcement learning [Blog]	2025
Addendum to GPT-5 system card: GPT-5-Codex [Report]	2025
Building more with GPT-5.1-Codex-Max [Blog]	2025
How Anthropic teams use Claude Code [Report]	2025

Open Challenges for Code-Assistant Harnesses

Paper	Venue
Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study	arXiv 2025
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks	arXiv 2025
Introducing Aardvark: OpenAI's Agentic Security Researcher [Blog]	2025
Codex Security: Now in Research Preview [Blog]	2026
Why Do Multi-Agent LLM Systems Fail?	arXiv 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems	arXiv 2025
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?	arXiv 2025
Where LLM Agents Fail and How They Can Learn from Failures	arXiv 2025
Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents	arXiv 2026
Fault-Tolerant Sandboxing for AI Coding Agents: A Transactional Approach to Safe Autonomous Execution	arXiv 2025
Introducing the Agent Governance Toolkit: Open-Source Runtime Security for AI Agents [Blog]	2026

🖥️ GUI / OS Agents

GUI/OS environments are program worlds in the most literal sense: every observation is rendered code, and every action is a call into another piece of code.

GUI/OS as a Partially Observable Program World

Paper	Venue
WebArena: A Realistic Web Environment for Building Autonomous Agents	ICLR 2024
Mind2Web: Towards a Generalist Agent for the Web	NeurIPS 2023
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents	ICLR 2025
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale	ICML 2025
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents	ICLR 2025
GPT-4V(ision) is a Generalist Web Agent, if Grounded	ICML 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	ACL 2024
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	NeurIPS 2024
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V	arXiv 2023
WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?	ICML 2024
CogAgent: A Visual Language Model for GUI Agents	CVPR 2024

Unifying Perception, Action, and Evaluation Through Code

Paper	Venue
Executable Code Actions Elicit Better LLM Agents	ICML 2024
Cradle: Empowering Foundation Agents towards General Computer Control	ICML 2025
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks	NeurIPS 2025
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents	ACL 2024
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	ECCV 2024
OS-ATLAS: Foundation Action Model for Generalist GUI Agents	ICLR 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent	CVPR 2025
Aria-UI: Visual Grounding for GUI Instructions	ACL 2025 Findings
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	ICLR 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents	arXiv 2025
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL	arXiv 2026
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?	NeurIPS 2024

Memory as Persistent Program State

Paper	Venue
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control	ICLR 2024
AppAgent: Multimodal Agents as Smartphone Users	CHI 2025
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration	NeurIPS 2024
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience	arXiv 2026
AutoGLM: Autonomous Foundation Agents for GUIs	arXiv 2024
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis	ACL 2025
PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents	arXiv 2026

UI Simulators and Sandboxes as Executable Dynamics

Paper	Venue
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration	ICLR 2018
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents	NeurIPS 2022
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks	ACL 2024
Understanding the Weakness of Large Language Model Agents within a Complex Android Environment	KDD 2024
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	ACL 2025
AgentBench: Evaluating LLMs as Agents	ICLR 2024
Code2World: A GUI World Model via Renderable Code Generation	arXiv 2026

From Simulation to Production: Executable Feedback Loops

Paper	Venue
3.5 Models and Computer Use [Blog]	2024
Introducing Operator [Blog]	2025
Project Mariner [Blog]	2025
AutoWebGLM: A Large Language Model-based Web Navigating Agent	KDD 2024

🤖 Autonomous Embodied Agents

Code grounds embodied actions in physical feasibility, accumulates reusable skills as memory, and supports auditable real-world deployment.

Agent Harness for Grounded and Verifiable Embodied Actions

Paper	Venue
Code as Policies: Language Model Programs for Embodied Control	ICRA 2023
ChatGPT for Robotics: Design Principles and Model Abilities	IEEE Access 2024
Inner Monologue: Embodied Reasoning through Planning with Language Models	CoRL 2022
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models	CoRL 2023
The Marathon 2: A Navigation System	IROS 2020
PaLM-E: An Embodied Multimodal Language Model	ICML 2023
Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning	arXiv 2025
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances	CoRL 2022
Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners	CoRL 2023
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse	arXiv 2026
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance	CoRL 2023
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis	ICML 2024
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation	IROS 2025
Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models	IJCAI 2025
LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation	ICRA 2026
NormCode: A Semi-Formal Language for Auditable AI Planning	arXiv 2025
CP-Agent: Agentic Constraint Programming	arXiv 2025
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation	arXiv 2025

Reusable Skills as Embodied Memory

Paper	Venue
Voyager: An Open-Ended Embodied Agent with Large Language Models	NeurIPS 2023
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models	ICRA 2024
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills	arXiv 2025
ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning	arXiv 2025
Lifelong Language-Conditioned Robotic Manipulation Learning	AAAI 2026
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience	arXiv 2026

Coordinated and Auditable Real-World Deployment

Paper	Venue
GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models	npj Robotics 2026
Agents4PLC: Automating Closed-Loop PLC Code Generation and Verification in Industrial Control Systems	IEEE TSE 2026
RACAS: Controlling Diverse Robots With a Single Agentic System	arXiv 2026
ALRM: Agentic LLM for Robotic Manipulation	arXiv 2026

🔬 Scientific Discovery Agents

Hypotheses are encoded as differential equations or generative models; protocols as XDL or Opentrons scripts; analyses as Jupyter notebooks. Code carries scientific reasoning, scientific action, and the scientific environment itself.

Scientific Discovery as a Partially Observable Program World

Paper	Venue
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search	arXiv 2025
ChemCrow: Augmenting large-language models with chemistry tools	Nature MI 2024
Autonomous chemical research with large language models	Nature 2023
Biomni: A General-Purpose Biomedical AI Agent	bioRxiv 2025
Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learning	Nature 2025
The virtual lab of AI agents designs new SARS-CoV-2 nanobodies	Nature 2025

Unifying Ideation, Experimentation, Analysis, and Communication

Paper	Venue
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models	NAACL 2025
BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology	EMNLP 2023
Agent Laboratory: Using LLM Agents as Research Assistants	EMNLP 2025 Findings
AgentRxiv: Towards Collaborative Autonomous Research	arXiv 2025
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery	arXiv 2024
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents	arXiv 2026
Executable Code Actions Elicit Better LLM Agents	ICML 2024
A universal system for digitization and automatic execution of the chemical synthesis literature	Science 2020