Skip to content

YennNing/Awesome-Code-as-Agent-Harness-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Code as Agent Harness Papers

Awesome arXiv Website HF #1 Paper of the Day @_akhaliq Visitors

This repository accompanies the survey Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems. We study the emerging role of code in agentic AI: code is no longer only a generated artifact, but increasingly serves as an executable, inspectable, and stateful harness through which agents reason, act, model environments, receive feedback, and coordinate. The repository organizes representative papers around three connected layers: Harness Interface, Harness Mechanisms, and Scaling the Harness, covering directions such as coding assistants, GUI/OS automation, scientific discovery, and embodied intelligence.

Tip

πŸ‘‹ We welcome paper suggestions, pull requests, and collaborations on code as agent harness. Please contact us at xuyingn2@illinois.edu, kt42@illinois.edu, twei10@illinois.edu, zihaoli5@illinois.edu, and bei4@illinois.edu. We will keep updating this repository with recent work on code-centric agentic systems and harness engineering.

Note

πŸ“š If you find this resource useful, please cite and Stars the repo:

@article{ning2026codeasharness,
  title   = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems},
  author  = {Ning, Xuying and Tieu, Katherine and Fu, Dongqi and Wei, Tianxin and Li, Zihao and Bei, Yuanchen and others},
  journal = {arXiv preprint arXiv:2605.18747},
  year    = {2026}
}

Framework overview

πŸ”” News

[2026-05] πŸš€ Our survey Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems is available on arXiv. Slides and project page links will be added here once available.

πŸ“‹ Table of Contents


🧩 Harness Interface

Code as the basic interface between a model and its task environment. Programs convert model outputs into executable, inspectable, and stateful structures: code makes reasoning executable, action programmable, and environment state inspectable.

Harness interface

πŸ’­ Code for Reasoning

Programs externalize internal logic into verifiable computation, allowing interpreters, symbolic solvers, execution traces, or process rewards to check and refine intermediate steps.

Program-Delegated Reasoning

Paper Venue
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks TMLR 2023
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning ICLR 2024
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator ICML 2024
Method-Based Reasoning for Large Language Models: Extraction, Reuse, and Continuous Improvement arXiv 2025
Code-Enabled Language Models Can Outperform Reasoning Models on Diverse Tasks arXiv 2025
When Do Program-of-Thought Works for Reasoning? AAAI 2024
PAL: Program-aided Language Models ICML 2023
Show Your Work: Scratchpads for Intermediate Computation with Language Models arXiv 2021
Reasoning Like Program Executors EMNLP 2022
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments ACL 2025 Findings
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models NeurIPS 2022

Hybrid Symbolic–Neural Execution

Paper Venue
Self-Verifying Reflection Helps Transformers with CoT Reasoning NeurIPS 2025
SSR: Socratic Self-Refine for Large Language Model Reasoning arXiv 2025
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance ICML 2025
Graph of Thoughts: Solving Elaborate Problems with Large Language Models AAAI 2024
Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation IROS 2025

Iterative Code-Grounded Reasoning

Paper Venue
NExT: Teaching Large Language Models to Reason about Code Execution ICML 2024
What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces arXiv 2025
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation ICML 2025
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment arXiv 2025
RLTF: Reinforcement Learning from Unit Test Feedback TMLR 2023
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning ICML 2025
Execution guided line-by-line code generation NeurIPS 2025
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning arXiv 2025
CYCLE: Learning to Self-Refine the Code Generation OOPSLA 2024
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback ACL 2024
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning NeurIPS 2022
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation ACL 2025 Findings
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting NeurIPS 2023
Self-Edit: Fault-Aware Code Editor for Code Generation ACL 2023

πŸ€– Code for Acting

Generated programs serve as policies, tool calls, behavior trees, or reusable skills for embodied, GUI, software, and tool-use environments.

Grounded Skill Selection

Paper Venue
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances CoRL 2022
Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners CoRL 2023
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance CoRL 2023
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse arXiv 2026
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition CoRL 2023
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models ICRA 2024

Programmatic Policy Generation

Paper Venue
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis ICML 2024
CP-Agent: Agentic Constraint Programming arXiv 2025
LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation ICRA 2026
NormCode: A Semi-Formal Language for Auditable AI Planning arXiv 2025
ALRM: Agentic LLM for Robotic Manipulation arXiv 2026
RACAS: Controlling Diverse Robots With a Single Agentic System arXiv 2026
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models npj Robotics 2026
Code as Policies: Language Model Programs for Embodied Control ICRA 2023
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation arXiv 2025
Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models IJCAI 2025

Lifelong Code-Based Agents

Paper Venue
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills arXiv 2025
ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning arXiv 2025
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience arXiv 2026
Voyager: An Open-Ended Embodied Agent with Large Language Models TMLR 2023
Lifelong Language-Conditioned Robotic Manipulation Learning arXiv 2026

🌍 Code for Environment Modeling

Program states, repositories, traces, simulators, and tests represent state, dynamics, and feedback signals for agent interaction.

Structured World Representations

Paper Venue
From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries NeurIPS 2025
PoE-World: Compositional World Modeling with Products of Programmatic Experts NeurIPS 2025
Code2World: A GUI World Model via Renderable Code Generation arXiv 2026
Code2Worlds: Empowering Coding LLMs for 4D World Generation arXiv 2026
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation EMNLP 2023

Execution-Trace World Modeling

Paper Venue
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning NeurIPS 2024
CWM: An Open-Weights LLM for Research on Code Generation with World Models arXiv 2025
Reinforcement World Model Learning for LLM-based Agents arXiv 2026
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning arXiv 2026
Aligning Agentic World Models via Knowledgeable Experience Learning arXiv 2026
WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment NeurIPS 2024

Code-Grounded Evaluation Environments

Paper Venue
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution ICML 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code ICLR 2025
SWE-bench: Can Language Models Resolve Real-world Github Issues? ICLR 2024
AgentBench: Evaluating LLMs as Agents ICLR 2024
CoRe: Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks NeurIPS 2025
Geogrambench: Benchmarking the geometric program reasoning in modern llms arXiv 2025
CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis arXiv 2026
Endless Terminals: Scaling RL Environments for Terminal Agents arXiv 2026
Reflexion: Language Agents with Verbal Reinforcement Learning NeurIPS 2023
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution ACL 2025
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback NeurIPS 2023

πŸ› οΈ Harness Mechanisms

Once code is placed inside the agent loop, the harness must decide what to execute next, preserve useful state, expose the right tools, and convert failures into corrective actions.

Harness mechanisms

πŸ—ΊοΈ Planning for Code Agents

Planning is harness control: it structures how the agent externalizes intent into executable steps, schedules interactions with code artifacts and tools, and regulates the trajectory of reasoning, execution, and revision over time.

Linear Decomposition Planning

Paper Venue
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis ICLR 2024
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
Self-planning Code Generation with Large Language Models TOSEM 2024
Knowledge-Aware Code Generation with Large Language Models arXiv 2024
PaT: Planning-after-Trial for Efficient Test-Time Code Generation 2025
A Little Help Goes a Long Way: Tutoring LLMs in Solving Competitive Programming through Hints TSE 2025

Structure-Grounded Planning

Paper Venue
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation ICLR 2026
Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks arXiv 2025
DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation AAMAS 2026
CodePlan: Repository-Level Coding Using LLMs and Planning FSE 2024
LocAgent: Graph-Guided LLM Agents for Code Localization ACL 2025
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool AAAI 2025

Search-Based Planning

Paper Venue
Planning in Natural Language Improves LLM Search for Code Generation ICLR 2025
Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks ACL 2025 Findings
Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs NeurIPS 2025
Meta-Harness: End-to-End Optimization of Model Harnesses arXiv 2026
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal ACL 2025
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search NeurIPS 2024
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models NAACL 2025
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation EMNLP 2025
SFS: Smarter Code Space Search Improves LLM Inference Scaling ICLR 2025

Orchestration-Based Planning

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024
CodeCoR: An LLM-based self-reflective multi-agent framework for code generation arXiv 2025
Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code arXiv 2025
SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair arXiv 2026
Requirements Development and Formalization for Reliable Code Generation: A Multi-Agent Vision ASE 2025
AlgoForge: Specializing Code Generation Agents through Collaborative Reinforcement Learning 2025
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving ACL 2024
Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair Frontiers in AI 2025
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering ACM MM 2024

🧠 Memory and Context Engineering

Memory in code-as-agent-harness systems is a state-management layer: which information stays in the active context, which is compacted, and which is offloaded to durable external storage.

Working Memory

Paper Venue
On the Failure of Latent State Persistence in Large Language Models arXiv 2025
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? arXiv 2025
CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory arXiv 2025
RepairAgent: An Autonomous, LLM-Based Agent for Program Repair ICSE 2025
Agentless: Demystifying LLM-based Software Engineering Agents FSE 2025
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering NeurIPS 2024

Semantic Memory

Paper Venue
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs arXiv 2025
Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey arXiv 2026
AgentSM: Semantic Memory for Agentic Text-to-SQL arXiv 2026
A Survey on Large Language Models for Code Generation TOSEM 2026
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation EMNLP 2023
AutoCodeRover: Autonomous Program Improvement ISSTA 2024
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges ACL 2024
A Survey on the Memory Mechanism of Large Language Model-Based Agents TOIS 2025

Experiential Memory

Paper Venue
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory arXiv 2025
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences arXiv 2026
Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL arXiv 2024
Towards Large Language Models with Human-Like Episodic Memory Trends in Cognitive Sciences 2025
Episodic Memories Generation and Evaluation Benchmark for Large Language Models ICLR 2025
ExpeL: LLM Agents Are Experiential Learners AAAI 2024

Long-Term Memory

Paper Venue
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory arXiv 2026
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents arXiv 2026
MemGPT: Towards LLMs as Operating Systems arXiv 2023
Your Code Agent Can Grow Alongside You with Structured Memory arXiv 2026
TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation arXiv 2025
Memory OS of AI Agent EMNLP 2025
Evaluating Very Long-Term Conversational Memory of LLM Agents ACL 2024

Multi-Agent Memory

Paper Venue
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution ICSE 2026
GameGPT: Multi-agent Collaborative Framework for Game Development arXiv 2023
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
MIRIX: Multi-Agent Memory System for LLM-Based Agents arXiv 2025
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization arXiv 2024
Compressing Code Context for LLM-based Issue Resolution arXiv 2026
Scaling Long-Horizon LLM Agent via Context-Folding arXiv 2025
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces arXiv 2026
SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024
G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems NeurIPS 2025

πŸ”§ Tool Usage for Code Agents

Tool usage is the action and observation layer of the code-agent harness: agents search repositories, inspect files, edit code, run commands, execute tests, call APIs, and verify intermediate results β€” all under typed schemas, sandboxes, and lifecycle hooks.

Function-Oriented Tool Use

Paper Venue
ToolCoder: Teach Code Generation Models to use API search tools arXiv 2023
CodeQA: Advanced Programming Question-Answering Using LLM Agent and RAG IEEE TENCON 2024
RAG-Based AI Agents for Enterprise Software Development: Implementation Patterns and Production Deployment 2025
The Devil Is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models ASE 2023

Environment-Interaction Tool Use

Paper Venue
Environment-in-the-Loop: Rethinking Code Migration with LLM-based Agents arXiv 2026
Test-Time Adaptation for LLM Agents via Environment Interaction ICLR 2026

Verification-Driven Tool Use

Paper Venue
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation arXiv 2025
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents arXiv 2025

Workflow-Orchestration Tool Use

Paper Venue
ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph arXiv 2024
ControlLLM: Augment Language Models with Tools by Searching on Graphs ECCV 2024
Agent Harness for Large Language Model Agents: A Survey Preprints 2026
Executable Code Actions Elicit Better LLM Agents ICML 2024
OpenHands: An Open Platform for AI Software Developers as Generalist Agents ICLR 2025
On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub TOSEM 2025

πŸ§ͺ Feedback-Guided Iterative Debugging

Iterative debugging closes the harness loop: development environments expose feedback (compiler diagnostics, runtime errors, tests, critique), and the agent transforms these signals into diagnosis, revision, and progressively better debugging behavior.

Development Environments for Agentic Coding

Contextual Environments for Repository-Aware Generation
Paper Venue
On the Impacts of Contexts on Repository-Level Code Generation NAACL 2025 Findings
A Survey on Model Context Protocol: Architecture, State-of-the-art, Challenges and Future Directions TechRxiv 2025
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases NAACL 2025
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation EMNLP 2024 (Demo)
Knowledge Graph Based Repository-Level Code Generation LLM4Code@ICSE 2025
From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems arXiv 2025
Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches arXiv 2026
AΒ³-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware TSE 2024
Interactive Environments for Human–LLM Collaboration
Paper Venue
Conversational AI as a Coding Assistant: Understanding Programmers' Interactions with and Expectations from Large Language Models for Coding arXiv 2025
The Design Space of LLM-Based AI Coding Assistants: An Analysis of 90 Systems in Academia and Industry VL/HCC 2025
Language Server Protocol: Defines a Common Protocol for Language Servers [Spec] β€”
Deductive Verification via the Debug Adapter Protocol arXiv 2021
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions TOSEM 2025
The Programmer's Assistant: Conversational Interaction with a Large Language Model for Software Development IUI 2023
Human-AI Experience in Integrated Development Environments: A Systematic Literature Review Empirical Software Engineering 2026
Execution and Validation Environments
Paper Venue
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing arXiv 2025
Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning arXiv 2025
FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks arXiv 2026
LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops ICSME 2025
Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety ICLR 2026
Kubeintellect: A modular llm-orchestrated agent framework for end-to-end kubernetes management arXiv 2025
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios ACL 2025 Findings
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? EMNLP 2024
Engineering Platforms for Deployment and Workflow Integration
Paper Venue
LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead TOSEM 2024
AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation arXiv 2025
ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework arXiv 2025
From challenges to metrics: An LLM-driven DevOps recommendation system grounded in evidence-based mappings Array 2025
AI Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions IEEE FLLM 2025
A Multi-Agent Coding Assistant for Cloud-Native Development: From Requirements to Deployable Microservices Preprints 2025
Continuous QoS-compliant Orchestration in the Cloud-Edge Continuum Software: Practice and Experience 2024
From Code Generation to AI Collaboration: The Role of Multi-Agent Systems in Software Engineering 2025
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations COLM 2024

Feedback Mechanisms for Iterative Debugging

Compilation and Static-Analysis Feedback
Paper Venue
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs arXiv 2025
Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis Discover Artificial Intelligence 2024
Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency arXiv 2025
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback ACL 2024 Findings
Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness arXiv 2025
Runtime Error and Exception Feedback
Paper Venue
Towards Agentic Runtime Healing arXiv 2024
Large Language Model Guided Self-Debugging Code Generation arXiv 2025
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff NeurIPS 2024
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step ACL 2024 Findings
Test-Based Execution Feedback
Paper Venue
Teaching Large Language Models to Self-Debug arXiv 2023
Learning to generate unit tests for automated debugging COLM 2025
TestART: Improving LLM-Based Unit Testing via Co-Evolution of Automated Generation and Repair Iteration arXiv 2024
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging ICSE 2026
Revisit Self-Debugging with Self-Generated Tests for Code Generation ACL 2025
LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation TSE 2024
Critique-Driven Feedback (Human or Auxiliary Agents)
Paper Venue
Interactive Debugging and Steering of Multi-Agent AI Systems CHI 2025
RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance International Conference on Agents 2024
Feedback-Driven Debugging and Self-Improvement
Paper Venue
Teaching Your Models to Understand Code via Focal Preference Alignment arXiv 2025
ReVeal: Self-Evolving Code Agents via Reliable Self-Verification NeurIPS 2025

πŸ‘₯ Scaling the Harness: Multi-Agent Code-Centric Systems

When multiple agents operate over code, the harness must coordinate roles, share intermediate artifacts, maintain common state, and verify collective progress through repositories, tests, traces, and structured workflows.

Scaling the harness

🎭 Functional Role Specialization

Distinct agents own slices of the shared code harness β€” synthesis, understanding, verification, execution, and planning.

Program Synthesis Agents

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024
ChatDev: Communicative Agents for Software Development ACL 2024
MAGE: A multi-agent engine for automated RTL code generation DAC 2025
Self-collaboration Code Generation via ChatGPT TOSEM 2024

Program Understanding Agents

Paper Venue
HyperAgent: Generalist software engineering agents to solve coding tasks at scale arXiv 2024
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement ISSTA 2025
CleanAgent: Automating data standardization with LLM-based agents arXiv 2024
MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution NeurIPS 2024

Verification Agents

Paper Venue
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks arXiv 2025
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation arXiv 2025

Execution Agents

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
HyperAgent: Generalist software engineering agents to solve coding tasks at scale arXiv 2024
MAGE: A multi-agent engine for automated RTL code generation DAC 2025

Planning Agents

Paper Venue
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization arXiv 2024
Self-Evolving Multi-Agent Collaboration Networks for Software Development ICLR 2025
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents ICSE 2025

πŸ’¬ Interaction Modes

Code-centric multi-agent interaction is artifact-mediated: agents observe and modify shared code, and grounding comes from the objective state exposed by execution.

Collaborative Synthesis

Paper Venue
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology arXiv 2024
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement ASE 2024

Critique and Repair

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
SEW: Self-evolving agentic workflows for automated code generation arXiv 2025

Adversarial Validation

Paper Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024
MAGE: A multi-agent engine for automated RTL code generation DAC 2025

Reasoning Debate

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation arXiv 2025

πŸ•ΈοΈ Workflow Topology

Topology of agent interaction (chain, cyclic, hierarchical, star, adaptive) is one of the most consequential design decisions in multi-agent code generation.

Pre-Defined Heuristic Topologies (Waterfall / Iterative / Hierarchical / Star)

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024
L2MAC: Large language model automatic computer for extensive code generation ICLR 2024
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
MAGE: A multi-agent engine for automated RTL code generation DAC 2025
HyperAgent: Generalist software engineering agents to solve coding tasks at scale arXiv 2024
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization arXiv 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation arXiv 2025

Objective-Driven and Adaptive Topologies

Paper Venue
FlowReasoner: Reinforcing Query-Level Meta-Agents arXiv 2025
BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization arXiv 2025
SEW: Self-evolving agentic workflows for automated code generation arXiv 2025

⚑ Execution Feedback Integration

Code is uniquely executable, producing objective oracle signals that anchor multi-agent coordination.

Compiler and Syntax Feedback

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
L2MAC: Large language model automatic computer for extensive code generation ICLR 2024

Test Pass/Fail Signals

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks arXiv 2025

Fuzzer Crash Traces

Paper Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024

Static Analysis Warnings

Paper Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024

Performance Profiling Results

Paper Venue
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing arXiv 2025

Fine-Grained Simulation Feedback

Paper Venue
MAGE: A multi-agent engine for automated RTL code generation DAC 2025

πŸ”„ Shared-Harness Synchronization

How multi-agent systems maintain a consistent shared view of program state.

Shared Blackboard

Paper Venue
L2MAC: Large language model automatic computer for extensive code generation ICLR 2024

Parallel Branches with Merge

Paper Venue
HyperAgent: Generalist software engineering agents to solve coding tasks at scale arXiv 2024

Structured Context Scheduling

Paper Venue
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024

Hierarchical Memory

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation arXiv 2025

Agent Pool Scaling

Paper Venue
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization arXiv 2024

πŸ›οΈ Shared Harness Representation

Four levels of formalization for the shared substrate: implicit/file-only, repository-based, execution-based, and blackboard.

Implicit / File-Only Representation

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024
CodeCoR: An LLM-based self-reflective multi-agent framework for code generation arXiv 2025
SEW: Self-evolving agentic workflows for automated code generation arXiv 2025
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology arXiv 2024
SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering ICML 2025

Repository-Based Representation

Paper Venue
HyperAgent: Generalist software engineering agents to solve coding tasks at scale arXiv 2024
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement ISSTA 2025

Execution-Based Representation

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks arXiv 2025
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing arXiv 2025
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation arXiv 2025
MAGE: A multi-agent engine for automated RTL code generation DAC 2025

Blackboard / Shared-State Representation

Paper Venue
L2MAC: Large language model automatic computer for extensive code generation ICLR 2024
GameGPT: Multi-agent Collaborative Framework for Game Development arXiv 2023
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation arXiv 2025

🎯 Harness-State Convergence

How a multi-agent code system decides the shared harness has reached an acceptable final state.

Correctness Convergence (Test-Gated)

Paper Venue
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation arXiv 2023
L2MAC: Large language model automatic computer for extensive code generation ICLR 2024
Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation arXiv 2025

Security Convergence

Paper Venue
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing arXiv 2024

Performance Convergence

Paper Venue
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing arXiv 2025

Score-Based Convergence

Paper Venue
MAGE: A multi-agent engine for automated RTL code generation DAC 2025
CodeCoR: An LLM-based self-reflective multi-agent framework for code generation arXiv 2025
Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling arXiv 2025

Consensus Convergence

Paper Venue
QualityFlow: An agentic workflow for program synthesis controlled by LLM quality checks arXiv 2025

Implicit Convergence

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024

πŸš€ Applications and Emerging Fields

Code-centric agentic systems become operational in tangible domains where code defines observable state, executable actions, persistent memory, and feedback signals.

Applications

πŸ’» Code Assistants

Repositories, tests, issue threads, and development tools form a persistent program world; assistants act over it as code-centric agents.

The Repository as a Persistent Program World

Paper Venue
RepoCoder: Repository-Level Code Completion through Iterative Retrieval and Generation EMNLP 2023
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases NAACL 2025
AutoCodeRover: Autonomous Program Improvement ISSTA 2024

Agent Harnesses as Executable Development Interfaces

Paper Venue
Claude Code [Blog] 2025
Introducing Codex [Blog] 2025
About GitHub Copilot Cloud Agent [Blog] 2025
DeepAgents [GitHub] 2025
Model Context Protocol [Docs] 2024
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions ACM TOSEM 2025
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents arXiv 2025
AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness arXiv 2026
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses arXiv 2026
Meta-Harness: End-to-End Optimization of Model Harnesses arXiv 2026
Natural-Language Agent Harnesses arXiv 2026

Execution Feedback as Grounded Verification

Paper Venue
Agentless: Demystifying LLM-based Software Engineering Agents arXiv 2024
RepairAgent: An Autonomous, LLM-Based Agent for Program Repair ICSE 2025
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? arXiv 2025
Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering arXiv 2024

Memory and Context Management at Repository Scale

Paper Venue
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation EMNLP 2024 (Demo)
ContextBench: A Benchmark for Context Retrieval in Coding Agents arXiv 2026
CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory arXiv 2025
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences arXiv 2026

Developer Intent and Project Conventions as Latent State

Paper Venue
Learning to Commit: Generating Organic Pull Requests via Online Repository Memory arXiv 2026
CodeTaste: Can LLMs Generate Human-Level Code Refactorings? arXiv 2026
SWE-bench+: Enhanced Coding Benchmark for LLMs ICSE Companion 2025

From Inline Completion to Autonomous SWE Agents

Paper Venue
Evaluating Large Language Models Trained on Code arXiv 2021
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot arXiv 2023
Expectation vs.\ Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models CHI Extended Abstracts 2022
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI 2024

From Patch Generation to Software Lifecycle Participation

Paper Venue
SWE-bench: Can Language Models Resolve Real-world Github Issues? ICLR 2024
SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering? ICML 2025
SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv 2025
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces arXiv 2026
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents ACL 2024
Ο„-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains ICLR 2025
AI Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions IEEE FLLM 2025
Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey arXiv 2026
Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration FSE 2025
CodeAgent: Autonomous Communicative Agents for Code Review EMNLP 2024

Multi-Agent Code Assistance and Shared Repositories

Paper Venue
ChatDev: Communicative Agents for Software Development ACL 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-Level Coding Challenges ACL 2024
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling ACL 2025

The Harness as a Distillation Surface

Paper Venue
Composer: Building a fast frontier model with reinforcement learning [Blog] 2025
Improving Composer through real-time reinforcement learning [Blog] 2025
Addendum to GPT-5 system card: GPT-5-Codex [Report] 2025
Building more with GPT-5.1-Codex-Max [Blog] 2025
How Anthropic teams use Claude Code [Report] 2025

Open Challenges for Code-Assistant Harnesses

Paper Venue
Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study arXiv 2025
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks arXiv 2025
Introducing Aardvark: OpenAI's Agentic Security Researcher [Blog] 2025
Codex Security: Now in Research Preview [Blog] 2026
Why Do Multi-Agent LLM Systems Fail? arXiv 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems arXiv 2025
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? arXiv 2025
Where LLM Agents Fail and How They Can Learn from Failures arXiv 2025
Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents arXiv 2026
Fault-Tolerant Sandboxing for AI Coding Agents: A Transactional Approach to Safe Autonomous Execution arXiv 2025
Introducing the Agent Governance Toolkit: Open-Source Runtime Security for AI Agents [Blog] 2026

πŸ–₯️ GUI / OS Agents

GUI/OS environments are program worlds in the most literal sense: every observation is rendered code, and every action is a call into another piece of code.

GUI/OS as a Partially Observable Program World

Paper Venue
WebArena: A Realistic Web Environment for Building Autonomous Agents ICLR 2024
Mind2Web: Towards a Generalist Agent for the Web NeurIPS 2023
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents ICLR 2025
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale ICML 2025
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents ICLR 2025
GPT-4V(ision) is a Generalist Web Agent, if Grounded ICML 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models ACL 2024
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments NeurIPS 2024
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V arXiv 2023
WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? ICML 2024
CogAgent: A Visual Language Model for GUI Agents CVPR 2024

Unifying Perception, Action, and Evaluation Through Code

Paper Venue
Executable Code Actions Elicit Better LLM Agents ICML 2024
Cradle: Empowering Foundation Agents towards General Computer Control ICML 2025
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks NeurIPS 2025
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents ACL 2024
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs ECCV 2024
OS-ATLAS: Foundation Action Model for Generalist GUI Agents ICLR 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent CVPR 2025
Aria-UI: Visual Grounding for GUI Instructions ACL 2025 Findings
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents ICLR 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents arXiv 2025
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL arXiv 2026
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? NeurIPS 2024

Memory as Persistent Program State

Paper Venue
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control ICLR 2024
AppAgent: Multimodal Agents as Smartphone Users CHI 2025
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration NeurIPS 2024
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience arXiv 2026
AutoGLM: Autonomous Foundation Agents for GUIs arXiv 2024
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis ACL 2025
PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents arXiv 2026

UI Simulators and Sandboxes as Executable Dynamics

Paper Venue
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration ICLR 2018
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents NeurIPS 2022
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks ACL 2024
Understanding the Weakness of Large Language Model Agents within a Complex Android Environment KDD 2024
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents ACL 2025
AgentBench: Evaluating LLMs as Agents ICLR 2024
Code2World: A GUI World Model via Renderable Code Generation arXiv 2026

From Simulation to Production: Executable Feedback Loops

Paper Venue
3.5 Models and Computer Use [Blog] 2024
Introducing Operator [Blog] 2025
Project Mariner [Blog] 2025
AutoWebGLM: A Large Language Model-based Web Navigating Agent KDD 2024

πŸ€– Autonomous Embodied Agents

Code grounds embodied actions in physical feasibility, accumulates reusable skills as memory, and supports auditable real-world deployment.

Agent Harness for Grounded and Verifiable Embodied Actions

Paper Venue
Code as Policies: Language Model Programs for Embodied Control ICRA 2023
ChatGPT for Robotics: Design Principles and Model Abilities IEEE Access 2024
Inner Monologue: Embodied Reasoning through Planning with Language Models CoRL 2022
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models CoRL 2023
The Marathon 2: A Navigation System IROS 2020
PaLM-E: An Embodied Multimodal Language Model ICML 2023
Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning arXiv 2025
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances CoRL 2022
Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners CoRL 2023
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse arXiv 2026
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance CoRL 2023
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis ICML 2024
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation IROS 2025
Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models IJCAI 2025
LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation ICRA 2026
NormCode: A Semi-Formal Language for Auditable AI Planning arXiv 2025
CP-Agent: Agentic Constraint Programming arXiv 2025
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation arXiv 2025

Reusable Skills as Embodied Memory

Paper Venue
Voyager: An Open-Ended Embodied Agent with Large Language Models NeurIPS 2023
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models ICRA 2024
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills arXiv 2025
ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning arXiv 2025
Lifelong Language-Conditioned Robotic Manipulation Learning AAAI 2026
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience arXiv 2026

Coordinated and Auditable Real-World Deployment

Paper Venue
GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models npj Robotics 2026
Agents4PLC: Automating Closed-Loop PLC Code Generation and Verification in Industrial Control Systems IEEE TSE 2026
RACAS: Controlling Diverse Robots With a Single Agentic System arXiv 2026
ALRM: Agentic LLM for Robotic Manipulation arXiv 2026

πŸ”¬ Scientific Discovery Agents

Hypotheses are encoded as differential equations or generative models; protocols as XDL or Opentrons scripts; analyses as Jupyter notebooks. Code carries scientific reasoning, scientific action, and the scientific environment itself.

Scientific Discovery as a Partially Observable Program World

Paper Venue
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search arXiv 2025
ChemCrow: Augmenting large-language models with chemistry tools Nature MI 2024
Autonomous chemical research with large language models Nature 2023
Biomni: A General-Purpose Biomedical AI Agent bioRxiv 2025
Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learning Nature 2025
The virtual lab of AI agents designs new SARS-CoV-2 nanobodies Nature 2025

Unifying Ideation, Experimentation, Analysis, and Communication

Paper Venue
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models NAACL 2025
BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology EMNLP 2023
Agent Laboratory: Using LLM Agents as Research Assistants EMNLP 2025 Findings
AgentRxiv: Towards Collaborative Autonomous Research arXiv 2025
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery arXiv 2024
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents arXiv 2026
Executable Code Actions Elicit Better LLM Agents ICML 2024
A universal system for digitization and automatic execution of the chemical synthesis literature Science 2020

Memory as Persistent Program State

Paper Venue
AIDE: AI-Driven Exploration in the Space of Code arXiv 2025
El Agente: An autonomous agent for quantum chemistry Matter 2025
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research arXiv 2023
Towards an AI co-scientist arXiv 2025

Simulators as Executable Dynamics

Paper Venue
AlphaEvolve: A coding agent for scientific and algorithmic discovery arXiv 2025

Self-Driving Labs as Executable Feedback Loops

Paper Venue
Self-driving laboratory for accelerated discovery of thin-film materials arXiv 2020
MatPilot: an LLM-enabled AI Materials Scientist under the Framework of Human-Machine Collaboration arXiv 2024
An autonomous laboratory for the accelerated synthesis of inorganic materials Nature 2023

Toward Agentic and Instruction-Following Science

Paper Venue
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation arXiv 2024
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering arXiv 2025
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers arXiv 2025
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery ICLR 2025
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models arXiv 2024

🧠 Agent Personalization

As recommendation moves from static prediction toward interactive agents, personalization systems must reason over latent and evolving user preferences through structured, editable preference states and executable feedback pipelines.

From Static Recommendation to Interactive Personalization

Paper Venue
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation SIGIR 2020
DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction IJCAI 2017
Large Language Models are Zero-Shot Rankers for Recommender Systems ECIR 2024
Uncovering ChatGPT's Capabilities in Recommender Systems RecSys 2023
RecoWorld: Building Simulated Environments for Agentic Recommender Systems arXiv 2025
RecMind: Large Language Model Powered Agent for Recommendation NAACL 2024 Findings
Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations ACM TOIS 2025
On Generative Agents in Recommendation SIGIR 2024
iAgent: LLM Agent as a Shield Between User and Recommender Systems ACL 2025 Findings

Preference State as an Editable Artifact

Paper Venue
A-Mem: Agentic Memory for LLM Agents NeurIPS 2026
Evo-Memory: Benchmarking LLM Agent Test-Time Learning with Self-Evolving Memory arXiv 2025
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory arXiv 2025
MemRec: Collaborative Memory-Augmented Agentic Recommender System arXiv 2026

Feedback as Policy Adaptation

Paper Venue
LLM-Powered User Simulator for Recommender System AAAI 2025
User Behavior Simulation with Large Language Model-Based Agents ACM TOIS 2025

Controllable and Instruction-Following Personalization

Paper Venue
Conversational Recommendation: Formulation, Methods, and Evaluation SIGIR 2020

✨ Acknowledgements

We thank the broader community for the contributions surveyed here. If your paper should be added or moved, please open a pull request or issue.

πŸ“„ License

This repository is released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors