Skip to content

This is the official repository of the paper "Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey"

License

Notifications You must be signed in to change notification settings

bowen-upenn/MMMA_Rationality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 

Repository files navigation

This is the official repository of the paper "Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey"

This survey is the first to specifically examine how multi-modal and multi-agent systems 🤖 are advancing towards rationality 🧠 in cognitive science, identifying their advancements over single-agent and language-only baselines, and discuss open problems in evaluating rationality beyong accuracy.

The fields of multi-modal and multi-agent systems are rapidly evolving, so we highly encourage researchers who want to promote their amazing works on this dynamic repository to submit a pull request and make updates. 💜

We have a concurrent work A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners Paper Code to be released in the upcoming week as well. The survey critics the limitations of existing benchmarks on evaluating rationality, and this paper goes beyond accuracy and reconceptualize the evaluation of reasoning capabilities in LLMs into a general, statistically rigorous framework of testable hypotheses. It is designed to determine whether LLMs are capable of genuine reasoning or if they primarily rely on token bias. Our findings with statistical guarantees suggest that LLMs struggle with probabilistic reasoning.

Citations

A bunny 🐰 will be happy if you could cite our work.

@misc{jiang2024multimodal,
      title={Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey}, 
      author={Bowen Jiang and Yangxinyu Xie and Xiaomeng Wang and Weijie J. Su and Camillo J. Taylor and Tanwi Mallick},
      year={2024},
      eprint={2406.00252},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Define Rationality

Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. This quality is essential for effective problem-solving, as it ensures that solutions are well-founded and systematically derived. We define four axioms we expect a rational agent or agent systems should satisfy:

  • Information grounding

    The decision of a rational agent is grounded on the physical and factual reality. In order to make a sound decision, the agent must be able to integrate sufficient and accurate information from different sources and modalities grounded in reality without hallucination.

  • Orderability of preference

    When comparing alternatives in a decision scenario, a rational agent can rank the options based on the current state and ultimately select the most preferred one based on the expected outcomes. The orderability of preferences ensures the agent can make consistent and logical choices when faced with multiple alternatives. LLM-based evaluations heavily rely on this property.

  • Independence from irrelevant context

    The agent's preference should not be influenced by information irrelevant to the decision problem at hand. LLMs have been shown to exhibit irrational behavior when presented with irrelevant context, leading to confusion and suboptimal decisions. To ensure rationality, an agent must be able to identify and disregard irrelevant information, focusing solely on the factors that directly impact the decision-making processes.

  • Invariance across logically equivalent representations.

    The preference of a rational agent remains invariant across equivalent representations of the decision problem, regardless of specific wordings or modalities.

Towards Rationality through Multi-Modal and Multi-Agent Systems

Each field of research in the figure above, such as knowledge retrieval or neuro-symbolic reasoning, addresses one or more fundamental axioms for rational thinking. These requirements are typically intertwined; therefore, an approach that enhances one aspect of rationality often inherently improves others simultaneously.

We include all related works in our survey below, categorized by their fields. Bold fonts are used to mark work that involve multi-modalities. In their original writings, most existing studies do not explicitly base their frameworks on rationality. Our analysis aims to reinterpret these works through the lens of our four axioms of rationality, offering a novel perspective that bridges existing methodologies with rational principles.

Knowledge Retrieval

The parametric nature of LLMs fundamentally limits how much information they can hold. A multi-modal and/or multi-agent system can include planning agents in its framework, which is akin to the System 2 process that can determine how and where to retrieve external knowledge, and what specific information to acquire. Additionally, the system can have summarizing agents that utilize retrieved knowledge to enrich the system's language outputs with better factuality.

RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Paper
Minedojo: Building open-ended embodied agents with internet-scale knowledge Paper Code
ReAct: Synergizing reasoning and acting in language models Paper Code
RA-CM3: Retrieval-Augmented Multimodal Language Modeling Paper
Chameleon: Plug-and-play compositional reasoning with large language models Paper Code
Chain of knowledge: A framework for grounding large language models with structured knowledge bases Paper Code
SIRI: Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering Paper
CooperKGC: Multi-Agent Synergy for Improving Knowledge Graph Construction Paper Code
DoraemonGPT: Toward understanding dynamic scenes with large language models Paper Code
WildfireGPT: Tailored Large Language Model for Wildfire Analysis Paper Code
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models Paper
CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph Prompting Paper Code
Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents Paper

Multi-Modal Foundation Models

As a picture is worth a thousand words, multi-modal approaches aim to improve the information grounding across various channels like language and vision. By incorporating multi-modal agents, multi-agent systems can greatly expand their capabilities, enabling a richer, more accurate, and contextually aware interpretation of environment. MMFMs are also particularly adept at promoting invariance by processing multi-modal data in an unified representation. Specifically, their large-scale cross-modal pretraining stage seamlessly tokenizes both vision and language inputs into a joint hidden embedding space, learning cross-modal correlations through a data-driven approach.

Generating output texts from input images requires only a single inference pass, which is quick and straightforward, aligning closely with the System 1 process of fast and automatic thinking in dual-process theories. RLHF and visual instruction-tuning enable more multi-round human-agent interactions and collaborations with other agents. This opens the possibility of subsequent research on the System 2 process in MMFMs.

CLIP: Learning transferable visual models from natural language supervision Paper Code
iNLG: Imagination-guided open-ended text generation Paper Code
BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models Paper Code
Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning Paper Code
MiniGPT-4: Enhancing vision-language understanding with advanced large language models Paper Code
Flamingo: a visual language model for few-shot learning Paper
OpenFlamingo: An open-source framework for training large autoregressive vision-language models Paper Code
LLaVA: Visual Instruction Tuning Paper Code
LLaVA 1.5: Improved Baselines with Visual Instruction Tuning Paper Code
CogVLM: Visual expert for pretrained language models Paper Code
GPT-4V(ision) System Card Paper
Gemini: A Family of Highly Capable Multimodal Models Paper
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper
GPT-4o Website

Large World Models

JEPA: A Path Towards Autonomous Machine Intelligence Paper
Voyager: An open-ended embodied agent with large language models Paper Code
Ghost in the Minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory Paper Code
Objective-Driven AI Slides
LWM: World Model on Million-Length Video And Language With RingAttention Paper Code
Sora: Video generation models as world simulators Website
IWM: Learning and Leveraging World Models in Visual Representation Learning Paper
CubeLLM: Language-Image Models with 3D Understanding Paper Code

Tool Utilizations

A multi-agent system can coordinate agents understanding when and which tool to use, which modality of information the tool should expect, how to call the corresponding API, and how to incorporate outputs from the API calls, which anchors subsequent reasoning processes with more accurate information beyond their parametric memory. Besides, using tools require translating natural language queries into API calls with predefined syntax. Once the planning agent has determined the APIs and their input arguments, the original queries that may contain irrelevant contexts become invisible to the tools, and the tools will ignore any variance in the original queries as long as they share the equivalent underlying logic, promiting the invariance property.

Visual Programming: Compositional visual reasoning without training Paper Code
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions Paper Code
Toolformer: Language Models Can Teach Themselves to Use Tools Paper Code
BabyAGI Code
ViperGPT: Visual inference via python execution for reasoning Paper Code
HuggingGPT: Solving AI tasks with ChatGPT and its friends in Gugging Face Paper Code
Chameleon: Plug-and-play compositional reasoning with large language models Paper Code
AutoGPT: build & use AI agents Code
ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases Paper Code
AssistGPT: A general multi-modal assistant that can plan, execute, inspect, and learn Paper Code
Avis: Autonomous visual information seeking with large language model agent Paper
BuboGPT: Enabling visual grounding in multi-modal llms Paper Code
MemGPT: Towards llms as operating systems Paper Code
MetaGPT: Meta programming for multi-agent collaborative framework Paper Code
Agent LUMOS: Learning agents with unified data, modular design, and open-source llms Paper Code
AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning Paper Code
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent Paper Code
DeLLMa: A Framework for Decision Making Under Uncertainty with Large Language Models Paper Code
ConAgents: Learning to Use Tools via Cooperative and Interactive Agents Paper Code
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering Paper Code

Web Agents

WebGPT: Browser-assisted question-answering with human feedback Paper
WebShop: Towards scalable real-world web interaction with grounded language agents Paper Code
Pix2Act: From pixels to UI actions: Learning to follow instructions via graphical user interfaces Paper Code
WebGUM: Multimodal web navigation with instruction-finetuned foundation models Paper [Code]
Mind2Web: Towards a Generalist Agent for the Web Paper Code
WebAgent: A real-world webagent with planning, long context understanding, and program synthesis Paper
CogAgent: A visual language model for GUI agents Paper Code
SeeAct: Gpt-4v (ision) is a generalist web agent, if grounded Paper Code

LLM-Based Evaluation

ChatEval: Towards better llm-based evaluators through multi-agent debate Paper Code
Benchmarking foundation models with language-model-as-an-examiner Paper
CoBBLEr: Benchmarking cognitive biases in large language models as evaluators Paper Code
Large Language Models are Inconsistent and Biased Evaluators Paper

Neuro-Symbolic Reasoning

Neural-symbolic reasoning is another promising approach to achieving consistent ordering of preferences and invariance by combining the strengths of languages and symbolic logic in a multi-agent system. A multi-agent system incorporating symbolic modules can not only understand language queries but also solve them with a level of consistency, providing a faithful and transparent reasoning process based on well-defined rules that adhere to logical principles, which is unachievable by LLMs alone within the natural language space. Neuro-Symbolic modules also expect standardized input formats. This layer of abstraction enhances the independence from irrelevant contexts and maintains the invariance of LLMs when handling natural language queries.

Binder: Binding language models in symbolic languages Paper Code
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions Paper Code
Sparks of artificial general intelligence: Early experiments with gpt-4 Paper
Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning Paper Code
Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker Paper Code
Towards formal verification of neuro-symbolic multi-agent systems Paper
What’s Left? Concept Grounding with Logic-Enhanced Foundation Models Paper Code
Ada: Learning adaptive planning representations with natural language guidance Paper
Large language models are neurosymbolic reasoners Paper Code
DoraemonGPT: Toward understanding dynamic scenes with large language models Paper Code
A Neuro-Symbolic Approach to Multi-Agent RL for Interpretability and Probabilistic Decision Making Paper
Conceptual and Unbiased Reasoning in Language Models Paper

Self-Reflection, Multi-Agent Debate, and Collboration

Due to the probabilistic outputs of LLMs, which resemble the rapid, non-iterative nature of human System 1 cognition, ensuring preference orderability and invariance is challenging. In contrast, algorithms that enable self-reflection and multi-agent systems that promote debate and consensus can slow down the thinking process and help align outputs more closely with the deliberate and logical decision-making typical of System 2 processes, thus enhancing rational reasoning in agents.

Collaborative approaches allow each agent in a system to compare and rank its preference on choices from its own or from other agents through critical judgments. It helps enable the system to discern and output the most dominant decision as a consensus, thereby improving the orderability of preference. At the same time, through such a slow and critical thinking process, errors in initial responses or input prompts are more likely to be detected and corrected.

Self-Refine: Iterative refinement with self-feedback Paper Code
Reflexion: Language agents with verbal reinforcement learning Paper Code
FORD: Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate Paper Code
Memorybank: Enhancing large language models with long-term memory Paper Code
LM vs LM: Detecting factual errors via cross examination Paper
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents Paper
Improving factuality and reasoning in language models through multiagent debate Paper Code
MAD: Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate Paper Code
S3: Social-network Simulation System with Large Language Model-Empowered Agents Paper
ChatDev: Communicative agents for software development Paper Code
ChatEval: Towards better llm-based evaluators through multi-agent debate Paper Code
AutoGen: Enabling next-gen llm applications via multi-agent conversation framework Paper Code
Corex: Pushing the boundaries of complex reasoning through multi-model collaboration Paper Code
DyLAN: Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization Paper Code
AgentCF: Collaborative learning with autonomous language agents for recommender systems Paper
MetaAgents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents Paper
Social Learning: Towards Collaborative Learning with Large Language Models Paper
Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias Paper
Combating Adversarial Attacks with Multi-Agent Debate Paper
Debating with More Persuasive LLMs Leads to More Truthful Answers Paper Code

Prompting Strategy and Memory

Works in section are not all directly related to multi-modal or multi-agent systems
CoT: Chain-of-thought prompting elicits reasoning in large language models Paper
Language model cascades Paper Code
ReAct: Synergizing reasoning and acting in language models Paper Code
Memorybank: Enhancing large language models with long-term memory Paper Code
Tree of thoughts: Deliberate problem solving with large language models Paper Code
Beyond chain-of-thought, effective graph-of-thought reasoning in large language models Paper Code
Graph of thoughts: Solving elaborate problems with large language models Paper Code
MemoChat: Tuning llms to use memos for consistent long-range open-domain conversation Paper Code
Retroformer: Retrospective large language agents with policy gradient optimization Paper Code
FormatSpread: How I learned to start worrying about prompt formatting Paper Code
ADaPT: As-Needed Decomposition and Planning with Language Models Paper Code
EureQA: How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts Paper
Combating Adversarial Attacks with Multi-Agent Debate Paper
MAD-Bench: How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts Paper

This survey builds connections between multi-modal and multi-agent systems with rationality, guided by dual-process theories and the four axioms we expect a rational agent or agent systems should satisfy: information grounding, orderability of preference, independence from irrelevant context, and invariance across logically equivalent representations. Our findings suggest that the grounding can usually be enhanced by multi-modalities, world models, knowledge retrieval, and tool utilization. The remaining three axioms are typically intertwined, and we sometimes describe their collective characteristics informally using terms such as coherence, consistency, and trustworthiness. These axioms are simultaneously improved by achievements in multi-modalities, tool utilization, neuro-symbolic reasoning, self-reflection, and multi-agent collaborations. These fields of research, by either delibration that slows down the "thinking" process or abstraction that boils down tasks to their logical essence, mimic the "System 2" thinking in human cognition, thereby enhancing the rationality of multi-agent systems in decision-making scenarios, compared to single-agent language-only baselines that resemble the "System 1" process.

Evaluating Rationality of Agents

The choices of evaluation metrics are important. We find that most benchmarks predominantly focus on the accuracy of the final performance, ignoring the most interesting intermediate reasoning steps and the concept of rationality. A promising direction is to create benchmarks specifically tailored to assess rationality, going beyond existing ones on accuracy. These new benchmarks should avoid data contamination and emphasize tasks that demand consistent reasoning across diverse representations and domains. Besides, existing evaluations on rationality provide limited comparisons between multi-modal/multi-agent frameworks and single-agent baselines, thus failing to fully elucidate the advantages multi-modal/multi-agent frameworks can offer.

🏳️‍🌈 Our Work 🏳️‍⚧️ A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners Paper Code
is designed to determine whether LLMs are capable of genuine reasoning or if they primarily rely on token bias. We go beyond accuracy and reconceptualize the evaluation of reasoning capabilities in LLMs into a general, statistically rigorous framework of testable hypotheses. Our findings with statistical guarantees suggest that LLMs struggle with probabilistic reasoning, with apparent performance improvements largely attributable to token bias. To be released in the upcoming week.

General Benchmarks or Evaluation Metrics

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge Paper Data
LogiQA: a challenge dataset for machine reading comprehension with logical reasoning Paper Data
Logiqa 2.0: an improved dataset for logical reasoning in natural language understanding Paper Data
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering Paper Data
Measuring mathematical problem solving with the math dataset Paper Data
HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data Paper Data
Conceptual and Unbiased Reasoning in Language Models Paper
Large language model evaluation via multi AI agents: Preliminary results Paper
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents Paper Code
AgentBench: Evaluating LLMs as Agents Paper Code
Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation Paper Code
Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games Paper Code

Adapting Cognitive Psychology Experiments

Using cognitive psychology to understand GPT-3 Paper Code
On the dangers of stochastic parrots: Can language models be too big? Paper

Testing Grounding against Hallucination

A multi-task, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity Paper Data
Hallucinations in large multilingual translation models Paper Data
Evaluating attribution in dialogue systems: The BEGIN benchmark Paper Data
HaluEval: A large-scale hallucination evaluation benchmark for large language models Paper Data
DialDact: A benchmark for fact-checking in dialogue Paper Data
FaithDial: A faithful benchmark for information-seeking dialogue Paper Data
AIS: Measuring attribution in natural language generation models Paper Data
Why does ChatGPT fall short in providing truthful answers Paper
FADE: Diving deep into modes of fact hallucinations in dialogue systems Paper Code
Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization Paper
Exploring and evaluating hallucinations in llm-powered code generation Paper
EureQA: Deceiving semantic shortcuts on reasoning chains: How far can models go without hallucination Paper Code
TofuEval: Evaluating hallucinations of llms on topic-focused dialogue summarization Paper Code
Object hallucination in image captioning Paper Code
Let there be a clock on the beach: Reducing object hallucination in image captioning Paper Code
Evaluating object hallucination in large vision-language models Paper Code
LLaVA-RLHF: Aligning large multimodal models with factually augmented RLHF Paper Code

Testing the Orderability of Preference

Large language models are not robust multiple choice selectors Paper Code
Leveraging large language models for multiple choice question answering Paper Code

Testing the Principle of Invariance

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning Paper
Rethinking benchmark and contamination for language models with rephrased samples Paper Code
From Form(s) to Meaning: Probing the semantic depths of language models using multisense consistency Paper Code
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity Paper Code
On sensitivity of learning with limited labelled data to the effects of randomness: Impact of interactions and systematic choices Paper
Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation Paper Code
Exploring multilingual human value concepts in large language models: Is value alignment consistent, transferable and controllable across languages? Paper Code
Fool your (vision and) language model with embarrassingly simple permutations Paper Code
Large language models are not robust multiple choice selectors Paper Code

Testing Independence from Irrelevant Context

Large language models can be easily distracted by irrelevant context Paper
How easily do irrelevant inputs skew the responses of large language models? Paper Code
Lost in the middle: How language models use long context Paper Code
Making retrieval-augmented language models robust to irrelevant context Paper Code
Towards AI-complete question answering: A set of prerequisite toy tasks Paper Code
CLUTRR: A diagnostic benchmark for inductive reasoning from text Paper Code
Transformers as soft reasoners over language Paper Code
Do prompt-based models really understand the meaning of their prompts? Paper Code
MileBench: Benchmarking MLLMs in long context Paper Code
Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences Paper Code
Seedbench-2: Benchmarking multimodal large language models Paper Code
DEMON: Finetuning multimodal llms to follow zero-shot demonstrative instructions Paper Code

Releases

No releases published

Packages

No packages published