# Qualitative Review

### Findings

| Method | Advantages | Issues | Comments |
|--------|------------|--------|----------|
| _Lead N_ | Results in a coherent paragraph. | Not a summary for scientific papers. Presents information missing the point of the paper. |  |
| _Lead N by Section_ | Reflects structure of document. Sentences have some continuity. | Does not summarize the papers. Similar to _Lead N_, mostly misses the point of the papers. |  |
| _TF-IDF_ |  |  |  |
| _BM25_ | Captures most relevant information. Keeps facts correct. | Results in disjointed paragraph. |  |
| _Two-pass w/ Gemma 3_ |  |  |  |
| _Embedding-based w/ Gemini 2.5_ |  |  | Tends to organize information in numbered lists. |
| _Embedding-based w/ Llama 3.1_ |  |  |  |
| _Embedding-based w/ Longformer Encoder-Decoder_ |  |  |  |

### Required Modules

In [2]:
import os
import json
from IPython.display import display, Markdown
import random

from typing import Dict, Any, Union, List

### Load Results

In [3]:
RULES_AND_STATS = os.path.join("..", "baselines", "results", "tfidf_bm25_lead_results.json")
TWO_PASS_LLM_SUMMARY = os.path.join("..", "baselines", "results", "llm_summarized_results.json")
LLM_EMBEDDING_SUMMARY = os.path.join("..", "baselines", "results", "llm_embeddings_results.json")


def load_json(filepath: str) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
    """Return JSON data."""
    if os.path.exists(filepath):
        with open(filepath, "r") as f:
            data = json.load(f)
        return data


results = {}
rules_n_stats = load_json(RULES_AND_STATS)
for paper_id, data in rules_n_stats.items():
    results[paper_id] = {}

    # Article's abstract
    results[paper_id]["Abstract"] = ' '.join(data["abstract"]["plain"])

    # 1. Lead N prediction
    results[paper_id]["Lead N"] = data["lead_n"]

    # 2. Lead N by Section prediction
    results[paper_id]["Lead N by Section"] = data["lead_n_by_section"]

    # 3. TF-IDF prediction
    n = len(data["abstract"]["plain"])
    tfidf_indexes = data["rank"]["TF-IDF"][:n]
    tfidf_sentences = [data["sentences"]["plain"][i] for i in tfidf_indexes]
    results[paper_id]["TF-IDF"] = ' '.join(tfidf_sentences)

    # 4. BM25 prediction
    n = len(data["abstract"]["plain"])
    tfidf_indexes = data["rank"]["BM25"][:n]
    tfidf_sentences = [data["sentences"]["plain"][i] for i in tfidf_indexes]
    results[paper_id]["BM25"] = ' '.join(tfidf_sentences)


two_pass_llm = load_json(TWO_PASS_LLM_SUMMARY)
for data in two_pass_llm:
    paper_id = data["paper_id"]

    # 5. Two-pass LLM prediction
    results[paper_id]["Two-pass w/ Gemma 3"] = data["gemma3:1b"]["summary"]\


llm_embeddings = load_json(LLM_EMBEDDING_SUMMARY)
for data in llm_embeddings:
    paper_id = data["paper_id"]

    # 6. Embedding-based LLM prediction
    results[paper_id]["Embedding-based w/ Gemini 2.5"] = data["gemini-2.5-flash"]["summary"]
    results[paper_id]["Embedding-based w/ Llama 3.1"] = data["llama-3.1-8b-instant"]["summary"]
    results[paper_id]["Embedding-based w/ Longformer Encoder-Decoder"] = data["led"]["summary"]

### Review Predictions

In [4]:
METHODS = [
    "Lead N",
    "Lead N by Section",
    "TF-IDF",
    "BM25",
    "Two-pass w/ Gemma 3",
    "Embedding-based w/ Gemini 2.5",
    "Embedding-based w/ Llama 3.1",
    "Embedding-based w/ Longformer Encoder-Decoder"]


def print_random_prediction(results: Dict[str, str]) -> None:
    """Display a randomly chosen prediction against the abstract."""
    paper_id = random.choice(list(results.keys()))
    paper_results = results[paper_id]

    abstract = results[paper_id]["Abstract"]
    method_1 = random.choice(METHODS)
    method_2 = random.choice([m for m in METHODS if m != method_1])
    summary_1 = results[paper_id][method_1]
    summary_2 = results[paper_id][method_2]

    text = f"""
<br>

## [{paper_id}](https://arxiv.org/pdf/{paper_id})

### Abstract
{abstract}

### {method_1}
{summary_1}

### {method_2}
{summary_2}
"""

    display(Markdown(text))


print_random_prediction(results)


<br>

## [2511.21636v1](https://arxiv.org/pdf/2511.21636v1)

### Abstract
AI/ML models have rapidly gained prominence as innovations for solving previously unsolved problems and their unintended consequences from amplifying human biases. Advocates for responsible AI/ML have sought ways to draw on the richer causal models of system dynamics to better inform the development of responsible AI/ML. However, a major barrier to advancing this work is the difficulty of bringing together methods rooted in different underlying assumptions (i.e., Dana Meadow’s ‘the unavoidable a priori’). This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.

### Embedding-based w/ Llama 3.1
The text discusses the differences and potential confusions between system dynamics (SD) and structural equation modeling (SEM). Here's a summary:

1. **Different equation frameworks:** SD uses nonlinear differential equations to represent causal systems and measurement models, while SEM uses linear equations with the option to include nonlinear interaction terms.
2. **Confusion points:** People often mistakenly associate the causal system of differential equations in SD with the linear equations in SEM or the solution to the system of equations in SD with the implied covariance matrix in SEM.
3. **Goals of SD and SEM:** The primary goal of SD is to develop an explanation for a dynamic behavior pattern, typically represented as a behavior over time graph, while the primary goal of SEM is to estimate the causal relationships between variables.
4. **Common mathematical framework:** To resolve the confusions, the text proposes developing a common mathematical framework that covers the usual models from both SD and SEM, rather than trying to embed one method within the other.

The text also outlines a general framework for modeling systems, consisting of three subsystems:

1. **Dynamic subsystem:** Describes the rate equations as a matrix of static variables.
2. **Static subsystem:** Describes the static variables as a matrix of linear and interaction terms between stocks or state variables and static variables.
3. **Measurement subsystem:** Describes the indicators or observed variables as a matrix of linear combinations of dynamic and state variables of the model.

### Embedding-based w/ Gemini 2.5
The provided text clarifies the fundamental differences and common confusions between **System Dynamics (SD)** and **Structural Equation Modeling (SEM)**.

**Key distinctions:**
*   **System Dynamics (SD)** utilizes a framework of **nonlinear differential equations** to represent a causal system, with solutions obtained through computer simulation to explain dynamic behavior patterns over time.
*   **Structural Equation Modeling (SEM)** employs a system of primarily **linear equations** (though generalized to include nonlinear terms) to model causal systems and measurement, yielding an implied covariance matrix for estimation.

**Sources of confusion** arise when people mistakenly associate SD's differential equations with SEM's linear causal structure, or SD's simulation solution with SEM's implied covariance matrix.

To address these confusions and bridge the understanding between the two approaches, the authors propose developing a **common mathematical framework**. This framework aims to decompose a system into three subsystems:
1.  **Dynamic subsystem:** Describes rate equations using static variables.
2.  **Static subsystem:** Describes static variables and interactions between state/stock variables and static variables.
3.  **Measurement subsystem:** Describes observed indicators as linear combinations of dynamic and state variables.

This common framework is intended to help readers trained in one method understand the other, providing a general way to relate the two rather than relying on specific tricks or embedded applications.


# Comments on Examples

### `2511.21471v1`

#### Lead N

---

In daily life, human can effortlessly integrate spatial information from their surroundings, with a capability known as spatial cognition.
This ability extends beyond mere object recognition, serving as a cognitive bridge between perceptual inputs and higher-level functions such as reasoning and navigation.
With the rapid advancements of large language models (LLMs) [ brown2020language , vicuna2023 , driess2023palm , gilardi2023chatgpt , rohan2023alpaca , touvron2023llama ] , multimodal large language models (MLLMs) have recently emerged as a major step toward general-purpose visual–linguistic intelligence [ bai2023qwen , dai2023instructblip , li2023blip , zhang2023internlm , brooks2023instructpix2pix , black2023training , li2023llavamed , zhu2023minigpt , zhang2023gpt4roi , liu2023llava , liu2023improvedllava , ye2023mplug , He2024malmm , Zhang2024groundhog , Chen2023internvl , Yuan2023osprey , Dong2024dreamllm , Cha2023honeybee , qwq32b ] .
By jointly aligning visual and textual modalities within a shared semantic space, MLLMs have moved beyond abstract visual representations, integrating linguistic context to interpret scenes in a more structured and human-like manner.
Recent advances show that MLLMs have exhibited spatial reasoning abilities [ Cai2025spatial , Cheng2024spatial , Chen2024spatial , Han2025video , Li2024top , Yamada2024evaluating , zhu2024llava , kumar2025does , yang2023set , tang2024sparkle , wu2025spatial , li2025llava , liu2025oryx ] , and several benchmarks have been introduced to quantify these capabilities [ azuma2022scanqa , ma2023sqa , yang2025thinking , li2025spatial , yin2025spatial , tong2024cambrian , zhang2025from , li2025view ] .

---

- Unlike _BM25_ or _TF-IDF_, this method results in a coherent paragraph.
- The idea is suddenly interrupted at the end of the paragraph, the point being built never gets to land.
- It is not a summary, the paragraph paints the landscape in which this research is relevant but entirely misses any meaningful presentation of what is unique for this paper.

### `2511.21460v1`

#### Embedding-based w/ Llama 3.1

---

The text discusses a study on enhancing the safety awareness of artificial intelligence (AI) agents in interactive environments, particularly in household tasks. The researchers propose a few key contributions:

1. MADRA (Multi-Agent Debate Risk Assessment): a method that uses a critical evaluator to drive iterative refinement and consensus voting, reducing single-agent bias and false rejections.
2. Task Hierarchical Planning Framework: a framework that integrates safety, memory, planning, and reflection modules to achieve self-evolution.
3. SafeAware-VH Dataset: a dataset containing safety and unsafe instructions to test the safety awareness of agents in VirtualHome.

The study compares the performance of different safety enhancement methods, including:

1. Single-Agent Safety CoT: a method that uses a Chain-of-Thought (CoT) safety reminder technique to enable the model to function as a safety detector.
2. MADRA vs Safety CoT: a comparison of the two methods shows that MADRA achieves a higher rejection rate for unsafe tasks while maintaining a lower rejection rate for safe tasks.

The researchers also evaluate the performance of their methods in two embodied environments: VirtualHome and AI2-THOR. The results show that:

1. MADRA achieves a high rejection rate for unsafe tasks: up to 90% in some cases.
2. MADRA maintains a low rejection rate for safe tasks: less than 10% in some cases.
3. The critical agent's capability plays a crucial role in MADRA: using a more powerful model as the critical agent yields optimal results.

Overall, the study demonstrates the effectiveness of MADRA in enhancing the safety awareness of AI agents in interactive environments and highlights the importance of using a multi-agent approach to mitigate the problem of over-rejection in single-agent risk assessment.

---

- Very structured, which results in a clear and easy to read summary.
- The summary seems to represent the information from the paper accurately, I cannot find any factual errors in the output.
- The summary follows closely the information presented on the abstract, it presents all of the information the authors deemed relevant.

#### Abstract

---

Ensuring the safety of embodied AI agents during task planning is critical for real-world deployment, especially in household environments where dangerous instructions pose significant risks. Existing methods often suffer from either high computational costs due to preference alignment training or over-rejection when using single-agent safety prompts. To address these limitations, we propose MADRA, a training-free Multi-Agent Debate Risk Assessment framework that leverages collective reasoning to enhance safety awareness without sacrificing task performance. MADRA employs multiple LLM-based agents to debate the safety of a given instruction, guided by a critical evaluator that scores responses based on logical soundness, risk identification, evidence quality, and clarity. Through iterative deliberation and consensus voting, MADRA significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Additionally, we introduce a hierarchical cognitive collaborative planning framework that integrates safety, memory, planning, and self-evolution mechanisms to improve task success rates through continuous learning. We also contribute SafeAware-VH, a benchmark dataset for safety-aware task planning in VirtualHome, containing 800 annotated instructions. Extensive experiments on AI2-THOR and VirtualHome demonstrate that our approach achieves over 90% rejection of unsafe tasks while ensuring that safe-task rejection is low, outperforming existing methods in both safety and execution efficiency. Our work provides a scalable, model-agnostic solution for building trustworthy embodied agents.

---

#### Embedding-based w/ Longformer Encoder-Decoder

---

The authors of this paper propose a new method for evaluating the safety risk awareness of LLM agents in VirtualHome by feeding a safety awareness prompt directly into the LLM . They use a chain-of-Thought technique to provide a safety reminder and perform risk assessments through step-by-step reasoning. The main contributions of this study are summarized as follows: We propose MADRA: a multi-agent debate framework where a critical evaluator drives iterative refinement and consensus voting, curbing single-llm bias and cutting false rejections.We have designed a task hierarchical planning framework based on cognitive collaboration that integrates safety, memory, planning, and reflection to achieve self-evolution. Our method is inherently training-free, demonstrating universality and flexibility as a plug-and-play module that can be easily applied across different scenarios and domains.2. MADRA vs Safety CoT In order to further explore the safety awareness of agents in virtual home, the authors develop a dataset called SafeAware-VH. This dataset contains both safe and unsafe instructions. Their safety enhancement techniques are shown in Table 2. Overall, the results suggest that different safety enhancement methods are effective in improving various language models, measured by their rejection rates for safe and unsafe content.3. Critical Agent role in Multi-Agent Debate Risk Assessment MADRA provides a critical role in detecting unsafe tasks and over-rejects safe instructions.4. Over-rejection refers to the tendency for safe instructions to be incorrectly flagged as unsafe.5. We propose a risk assessment method based on multi-labor debate (MADRA) and apply it as a universal safety module to any task.6. Various approaches to evaluating agent safety are used in this study. For example, using a less capable model such as Llama3 as the Critical Agent leads to a significant increase in the number of unsafe tasks, while for more powerful models like GPT-3/GPT-4o the Critical Assessor yields near-perfect safe task success rates.7. Critical Model vs Safety Checker MADRA uses a Chain-of Thought technique to serve as a safety detector and performs risk assessment step by step reasoning.8. Safety-CoT pushes the unsafe task rejection rate to 80%–93% , a 20-56 percentage point gain over the raw model, but achieves respectable success rates on safe tasks up to 70.3% in AI2-THOR.9. A key finding is the critical role of the Critical Antagonist's capability within the Multi-Agile Decisions Risk Assessment module.10. R-Judge is a good for serving as a starting point for assessing safety risks in an interactive environment, but lacks household tasks. Therefore, they establish a dataset named SafeAaware-Vh, which contains safety and unsafe instruction.11. MADE vs Safety Countermeasures MADRA + CoT improves all eight language models tested in this field by feeding prompts directly into LLLM and feeding them with chains of thought. The experimental results presented in table 2 demonstrate the effectiveness of different safety enhancing methods across various languages models, demonstrated by their failure rates for safely and unsafe content.12. Single-agent Safety CoT yields the highest absolute rejection rate of unsafe prompts, but simultaneously over-rebels safe instructions alike.13. While heaps of effort is required to train neural nets specifically for security detection, free-training and directly using LLMs for single-agent security detection can easily lead to the problem of overrojection, making it difficult to effectively enhance safety awareness.14. To address the issue of over-overrejection by a single LLM agent, we propose a threat assessment method Based on Multagent Debate Method MADRA, wherein a "risk assessment method" driven by consensus voting is used to weed out bias and false rejectings.15. The method does not require training since it is machine learning-free.16. The overall success rate of our method is very high, indicating that the performance of the planning system is excellent and it can execute the actions of security task instructions as successfully as possible.17. Determining the appropriateness of items in a non-interactive environment requires large computational costs.18. Free-training also requires huge computational costs because of its reliance on artificial intelligence.19. Reversal refers to correctly refusing unsafe tasks.20. Overtaking refers to correcting incorrect errors or flagging unsafe tasks as unsafe instead of calling them unsafe.21. Overeating refers to correcting mistakes or misinterpreting instructions incorrectly.22. Overworking refers to fixing problems caused by mistake detection.23. Overturning refers to preventing people from entering dangerous situations.24. Dangerous house tasks are relatively scarce due to the existence of no datasets for dangerous home tasks. As such, they created Safe Aware-VHD.

---

- There is so much text that should be cut off, but nothing more than this reference to a table we don't get to see: _Their safety enhancement techniques are shown in Table 2._
- The formatting is poor, there are listed elements lost in the middle of the _paragraph_ that start counting from the second item. Even more, the numbers unnecessarily continue all the way til the end.
- It compromises the results by mixing different point from the paper in a way that not only misses the relevant point, but makes it factually incorrect: _8. Safety-CoT pushes the unsafe task rejection rate to 80%–93% , a 20-56 percentage point gain over the raw model, but achieves respectable success rates on safe tasks up to 70.3% in AI2-THOR._ In the example sentence, the main comparison between Safety-CoT and MADRAS is the missing safety rejection rate. Furthermore, the mention of AI2-THOR does not belong here and confuses the methodology in the paper.
- Another example: _MADE vs Safety Countermeasures MADRA + CoT_. There is no reference to any object, algorithm, or anything called _MADE_ in the paper. The whole construction makes no sense.

### `2511.21570v1`

#### BM25 Summary

---

This is further compounded by AI’s need for vast amounts of data and high energy demands, highlighting an important challenge for responsible foresight: developing ethical, sustainable, and context-sensitive AI systems that fully support human intelligence in future decision-making.
Nevertheless, AI’s ability to analyze extensive datasets, model complex systems, and simulate alternative futures presents a promising pathway to support responsible foresight, enabling policymakers to gain insights that extend beyond what human cognition alone can offer.
Responsible foresight [ uruena2021foresight ] will thus require technical tools like AI to close the decision loop, but also a deep understanding of the interconnected social, economic, and environmental systems that influence future outcomes, alongside a commitment to ethical and sustainable decision-making.
Each approach contributes unique insights: Superforecasting harnesses the wisdom of skilled forecasters to generate highly accurate predictions; prediction markets leverage collective knowledge; world simulation creates virtual environments to capture the complexity of our social, economic and environmental systems; and simulation intelligence uses AI to design control strategies within simulation worlds providing meaningful insights on efficient and resilient pathways for the future.
Incorporating responsible foresight into policymaking increasingly will involve the use of algorithms to analyze complex data, predict potential outcomes, and offer insight into a range of possible futures.
These complementary methods—Superforecasting, prediction markets, world simulation, simulation intelligence, scenario-building, participatory futures, futures literacy and hybrid intelligence—form a powerful toolkit for responsible computational foresight.
To realize this potential, AI in responsible foresight must be conceived as an assistive tool—a cognitive exoskeleton that enables policymakers to navigate complexity, envision a range of desirable futures, and critically assess the impacts of various decisions.

---

- Although the selected sentences are meaningful, the whole summary is disjointed and full of jumps. For example, the first sentence of the BM25 summary starts as a response to a missing sentence.
- The text keeps referring to _responsible foresight_ without ever defining the term, which in the abstract is done in the first sentence. There are other unclaryfied references like _superforecasting_ and _participatory futures_.
- Some sentences are ridiculously long and overcomplicated, while we can blame the author of the paper this method can not produce anything beyond what is already offered by the text.
- As a note, the _BM25_ methods was applied with a variable number of sentences $n$ equal to the number of sentences in the abstract.

#### Embedding-based w/ Gemini 2.5 Summary

---

This text introduces "responsible computational foresight," a field that leverages human-centric AI to empower policymakers.
The central argument is that AI should augment human intelligence, not replace it, by helping explore a broad spectrum of future possibilities—including unexpected, unintended, and desirable ones—beyond mere precise predictions or historical extrapolations.

This approach integrates various computational tools like forecasting, complex systems simulation, AI-assisted scenario-building, and simulation intelligence, alongside human elements such as participatory futures and futures literacy.
These tools, combined with a deep understanding of interconnected social, economic, and environmental systems and a commitment to ethical decision-making, strengthen our ability to anticipate and prepare for a range of outcomes.

Ultimately, responsible computational foresight aims to enhance ethical decision-making, build resilience against uncertainty, and enable society to navigate complexity with informed, adaptive, and inclusive strategies, thereby shaping sustainable and equitable paths for future generations.

---

- The LLM immediately defines the unique concept of the paper, _responsible foresight_.
- No long or overcomplicated sentences.

by [Manuel Velarde](mailto:manuel@velarde.me)