# RAG Summarization with BM25 and LLMs

### Implementation Description

| Step | Description | Rationale | Comment |
|------|-------------|-----------|---------|
| Process source document | Break paper into sentences. Encode the full document together, and each sentence independently. Tokenize and lemmatize text. |  | Using `verbatim_rag` to read HTML files. |
| Select relevant sentences | Calculate **BM25** similarity score between sentences and full document. Request an LLM model to extract the theme | This was the most reliable method for selecting sentences from the baseline results. |  |
| Shape summary from sentences | Use **LLM** to shape selected sentences into summary. | Plain `rank_bm25` results are disjointed and out of order. |  |
| Append supporting citation | Use `verbatim_rag` to identify supporting material in the original paper. |  |  |

#### Required Modules

In [59]:
from typing import Optional, Tuple, List

import os
import sys
import glob
import re
import json
from dotenv import load_dotenv

import pandas as pd

from IPython.display import display, Markdown

import nltk
from rank_bm25 import BM25Okapi

from verbatim_rag.schema import DocumentSchema
from verbatim_rag.chunker_providers import MarkdownChunkerProvider
from verbatim_rag.embedding_providers import SentenceTransformersProvider
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag import VerbatimIndex, VerbatimRAG
from verbatim_rag.core import LLMClient

from openai import OpenAI

assert nltk.download('wordnet')
assert nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [52]:
sys.path.append(os.path.abspath("../"))

In [54]:
from utils.metrics import *

#### Constants

Using `verbatim_rag` to load documents.

In [74]:
ARXIV_URL = "https://arxiv.org/pdf/"
HTML_PATH = os.path.join("..", "data", "raw", "htmls")

documents = glob.glob(f"{HTML_PATH}/*.txt")
documents

DOCUMENT_ID = [
    '2511.21398v1',
    '2511.21444v1',
    '2511.21460v1',
    '2511.21471v1',
    '2511.21522v1',
    '2511.21569v1',
    '2511.21570v1',
    '2511.21591v1',
    '2511.21636v1',
    '2511.21678v1',
]

DOCUMENT_ID = documents[0:10]

#DOCUMENT_ID = ["..\\data\\raw\\htmls\\2510.25320v1.txt"]

#### Helper Functions

Method to extract abstract using regular expressions.

In [75]:
def abstract_from_(paper: str) -> Optional[str]:
    """Get abstract from Markdown text."""
    match = re.search(r'## Abstract\s*(.+?)(?=\n##)', paper, re.DOTALL)
    if match:
        abstract = match.group(1).strip()
        # abstract = re.sub(r"^\s*\.\s*\n*", "", abstract)
        return abstract
    return None

Method to display Markdown strings.

In [76]:
def print_markdown_(text: str) -> None:
    """Print Markdown string."""
    display(Markdown(text))

In [77]:
def parse_and_bucket_sections(full_text):
    # 1. Define the 4 Buckets
    buckets = {
        "intro": [],
        "methods": [],
        "results": [],
        "conclusion": []
    }
    
    # 2. Regex to find Top-Level Headers (e.g., "## I Introduction" or "2. Methodology")
    # Matches lines starting with numbering (I, 1, 1.) or Markdowns (##) followed by text
    # We ignore ### (subsections) effectively treating them as body text
    header_pattern = re.compile(r'^(?:##\s+|[IVX\d]+\.?\s+)(.*)', re.MULTILINE)
    
    # Split text by these headers
    # 'split' returns [text_before_first_header, header1, content1, header2, content2...]
    segments = header_pattern.split(full_text)
    
    # Iterate through pairs of (Header, Content)
    # Skip segments[0] (text before first header)
    for i in range(1, len(segments), 2):
        header_title = segments[i].strip().lower()
        content = segments[i+1].strip()
        
        # 3. Semantic Mapping Logic
        if any(x in header_title for x in ['intro', 'background', 'related', 'motivation']):
            buckets['intro'].append(content)
            
        elif any(x in header_title for x in ['method', 'formulation', 'system', 'approach', 'model', 'data', 'architecture']):
            buckets['methods'].append(content)
            
        elif any(x in header_title for x in ['result', 'experiment', 'evaluation', 'perform', 'metric', 'ablation']):
            buckets['results'].append(content)
            
        elif any(x in header_title for x in ['conclusion', 'discussion', 'future']):
            buckets['conclusion'].append(content)
            
        elif 'reference' in header_title:
            continue # Drop references
            
        else:
            # Fallback: If we can't guess, put it in Methods (safest bet for middle sections)
            # Or append to the previous bucket found
            buckets['methods'].append(content)

    # 4. Join the lists back into single text blocks
    return {
        "intro_text": "\n".join(buckets['intro']),
        "methods_text": "\n".join(buckets['methods']),
        "results_text": "\n".join(buckets['results']),
        "conclusion_text": "\n".join(buckets['conclusion'])
    }

## Process Source Documents

The `verbatim_rag` library captures the abstract correctly in all examples, unlike the processed used for the baseline models.

In [79]:
documents = {}
papers = {}
abstracts = {}
document_sections = {}

for document_id in DOCUMENT_ID:
    # paper = DocumentSchema.from_url(url=ARXIV_URL + document_id)
    #document = DocumentSchema.from_url(url=os.path.join(HTML_PATH, document_id + '.txt'))
    document = DocumentSchema.from_url(url=document_id)
    documents[document_id] = document
    papers[document_id] = document.content
    abstracts[document_id] = abstract_from_(document.content)
    document_sections[document_id] = parse_and_bucket_sections(document.content)

2026-01-12 19:49:23,844 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-12 19:49:23,961 - INFO - Going to convert document batch...
2026-01-12 19:49:23,962 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-12 19:49:23,962 - INFO - Processing document 2510.24832v1.txt
2026-01-12 19:49:24,168 - INFO - Finished converting document 2510.24832v1.txt in 0.33 sec.
2026-01-12 19:49:24,271 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-12 19:49:24,913 - INFO - Going to convert document batch...
2026-01-12 19:49:24,914 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-12 19:49:24,915 - INFO - Processing document 2510.25005v1.txt
2026-01-12 19:49:25,452 - INFO - Finished converting document 2510.25005v1.txt in 1.18 sec.
2026-01-12 19:49:25,604 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-12 19:49:25,789 - INFO - Going to convert 

## Obtain section summaries

In [139]:
def build_section_summary(client: OpenAI, prompt_template: str, sentences: str) -> str:
    response = client.responses.create(
        model=OPENAI_MODEL,
        instructions="Only reply with a summarized paragraph.",
        input=prompt_template.format(sentences=sentences)
    )
    
    return response.output_text

def build_summary(client: OpenAI, summary_wordcount: int, prompt_template: str, intro_text: str, methods_text: str, results_text: str, conclusion_text: str) -> str:
    response = client.responses.create(
        model=OPENAI_MODEL,
        instructions="""
- Start directly with the problem or the proposed solution (do not say 'This paper presents...').
- Connect the Methodology and Results logically using transition words (e.g., 'Consequently,' 'Specifically,' 'We observe that').
- Ensure the tone is objective, impersonal, and authoritative.
- Ensure the final word at least 200 relevant words long.
- Do not use bullet points. Write a single paragraph.""",
        input=prompt_template.format(summary_wordcount=summary_wordcount, 
                                     intro_text=intro_text, 
                                     methods_text=methods_text, 
                                     results_text=results_text, 
                                     conclusion_text=conclusion_text)
    )
    
    return response.output_text

def build_refiner(client: OpenAI, prompt_template: str, v1_summary: str):
    response = client.responses.create(
        model=OPENAI_MODEL,
        instructions="""
    Refinement Rules:
    - Remove generic phrases like 'comprehensive experiments show that' or 'in order to'.
    - Merge short sentences where possible to improve flow.
    - Ensure the final word count is between 150-250 words.
    - Crucial: Ensure the primary metric (the number/result) appears in the final text.""",
        input=prompt_template.format(v1_summary=v1_summary)
    )
    
    return response.output_text

def build_exam_questions(client: OpenAI, prompt_template: str, full_text: str):
    response = client.responses.create(
        model=OPENAI_MODEL,
        input=prompt_template.format(full_text=full_text)
    )
    return response.output_text
    
def build_exam_answers(client: OpenAI, prompt_template: str, summary_text: str, questions: str):
    response = client.responses.create(
        model=OPENAI_MODEL,
        input=prompt_template.format(summary_text=summary_text,
                                     questions=questions
                                    )
    )
    return response.output_text

def build_evaluation(client: OpenAI, prompt_template: str, question: str, ground_truth: str, answer: str):
    response = client.responses.create(
        model=OPENAI_MODEL,
        input=prompt_template.format(question=question,
                                     ground_truth=ground_truth, 
                                     answer=answer)
    )
    return response.output_text
    

In [252]:
INTRO_PROMPT = """
Analyze the provided Introduction. Identify the following two elements clearly:
- The Problem Space: What specific limitation, inefficiency, or challenge in the current state of the art is this paper addressing?
- The Proposed Solution: What is the core system, framework, or hypothesis introduced to solve this?

Introduction:
{sentences}
"""

METHODS_PROMPT = """
Analyze the provided Methods/Methodology. Extract the technical specifics using the following keys:
- Data/Environment: What specific datasets or source materials were used? (Do not invent any if they are not mentioned).
- The Architecture: Briefly describe the model structure, algorithm, or experimental design.
- Settings: Mention any critical hyperparameters, baselines, or comparisons used.

Methods:
{sentences}
"""

RESULTS_PROMPT = """
Analyze the provided Results. Extract the key evidence that supports the contribution:
- Quantitative Metrics: Extract the specific main performance numbers strictly as they appear in the text. If more than 2 different metrics and numbers exist, only mention the two most important metrics.
- Comparison: How did it perform relative to the baseline?
- CONSTRAINT: Only extract numbers explicitly written in the text. If no specific quantitative metrics are provided, describe the qualitative improvement (e.g., "significantly faster," "more robust") without inventing data. The numbers themselves must be an important part of the results.

Results:
{sentences}
"""

CONCLUSION_PROMPT = """
Analyze the provided Discussion/Conclusion. Summarize:
- Interpretation: Why do the results matter?
- Limitations: What is one key weakness acknowledged by the authors?
- Impact: What is the final takeaway for the field?

Conclusion:
{sentences}
"""

JOINER_PROMPT = """
You are an expert technical writer. I will provide you with the key component parts of a research paper. Your goal is to weave them into a single, fluid abstract (approx. {summary_wordcount} words).
Avoid including formulas.

Instructions:
1. Tone: Maintain an objective, scholarly tone suitable for the specific domain of the input (e.g., if the content is Computer Science, use CS terminology; if Biology, use Bio terminology).
2. Accuracy: Do not introduce new information or numbers that are not present in the Input Data.
3. Flow: Ensure smooth logical transitions between the Problem, Method, and Results.

Input Data:
- Context: {intro_text}
- Methodology: {methods_text}
- Key Findings: {results_text}
- Implications: {conclusion_text}
"""

REFINER_PROMPT = """
Review the draft abstract below.
Rewrite it to be denser and more concise, mimicking the style of a high-impact publication in this specific field.

Constraints:
- Retain all specific metrics and proper nouns (e.g., model names, dataset names).
- Do not add sensationalist adjectives (e.g., avoid "groundbreaking" unless supported by data).
- Focus on information density: reduce fluff words to allow more space for technical details.

Draft: {v1_summary}
"""

OPENAI_MODEL = 'o4-mini'

In [253]:
load_dotenv()
key = os.getenv("OPENAI_API_KEY")
assert key

client = OpenAI(api_key=key)

In [254]:
rag_summaries_vs_abstract = {}
scoring_results = []

docs = document_sections.items()
#docs = [list(document_sections.items())[-1]]

for k, v in docs:
    print(f"Summarizing document: {k}")
    doc_content = document_sections[k]

    print(f"Summarizing Introduction...")
    intro_summary = build_summary_(client=client, 
                                   prompt_template=INTRO_PROMPT, 
                                   sentences=doc_content["intro_text"])
    print(f"Summarizing Methods...")
    methods_summary = build_summary_(client=client, 
                                     prompt_template=METHODS_PROMPT, 
                                     sentences=doc_content["methods_text"])
    print(f"Summarizing Results...")
    results_summary = build_summary_(client=client, 
                                     prompt_template=RESULTS_PROMPT, 
                                     sentences=doc_content["results_text"])
    print(f"Summarizing Conclusion...")
    conclusion_summary = build_summary_(client=client, 
                                        prompt_template=CONCLUSION_PROMPT, 
                                        sentences=doc_content["conclusion_text"])

    print(f"Formulating cohesive summary...")
    v1_summary = build_summary(client=client, 
                               summary_wordcount=200,
                               prompt_template=JOINER_PROMPT, 
                               intro_text=intro_summary,
                               methods_text=methods_summary, 
                               results_text=results_summary, 
                               conclusion_text=conclusion_summary
                              )

    print(f"Refining summary...")
    refined_summary = build_refiner(client=client, 
                                    prompt_template=REFINER_PROMPT, 
                                    v1_summary=v1_summary)

    curr_abstract = abstracts[k]
    rag_summaries_vs_abstract[k] = {"abstract": curr_abstract, 
                                    "v1_summary": v1_summary,
                                   "refined_summary": refined_summary}

    
    rouge_scores_subsection_llm_v1 = calculate_rouge_score(curr_abstract, 
                                                           v1_summary)
    rouge_scores_subsection_llm_refined = calculate_rouge_score(curr_abstract, 
                                                                refined_summary)
    
    bert_scores_subsection_llm_v1 = calculate_bert_score(curr_abstract, 
                                                         v1_summary)
    bert_scores_subsection_llm_refined = calculate_bert_score(curr_abstract, 
                                                              refined_summary)

    scoring_results.append(
        {
            "paper_id": k, 
            "method": "V1_SCAFFOLDED_TEMPLATING",
            "rouge1": rouge_scores_subsection_llm_v1["rouge1_fmeasure"], 
            "rougeL": rouge_scores_subsection_llm_v1["rougeL_fmeasure"],
            "bert_score_f1": bert_scores_subsection_llm_v1["bertscore_f1"]
        }
    )
    
    scoring_results.append(
            {
                "paper_id": k, 
                "method": "REFINED_SCAFFOLDED_TEMPLATING",
                "rouge1": rouge_scores_subsection_llm_refined["rouge1_fmeasure"], 
                "rougeL": rouge_scores_subsection_llm_refined["rougeL_fmeasure"],
                "bert_score_f1": bert_scores_subsection_llm_refined["bertscore_f1"]
            }
        )

df_results = pd.DataFrame(scoring_results)
df_results

Summarizing document: ..\data\raw\htmls\2510.24832v1.txt
Summarizing Introduction...


2026-01-12 23:15:23,403 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:15:33,853 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:15:40,155 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:15:47,214 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:16:09,100 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:16:20,799 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:16:20,802 - INFO - Using default tokenizer.
2026-01-12 23:16:20,820 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25005v1.txt
Summarizing Introduction...


2026-01-12 23:16:28,151 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:16:39,003 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:16:43,289 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:16:47,780 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:17:08,249 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:17:42,342 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:17:42,344 - INFO - Using default tokenizer.
2026-01-12 23:17:42,351 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25007v1.txt
Summarizing Introduction...


2026-01-12 23:17:48,134 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:17:58,752 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:18:04,664 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:18:08,564 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:18:19,656 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:18:40,667 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:18:40,669 - INFO - Using default tokenizer.
2026-01-12 23:18:40,691 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25014v1.txt
Summarizing Introduction...


2026-01-12 23:18:47,788 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:18:57,120 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:19:04,605 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:19:08,196 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:19:16,525 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:19:37,493 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:19:37,495 - INFO - Using default tokenizer.
2026-01-12 23:19:37,519 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25065v1.txt
Summarizing Introduction...


2026-01-12 23:19:43,497 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:19:52,112 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:19:58,123 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:20:03,348 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:20:23,305 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:20:37,566 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:20:37,569 - INFO - Using default tokenizer.
2026-01-12 23:20:37,584 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25091v1.txt
Summarizing Introduction...


2026-01-12 23:20:45,141 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:21:04,068 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:21:13,417 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:21:20,882 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:21:28,523 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:22:01,867 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:22:01,870 - INFO - Using default tokenizer.
2026-01-12 23:22:01,889 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25101v1.txt
Summarizing Introduction...


2026-01-12 23:22:08,999 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:22:16,069 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:22:30,829 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:22:35,404 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:22:56,567 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:23:23,199 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:23:23,201 - INFO - Using default tokenizer.
2026-01-12 23:23:23,222 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25179v1.txt
Summarizing Introduction...


2026-01-12 23:23:31,588 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:23:38,613 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:23:46,517 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:23:51,353 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:24:03,071 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:24:10,207 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:24:10,210 - INFO - Using default tokenizer.
2026-01-12 23:24:10,226 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25205v1.txt
Summarizing Introduction...


2026-01-12 23:24:21,631 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:24:33,355 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:24:39,661 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:24:51,793 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:25:10,138 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:25:28,682 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:25:28,685 - INFO - Using default tokenizer.
2026-01-12 23:25:28,721 - INFO - Using default tokenizer.


Summarizing document: ..\data\raw\htmls\2510.25223v1.txt
Summarizing Introduction...


2026-01-12 23:25:36,035 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Methods...


2026-01-12 23:25:46,097 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Results...


2026-01-12 23:25:54,432 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summarizing Conclusion...


2026-01-12 23:25:57,918 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Formulating cohesive summary...


2026-01-12 23:26:06,220 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refining summary...


2026-01-12 23:26:26,389 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:26:26,392 - INFO - Using default tokenizer.
2026-01-12 23:26:26,414 - INFO - Using default tokenizer.


Unnamed: 0,paper_id,method,rouge1,rougeL,bert_score_f1
0,..\data\raw\htmls\2510.24832v1.txt,V1_SCAFFOLDED_TEMPLATING,0.48,0.224,0.629264
1,..\data\raw\htmls\2510.24832v1.txt,REFINED_SCAFFOLDED_TEMPLATING,0.458128,0.216749,0.624795
2,..\data\raw\htmls\2510.25005v1.txt,V1_SCAFFOLDED_TEMPLATING,0.260274,0.164384,0.583295
3,..\data\raw\htmls\2510.25005v1.txt,REFINED_SCAFFOLDED_TEMPLATING,0.30303,0.199134,0.588976
4,..\data\raw\htmls\2510.25007v1.txt,V1_SCAFFOLDED_TEMPLATING,0.262425,0.147117,0.552201
5,..\data\raw\htmls\2510.25007v1.txt,REFINED_SCAFFOLDED_TEMPLATING,0.274882,0.146919,0.549716
6,..\data\raw\htmls\2510.25014v1.txt,V1_SCAFFOLDED_TEMPLATING,0.416667,0.200617,0.641713
7,..\data\raw\htmls\2510.25014v1.txt,REFINED_SCAFFOLDED_TEMPLATING,0.434004,0.241611,0.649669
8,..\data\raw\htmls\2510.25065v1.txt,V1_SCAFFOLDED_TEMPLATING,0.375,0.185185,0.607236
9,..\data\raw\htmls\2510.25065v1.txt,REFINED_SCAFFOLDED_TEMPLATING,0.429752,0.214876,0.611708


In [255]:
[(len(d["abstract"]), len(d["v1_summary"]), len(d["refined_summary"])) for d in rag_summaries_vs_abstract.values()]

[(1341, 1961, 1293),
 (453, 1659, 1199),
 (1356, 2078, 1439),
 (1380, 3069, 1623),
 (1116, 1934, 1469),
 (2025, 2054, 1496),
 (1601, 2245, 1442),
 (1294, 2149, 1382),
 (2025, 2741, 1672),
 (1809, 2311, 1522)]

In [256]:
with open("./summary_templating.json", "w") as f:
    json.dump(rag_summaries_vs_abstract, f)

In [265]:
EXAMINER_PROMPT = """
You are an expert researcher. Read the following Full Paper text.
Generate 5 fact based questions that must be simple enough to be answered from a theoretical scientific abstract for this paper. Keep the answers limited to a single sentence if possible. 
You must ignore the paper's Abstract section and instead only rely on the rest of the paper. Choose your own questions to be answered that you consider would create a high quality scientific abstract based on the Introduction, Methodology, Results and Conclusions.
For each question, provide the exact Ground Truth Answer from the text.

Constraints:
1. Question 1 must be about the **specific problem** addressed.
2. Question 2 must be about the **methodology** used.
3. Question 3 must be about the **quantitative results** (metrics, numbers).
4. Question 4 must be about the **dataset or environment**.
5. Question 5 must be about the **limitations or future work**.

Format output as python compatible JSON list (in text format) with a question property and an answer property for each of the questions.

Full Paper Text: {full_text}
"""

STUDENT_PROMPT = """
You are a student taking a test.
Read the following Summary and answer the Questions based *strictly* on the information provided in that summary.
If the information is not in the summary, answer "NOT MENTIONED".

Summary: {summary_text}

Questions:
{questions}

Format the output as a python list of answers.
"""

GRADER_PROMPT = """
Compare the Student's Answer to the Ground Truth Answer.
Score 1 if the Student's Answer is correct and sufficiently semantically similar to the Ground Truth (typos and incomplete answers are OK, but complete fabrications are not acceptable).
Score 0 if the Student's Answer is incorrect, vague, or "NOT MENTIONED".

Q: {question}
Ground Truth: {ground_truth}
Student Answer: {answer}

Return ONLY the score (0 or 1).
"""

In [266]:
results = []

for paper_id, paper_content in papers.items():
    print("#######", paper_id)
    # 1. Generate the Exam (using Full Text)
    exam_json = build_exam_questions(client=client, 
                                     prompt_template=EXAMINER_PROMPT, 
                                     #full_text=paper_content.replace())
                                     full_text="\n".join([v for v in document_sections[paper_id].values()])
                                    )
    questions = json.loads(exam_json)
    #print(questions)
    
    # Test summary answers
    summary_answers = build_exam_answers(client=client, 
                                         prompt_template=STUDENT_PROMPT, 
                                         summary_text=rag_summaries_vs_abstract[paper_id]["v1_summary"], 
                                         questions=questions)
    summary_answers = json.loads(summary_answers)

    # Test summary answers
    refined_summary_answers = build_exam_answers(client=client, 
                                         prompt_template=STUDENT_PROMPT, 
                                         summary_text=rag_summaries_vs_abstract[paper_id]["refined_summary"], 
                                         questions=questions)
    refined_summary_answers = json.loads(refined_summary_answers)
    
    # Test abstract answers
    baseline_answers = build_exam_answers(client=client, 
                                          prompt_template=STUDENT_PROMPT, 
                                          summary_text=rag_summaries_vs_abstract[paper_id]["abstract"], 
                                          questions=questions)
    baseline_answers = json.loads(baseline_answers)
    
    # 4. Grading
    summary_score = 0
    refined_summary_score = 0
    abstract_score = 0
    
    for i, q in enumerate(questions):
        print("\n")
        print(f"Question: {q['question']}")
        print(f"Ground truth: {q['answer']}")
        
        summary_grade = build_evaluation(client=client,
                                         prompt_template=GRADER_PROMPT, 
                                        question=q['question'], 
                                        ground_truth=q['answer'], 
                                        answer=summary_answers[i])
        print(f"Summary answer: {summary_answers[i]}")
        if "1" in summary_grade: 
            summary_score += 1

        refined_summary_grade = build_evaluation(client=client,
                                         prompt_template=GRADER_PROMPT, 
                                        question=q['question'], 
                                        ground_truth=q['answer'], 
                                        answer=refined_summary_answers[i])
        print(f"Refined summary answer: {refined_summary_answers[i]}")
        if "1" in refined_summary_grade: 
            refined_summary_score += 1
        
        abstract_grade = build_evaluation(client=client,
                                          prompt_template=GRADER_PROMPT, 
                                        question=q['question'], 
                                        ground_truth=q['answer'], 
                                        answer=baseline_answers[i])
        print(f"Abstract answer: {baseline_answers[i]}")
        if "1" in abstract_grade: 
            abstract_score += 1

    results.append({
        "paper_id": paper_id,
        "summary_recall": summary_score / 5.0,
        "refined_summary_recall": summary_score / 5.0,
        "abstract_recall": abstract_score / 5.0
    })
    print()
    
# Calculate Average Improvement
avg_summary_recall = sum(r['summary_recall'] for r in results) / len(results)
avg_refined_summary_recall = sum(r['refined_summary_recall'] for r in results) / len(results)
avg_abstract_recall = sum(r['abstract_recall'] for r in results) / len(results)

print(f"RAG Summary Recall: {avg_summary_recall:.2%}")
print(f"RAG Refined Summary Recall: {avg_refined_summary_recall:.2%}")
print(f"Original Abstract Recall: {avg_abstract_recall:.2%}")

####### ..\data\raw\htmls\2510.24832v1.txt


2026-01-12 23:35:36,361 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:36:02,440 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:36:14,849 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:36:23,453 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem in RLVR data scheduling for LLM mathematical reasoning does this paper address?
Ground truth: Existing RLVR data scheduling methods estimate query difficulty primarily via final solution accuracy and overlook richer query-level characteristics such as the structural complexity of the reasoning tree.


2026-01-12 23:36:25,158 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Existing RLVR data-scheduling methods rely on final solution accuracy as a proxy for problem difficulty and overlook the structural complexity of a query‚Äôs reasoning tree, leading to misprioritization of examples.


2026-01-12 23:36:27,245 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Existing RLVR data scheduling methods estimate query difficulty via final solution accuracy, ignoring reasoning‚Äêtree structure and misprioritizing samples.


2026-01-12 23:36:29,220 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: Existing RLVR data scheduling methods rely on path-based metrics to rank queries and overlook the structural complexity of their reasoning trees.


Question: What methodology do the authors propose to better quantify and schedule queries during RLVR training?
Ground truth: They introduce the Reasoning Score (r-score), a tree-based metric quantifying a query's learning potential under a fixed node-editing budget, and propose Re-Schedule, a data scheduling algorithm that dynamically weights queries from structurally simple to complex based on the r-score.


2026-01-12 23:36:32,616 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: They introduce the Reasoning Score (r-score), which measures a query‚Äôs maximum accuracy gain under a fixed node-editing budget, and present the Re-Schedule algorithm to build approximate reasoning trees offline, simulate edits to compute r-scores, and dynamically weight samples in an easy-to-hard curriculum.


2026-01-12 23:36:35,682 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: They introduce the Reasoning Score (r‚Äêscore), defined as the maximum accuracy gain under a fixed node‚Äêediting budget, and propose Re-Schedule, which constructs approximate k-ary reasoning trees offline, simulates node edits to compute r‚Äêscores, and applies an easy‚Äêto‚Äêhard curriculum with dynamic weights during RLVR fine-tuning.


2026-01-12 23:36:37,863 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: They introduce the Reasoning Score (r-score), a metric measuring query difficulty from its reasoning-tree structure, and propose Re-Schedule, a curriculum algorithm that orders queries from structurally simple to complex based on the r-score.


Question: What quantitative improvement does the Re-Schedule method achieve over baseline scheduling approaches?
Ground truth: Re-Schedule significantly improves average accuracy on complex reasoning tasks, achieving gains of up to 3.2% over accuracy-based scheduling baselines.


2026-01-12 23:36:39,931 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Re-Schedule achieves 47.1% accuracy versus 46.9% under a linear schedule and 48.3% versus 47.4% under a sigmoid schedule for single-node fixes, consistently outperforming baseline scheduling approaches.


2026-01-12 23:36:42,017 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Under a linear schedule, Re-Schedule achieves 47.1% versus 46.9% (+0.2%), and under a sigmoid schedule, 48.3% versus 47.4% (+0.9%).


2026-01-12 23:36:44,027 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: Re-Schedule improves average accuracy on six math-reasoning benchmarks by up to 3.2%.


Question: On which dataset and evaluation benchmarks do the authors demonstrate the effectiveness of their approach?
Ground truth: They train on the DAPO-Math-17k dataset of integer math problems and evaluate on six benchmarks: AIME24, AIME25, AMC23, MATH-500, Minerva Math, and OlympiadBench.


2026-01-12 23:36:46,040 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: They evaluate on DAPA-Math-17K and five other standard math-reasoning benchmarks using Qwen2.5-Math-7B and Qwen2.5-7B models.


2026-01-12 23:36:48,057 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: They evaluate on six math-reasoning benchmarks, including the DAPA-Math-17K dataset with Qwen2.5-Math-7B and Qwen2.5-7B.


2026-01-12 23:36:49,620 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What limitation related to the reasoning tree approximation do the authors identify?
Ground truth: While larger values for the branching factor k and maximum depth d theoretically provide a more accurate approximation and thus a more effective r-score, they also introduce a significant computational overhead.


2026-01-12 23:36:51,313 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:36:52,708 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:36:54,921 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25005v1.txt


2026-01-12 23:37:12,764 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:37:30,473 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:37:46,220 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:37:50,137 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem does the paper address?
Ground truth: The lack of theoretical foundations for counterfactual inference in cyclic structural causal models under shift-scale interventions, due to violations of unique solvability in the presence of feedback loops.


2026-01-12 23:37:52,269 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Addressing the failure of prevailing counterfactual frameworks in cyclic causal systems by providing theoretical foundations for counterfactual inference in cyclic SCMs under shift-scale interventions.


2026-01-12 23:37:54,093 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: The lack of theoretical foundations for counterfactual inference in cyclic structural causal models under shift-scale interventions, caused by feedback loops violating unique solvability assumptions in existing (acyclic) frameworks.


2026-01-12 23:37:55,616 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What methodology do the authors use to establish unique solvability and well-posedness of counterfactuals in cyclic SCMs?
Ground truth: They assume a global ‚Ñìp-contraction condition on the causal mechanism and apply Banach‚Äôs fixed-point theorem to prove that both the original and shift-scale intervened twin SCMs are uniquely solvable (i.e. simple SCMs).


2026-01-12 23:37:57,867 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: They impose a global ‚Ñì·µñ-contraction condition on the causal mechanisms and apply Banach‚Äôs fixed-point theorem to guarantee unique solvability of both the original and intervened twin SCMs.


2026-01-12 23:38:00,108 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: They impose a global ‚Ñì·µñ-contraction condition (Œ∫<1) on the causal mechanisms and apply Banach‚Äôs fixed-point theorem to prove unique solvability of both the original and shift-scale intervened twin SCMs.


2026-01-12 23:38:01,867 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What quantitative bound do the authors derive for the tails of counterfactual functionals under Gaussian noise?
Ground truth: They show that for any 1-Lipschitz functional h, P(h(X,X')‚àíE[h(X,X')]‚â•t) ‚â§ exp(‚àít¬≤/(2(1‚àíŒ∫)‚Åª¬≤œÉ¬≤)) for t>0, so (X,X') is sub-Gaussian with proxy (1‚àíŒ∫)‚Åª¬≤œÉ¬≤.


2026-01-12 23:38:04,546 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: They derive sub-Gaussian concentration bounds for counterfactual outcomes under Gaussian noise, showing P(h(X,X')‚àíE[h(X,X')]‚â•t) ‚â§ exp(‚àít¬≤/(2(1‚àíŒ∫)‚Åª¬≤œÉ¬≤)), i.e. a variance proxy of (1‚àíŒ∫)‚Åª¬≤œÉ¬≤.


2026-01-12 23:38:07,480 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: They show that for any 1-Lipschitz functional h, P(h(X,X')‚àíE[h(X,X')]‚â•t) ‚â§ exp(‚àít¬≤/(2(1‚àíŒ∫)‚Åª¬≤œÉ¬≤)), so (X,X') is sub-Gaussian with variance proxy (1‚àíŒ∫)‚Åª¬≤œÉ¬≤.


2026-01-12 23:38:09,365 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What illustrative environment do the authors use to demonstrate their theory?
Ground truth: A two-variable linear cyclic SCM modeling consumption C and income I defined by C=0.50¬∑I+1+E_C, I=0.40¬∑C+0.50+E_I with (E_C,E_I)‚ä§~ùí©(0,0.04¬∑I‚ÇÇ).


2026-01-12 23:38:12,013 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: An illustrative two-variable linear cyclic SCM modeling consumption and income under Gaussian noise.


2026-01-12 23:38:14,589 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: A two-variable linear cyclic SCM of consumption C and income I defined by C = 0.50¬∑I + 1 + E_C and I = 0.40¬∑C + 0.50 + E_I with (E_C,E_I)·µÄ ‚àº N(0, 0.04¬∑I‚ÇÇ).


2026-01-12 23:38:16,095 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What limitation or future work do the authors identify regarding their intervention class?
Ground truth: They only cover shift-scale interventions with bounded scale factors (|a_j|‚â§1) and do not yet address interventions with larger multiplicative gains, stochastic policies, or more general functional forms.


2026-01-12 23:38:17,965 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: They do not evaluate any empirical datasets or comparative baselines.


2026-01-12 23:38:19,663 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:38:21,386 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25007v1.txt


2026-01-12 23:38:40,459 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:38:47,046 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:38:54,857 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:39:04,436 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem does ProFees address in CPT E/M coding?
Ground truth: It addresses the resource-intensive, inconsistent, and error-prone manual process of assigning CPT E/M codes under complex guidelines and variable coder expertise.


2026-01-12 23:39:05,691 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: It addresses the resource-intensive, inconsistent, and error-prone manual process of assigning CPT E/M codes under complex guidelines and variable coder expertise.


2026-01-12 23:39:06,921 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: It addresses the resource-intensive, inconsistent, and error-prone manual process of assigning CPT E/M codes under complex guidelines and variable coder expertise.


2026-01-12 23:39:08,206 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What methodology does ProFees use for CPT E/M coding automation?
Ground truth: ProFees employs a modular LLM-based framework combining dynamic few-shot chain-of-thought prompting, explicit LLM-based criticism (RCI), self-consistency via majority voting, and rule-based decision trees.


2026-01-12 23:39:10,807 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: ProFees employs a modular LLM-based framework combining dynamic few-shot chain-of-thought prompting, an LLM-based critic for explicit MDM validation, a self-consistency strategy via majority voting over parallel inferences, and a deterministic rule-based CPT decision tree.


2026-01-12 23:39:13,599 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: ProFees employs a modular LLM-based framework combining dynamic few-shot chain-of-thought prompting, an LLM-based MDM critic with recursive criticism & improvement, self-consistency via K parallel inferences with majority voting, and a deterministic CPT decision tree.


2026-01-12 23:39:15,436 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: By what percentage did ProFees improve CPT coding accuracy over the commercial coding system?
Ground truth: ProFees achieved 36.85% higher CPT accuracy than the commercial System A.


2026-01-12 23:39:16,896 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: 36.85%


2026-01-12 23:39:18,274 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: 36.85%


2026-01-12 23:39:19,570 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: more than 36%


Question: What was the size of the test dataset used to evaluate ProFees?
Ground truth: The Test dataset comprised 99 encounters.


2026-01-12 23:39:21,099 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:39:22,469 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:39:23,808 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What future work do the authors propose for ProFees?
Ground truth: Future work includes extending the model to support multiple codes and CPT modifiers, and generating synthetic datasets for edge-case testing and to enrich the VDB.


2026-01-12 23:39:26,770 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Multi-code support and synthetic edge-case generation to enhance generalizability.


2026-01-12 23:39:29,367 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Future work includes extending multi-code support and synthetic edge-case generation to improve generalizability.


2026-01-12 23:39:30,929 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25014v1.txt


2026-01-12 23:39:48,835 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:39:55,346 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:40:03,383 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:40:16,080 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific challenge does this paper address in using LLMs for in-game trading?
Ground truth: The paper addresses the core tension between LLMs' creative flexibility and the semi-structured procedures of commercial transactions in in-game trading.


2026-01-12 23:40:17,924 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: The paper addresses the core tension between LLMs' creative flexibility and the semi-structured procedures of commercial transactions in in-game trading.


2026-01-12 23:40:20,330 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: LLMs fail to enforce the semi-structured trading procedures‚Äîbrowse, offer, review, confirm‚Äîleading to skipped steps and unwanted purchases.


2026-01-12 23:40:22,258 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: The core tension between LLMs‚Äô creative flexibility and the procedural demands of in-game trading, namely their failure to follow essential procedural flows in rule-governed trading systems


Question: What methodology does the paper introduce to enforce procedural compliance?
Ground truth: The paper introduces Autoregressive State-Tracking Prompting (ASTP), a prompting methodology that makes state-tracking an explicit, autoregressive process embedded in a structured Prime-Guide-Enforce workflow.


2026-01-12 23:40:23,186 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: The paper introduces Autoregressive State-Tracking Prompting (ASTP), a prompting methodology that makes state-tracking an explicit, autoregressive process embedded in a structured Prime-Guide-Enforce workflow.


2026-01-12 23:40:26,049 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Autoregressive State-Tracking Prompting (ASTP), a Prime-Guide-Enforce workflow that requires explicit inference and emission of the previous dialogue state before each turn, paired with placeholder-based post-processing for numeric precision.


2026-01-12 23:40:28,031 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: Autoregressive State-Tracking Prompting (ASTP)


Question: By how much did ASTP improve procedural compliance according to the results?
Ground truth: ASTP increased adherence to key safeguards from 78.1% to 99.6%.


2026-01-12 23:40:29,191 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: ASTP increased adherence to key safeguards from 78.1% to 99.6%.


2026-01-12 23:40:31,358 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Procedural compliance improved from 78.1% to 99.6%.


2026-01-12 23:40:32,943 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What environment was used to evaluate the proposed method?
Ground truth: All experiments utilized a virtual player LLM interacting with an LLM-driven NPC over 300 dialogues across two scenarios: Specific Item Purchase and Item Recommendation.


2026-01-12 23:40:34,220 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:40:37,188 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: It was evaluated on 300 in-game trading dialogues from two scenarios using JSON-formatted world data (52 items) and a 20-item merchant inventory.


2026-01-12 23:40:39,579 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What future work do the authors suggest?
Ground truth: Future work should investigate the scalability of ASTP across a larger number of states and more complex transition rules.


2026-01-12 23:40:40,797 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Future work should investigate the scalability of ASTP across a larger number of states and more complex transition rules.


2026-01-12 23:40:41,984 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:40:43,069 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25065v1.txt


2026-01-12 23:40:54,596 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:41:05,940 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:41:14,833 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:41:21,674 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem does PM4GRPO aim to address?
Ground truth: Existing GRPO-inspired methods focus solely on optimizing final answers and neglect the underlying reasoning processes, resulting in suboptimal behaviors such as unnecessary verbosity and accidental correctness.


2026-01-12 23:41:23,412 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Existing GRPO-based RL post-training methods optimize only final answers or surface-level text features, neglecting the chain-of-thought process and encouraging verbosity, speculative leaps, or accidental correctness without genuine understanding.


2026-01-12 23:41:25,373 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Existing GRPO-based post-training methods optimize only final answers or surface text and neglect chain-of-thought, encouraging verbosity, speculation, or accidental correctness.


2026-01-12 23:41:27,776 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: PM4GRPO addresses the limitation of outcome-centric reward schemes in GRPO-based post-training, which focus only on final answers/formats and neglect the underlying multi-step reasoning process.


Question: What methodology does PM4GRPO use to integrate reasoning process alignment into GRPO-based post-training?
Ground truth: PM4GRPO uses Process Mining techniques‚Äîinductive miner to build process models from policy reasoning traces and alignment-based conformance checking to measure their alignment with a teacher model‚Äôs reasoning as a conformance reward in GRPO.


2026-01-12 23:41:30,627 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: PM4GRPO integrates Process Mining into GRPO by using the Inductive Miner to discover a process model from the student‚Äôs generated reasoning traces and alignment-based conformance checking against the teacher‚Äôs traces to compute an F1-based sequence-level reward, which is then combined with standard format and answer rewards under the GSPO objective.


2026-01-12 23:41:33,039 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: PM4GRPO extends Group Sequence Policy Optimization by applying Process Mining: it uses the Inductive Miner to infer a process model from policy-generated reasoning sequences and alignment-based conformance checking to compute an F1 conformance reward measuring alignment with a teacher‚Äôs reasoning traces.


2026-01-12 23:41:35,718 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: It uses process mining techniques to compute a scalar conformance reward‚Äîmeasuring how closely the policy model‚Äôs reasoning traces align with those of a pretrained teacher model‚Äîand incorporates this reward into GRPO.


Question: What MATH500 accuracy did the 7B-scale PM4GRPO model achieve?
Ground truth: The 7B-scale PM4GRPO model achieved 91.1% accuracy on the MATH500 benchmark.


2026-01-12 23:41:37,083 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: 91.1%


2026-01-12 23:41:38,848 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: 91.1%


2026-01-12 23:41:40,176 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: Which dataset was used to train the PM4GRPO models?
Ground truth: The models were trained on the DeepMath-103k dataset of mathematical problems with solutions generated by DeepSeek-R1.


2026-01-12 23:41:41,784 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:41:43,426 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:41:44,847 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What future research direction do the authors suggest?
Ground truth: They suggest exploring Process Mining techniques for quantitatively evaluating reasoning procedures as processes in reinforcement learning for large reasoning models.


2026-01-12 23:41:48,579 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: They suggest leveraging process-aware reward design to develop more robust reinforcement learning strategies for large reasoning models.


2026-01-12 23:41:50,052 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:41:52,319 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25091v1.txt


2026-01-12 23:42:14,551 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:42:27,720 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:42:37,284 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:42:45,238 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem does the paper address?
Ground truth: We formulate the d d -day-ahead stock movement prediction as a binary classification problem, aiming to forecast whether the closing price of each constituent stock within the market index will rise after d d trading days.


2026-01-12 23:42:47,102 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:42:49,700 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Stock movement prediction in the presence of low signal-to-noise ratios, non-stationarity, dynamic inter-stock dependencies, and multimodal heterogeneity.


2026-01-12 23:42:51,219 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What methodology is proposed to solve this problem?
Ground truth: We propose a novel multi-modal architecture that synergistically integrates multi-context hypergraph modeling, LLM-enhanced semantic reasoning, and style-structure expert specialization.


2026-01-12 23:42:54,236 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: A unified multi-modal architecture called H3M-SSMoEs, which integrates a hierarchical multi-context hypergraph, a frozen Llama-3.2-1B LLM with lightweight adapters for deep semantic alignment, and a Style-Structured Mixture-of-Experts module that sparsely activates specialized experts based on market regimes.


2026-01-12 23:42:56,743 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: H3M-SSMoEs: a hierarchical multi-context hypergraph combined with a frozen Llama-3.2-1B LLM with lightweight adapters and a Style-Structured Mixture-of-Experts module.


2026-01-12 23:42:59,426 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: H3M-SSMoEs: a Hypergraph-based Multimodal architecture integrating a Multi-Context Hypergraph for spatiotemporal and inter-stock relational learning, an LLM-enhanced reasoning module for semantic fusion of quantitative and textual data, and a Style-Structured Mixture of Experts for regime-aware specialization.


Question: What quantitative results demonstrate the performance of the proposed method?
Ground truth: Extensive experiments on the DJIA, NASDAQ 100, and S&P 100 indices demonstrate our method's state-of-the-art performance, achieving the highest risk-adjusted returns with Sharpe ratios of 1.585, 2.100, and 1.351, and Calmar ratios of 3.377, 4.380, and 2.075, respectively, while maintaining the lowest maximum drawdowns (14.81%, 16.17%, and 14.27%).


2026-01-12 23:43:02,197 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: On the DJIA, H3M-SSMoEs achieved a 50.00% annual return and a Sharpe ratio of 1.585, representing a 57.7% improvement over the strongest baseline.


2026-01-12 23:43:04,250 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Backtesting on DJIA yields a 50.00% annual return and a Sharpe ratio of 1.585, a 57.7% improvement over the strongest baseline.


2026-01-12 23:43:05,947 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: On which datasets or environments was the proposed method evaluated?
Ground truth: We evaluated our method on three major stock indices: DJIA, NASDAQ 100, and S&P 100, using data from January 1, 2020 to August 31, 2025, with a 7:1:2 split into training, validation, and testing sets.


2026-01-12 23:43:07,481 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: The model was evaluated on the Dow Jones Industrial Average (DJIA).


2026-01-12 23:43:09,469 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Backtesting on the Dow Jones Industrial Average (DJIA).


2026-01-12 23:43:10,724 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What limitation or future work does the paper identify?
Ground truth: A promising research direction lies in developing unified frameworks that jointly incorporate hypergraph-informed structural priors, LLM-based semantic reasoning, and specialized MoE processing, balancing representational richness with efficiency.


2026-01-12 23:43:13,160 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: The framework‚Äôs complexity and data diversity requirements may challenge real-time deployment and generalization across markets.


2026-01-12 23:43:15,233 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: The model‚Äôs complexity and data demands may hinder real-time deployment and cross-market generalization.


2026-01-12 23:43:16,770 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25101v1.txt


2026-01-12 23:43:37,702 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:43:55,248 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:44:05,643 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:44:17,293 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific limitation of existing agentic KBQA methods does this paper address?
Ground truth: The paper addresses the reliance on process supervision in existing agentic KBQA methods, which provides weak incentives for autonomous exploration and leads to limited robustness and flexibility.


2026-01-12 23:44:19,493 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Reliance on process-supervised, idealized, gold-logical-form reasoning trajectories that are singular and error-free, resulting in brittleness, poor robustness to noisy tool interactions, and limited flexibility to explore alternative reasoning paths.


2026-01-12 23:44:21,412 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: The paper addresses the reliance on process-supervised gold logical-form trajectories in existing agentic KBQA methods, which leads to brittleness and limited flexibility.


2026-01-12 23:44:23,868 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: The reliance on process supervision, which offers weak incentives for exploration and fails to strengthen agentic reasoning.


Question: What methodology does KnowCoder-A1 use to enhance agentic reasoning in KBQA?
Ground truth: KnowCoder-A1 adopts a multi-stage curriculum reinforcement learning approach that relies mainly on outcome-only supervision.


2026-01-12 23:44:25,262 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: A multi-stage curriculum reinforcement learning framework trained exclusively with outcome-only supervision.


2026-01-12 23:44:27,132 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: KnowCoder-A1 uses a multi-stage curriculum reinforcement learning framework trained solely with outcome-only supervision.


2026-01-12 23:44:28,779 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: A multi-stage curriculum reinforcement learning approach trained under outcome-only supervision with an easy-to-hard curriculum.


Question: What quantitative performance does KnowCoder-A1 achieve on the GrailQA benchmark?
Ground truth: Using 12√ó less training data, it achieves an F1 score of 80.5% on GrailQA, achieving a 3.3% relative improvement over KBQA-o1, the previous SOTA agentic-based approach.


2026-01-12 23:44:30,885 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: An F1 score of 80.5% on the generalization-focused GrailQA dataset, with a 3.3% relative improvement over the prior KBQA-o1 baseline.


2026-01-12 23:44:33,165 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: It achieves an F1 score of 80.5% on GrailQA, representing a 3.3% relative improvement over the KBQA-o1 baseline.


2026-01-12 23:44:35,622 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: Up to an 11.1% relative improvement on the zero-shot subset of GrailQA while using only one-twelfth of the training data.


Question: On which datasets is KnowCoder-A1 evaluated?
Ground truth: KnowCoder-A1 is evaluated on three widely-used KBQA datasets: WebQSP, CWQ, and GrailQA.


2026-01-12 23:44:37,066 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: WebQSP, CWQ, and GrailQA.


2026-01-12 23:44:38,466 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: KnowCoder-A1 is evaluated on the WebQSP, CWQ and GrailQA datasets.


2026-01-12 23:44:40,068 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What future research directions do the authors suggest?
Ground truth: Future work may investigate more advanced reflection mechanisms to mitigate remaining error types and extend the curriculum strategy to other complex, agent-based reasoning tasks.


2026-01-12 23:44:42,390 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Investigating advanced reflection mechanisms to address residual error types and extending the curriculum reinforcement learning paradigm to other complex interactive AI tasks.


2026-01-12 23:44:43,842 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:44:45,291 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25179v1.txt


2026-01-12 23:44:57,073 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:45:07,737 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:45:14,360 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:45:21,218 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem does this paper address?
Ground truth: Defending large vision-language models against cross-modal adversarial attacks that exploit visual vulnerabilities and modality shifts in semantic meaning.


2026-01-12 23:45:23,300 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Defending LVLMs against sophisticated cross-modal adversarial attacks, including pixel-level perturbations, hidden intent in benign text‚Äìimage combinations, and ensemble strategies that evade existing defenses.


2026-01-12 23:45:25,927 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Vulnerability of LVLMs to pixel-level adversarial perturbations, hidden intent in benign text‚Äìimage pairs, and ensemble attacks that bypass rule-based defenses or require costly computation.


2026-01-12 23:45:28,978 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: Defending multimodal systems against jailbreak attacks.


Question: What methodology do the authors propose to improve LVLM safety?
Ground truth: A model-agnostic Agentic Moderation Framework that coordinates specialized Shield, Responder, Evaluator, and Reflector agents in an iterative, collaborative workflow.


2026-01-12 23:45:31,329 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: A model-agnostic Agentic Moderation framework that coordinates specialized SHIELD, Responder, Evaluator, and Reflection agents in an iterative, collaborative workflow without retraining the underlying LVLM.


2026-01-12 23:45:34,023 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: A model-agnostic Agentic Moderation framework that uses a Coordinator to orchestrate four specialist agents‚ÄîSHIELD, Responder, Evaluator, and Reflector‚Äîin an iterative collaborative workflow without retraining the underlying LVLM.


2026-01-12 23:45:36,749 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: A model-agnostic framework leveraging dynamic, cooperative agents (Shield, Responder, Evaluator, Reflector) for context-aware, interpretable moderation.


Question: What quantitative improvements does Agentic Moderation achieve?
Ground truth: It reduces Attack Success Rate by 7‚Äì19% while keeping Non-Following Rate stable and improving Refusal Rate by 4‚Äì20%.


2026-01-12 23:45:38,717 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: A 17% increase in refusal rate on LLaMA with only 0.015 seconds of preprocessing overhead per query.


2026-01-12 23:45:40,823 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: It boosts refusal rates by 17% over static rule-based and classifier baselines with only a 0.015-second preprocessing overhead per query.


2026-01-12 23:45:42,661 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: It reduces Attack Success Rate by 7‚Äì19%, maintains a stable Non-Following Rate, and improves Refusal Rate by 4‚Äì20%.


Question: On which datasets and with what sampling strategy is the framework evaluated?
Ground truth: On five cross-modality adversarial safety datasets‚ÄîAdvBench, FigStep, Flow-JD, MMSafety, and SIUO‚Äîusing a random sample of 100 instances from each.


2026-01-12 23:45:44,494 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:45:46,087 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:45:47,539 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What are the main limitations of the framework and proposed future work?
Ground truth: The multi-agent design trades off safety robustness with increased computational cost and latency, and future work will explore adaptive agent scheduling and cost-aware coordination strategies.


2026-01-12 23:45:49,316 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: The absence of adaptive coordination among agents, which the authors identify as a direction for future development.


2026-01-12 23:45:51,397 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: The lack of adaptive coordination among agents is the primary limitation, with future work aiming to develop adaptive coordination strategies.


2026-01-12 23:45:53,214 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25205v1.txt


2026-01-12 23:46:28,731 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:46:39,728 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:46:52,619 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:47:02,744 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific objective is addressed by this paper?
Ground truth: Our objective is to adaptively reduce the energy consumption of the autonomous vehicle under a specific perception accuracy, while maintaining good driving performance.


2026-01-12 23:47:04,609 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Our objective is to adaptively reduce the energy consumption of the autonomous vehicle under a target perception accuracy while maintaining strong driving performance.


2026-01-12 23:47:08,907 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Our objective is to adaptively reduce the energy consumption of the autonomous‚Äêdriving perception system under a specific perception accuracy, while maintaining good driving performance.


2026-01-12 23:47:10,998 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: To adaptively reduce energy consumption of autonomous vehicle perception while maintaining desired perception accuracy and good driving performance.


Question: What methodology is proposed in this study?
Ground truth: To tackle these challenges, we propose an energy-efficient autonomous driving framework, called EneAD, which includes an adaptive perception module and a robust decision module.


2026-01-12 23:47:13,143 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: We propose an energy-efficient autonomous driving framework called EneAD, which comprises an adaptive perception module and a regularized reinforcement-learning decision module.


2026-01-12 23:47:15,806 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: We propose EneAD, an energy‚Äêefficient framework that adaptively configures perception (model choice, framerate, interpolation) via a lightweight uncertainty-aware classifier and multi-objective Bayesian optimization, and solves driving as a regularized MDP with P-DQN‚Äêstyle actor and value networks.


2026-01-12 23:47:17,246 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: An energy-efficient autonomous driving framework called EneAD, which includes an adaptive perception module and a robust decision module.


Question: What quantitative improvements does EneAD achieve in perception consumption and driving range?
Ground truth: To sum up, our framework EneAD can achieve a 1.9√ó-3.5√ó reduction of perception consumption, a slight reduction of the driving system, and a 3.9%-8.5% increase of driving range.


2026-01-12 23:47:20,935 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: EneAD achieves a 1.9√ó‚Äì3.5√ó reduction in perception energy consumption and a 3.9%‚Äì8.5% increase in driving range.


2026-01-12 23:47:22,677 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: EneAD achieves a 1.9√ó‚Äì3.5√ó reduction in perception energy and a 3.9%‚Äì8.5% increase in driving range.


2026-01-12 23:47:24,957 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: 1.9√ó‚Äì3.5√ó reduction in perception energy consumption, and 3.9%‚Äì8.5% increase in driving range.


Question: On what simulation environment is the framework evaluated?
Ground truth: We simulate the entire autonomous driving pipeline on the Carla simulator, which is a widely used project focused on creating a publicly available virtual environment for autonomous driving.


2026-01-12 23:47:26,360 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:47:27,559 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:47:29,223 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What limitation or future work is identified regarding the highest-difficulty scenarios?
Ground truth: To address it, more research breakthroughs in autonomous driving perception models are needed in the future.


2026-01-12 23:47:31,206 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: NOT MENTIONED


2026-01-12 23:47:32,823 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:47:34,744 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
####### ..\data\raw\htmls\2510.25223v1.txt


2026-01-12 23:47:50,683 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:48:02,055 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:48:12,333 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 23:48:18,687 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"




Question: What specific problem does the FELA system aim to solve?
Ground truth: Automated feature engineering on industrial-scale event log data.


2026-01-12 23:48:19,993 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Automated feature engineering on industrial-scale event log data.


2026-01-12 23:48:21,455 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Automating feature engineering for massive, heterogeneous industrial event logs.


2026-01-12 23:48:22,612 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: Automated feature engineering on industrial-scale event log data.


Question: How does FELA manage the complexity of automated feature engineering according to its architecture?
Ground truth: FELA employs multiple LLM-based agents with specialized roles that collaborate to manage the complexity of automated feature engineering.


2026-01-12 23:48:24,686 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: FELA orchestrates specialized LLM agents (idea agents, code agents, critic agents, evaluation agent) via a hierarchical idea‚Äìfeature knowledge structure and an agentic evolution algorithm to manage complexity.


2026-01-12 23:48:26,554 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: By using a multi-agent framework of specialized LLM agents (idea agents, code agents, critic agents) orchestrated via a hierarchical idea‚Äìfeature knowledge structure and agentic evolution algorithm.


2026-01-12 23:48:27,819 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: FELA employs multiple LLM-based agents with specialized roles that collaborate to manage the complexity of automated feature engineering.


Question: What AUC improvement did FELA achieve on the Taobao dataset compared to LLM-FE?
Ground truth: A notable AUC improvement from 0.641 to 0.653 over LLM-FE.


2026-01-12 23:48:29,646 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: AUC improvement from 0.641 to 0.653 on the Taobao conversion prediction task.


2026-01-12 23:48:31,182 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: An AUC increase from 0.641 to 0.653.


2026-01-12 23:48:33,668 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: Which datasets were used to evaluate the FELA system?
Ground truth: Three real-world datasets are adopted for evaluation, including Diabetes Health Indicator Dataset (Dia), Tabao Conversion Prediction Data (Taobao), and the User Churn Data in Tencent Game Platform (Tencent).


2026-01-12 23:48:35,458 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: ['Taobao conversion prediction', 'Tencent user churn prediction']


2026-01-12 23:48:37,289 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: Taobao conversion prediction dataset and Tencent user churn prediction dataset.


2026-01-12 23:48:38,862 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED


Question: What future extensions of FELA are suggested by the authors?
Ground truth: Future work will extend FELA toward broader applications, including multimodal data, dynamic environments, and tighter human-in-the-loop collaboration to further enhance controllability, scalability, and domain alignment.


2026-01-12 23:48:41,639 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Summary answer: Enhancing controllability, scalability, and cross-domain applicability.


2026-01-12 23:48:43,124 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Refined summary answer: NOT MENTIONED


2026-01-12 23:48:44,527 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Abstract answer: NOT MENTIONED
RAG Summary Recall: 60.00%
RAG Refined Summary Recall: 60.00%
Original Abstract Recall: 36.00%
