# RAG Summarization with BM25 and LLMs

### Implementation Description

| Step | Description | Rationale | Comment |
|------|-------------|-----------|---------|
| Process source document | Break paper into sentences. Encode the full document together, and each sentence independently. Tokenize and lemmatize text. |  | Using `verbatim_rag` to read HTML files. |
| Select relevant sentences | Calculate **BM25** similarity score between sentences and full document. Request an LLM model to extract the theme | This was the most reliable method for selecting sentences from the baseline results. |  |
| Shape summary from sentences | Use **LLM** to shape selected sentences into summary. | Plain `rank_bm25` results are disjointed and out of order. |  |
| Append supporting citation | Use `verbatim_rag` to identify supporting material in the original paper. |  |  |

#### Required Modules

In [41]:
from typing import Optional, Tuple, List

import os
import sys
import glob
import re
import json
from dotenv import load_dotenv

import pandas as pd

from IPython.display import display, Markdown

import nltk
from rank_bm25 import BM25Okapi

from verbatim_rag.schema import DocumentSchema
from verbatim_rag.chunker_providers import MarkdownChunkerProvider
from verbatim_rag.embedding_providers import SentenceTransformersProvider
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag import VerbatimIndex, VerbatimRAG
from verbatim_rag.core import LLMClient

from openai import OpenAI

assert nltk.download('wordnet')
assert nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
sys.path.append(os.path.abspath("../"))

In [43]:
from utils.metrics import *

#### Constants

Using `verbatim_rag` to load documents.

In [44]:
ARXIV_URL = "https://arxiv.org/pdf/"
HTML_PATH = os.path.join("..", "data", "raw", "htmls")

documents = glob.glob(f"{HTML_PATH}/*.txt")
documents

DOCUMENT_ID = [
    '2511.21398v1',
    '2511.21444v1',
    '2511.21460v1',
    '2511.21471v1',
    '2511.21522v1',
    '2511.21569v1',
    '2511.21570v1',
    '2511.21591v1',
    '2511.21636v1',
    '2511.21678v1',
]

DOCUMENT_ID = documents[0:10]

#DOCUMENT_ID = ["..\\data\\raw\\htmls\\2510.25320v1.txt"]

#### Helper Functions

Method to extract abstract using regular expressions.

In [45]:
def abstract_from_(paper: str) -> Optional[str]:
    """Get abstract from Markdown text."""
    match = re.search(r'## Abstract\s*(.+?)(?=\n##)', paper, re.DOTALL)
    if match:
        abstract = match.group(1).strip()
        # abstract = re.sub(r"^\s*\.\s*\n*", "", abstract)
        return abstract
    return None

Method to display Markdown strings.

In [46]:
def print_markdown_(text: str) -> None:
    """Print Markdown string."""
    display(Markdown(text))

## Process Source Documents

The `verbatim_rag` library captures the abstract correctly in all examples, unlike the processed used for the baseline models.

In [47]:
documents = []
papers = []
abstracts = []

for document in DOCUMENT_ID:
    # paper = DocumentSchema.from_url(url=ARXIV_URL + document)
    #document = DocumentSchema.from_url(url=os.path.join(HTML_PATH, document + '.txt'))
    document = DocumentSchema.from_url(url=document)
    documents.append(document)
    papers.append(document.content)
    abstracts.append(abstract_from_(papers[-1]))

2026-01-12 20:07:35,448 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-12 20:07:35,568 - INFO - Going to convert document batch...
2026-01-12 20:07:35,570 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-12 20:07:35,571 - INFO - Processing document 2510.24832v1.txt
2026-01-12 20:07:35,773 - INFO - Finished converting document 2510.24832v1.txt in 0.33 sec.
2026-01-12 20:07:35,884 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-12 20:07:36,204 - INFO - Going to convert document batch...
2026-01-12 20:07:36,205 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-12 20:07:36,205 - INFO - Processing document 2510.25005v1.txt
2026-01-12 20:07:36,800 - INFO - Finished converting document 2510.25005v1.txt in 0.92 sec.
2026-01-12 20:07:36,966 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-12 20:07:37,180 - INFO - Going to convert 

#### Tokenize & Lemmatize

Method to pre-process papers for BM25.

In [48]:
def pre_process_(paper: str) -> Tuple[List[List[str]], List[str]]:
    """Process paper into plain and lemmatized sentences."""
    stop_words = set(nltk.corpus.stopwords.words("english"))
    lemmatizer = nltk.stem.WordNetLemmatizer()

    sentence_split = nltk.sent_tokenize(paper)
    word_split = [
        nltk.word_tokenize(sentence) for
        sentence in sentence_split]

    plain = []
    lemmatized = []
    for i, sentence in enumerate(word_split):
        lemmatized.append([])
        plain.append(sentence_split[i].replace('\n', ''))

        for word in sentence:
            token = word.lower()
            if token.isalpha() and token not in stop_words:
                lemmatized[-1].append(lemmatizer.lemmatize(token))

        # Discard sentences where
        # lemmatization returns nothing
        if not lemmatized[-1]:
            lemmatized.pop()
            plain.pop()

    return lemmatized, plain

Pre-process each paper.

In [49]:
lemmatized_papers = []
plain_sentences = []

for paper in papers:
    lemmatized, plain = pre_process_(paper)
    lemmatized_papers.append(lemmatized)
    plain_sentences.append(plain)

## Select Relevant Sentences

Method to rank sentences using BM25.

In [50]:
def bm25_rank_(lemmatized: str) -> List[int]:
    """Order index of sentences by BM25 similarity to whole document."""
    sentences = BM25Okapi(lemmatized)
    scores = sentences.get_scores(sum(lemmatized, []))

    indexes = sorted(
        range(len(scores)),
        key=lambda i: scores[i],
        reverse=True)
    return indexes

Calculate sentence ranking.

In [51]:
bm25_rankings = []

for sentences in lemmatized_papers:
    bm25_rankings.append(bm25_rank_(sentences))

In [76]:
# for s, r in zip(plain_sentences, bm25_rankings):
#     print(s, r)
#s, r

get_plain_(sentences=s, indexes=r, n=15)

['Given a feature implementation produced by the idea agent, along with high-quality examples from previous attempts and the data schema as concrete guidance, the code agent generates highly executable and arbitrarily sophisticated code Œ∏ t \\theta\\_{t} to transform raw features:|    | Œ∏ t = ùíú code  ( { { d i , j } j = 0 M i } i = 0 k , ‚Ñã , Œ∏ j &lt; t ) , \\theta\\_{t}=\\mathcal{A}\\_{\\text{code}}(\\{\\{d\\_{i,j}\\}\\_{j=0}^{M\\_{i}}\\}\\_{i=0}^{k},\\mathcal{H},\\theta\\_{j&lt;t}),   |    | (6)   ||----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-------|where { { d i , j } j = 0 M i } i = 0 k \\{\\{d\\_{i,j}\\}\\_{j=0}^{M\\_{i}}\\}\\_{i=0}^{k} denotes selected feature implementations from k k ideas, and t t denotes the iteration step in the FELA system.',
 'In this paper, we propose FELA (Feature Engineering LLM Agents), a mult

Method to gather plain sentences selected by BM25.

In [52]:
def get_plain_(sentences: List[str], indexes: List[int], n: int=15) -> List[str]:
    """Get top n sentences according to indexes."""
    result = []
    for i in indexes[:n]:
        result.append(sentences[i])
    return result

# print(get_plain_(plain_sentences[2], bm25_rankings[2]))

## Shape Summary from Sentences

Query used to format selected sentences into a summary.

In [53]:
QUERY_PROMPT = """
These are 15 sentences selected from a scientific paper:

{sentences}

Please format a concise summary of this paper by rewriting these sentences.
The summary can re-order the sentences.
The summary can discard least relevant sentences.
The summary has to be 5 to 8 sentences long.
Write from the perspective of the reader using phrases like:
  - "the paper claims to ..."
  - "the authors state that ...".
  - "the article asserts ...".

Summary:
"""

OPENAI_MODEL = 'o4-mini'

Method to query LLM model to shape summary.

**Note:** Requires OpenAI key.

In [54]:
load_dotenv()
key = os.getenv("OPENAI_API_KEY")
assert key


def build_summary_(sentences: List[str]) -> str:
    client = OpenAI(api_key=key)
    response = client.responses.create(
        model=OPENAI_MODEL,
        instructions="Only reply with the rewritten paragraph.",
        input=QUERY_PROMPT.format(sentences=sentences)
    )
    return response.output_text

Generate all summaries.

In [55]:
summaries = []

for s, r in zip(plain_sentences, bm25_rankings):
    summaries.append(build_summary_(get_plain_(s, r)))

2026-01-12 20:08:01,839 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:08:14,988 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:08:24,933 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:08:36,449 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:08:48,398 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:08:56,744 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:09:07,523 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:09:19,224 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:09:28,227 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2026-01-12 20:09:43,667 - INFO - HTTP Request:

Save temporary data.

In [14]:
# with open('output.json', 'w') as f:
#     json.dump(summaries, f, indent=2)

## Metrics

In [58]:
len(summaries)
[len(s) for s in summaries]

[1250, 1049, 1187, 1168, 903, 1330, 978, 1179, 1369, 1483]

In [79]:
scoring_results = []

for k, curr_abstract, summary in zip(DOCUMENT_ID, abstracts, summaries):
    rouge_scores_llm = calculate_rouge_score(curr_abstract,
                                             summary)
    bert_scores_llm = calculate_bert_score(curr_abstract, 
                                           summary)
    
    scoring_results.append(
        {
            "paper_id": k, 
            "method": "BM25",
            "rouge1": rouge_scores_llm["rouge1_fmeasure"], 
            "rougeL": rouge_scores_llm["rougeL_fmeasure"],
            "bert_score_f1": bert_scores_llm["bertscore_f1"]
        }
    )

df_results = pd.DataFrame(scoring_results)
df_results

2026-01-12 21:34:03,230 - INFO - Using default tokenizer.
2026-01-12 21:34:05,241 - INFO - Using default tokenizer.
2026-01-12 21:34:06,313 - INFO - Using default tokenizer.
2026-01-12 21:34:07,493 - INFO - Using default tokenizer.
2026-01-12 21:34:08,631 - INFO - Using default tokenizer.
2026-01-12 21:34:09,716 - INFO - Using default tokenizer.
2026-01-12 21:34:11,208 - INFO - Using default tokenizer.
2026-01-12 21:34:12,491 - INFO - Using default tokenizer.
2026-01-12 21:34:13,553 - INFO - Using default tokenizer.
2026-01-12 21:34:14,744 - INFO - Using default tokenizer.


Unnamed: 0,paper_id,method,rouge1,rougeL,bert_score_f1
0,..\data\raw\htmls\2510.24832v1.txt,BM25,0.486911,0.225131,0.647632
1,..\data\raw\htmls\2510.25005v1.txt,BM25,0.194915,0.127119,0.497903
2,..\data\raw\htmls\2510.25007v1.txt,BM25,0.339152,0.154613,0.5758
3,..\data\raw\htmls\2510.25014v1.txt,BM25,0.391421,0.16622,0.573384
4,..\data\raw\htmls\2510.25065v1.txt,BM25,0.387324,0.176056,0.596
5,..\data\raw\htmls\2510.25091v1.txt,BM25,0.328228,0.148796,0.553313
6,..\data\raw\htmls\2510.25101v1.txt,BM25,0.36413,0.163043,0.606928
7,..\data\raw\htmls\2510.25179v1.txt,BM25,0.492582,0.272997,0.693892
8,..\data\raw\htmls\2510.25205v1.txt,BM25,0.384937,0.1841,0.603851
9,..\data\raw\htmls\2510.25223v1.txt,BM25,0.512249,0.240535,0.667206


In [83]:
summaries[0]

'The paper claims to model the space of solution paths for open‚Äêended queries as a ‚ÄúReasoning Tree,‚Äù where each node represents an intermediate reasoning step and each path a potential solution trajectory. The authors argue that existing curriculum learning strategies rely on final solution accuracy and thus overlook richer query‚Äêlevel characteristics like the structural complexity of these trees. To address this, the article asserts a novel metric called the Reasoning Score (r-score), defined as the maximum sum of node evaluations under a fixed budget of selected nodes. The authors state that their Reasoning Tree Schedule (Re-Schedule) leverages this metric to construct a curriculum that prioritizes queries based on their structural richness rather than just difficulty. They integrate this scheduling strategy into reinforcement learning with verifiable rewards (RLVR), employing policy optimization methods such as GRPO. The paper claims that this approach consistently outperfor

## Append Supporting Citation

Method to find supporting evidence for summary using `verbatim_rag`.

In [19]:
def find_evidence_(summary: str, document: DocumentSchema, n: int) -> str:

    chunker = MarkdownChunkerProvider(
        min_chunk_size=500,
        max_chunk_size=5000)
    dense_provider = SentenceTransformersProvider(
        model_name="ibm-granite/granite-embedding-english-r2",
        device='cpu')
    vector_store = LocalMilvusStore(
        db_path=f"./rag_test_{n}.db",
        collection_name=f'rag_test_{n}',
        dense_dim=dense_provider.get_dimension(),
        enable_dense=True,
        enable_sparse=False,
        nlist=16384)
    index = VerbatimIndex(
        vector_store=vector_store,
        dense_provider=dense_provider,
        chunker_provider=chunker)
    index.add_documents([document])

    llm_client = LLMClient(model=OPENAI_MODEL, temperature=1.0)
    rag = VerbatimRAG(index, llm_client=llm_client)
    query_string = f"Find supporting evidence for this summary:\n {summary}"

    response = rag.query(query_string)
    return response.answer

Find all citations for the summaries.

In [26]:
n = 9
citation = find_evidence_(summaries[n], documents[n], n)

2026-01-06 14:27:09,325 - INFO - Load pretrained SentenceTransformer: ibm-granite/granite-embedding-english-r2
2026-01-06 14:27:13,036 - INFO - Loaded SentenceTransformers model: ibm-granite/granite-embedding-english-r2
2026-01-06 14:27:14,180 - INFO - Created indexes for collection: rag_test_9
2026-01-06 14:27:14,204 - INFO - Created documents collection: rag_test_9_documents
2026-01-06 14:27:14,204 - INFO - Connected to Milvus Lite: ./rag_test_9.db
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:22<00:00, 22.19s/it]]
2026-01-06 14:27:36,485 - INFO - Added 25 vectors to Milvus
2026-01-06 14:27:36,512 - INFO - Added 1 documents to Milvus
Adding documents: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:22<00:00, 22.31s/it]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  7.78it/s]


Extracting relevant spans...
Extracting spans (batch mode)...


2026-01-06 14:28:38,926 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processing spans...
Generating response...


2026-01-06 14:28:44,593 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Preview summary and evidence pair.

In [27]:
print_markdown_(abstracts[n])
print_markdown_(summaries[n])
print_markdown_(citation)

MLLMs exhibit strong reasoning on isolated queries, yet they operate *de novo* -solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated , preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem , a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge-preserving stable, generalizable strategies while avoiding catastrophic forgetting.

Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction-hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at

[https://weihao-bo.github.io/ViLoMeo-page/](https://weihao-bo.github.io/ViLoMeo-page/) .

The paper claims to introduce a dual-stream memory architecture for multimodal LLMs that separately stores textual reasoning guidelines and visual priors. The authors state that each memory bank is queried via cosine similarity thresholds to retrieve top-k relevant entries, followed by a two-stage filtering process that first ranks visual candidates by perceptual embedding and then by textual relevance. The article asserts that an LLM-based error analysis module classifies reasoning failures and generates abstracted logic and visual guidelines whenever the model‚Äôs prediction diverges from the ground truth. The authors state that these new guidelines are merged into the respective memory banks through replace-or-merge operations, ensuring that both streams evolve with fresh, context-specific knowledge. The paper claims that final answers are generated by conditioning on the original image and question along with the retrieved dual-stream memories, integrating perception, question understanding, and guided reasoning. The article asserts that this cyclical process of retrieval, error analysis, guideline synthesis, and memory update systematically addresses reasoning errors and common visual pitfalls. The authors conclude that their approach yields more accurate, explainable solutions by continually refining the model‚Äôs perceptual and logical priors.

Here is the supporting evidence for the summary: 
- **Memory Retrieval**: [1] (c) Memory Retrieval : Specialized dual-stream retrieval mechanism. Visual memories undergo a two-stage process involving image-embedding retrieval followed by question-specific retrieval, since visual information must be conditioned on both image content and the textual query. Logical memories are retrieved through problem analysis and text-embedding similarity. 
- **Memory Generation**: [2] (b) Memory Generation : An error-attribution framework that employs an LLM for logical analysis and an MLLM for visual analysis, producing structured memory schemas through similarity-based merge and create operations.

### Conclusions on Base Approach

In general, the summaries capture more details on what the paper attempts to accomplish.
There is often a enumeration of the steps followed in the research to a deeper level than discussed in the abstract.
The level of detail can sometimes render statements that are missing context, but the context can often be found in the evidence section.
Evidence generation from the whole summary often results in clear citations for the first half of the summary while ignoring the second.

| Paper Index | Accurate Summary | Corroborating Evidence | Observation |
|-------------|------------------|------------------------|-------------|
| 0 | Yes. | No. | Consistently returns only 1 piece of evidence, often from the conclusion. |
| 1 | Yes. | Yes. | Evidence is relevant and with high coverage. |
| 2 | Yes. | Yes. |  |
| 3 | Yes. | Yes. | Evidence focused only on first half of summary. |
| 4 | - | - | Somehow VSCodium freezes everytime this runs. |
| 5 | No, minor omission mixing different statistics. | Yes. | Evidence offers insight into missing statements in summary. |
| 6 | Yes. | Yes. | Accurate summary and complete evidence. |
| 7 | No. | Yes. | The summary looks into details of the steps in the research missing the overall point stated in the abstract. |
| 8 | Yes. | No. | Summary goes deep is disjointed and goes into details while failing to describe the general idea. No evidence was found. |
| 9 | Yes | Yes, but incomplete. |  |