# Summarization with Verbatim RAG

### Implementation Description

| Step | Description | Rationale |
|------|-------------|-----------|
| Process source document | Break paper into sentences. Encode the full document together, and each sentence independently. |  |
| Select relevant sentences | Calculate **BM25** similarity score between sentences and full document. | This was the most reliable method for selecting sentences from the baseline results. |
|  |  |  |
|  |  |  |
|  |  |  |

### Required Modules

In [7]:
from verbatim_rag.schema import DocumentSchema
from verbatim_rag.chunker_providers import MarkdownChunkerProvider
from verbatim_rag.embedding_providers import SentenceTransformersProvider
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag import VerbatimIndex, VerbatimRAG
from verbatim_rag.core import LLMClient

### Constant Variables

In [2]:
ARXIV_URL = "https://arxiv.org/pdf/"
DOCUMENT_ID = [
    '2511.21398v1',
    '2511.21444v1',
    '2511.21460v1',
    '2511.21471v1',
    '2511.21522v1',
    '2511.21569v1',
    '2511.21570v1',
    '2511.21591v1',
    '2511.21636v1',
    '2511.21678v1',
]

### Process Source Document

In [3]:
LOAD_PAPERS = True


if LOAD_PAPERS:
    papers = [
        DocumentSchema.from_url(url=ARXIV_URL + document_id)
        for document_id in DOCUMENT_ID
    ]
else:
    print("Papers already loaded!")

2026-01-04 20:15:12,585 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-04 20:15:12,663 - INFO - Going to convert document batch...
2026-01-04 20:15:12,664 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-04 20:15:12,678 - INFO - Loading plugin 'docling_defaults'
2026-01-04 20:15:12,681 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-01-04 20:15:12,695 - INFO - Loading plugin 'docling_defaults'
2026-01-04 20:15:12,700 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2026-01-04 20:15:12,701 - INFO - rapidocr cannot be used because onnxruntime is not installed.
2026-01-04 20:15:12,702 - INFO - easyocr cannot be used because it is not installed.
2026-01-04 20:15:12,979 - INFO - Accelerator device: 'cpu'
[32m[INFO] 2026-01-04 20:15:12,993 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-01-04 20:15:12,997 [RapidOCR] device_con

In [4]:
chunker = MarkdownChunkerProvider(
    min_chunk_size=500,
    max_chunk_size=5000,
)

dense_provider = SentenceTransformersProvider(
    model_name="ibm-granite/granite-embedding-english-r2",
    device='cpu',
)
vector_store = LocalMilvusStore(
    db_path="./rag_test.db",
    collection_name='rag_test',
    dense_dim=dense_provider.get_dimension(),
    enable_dense=True,
    enable_sparse=False,
    nlist=16384,
)
index = VerbatimIndex(
        vector_store=vector_store,
        dense_provider=dense_provider,
        chunker_provider=chunker,
    )

index.add_documents([papers[0]])

2026-01-04 20:23:07,016 - INFO - PyTorch version 2.9.1 available.
2026-01-04 20:23:07,422 - INFO - Load pretrained SentenceTransformer: ibm-granite/granite-embedding-english-r2
2026-01-04 20:23:13,337 - INFO - Loaded SentenceTransformers model: ibm-granite/granite-embedding-english-r2
  from pkg_resources import DistributionNotFound, get_distribution
2026-01-04 20:23:14,032 - INFO - Connected to Milvus Lite: ./rag_test.db
Batches: 100%|██████████| 2/2 [01:40<00:00, 50.38s/it]]
2026-01-04 20:24:54,914 - INFO - Added 51 vectors to Milvus
2026-01-04 20:24:54,943 - INFO - Added 1 documents to Milvus
Adding documents: 100%|██████████| 1/1 [01:40<00:00, 100.91s/it]


In [10]:
llm_client = LLMClient(model="gpt-4.1", temperature=1.0)

rag = VerbatimRAG(index, llm_client=llm_client)

response = rag.query("Build a summary of the paper 'Prune4Web: DOM Tree Pruning Programming for Web Agent'.")

print(response.answer)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00, 15.73it/s]


Extracting relevant spans...
Extracting spans (batch mode)...


2026-01-04 20:37:54,986 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processing spans...
Generating response...


2026-01-04 20:37:56,919 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Here is a concise summary of the paper based on the provided verbatim facts:

1. **Technical Core:** [1] DOM Tree Pruning Programming is the technical core of Prune4Web. It offloads the heavy task of element filtering from the LLM itself to a lightweight, dynamically generated program.
2. **Scoring Function Generation:** [2] Step 2: Scoring Function Generation. The core task of the Programmatic Element Filter is to generate a Python scoring function f score t for the current step. We design a heuristic-based Scoring Function Template, where the LLM only needs to generate key parameters for this template. This approach significantly improves the stability and controllability of the generated code while maintaining flexibility. Algorithm 1 shows the pseudo-code of the template. The template mimics human intuition when searching for elements using keywords. It assumes that a target element contains identifiable textual features within the HTML. The template performs tiered, weighted match

[Manuel Velarde](mailto:manuel@velarde.me)