get_page_content over-collects on Markdown when given a comma-separated page list

get_page_content claims to accept three forms: '5-7', '3,8', or '12' (see docstring at pageindex/retrieve.py:111-119). For PDFs it works as advertised. For Markdown docs '3,8' is treated as the inclusive range [3,8] and pulls in every heading whose line_num lands between them.

Repro, no LLM, no real PDF:

```python
import json
from pageindex.retrieve import get_page_content
docs = {
  'md':  {'type':'md','structure':[
    {'line_num':5,'text':'L5','nodes':[]},
    {'line_num':10,'text':'L10','nodes':[]},
    {'line_num':50,'text':'L50','nodes':[]},
    {'line_num':100,'text':'L100','nodes':[]}]},
  'pdf': {'type':'pdf','pages':[
    {'page':5,'content':'P5'},{'page':10,'content':'P10'},
    {'page':50,'content':'P50'},{'page':100,'content':'P100'}]}}
print(json.loads(get_page_content(docs, 'md',  '5,100')))
print(json.loads(get_page_content(docs, 'pdf', '5,100')))
```

Got:
- md  -> pages 5, 10, 50, 100
- pdf -> pages 5, 100

Want both to return [5, 100] like the docstring suggests.

The over-collection happens in _get_md_page_content at pageindex/retrieve.py:56-76, which does min(page_nums)/max(page_nums) and matches everything in that window. _parse_pages already returns a discrete sorted list, so the loss happens entirely in the Markdown helper.

Noticed this while looking at how the agentic demo (examples/agentic_vectorless_rag_demo.py) calls get_page_content. On long Markdown docs a comma-list quietly pulls in unrelated sections, which inflates token use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_page_content over-collects on Markdown when given a comma-separated page list #279

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

get_page_content over-collects on Markdown when given a comma-separated page list #279

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions