get_page_content claims to accept three forms: '5-7', '3,8', or '12' (see docstring at pageindex/retrieve.py:111-119). For PDFs it works as advertised. For Markdown docs '3,8' is treated as the inclusive range [3,8] and pulls in every heading whose line_num lands between them.
Repro, no LLM, no real PDF:
import json
from pageindex.retrieve import get_page_content
docs = {
'md': {'type':'md','structure':[
{'line_num':5,'text':'L5','nodes':[]},
{'line_num':10,'text':'L10','nodes':[]},
{'line_num':50,'text':'L50','nodes':[]},
{'line_num':100,'text':'L100','nodes':[]}]},
'pdf': {'type':'pdf','pages':[
{'page':5,'content':'P5'},{'page':10,'content':'P10'},
{'page':50,'content':'P50'},{'page':100,'content':'P100'}]}}
print(json.loads(get_page_content(docs, 'md', '5,100')))
print(json.loads(get_page_content(docs, 'pdf', '5,100')))
Got:
- md -> pages 5, 10, 50, 100
- pdf -> pages 5, 100
Want both to return [5, 100] like the docstring suggests.
The over-collection happens in _get_md_page_content at pageindex/retrieve.py:56-76, which does min(page_nums)/max(page_nums) and matches everything in that window. _parse_pages already returns a discrete sorted list, so the loss happens entirely in the Markdown helper.
Noticed this while looking at how the agentic demo (examples/agentic_vectorless_rag_demo.py) calls get_page_content. On long Markdown docs a comma-list quietly pulls in unrelated sections, which inflates token use.
get_page_content claims to accept three forms: '5-7', '3,8', or '12' (see docstring at pageindex/retrieve.py:111-119). For PDFs it works as advertised. For Markdown docs '3,8' is treated as the inclusive range [3,8] and pulls in every heading whose line_num lands between them.
Repro, no LLM, no real PDF:
Got:
Want both to return [5, 100] like the docstring suggests.
The over-collection happens in _get_md_page_content at pageindex/retrieve.py:56-76, which does min(page_nums)/max(page_nums) and matches everything in that window. _parse_pages already returns a discrete sorted list, so the loss happens entirely in the Markdown helper.
Noticed this while looking at how the agentic demo (examples/agentic_vectorless_rag_demo.py) calls get_page_content. On long Markdown docs a comma-list quietly pulls in unrelated sections, which inflates token use.