Hi team! We've been using PageIndex on large documents (50–100+ pages) and found that a significant portion of LLM
calls during indexing can be resolved locally without losing accuracy.
Findings
Many LLM calls are for high-confidence decisions that local heuristics can handle:
| Decision Point |
Current Behavior |
Proposed Optimization |
find_toc_pages |
1 LLM call per page scanned |
Font/layout analysis — dot leaders, digit-ending lines, TOC keywords — resolves obvious cases locally |
verify_toc |
1 LLM call per TOC entry |
Fuzzy string matching pre-confirms title presence when confidence is high |
check_title_appearance_in_start |
1 LLM call per node |
Fuzzy match on the first ~300 chars of the target page |
Initial test on an 86-page paper: ~260 LLM calls → ~150 (~42% reduction) with identical verification accuracy.
Tested on a small set of documents — broader benchmarking across document types would be a useful next step.
Key constraint: heuristics only skip a call when confidence is high. Uncertain cases always escalate to the LLM. The self-verification loop is never bypassed.
Note: Path 4 documents (no TOC, no headings) see minimal
savings — no structural signals exist for heuristics to
exploit, so LLM remains essential there.
Approach
-
Font-based TOC detection using PyMuPDF page layout signals (line lengths, dot-leader patterns, digit-ending ratios, TOC keywords)
-
Fuzzy title matching via rapidfuzz with per-path confidence thresholds
-
LLMCallCounter to track actual savings per document
-
if_use_heuristics config flag (default: yes) or set to no to restore original behavior exactly.
Fully backward compatible, no breaking changes.
Implementation
Full implementation and test results:
https://github.com/Unizoy/pageindex-optimized
Diff showing exact changes:
Unizoy/pageindex-optimized#1
The implementation is complete and tested. Early results look promising and would love feedback from the team on
edge cases we may not have hit yet. Happy to open a PR or discuss a different integration approach if preferred.
Hi team! We've been using PageIndex on large documents (50–100+ pages) and found that a significant portion of LLM
calls during indexing can be resolved locally without losing accuracy.
Findings
Many LLM calls are for high-confidence decisions that local heuristics can handle:
find_toc_pagesverify_toccheck_title_appearance_in_startInitial test on an 86-page paper: ~260 LLM calls → ~150 (~42% reduction) with identical verification accuracy.
Tested on a small set of documents — broader benchmarking across document types would be a useful next step.
Key constraint: heuristics only skip a call when confidence is high. Uncertain cases always escalate to the LLM. The self-verification loop is never bypassed.
Approach
Font-based TOC detection using PyMuPDF page layout signals (line lengths, dot-leader patterns, digit-ending ratios, TOC keywords)
Fuzzy title matching via
rapidfuzzwith per-path confidence thresholdsLLMCallCounterto track actual savings per documentif_use_heuristicsconfig flag (default:yes) or set tonoto restore original behavior exactly.Fully backward compatible, no breaking changes.
Implementation
Full implementation and test results:
https://github.com/Unizoy/pageindex-optimized
Diff showing exact changes:
Unizoy/pageindex-optimized#1
The implementation is complete and tested. Early results look promising and would love feedback from the team on
edge cases we may not have hit yet. Happy to open a PR or discuss a different integration approach if preferred.