fix: add missing commas in LLM prompt JSON formats and guard list return types by zhoufengen · Pull Request #274 · VectifyAI/PageIndex

zhoufengen · 2026-05-13T02:55:33Z

Summary

Two related robustness issues that cause crashes when LLMs return malformed or unexpected JSON responses.

Changes

Fix #257 — Missing commas in LLM prompt reply formats

Six prompt templates in pageindex/page_index.py were missing a trailing comma after the "thinking" field:

check_title_appearance (line 34)
check_title_appearance_in_start (line 62)
toc_detector_single_page (line 112)
check_if_toc_extraction_is_complete (line 132)
check_if_toc_transformation_is_complete (line 150)
detect_page_index (line 213)

When an LLM follows the format literally, it produces invalid JSON (missing comma between keys). extract_json then fails and returns {}, causing KeyError on toc_detected, completed, or page_index_given_in_toc.

Fix #199 — `generate_toc_init` / `generate_toc_continue` must return a list

When extract_json fails on a malformed LLM response it returns {} (a dict). Both generate_toc_init and generate_toc_continue passed this through directly. Callers expect a list:

process_no_toc calls toc_with_page_number.extend(...) → AttributeError: 'dict' object has no attribute 'extend'
meta_processor calls item.get('physical_index') on list items → AttributeError: 'str' object has no attribute 'get'

Both functions now return [] when extract_json returns a non-list, matching the expected contract.

Testing

Added tests/test_llm_response_robustness.py with 5 tests covering:

All "thinking" fields in prompt reply formats have trailing commas
generate_toc_init returns [] (not a dict) when LLM returns malformed JSON
generate_toc_continue returns [] (not a dict) when LLM returns malformed JSON
generate_toc_init passes through a valid list unchanged
process_no_toc does not raise AttributeError when generate_toc_init returns []

Run with:

pytest tests/test_llm_response_robustness.py -v

Related Issues

Fixes #257
Fixes #199

…urn types - Add trailing commas after "thinking" fields in 6 prompt reply formats (check_title_appearance, check_title_appearance_in_start, toc_detector_single_page, check_if_toc_extraction_is_complete, check_if_toc_transformation_is_complete, detect_page_index). Without the comma, LLMs that follow the format literally produce invalid JSON, causing extract_json to fail and downstream KeyError crashes (VectifyAI#257). - Guard generate_toc_init and generate_toc_continue to return [] instead of a dict when extract_json returns a non-list on malformed LLM output. Prevents AttributeError: 'dict' object has no attribute 'extend' in process_no_toc and AttributeError: 'str' object has no attribute 'get' in meta_processor (VectifyAI#199). Generated with [AWS Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add missing commas in LLM prompt JSON formats and guard list return types#274

fix: add missing commas in LLM prompt JSON formats and guard list return types#274
zhoufengen wants to merge 1 commit into
VectifyAI:mainfrom
zhoufengen:fix/llm-response-robustness

zhoufengen commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhoufengen commented May 13, 2026

Summary

Changes

Fix #257 — Missing commas in LLM prompt reply formats

Fix #199 — generate_toc_init / generate_toc_continue must return a list

Testing

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix #199 — `generate_toc_init` / `generate_toc_continue` must return a list