Skip to content

fix: add missing commas in LLM prompt JSON formats and guard list return types#274

Open
zhoufengen wants to merge 1 commit into
VectifyAI:mainfrom
zhoufengen:fix/llm-response-robustness
Open

fix: add missing commas in LLM prompt JSON formats and guard list return types#274
zhoufengen wants to merge 1 commit into
VectifyAI:mainfrom
zhoufengen:fix/llm-response-robustness

Conversation

@zhoufengen
Copy link
Copy Markdown

Summary

Two related robustness issues that cause crashes when LLMs return malformed or unexpected JSON responses.

Changes

Fix #257 — Missing commas in LLM prompt reply formats

Six prompt templates in pageindex/page_index.py were missing a trailing comma after the "thinking" field:

  • check_title_appearance (line 34)
  • check_title_appearance_in_start (line 62)
  • toc_detector_single_page (line 112)
  • check_if_toc_extraction_is_complete (line 132)
  • check_if_toc_transformation_is_complete (line 150)
  • detect_page_index (line 213)

When an LLM follows the format literally, it produces invalid JSON (missing comma between keys). extract_json then fails and returns {}, causing KeyError on toc_detected, completed, or page_index_given_in_toc.

Fix #199generate_toc_init / generate_toc_continue must return a list

When extract_json fails on a malformed LLM response it returns {} (a dict). Both generate_toc_init and generate_toc_continue passed this through directly. Callers expect a list:

  • process_no_toc calls toc_with_page_number.extend(...)AttributeError: 'dict' object has no attribute 'extend'
  • meta_processor calls item.get('physical_index') on list items → AttributeError: 'str' object has no attribute 'get'

Both functions now return [] when extract_json returns a non-list, matching the expected contract.

Testing

Added tests/test_llm_response_robustness.py with 5 tests covering:

  1. All "thinking" fields in prompt reply formats have trailing commas
  2. generate_toc_init returns [] (not a dict) when LLM returns malformed JSON
  3. generate_toc_continue returns [] (not a dict) when LLM returns malformed JSON
  4. generate_toc_init passes through a valid list unchanged
  5. process_no_toc does not raise AttributeError when generate_toc_init returns []

Run with:

pytest tests/test_llm_response_robustness.py -v

tests pass

Related Issues

Fixes #257
Fixes #199

…urn types

- Add trailing commas after "thinking" fields in 6 prompt reply formats
  (check_title_appearance, check_title_appearance_in_start,
  toc_detector_single_page, check_if_toc_extraction_is_complete,
  check_if_toc_transformation_is_complete, detect_page_index).
  Without the comma, LLMs that follow the format literally produce invalid
  JSON, causing extract_json to fail and downstream KeyError crashes (VectifyAI#257).

- Guard generate_toc_init and generate_toc_continue to return [] instead of
  a dict when extract_json returns a non-list on malformed LLM output.
  Prevents AttributeError: 'dict' object has no attribute 'extend' in
  process_no_toc and AttributeError: 'str' object has no attribute 'get'
  in meta_processor (VectifyAI#199).

Generated with [AWS Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant