Cloud OCR indexing, pageindex dev1 bump, warning cleanup by rejojer · Pull Request #14 · VectifyAI/OpenKB

rejojer · 2026-04-10T17:32:34Z

Summary

index_long_document now fetches per-page markdown from PageIndex cloud via col.get_page_content when PAGEINDEX_API_KEY is set, falling back to local pymupdf on error or empty result. Cloud output preserves tables, math, and section headers that raw pymupdf text extraction loses.
Bumps pageindex to 0.3.0.dev1 to pick up fix: poll status=="completed" in cloud add_document PageIndex#226, which fixes the cloud add_document poll to check status == "completed" instead of the unreliable retrieval_ready flag (previously caused col.add() to hang until the 10 min timeout on otherwise-successful uploads).
Moves warnings.filterwarnings("ignore") before openkb.cli's module imports so pydub's missing-ffmpeg RuntimeWarning, emitted when markitdown pulls pydub in, no longer leaks to stderr. The existing post-import call is kept because markitdown clobbers filter state during its own import.
Also includes the earlier init-prompt simplification that captures the LLM API key to .env (Simplify init prompts and capture API key to .env #13).

Test plan

End-to-end run of index_long_document in cloud mode on a 4-page PDF: upload+poll ~27s, get_page_content ~2s, JSON written to wiki/sources/ with 4 {page, content} entries of high-quality markdown.
Confirmed fallback branch triggers when cloud call raises / returns empty.
Verified installed pageindex==0.3.0.dev1 contains the poll fix.
Smoke test openkb add on a long PDF end-to-end (convert → index → compile) in cloud mode.
Smoke test openkb add in local mode to confirm the non-cloud branch is unchanged.

Drop the language and pageindex_threshold prompts from `openkb init`; both fall back to config defaults and can be edited later in `.openkb/config.yaml`. In their place, add an interactive API key prompt that writes `LLM_API_KEY` to `./.env` (chmod 0600) when the user provides one, so first-time setup no longer requires a separate manual step. Also polish the model prompt with provider examples and a link to LiteLLM for others.

Simplify init prompts and capture API key to .env

When PAGEINDEX_API_KEY is set, index_long_document now fetches per-page markdown via col.get_page_content() instead of running local pymupdf. Cloud OCR produces cleaner output (preserves tables, math, and section headers) than raw pymupdf text extraction. Falls back to local pymupdf if the cloud call raises or returns an empty result.

Picks up the cloud add_document poll fix from VectifyAI/PageIndex#226, which switches the readiness signal from retrieval_ready to status == "completed".

Move warnings.filterwarnings("ignore") to before the module imports so pydub's missing-ffmpeg RuntimeWarning, emitted when markitdown pulls it in, is suppressed. The existing post-import call is kept because markitdown clobbers the filter state during its own import.

rejojer · 2026-04-10T17:42:38Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

rejojer added 5 commits April 10, 2026 20:51

Merge pull request #13 from VectifyAI/feat/init-api-key-prompt

771452d

Simplify init prompts and capture API key to .env

Bump pageindex to 0.3.0.dev1

e0ab3f9

Picks up the cloud add_document poll fix from VectifyAI/PageIndex#226, which switches the readiness signal from retrieval_ready to status == "completed".

rejojer merged commit 2e1caf9 into main Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud OCR indexing, pageindex dev1 bump, warning cleanup#14

Cloud OCR indexing, pageindex dev1 bump, warning cleanup#14
rejojer merged 5 commits intomainfrom
dev

rejojer commented Apr 10, 2026

Uh oh!

rejojer commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rejojer commented Apr 10, 2026

Summary

Test plan

Uh oh!

rejojer commented Apr 10, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant