Skip to content

Cloud OCR indexing, pageindex dev1 bump, warning cleanup#14

Merged
rejojer merged 5 commits intomainfrom
dev
Apr 10, 2026
Merged

Cloud OCR indexing, pageindex dev1 bump, warning cleanup#14
rejojer merged 5 commits intomainfrom
dev

Conversation

@rejojer
Copy link
Copy Markdown
Member

@rejojer rejojer commented Apr 10, 2026

Summary

  • index_long_document now fetches per-page markdown from PageIndex cloud via col.get_page_content when PAGEINDEX_API_KEY is set, falling back to local pymupdf on error or empty result. Cloud output preserves tables, math, and section headers that raw pymupdf text extraction loses.
  • Bumps pageindex to 0.3.0.dev1 to pick up fix: poll status=="completed" in cloud add_document PageIndex#226, which fixes the cloud add_document poll to check status == "completed" instead of the unreliable retrieval_ready flag (previously caused col.add() to hang until the 10 min timeout on otherwise-successful uploads).
  • Moves warnings.filterwarnings("ignore") before openkb.cli's module imports so pydub's missing-ffmpeg RuntimeWarning, emitted when markitdown pulls pydub in, no longer leaks to stderr. The existing post-import call is kept because markitdown clobbers filter state during its own import.
  • Also includes the earlier init-prompt simplification that captures the LLM API key to .env (Simplify init prompts and capture API key to .env #13).

Test plan

  • End-to-end run of index_long_document in cloud mode on a 4-page PDF: upload+poll ~27s, get_page_content ~2s, JSON written to wiki/sources/ with 4 {page, content} entries of high-quality markdown.
  • Confirmed fallback branch triggers when cloud call raises / returns empty.
  • Verified installed pageindex==0.3.0.dev1 contains the poll fix.
  • Smoke test openkb add on a long PDF end-to-end (convert → index → compile) in cloud mode.
  • Smoke test openkb add in local mode to confirm the non-cloud branch is unchanged.

rejojer added 5 commits April 10, 2026 20:51
Drop the language and pageindex_threshold prompts from `openkb init`;
both fall back to config defaults and can be edited later in
`.openkb/config.yaml`. In their place, add an interactive API key
prompt that writes `LLM_API_KEY` to `./.env` (chmod 0600) when the
user provides one, so first-time setup no longer requires a separate
manual step. Also polish the model prompt with provider examples and
a link to LiteLLM for others.
Simplify init prompts and capture API key to .env
When PAGEINDEX_API_KEY is set, index_long_document now fetches
per-page markdown via col.get_page_content() instead of running
local pymupdf. Cloud OCR produces cleaner output (preserves
tables, math, and section headers) than raw pymupdf text
extraction. Falls back to local pymupdf if the cloud call raises
or returns an empty result.
Picks up the cloud add_document poll fix from VectifyAI/PageIndex#226,
which switches the readiness signal from retrieval_ready to
status == "completed".
Move warnings.filterwarnings("ignore") to before the module imports
so pydub's missing-ffmpeg RuntimeWarning, emitted when markitdown
pulls it in, is suppressed. The existing post-import call is kept
because markitdown clobbers the filter state during its own import.
@rejojer
Copy link
Copy Markdown
Member Author

rejojer commented Apr 10, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@rejojer rejojer merged commit 2e1caf9 into main Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant