Conversation
Drop the language and pageindex_threshold prompts from `openkb init`; both fall back to config defaults and can be edited later in `.openkb/config.yaml`. In their place, add an interactive API key prompt that writes `LLM_API_KEY` to `./.env` (chmod 0600) when the user provides one, so first-time setup no longer requires a separate manual step. Also polish the model prompt with provider examples and a link to LiteLLM for others.
Simplify init prompts and capture API key to .env
When PAGEINDEX_API_KEY is set, index_long_document now fetches per-page markdown via col.get_page_content() instead of running local pymupdf. Cloud OCR produces cleaner output (preserves tables, math, and section headers) than raw pymupdf text extraction. Falls back to local pymupdf if the cloud call raises or returns an empty result.
Picks up the cloud add_document poll fix from VectifyAI/PageIndex#226, which switches the readiness signal from retrieval_ready to status == "completed".
Move warnings.filterwarnings("ignore") to before the module imports
so pydub's missing-ffmpeg RuntimeWarning, emitted when markitdown
pulls it in, is suppressed. The existing post-import call is kept
because markitdown clobbers the filter state during its own import.
Member
Author
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
index_long_documentnow fetches per-page markdown from PageIndex cloud viacol.get_page_contentwhenPAGEINDEX_API_KEYis set, falling back to local pymupdf on error or empty result. Cloud output preserves tables, math, and section headers that raw pymupdf text extraction loses.pageindexto0.3.0.dev1to pick up fix: poll status=="completed" in cloud add_document PageIndex#226, which fixes the cloudadd_documentpoll to checkstatus == "completed"instead of the unreliableretrieval_readyflag (previously causedcol.add()to hang until the 10 min timeout on otherwise-successful uploads).warnings.filterwarnings("ignore")beforeopenkb.cli's module imports so pydub's missing-ffmpegRuntimeWarning, emitted when markitdown pulls pydub in, no longer leaks to stderr. The existing post-import call is kept because markitdown clobbers filter state during its own import..env(Simplify init prompts and capture API key to .env #13).Test plan
index_long_documentin cloud mode on a 4-page PDF: upload+poll ~27s,get_page_content~2s, JSON written towiki/sources/with 4{page, content}entries of high-quality markdown.pageindex==0.3.0.dev1contains the poll fix.openkb addon a long PDF end-to-end (convert → index → compile) in cloud mode.openkb addin local mode to confirm the non-cloud branch is unchanged.