Skip to content

feat: add PageIndex SDK with local/cloud dual-mode support#207

Merged
KylinMountain merged 26 commits intoVectifyAI:devfrom
KylinMountain:feat/sdk
Apr 6, 2026
Merged

feat: add PageIndex SDK with local/cloud dual-mode support#207
KylinMountain merged 26 commits intoVectifyAI:devfrom
KylinMountain:feat/sdk

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

Unified Python SDK for document indexing and retrieval, supporting both self-hosted (local) and fully-managed (cloud) modes.

Highlights

  • Dual-mode client: LocalClient (self-hosted, user LLM key) / CloudClient (fully managed, no LLM key)
  • Collection-based multi-document management with SHA-256 dedup
  • Streaming query: col.query(stream=True) returns async-iterable QueryStream
  • Pluggable protocols: DocumentParser, StorageEngine (SQLite default)
  • Cloud backend: actual PageIndex API with SSE streaming via chat/completions

Usage

from pageindex import LocalClient, CloudClient

# Local
client = LocalClient()
col = client.collection()
col.add("paper.pdf")
col.query("What is this about?")

# Cloud
client = CloudClient(api_key="pi-xxx")
col = client.collection()
col.add("paper.pdf")
col.query("What is this about?", stream=True)

@KylinMountain KylinMountain force-pushed the feat/sdk branch 4 times, most recently from f4ca4c5 to 1369cf1 Compare April 1, 2026 09:47
- Critical: preserve text in markdown structure for fallback retrieval
- Cloud: SSE response close, folder cache dict, truncate error body
- Cloud: filter internal tools, async-safe streaming via to_thread
- SQLite: multi-thread connection tracking, context manager
- Security: collection name validation, parse_pages range cap
- Polish: use count_tokens wrapper, _EXAMPLES_DIR naming, QueryStream public
- Backend protocol: add @runtime_checkable
- Replace ConfigLoader + config.yaml with Pydantic IndexConfig
- Use bool for config flags (if_add_node_summary etc.) instead of "yes"/"no"
- Enable doc_description by default for better agent QA
- Early API key validation on LocalClient init via litellm provider detection
- Expose index_config parameter on LocalClient for advanced users
- Remove config.yaml dependency from pip package
…aming

- Fix return type annotation: dict -> list (tree structure is a list)
- Fix not-found return: {} -> [] for consistency
- Cloud streaming: replace batch-then-yield with asyncio.Queue for
  true real-time event delivery via background thread
…n type, legacy API fix

- Remove client-side dedup in CloudBackend (server responsibility)
- Cloud streaming: real-time via asyncio.Queue instead of batch-then-yield
- Fix get_document_structure return type: dict -> list, not-found returns []
- Fix legacy page_index() API: use IndexConfig instead of deleted ConfigLoader
- Add folder upgrade warning (once only)
- Demo: always upload, no client-side caching
KylinMountain and others added 8 commits April 3, 2026 17:27
Local demo was missing LLM provider configuration, making it fail
on first run without clear guidance.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
….pageindex

Local-only params are now documented. Default storage_path changed from
~/.pageindex (global) to ./.pageindex (project-local) for better isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was only defined on LocalClient but called from PageIndexClient._init_local(),
causing AttributeError when using PageIndexClient directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PI improvements

- Extract images from PDF pages preserving text-image reading order
  using pymupdf get_text("dict") blocks. Images saved to
  files/{collection}/{doc_id}/images/ with relative paths in content.
- Add get_document_structure() and get_page_content() to Collection public API
- get_document() now returns structure; add include_text param to populate
  node text from page cache (WARNING in docstring: not for agent/LLM use)
- delete_document() cleans up images directory
- Agent system prompt instructs LLM to preserve image references in answers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KylinMountain and others added 5 commits April 5, 2026 23:59
Allows callers to specify where extracted PDF images are saved.
Default behavior unchanged (internal .pageindex/files/.../images/).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- delete_document/delete_collection now clean up custom images_dir
- add_document failure path cleans up custom images_dir
- _init_local: explicit model/retrieve_model kwargs now override
  index_config dict values (was reversed)
- CloudBackend.get_document: inline structure fetch instead of calling
  get_document_structure (still 2 HTTP calls but avoids method indirection)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Images always stored internally at .pageindex/files/{collection}/{doc_id}/images/.
Simplifies delete/cleanup logic — no more dual-path handling.
Consumers that need images elsewhere should copy at render time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aligns with OpenAI Agents SDK requirement. No 3.11-specific features used.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cloud API returns tree/ocr data in `result` field, but code only checked
`tree` and `structure` keys. Also normalizes cloud node schema
(page_index → start_index/end_index, prefix_summary → summary) and
OCR response (page_index → page, markdown → content) to match local format.
@KylinMountain KylinMountain changed the base branch from main to dev April 6, 2026 14:47
@KylinMountain KylinMountain marked this pull request as ready for review April 6, 2026 14:47
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@KylinMountain KylinMountain merged commit b63fd97 into VectifyAI:dev Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant