Skip to content

v1.1.0

Latest

Choose a tag to compare

@allen2c allen2c released this 06 Jul 07:37
· 78 commits to main since this release
8bff5f5

✨ Feat: Introduce Automatic Document Chunking

This update rolls out a major enhancement: automatic document chunking! 🧩 Now, documents are intelligently split into smaller, more manageable pieces before embedding, leading to more precise and relevant vector search results.


🚀 What's New?

  • Automatic Document Splitting: When using dvs.add(), documents are now automatically chunked based on line count or token count. This is handled by the new chunkle and tiktoken dependencies.
  • New Chunking Controls: The dvs.add() method gets new parameters to customize how documents are split:
    • lines_per_chunk
    • tokens_per_chunk
  • Smarter Document & Point Management:
    • The Document type now includes fields like source_id, chunk_index, and is_chunk to track the relationship between chunks and their original source. 🔗
    • Processing is now more efficient, creating embeddings and database points in batches based on the new chunks.
  • Database Indexing: The documents table is now indexed by source_id to allow for quickly finding all chunks related to a single parent document.

🔧 Key Changes

  • dvs.add() Refactor: The core logic is updated to first chunk documents and then process these chunks for embedding and storage.
  • Document Type: Enhanced with new fields to support chunking and token counting.
  • Dependencies: Added chunkle and tiktoken to pyproject.toml and requirements files.
  • Version Bump: Updated project version from 1.0.0 to 1.1.0. 📦