✨ Feat: Introduce Automatic Document Chunking
This update rolls out a major enhancement: automatic document chunking! 🧩 Now, documents are intelligently split into smaller, more manageable pieces before embedding, leading to more precise and relevant vector search results.
🚀 What's New?
- Automatic Document Splitting: When using
dvs.add(), documents are now automatically chunked based on line count or token count. This is handled by the newchunkleandtiktokendependencies. - New Chunking Controls: The
dvs.add()method gets new parameters to customize how documents are split:lines_per_chunktokens_per_chunk
- Smarter Document & Point Management:
- The
Documenttype now includes fields likesource_id,chunk_index, andis_chunkto track the relationship between chunks and their original source. 🔗 - Processing is now more efficient, creating embeddings and database points in batches based on the new chunks.
- The
- Database Indexing: The documents table is now indexed by
source_idto allow for quickly finding all chunks related to a single parent document.
🔧 Key Changes
dvs.add()Refactor: The core logic is updated to first chunk documents and then process these chunks for embedding and storage.DocumentType: Enhanced with new fields to support chunking and token counting.- Dependencies: Added
chunkleandtiktokentopyproject.tomland requirements files. - Version Bump: Updated project version from
1.0.0to1.1.0. 📦