feat: chunk-aware hybrid search in all text adapters (#854)#1173
Open
pyramation wants to merge 3 commits into
Open
feat: chunk-aware hybrid search in all text adapters (#854)#1173pyramation wants to merge 3 commits into
pyramation wants to merge 3 commits into
Conversation
…bedding (#856) Adds search_indexes to ProcessChunks parameter_schema and ProcessFileEmbedding's chunks sub-config with default ['fulltext']. Enables hybrid RAG by opting into fulltext (tsvector), bm25, or trigram search on the chunks content column. Companion to constructive-io/constructive-db#1164
) - Extract shared getChunksInfo/ChunksInfo into adapters/chunks.ts - tsvector adapter: lateral subquery for MAX(ts_rank) across chunks - BM25 adapter: lateral subquery for MIN(bm25_score) across chunks - trgm adapter: lateral subquery for MAX(similarity) across chunks - All adapters respect includeChunks option (default: true when @hasChunks present) - pgvector adapter refactored to use shared ChunksInfo - Integration tests: chunk-aware tsvector and trgm queries - Setup SQL: add content, search (tsvector), BM25, trgm indexes on posts_chunks
Contributor
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements chunk-aware text search across all 4 search adapters (tsvector, BM25, trgm, pgvector) via the
@hasChunkssmart tag (#854). This enables hybrid search — applications can query both vector and text indexes on chunks tables simultaneously, then combine results with RRF or other fusion strategies.New shared module —
adapters/chunks.ts:ChunksInfointerface with all chunk metadata fields:chunksTable,parentFk,parentPk,embeddingField,contentField,searchField,searchIndexesgetChunksInfo(codec)— extracts and validates chunk metadata from@hasChunkssmart tagAdapter changes:
includeChunksoption(SELECT MAX(ts_rank(search, tsquery)) FROM chunks WHERE fk = parent.id)GREATEST(parent, chunk)— higher is better(SELECT MIN(score <@> bm25query) FROM chunks WHERE fk = parent.id)LEAST(parent, chunk)— lower is better(SELECT MAX(similarity(content, value)) FROM chunks WHERE fk = parent.id)GREATEST(parent, chunk)— higher is betterChunksInfoLEAST(parent, chunk)— lower is betterEach adapter checks
searchIndexesfrom@hasChunksto verify the chunks table has the relevant index type before enabling chunk-aware querying.Integration tests:
finds parent via chunk tsvector match (term only in chunks)— verifies parent is found when search term only exists in chunk contentfinds parent via chunk trgm similarity (term only in chunks)— verifies fuzzy matching through chunkscontent,search(tsvector), BM25, and trgm indexes onposts_chunkstableAlso includes the prerequisite #856 work (search_indexes parameter on ProcessChunks/ProcessFileEmbedding).
Review & Testing Checklist for Human
Medium-high risk — core search infrastructure change affecting all 4 adapters.
includeChunks: falsecorrectly disables chunk querying for all text adapters (tsvector, BM25, trgm){chunks_table}_{content_field}_bm25_idx) matches what the DB generator createsNotes
@hasChunkssmart tag with the search metadata these adapters consumeLink to Devin session: https://app.devin.ai/sessions/2b5a29d83d3f478e8d3d972653b4879c
Requested by: @pyramation