Skip to content

feat: add FK-aware schema indexing with metadata across vector stores#9

Merged
Zay-M3 merged 1 commit intomainfrom
feat/fk-aware-schema-indexing
Apr 4, 2026
Merged

feat: add FK-aware schema indexing with metadata across vector stores#9
Zay-M3 merged 1 commit intomainfrom
feat/fk-aware-schema-indexing

Conversation

@Zay-M3
Copy link
Copy Markdown
Owner

@Zay-M3 Zay-M3 commented Apr 4, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Database relationship and foreign-key information are now extracted and indexed alongside table definitions for more comprehensive schema awareness.
    • Vector search now independently queries tables and relationships, merging results for better relevance.
  • Improvements

    • Prompt generation now uses chat-style message formatting for improved SQL generation and business query handling.
    • Metadata filtering added to vector store operations for more granular document organization and retrieval.

@Zay-M3 Zay-M3 self-assigned this Apr 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR refactors the vector indexing pipeline to support document-based indexing with metadata and relationship tracking. It introduces structured document payloads containing id, content, and metadata fields; updates vector store interfaces to enable kind-based filtering; refactors schema extraction to return both tables and relationships; and converts prompt generation from string output to role-based message lists.

Changes

Cohort / File(s) Summary
Schema Extraction & Relationship Discovery
naturalsql/sql/sqlschema.py
Changed extract_schema() return format to bundle {"tables": {...}, "relationships": [...]} with relationship edge extraction via new _parse_relationship_rows(). Updated formated_for_ia() to accept schema bundle and return list[dict] with {id, content, metadata} entries for both tables and relationships instead of plain strings.
Vector Store Base Interface
naturalsql/vector/stores/base.py
Extended upsert() with optional metadatas: List[dict[str, Any]] | None parameter and query() with optional kind: str | None filter parameter for kind-based document retrieval.
Vector Store Implementations
naturalsql/vector/stores/chroma_store.py, naturalsql/vector/stores/sqlite_store.py
Implemented metadata persistence via metadatas parameter in upsert() and kind-based filtering via where clause in query(). SQLite additionally added metadata_json column, runtime schema migration, and foreign key enforcement.
Vector Manager & Indexing
naturalsql/controller/controllervector.py
Added new index_documents(documents_payload: list[dict[str, Any]]) method accepting document dicts with {id, content, metadata}. Refactored index_tables() as compatibility wrapper delegating to index_documents(). Updated search_relevant_tables() to query separately by kind="table" and kind="relationship", merge/sort results by distance, and enforce limit on collected results.
API & Integration
naturalsql/api.py
Updated build_vector_db() to invoke vm.index_documents(documents_payload) instead of vm.index_tables(formatted), computing payload via extractor.formated_for_ia(schema_bundle) and reporting indexed document count via len(documents_payload).
Prompt Generation
naturalsql/utils/prompt.py
Changed build_prompt() and prompt_query() return types from str to list[dict[str, str]] (role-based message format). Added schema-wrapping in <schema> tags as separate system message, consolidated SQL constraints into "Mandatory rules" block, and updated prompt_query() to defensively read response fields via .get() with fallback for missing keys.

Sequence Diagram

sequenceDiagram
    participant API as NaturalSQL.build_vector_db()
    participant Extractor as SQLSchemaExtractor
    participant VectorMgr as VectorManager
    participant Store as VectorStore (Chroma/SQLite)
    
    API->>Extractor: extract_schema()
    Extractor-->>API: {"tables": {...}, "relationships": [...]}
    
    API->>Extractor: formated_for_ia(schema_bundle)
    Extractor-->>API: [{id, content, metadata}, ...]<br/>(documents_payload)
    
    API->>VectorMgr: index_documents(documents_payload)
    VectorMgr->>VectorMgr: Build embeddings for each document
    
    VectorMgr->>Store: upsert(documents, ids, embeddings, metadatas)
    Store->>Store: Persist with kind filtering (table/relationship)
    Store-->>VectorMgr: Success
    
    Note over API,Store: Later: search_relevant_tables()
    VectorMgr->>Store: query(embedding, limit, kind="table")
    Store-->>VectorMgr: Table results
    
    VectorMgr->>Store: query(embedding, limit, kind="relationship")
    Store-->>VectorMgr: Relationship results
    
    VectorMgr->>VectorMgr: Merge & sort by distance
    VectorMgr-->>API: Ranked results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #5: Introduces overlapping vector subsystem modifications (VectorStore/Chroma/SQLite upsert/query signatures, VectorManager index_documents flow, document payload patterns).
  • PR #8: Modifies the same prompt generation functions (build_prompt, prompt_query) with overlapping signature and behavior changes.
  • PR #2: Shares core flow changes in schema extraction and vector manager indexing (naturalsql/sql/sqlschema.py and naturalsql/controller/controllervector.py).

Poem

🐰 Hops through vectors with glee,
Documents and metadata dance free,
Tables and relationships entwined,
A structured schema, so refined—
The burrow's indexing, redesigned!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding FK-aware (foreign key) schema indexing with metadata support across vector stores, which is reflected in the schema extraction updates, metadata handling, and vector store enhancements throughout all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fk-aware-schema-indexing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Zay-M3 Zay-M3 merged commit 588cab3 into main Apr 4, 2026
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant