# Vector Store Usage

This notebook demonstrates how to use the `VectorStore` class from `nl2sql.knowledge_base.vector_store`.

In [1]:
# Settings
import os

if os.getcwd().endswith("notebooks"):
    os.chdir("..")
print(os.getcwd())

/Users/cmcoutosilva/Projects/github/nl2sql-agent


In [2]:
from nl2sql.database.postgresql import PostgreSQLConnector
from nl2sql.knowledge_base.vector_store import VectorStore
from nl2sql.knowledge_base.data_dictionary import DataDictionary
from nl2sql.knowledge_base.sql_examples import SQLExample
from nl2sql.config import load_schema_config
from nl2sql.utils import print_section

## Initialize Vector Store

In [3]:
# Set up database connector
db_connector = PostgreSQLConnector(config_path="configs/database.yml")

# Initialize vector store
vector_store = VectorStore(
    db_connector,
    collection_name="nl2sql_demo_embeddings"
)

print("Vector store initialized successfully")

Vector store initialized successfully


## Load Knowledge Base Data

In [4]:
# Load data dictionary
data_dictionary = DataDictionary.from_inspector(
    inspector=db_connector.inspector,
    database_schema=load_schema_config()
)

# Load SQL examples
sql_examples = SQLExample.from_yaml("knowledge/sql_examples.yml")

print(f"Loaded data dictionary with {len(data_dictionary.databases)} databases")
print(f"Loaded {len(sql_examples)} SQL examples")

Loaded data dictionary with 1 databases
Loaded 6 SQL examples


## Create Documents from Data Dictionary

In [5]:
# Get documents from data dictionary
schema_documents = vector_store.get_documents_from_data_dictionary(data_dictionary)

print(f"Created {len(schema_documents)} schema documents")
print_section("First Schema Document")
print(f"Content: {schema_documents[0].page_content[:200]}...")
print(f"Metadata: {schema_documents[0].metadata}")

Created 11 schema documents
First Schema Document
Content: TABLE: customers
DESCRIPTION: This dataset has information about the customer and its location. Use it to identify unique customers in the orders dataset and to find the orders delivery location. At o...
Metadata: {'type': 'schema', 'database': 'olist_ecommerce', 'schema': 'ecommerce', 'table': 'customers', 'primary_keys': 'customer_id', 'foreign_keys': ''}


## Create Documents from SQL Examples

In [6]:
# Get documents from SQL examples
sql_documents = vector_store.get_documents_from_sql_examples(sql_examples)

print(f"Created {len(sql_documents)} SQL example documents")
print_section("First SQL Document")
print(f"Content: {sql_documents[0].page_content}")
print(f"Metadata: {sql_documents[0].metadata}")

Created 6 SQL example documents
First SQL Document
Content: Question: How many orders are there in total?
```sql
SELECT COUNT(*)
FROM "ecommerce"."orders"
```
Metadata: {'type': 'example', 'title': 'total_orders'}


## Add Documents to Vector Store

In [7]:
# Combine all documents
all_documents = schema_documents + sql_documents

# Add to vector store
vector_store.add_documents(all_documents)

print(f"Added {len(all_documents)} documents to vector store")

Added 17 documents to vector store


## Search Similar Documents

In [8]:
# Search for similar documents
query = "How many orders are there?"
similar_docs = vector_store.vectorstore.similarity_search(query, k=3)

print_section(f"Similar documents for: '{query}'")
for i, doc in enumerate(similar_docs, 1):
    print(f"\n{i}. Type: {doc.metadata.get('type', 'unknown')}")
    print(f"   Content: {doc.page_content[:120]}...")

Similar documents for: 'How many orders are there?'

1. Type: example
   Content: Question: How many orders are there in total?
```sql
SELECT COUNT(*)
FROM "ecommerce"."orders"
```...

2. Type: example
   Content: Question: What is the distribution of orders by status?
```sql
SELECT "order_status",
       COUNT(*)
FROM "ecommerce"."...

3. Type: schema
   Content: TABLE: orders
DESCRIPTION: This is the core dataset. From each order you might find all other information.
PRIMARY KEYS:...


In [9]:
# Search with scores
query = "Show me customer information"
similar_docs_with_scores = vector_store.vectorstore.similarity_search_with_score(query, k=3)

print_section(f"Similar documents with scores for: '{query}'")
for i, (doc, score) in enumerate(similar_docs_with_scores, 1):
    print(f"\n{i}. Score: {score:.4f}")
    print(f"   Type: {doc.metadata.get('type', 'unknown')}")
    print(f"   Content: {doc.page_content[:100]}...")

Similar documents with scores for: 'Show me customer information'

1. Score: 0.2706
   Type: example
   Content: Question: Show me orders with customer information
```sql
SELECT o."order_id",
       o."order_statu...

2. Score: 0.4814
   Type: example
   Content: Question: Which cities have the most customers?
```sql
SELECT "customer_city",
       COUNT(*)
FROM ...

3. Score: 0.5111
   Type: schema
   Content: TABLE: customers
DESCRIPTION: This dataset has information about the customer and its location. Use ...


## Search by Document Type

In [10]:
# Search only in schema documents
query = "customer table structure"
schema_results = vector_store.vectorstore.similarity_search(
    query, 
    k=5,
    filter={"type": "schema"}
)

print_section(f"Schema documents for: '{query}'")
for i, doc in enumerate(schema_results, 1):
    print(f"\n{i}. Table: {doc.metadata.get('table', 'unknown')}")
    print(f"   Schema: {doc.metadata.get('schema', 'unknown')}")
    print(f"   Content: {doc.page_content[:100]}...")

Schema documents for: 'customer table structure'

1. Table: customers
   Schema: ecommerce
   Content: TABLE: customers
DESCRIPTION: This dataset has information about the customer and its location. Use ...

2. Table: orders
   Schema: ecommerce
   Content: TABLE: orders
DESCRIPTION: This is the core dataset. From each order you might find all other inform...

3. Table: order_reviews
   Schema: ecommerce
   Content: TABLE: order_reviews
DESCRIPTION: This dataset includes data about the reviews made by the customers...

4. Table: order_items
   Schema: ecommerce
   Content: TABLE: order_items
DESCRIPTION: This dataset includes data about the items purchased within each ord...

5. Table: order_payments
   Schema: ecommerce
   Content: TABLE: order_payments
DESCRIPTION: This dataset includes data about the orders payment options.
PRIM...
