## Vector DB

### 1- Embedding Model Selection

One crucial step is choosing the right embedding model that support the characteritics of our chunks profile particularly max tokens. OpenAIEmbedding takes care of it, but if we hand select the models it becomes important which model we are using. Below is the chunk stats from step 1. 

| Metric         | Characters         | Tokens             |
|----------------|--------------------|--------------------|
| Minimum        | 172                | 58                 |
| Maximum        | 2472               | 746                |
| Mean           | 1625.64            | 500.89             |
| Median         | 1840.0             | 579.0              |
| Standard Dev.  | 639.48             | 193.15             |
| Number of Chunks | **579**          | **579**            |


(**Domain Specific Embedding:** We can use do domain specific embedding models. And test it out.)


Here, I don't want to use API. Based on average chunk size (~500–750 tokens), we use: **BAAI/bge-base-en-v1.5**

In [1]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={'device': 'cpu'},  # or 'cuda' if available
    encode_kwargs={'normalize_embeddings': True}  # for cosine search
)

### 2- Create and Persist Chroma Vector Store



In [2]:

## Load Chunks
import pickle

with open("data/chunks.pkl", "rb") as f:
    token_chunks = pickle.load(f)

print(f"✅ Loaded {len(token_chunks)} chunks")



## Create and persist the Chroma vector store
from langchain.vectorstores import Chroma

chroma_db = Chroma.from_documents(
    documents=token_chunks,
    embedding=embedding_model,
    persist_directory="chroma_db",
        collection_metadata={
        "hnsw:space": "cosine"  # Set the similarity metric
    }
)

chroma_db.persist()
print("✅ Chroma vector store saved to 'chroma_db'")

✅ Loaded 579 chunks
✅ Chroma vector store saved to 'chroma_db'


  chroma_db.persist()


### 3- Querying Chroma DB



In [None]:
chroma_db = Chroma(
    persist_directory="chroma_db",
    embedding_function=embedding_model,
    collection_metadata={
        "hnsw:space": "cosine"  # Set the similarity metric
    }
)

In [3]:

query = "How does the model deal with missing values?"
results = chroma_db.similarity_search(query, k=3)

for i, doc in enumerate(results):
    print(f"\n🔎 Result {i+1}")
    print(doc.page_content[:300])  # Preview first 300 chars
    print("Metadata:", doc.metadata)



🔎 Result 1
One	of	the	simplest	ways	to	fill	in	missing	values	is	to	carry	forward	the	last	known	value	prior
to	the	missing	one,	an	approach	known	as	
forward	fill
.	No	mathematics	or	complicated	logic	is
required.	Simply	consider	the	experience	of	moving	forward	in	time	with	the	data	that	was
available,	and	y
Metadata: {'creationdate': '2020-03-30T07:09:46+00:00', 'page_label': '38', 'source': 'data/ps.pdf', 'page': 37, 'producer': 'PDF Candy', 'moddate': '2020-03-30T07:09:46+00:00', 'total_pages': 365, 'creator': 'PyPDF'}

🔎 Result 2
One	of	the	simplest	ways	to	fill	in	missing	values	is	to	carry	forward	the	last	known	value	prior
to	the	missing	one,	an	approach	known	as	
forward	fill
.	No	mathematics	or	complicated	logic	is
required.	Simply	consider	the	experience	of	moving	forward	in	time	with	the	data	that	was
available,	and	y
Metadata: {'page_label': '38', 'creator': 'PyPDF', 'source': 'data/ps.pdf', 'total_pages': 365, 'producer': 'PDF Candy', 'creationdate': '2020-03-30T07:09:4

In [None]:

query = "How does the model deal with missing values?"
results = chroma_db.similarity_search(query, k=3)

for i, doc in enumerate(results):
    print(f"\n🔎 Result {i+1}")
    print(doc.page_content[:300])  # Preview first 300 chars
    print("Metadata:", doc.metadata)


  chroma_db = Chroma(



🔎 Result 1
One	of	the	simplest	ways	to	fill	in	missing	values	is	to	carry	forward	the	last	known	value	prior
to	the	missing	one,	an	approach	known	as	
forward	fill
.	No	mathematics	or	complicated	logic	is
required.	Simply	consider	the	experience	of	moving	forward	in	time	with	the	data	that	was
available,	and	y
Metadata: {'producer': 'PDF Candy', 'source': 'data/ps.pdf', 'page': 37, 'page_label': '38', 'creationdate': '2020-03-30T07:09:46+00:00', 'moddate': '2020-03-30T07:09:46+00:00', 'creator': 'PyPDF', 'total_pages': 365}

🔎 Result 2
Figure	2-9.	
The	dashed	line	shows	the	linear	interpolation	while	the	dotted	line	shows	the	spline
interpolation.
There	are	many	situations	where	a	linear	(or	spline)	interpolation	is	appropriate.	Consider	mean
average	weekly	temperature	where	there	is	a	known	trend	of	rising	or	falling	temperatures
Metadata: {'page_label': '42', 'page': 41, 'creator': 'PyPDF', 'source': 'data/ps.pdf', 'moddate': '2020-03-30T07:09:46+00:00', 'producer': 'PDF Candy', 'to

## 4- Comparing different vector databases

| Vector DB   | Open Source | Index Types Supported      | Metadata Filtering | Hybrid Search        | LangChain Support | Best Use Case                                     |
|-------------|-------------|-----------------------------|---------------------|-----------------------|--------------------|--------------------------------------------------|
| **Chroma**  | ✅ Yes      | HNSW-like ANN               | ✅ Yes              | ❌ No                | ✅ Native          | Local dev, rapid prototyping, simple pipelines   |
| **FAISS**   | ✅ Yes      | Flat, IVF, PQ, HNSW, OPQ    | ❌ No               | ❌ No                | ✅ Full            | In-memory high-speed search, no metadata needed  |
| **Weaviate**| ✅ Yes      | HNSW + optional BM25 Hybrid | ✅ Yes              | ✅ Yes (BM25 + Vector)| ✅ Native          | Hybrid semantic + keyword retrieval at scale     |

HNSW ANN: Graph based Approximate Neearest Neighbour

### Index Type Comparison:

| Index Type | Type    | Description                                                                 | Accuracy        | Speed         | Memory Usage | Time Complexity     | Best Use Case                                         |
|------------|---------|-----------------------------------------------------------------------------|------------------|---------------|--------------|----------------------|--------------------------------------------------------|
| **Flat**   | Vector  | Exhaustive search (compares against all vectors)                            | ✅ Exact         | ❌ Slow       | ❌ High       | O(N)                | Small datasets where precision is critical             |
| **IVF**    | Vector  | Partitions vectors into clusters and searches only a few                    | ⚠️ Approximate   | ✅ Fast       | ✅ Medium     | O(N / K) or O(√N)    | Mid-scale, good speed-accuracy trade-off               |
| **HNSW**   | Vector  | Graph-based ANN with hierarchical layers and greedy traversal               | ✅ Very High     | ✅ Fastest    | ❌ High       | O(log N)            | High-scale, high-accuracy real-time search             |
| **OPQ**    | Vector  | Compresses and rotates vectors for quantized fast search                    | ✅ Higher than PQ| ✅ Fast       | ✅ Low        | O(log N)            | Billion-scale retrieval with constrained memory        |
| **BM25**   | Lexical | Keyword-based search using TF-IDF + heuristics                              | ❌ No semantics  | ✅ Fast       | ✅ Low        | O(log N) or better  | Keyword search; good in hybrid with semantic vectors   |


Hybrid search = **Lexical** search (keywords) + **Semantic** search (vector similarity)



| Use Case                                | Use HNSW?                              |
|-----------------------------------------|----------------------------------------|
| Small data (<10k vectors)               | ❌ Not needed, use `FlatL2` or `FlatIP` |
| Large data (>50k+ vectors)              | ✅ Yes, faster retrieval                |
| You want fast **approximate** retrieval | ✅ Yes                                  |
| You want exact similarity               | ❌ No, use `Flat*` indexes              |





## B. FAISS

**Index types**

| FAISS Index       | Distance Metric | Description                                                             | Use Case                                               | Notes                                                                 |
|-------------------|------------------|-------------------------------------------------------------------------|---------------------------------------------------------|-----------------------------------------------------------------------|
| IndexFlatL2       | L2 (Euclidean)   | Computes exact Euclidean (squared) distance                             | Default for dense vectors (e.g., image, speech)         | Slower but highly accurate; scales poorly to large datasets           |
| IndexFlatIP       | Inner Product    | Computes dot product between vectors                                    | Use with normalized vectors → cosine similarity         | If embeddings are normalized, IP ≈ cosine similarity                  |
| IndexFlatL1       | L1 (Manhattan)   | Computes L1 (absolute value) distance                                   | Rarely used — more common in sparse data or anomaly detection | FAISS supports it but with limited optimizations                      |
| IndexFlat         | Depends on config| Deprecated — avoid. Use one of the explicit ones like IndexFlatL2       | ❌ Do not use                                            |                                                                       |
| IndexIVFFlat      | Any              | Inverted index + Flat inside clusters (needs training)                  | Large datasets with >100k vectors                       | Needs training with representative data                              |
| IndexHNSWFlat     | Any              | Graph-based ANN with exact inner cluster distance                        | Fast + high recall for big datasets                     | Supports cosine similarity (with normalization)                      |
| IndexPQ           | Compressed (L2)  | Product Quantization — memory-efficient approximate search              | Billions of vectors                                     | Lower recall, fast, low memory                                       |


**Distance Metrics**


| Metric         | FAISS Index       | Formula                                       | Best For                          |
|----------------|-------------------|-----------------------------------------------|-----------------------------------|
| L2             | IndexFlatL2       | ||a - b||² (Euclidean)                        | Default for most dense vectors    |
| L1             | IndexFlatL1       | sum(abs(a_i - b_i))                           | Sparse data / rare cases          |
| Inner Product  | IndexFlatIP       | a ⋅ b                                         | Use with normalized vectors       |
| Cosine         | ≈ IndexFlatIP + normalized embeddings | 1 - cos(θ)                         | Semantic similarity (e.g., text)  |


**Choosing strategy**

| Goal                          | Recommended Index         | Notes                                                                 |
|-------------------------------|----------------------------|-----------------------------------------------------------------------|
| High accuracy, small dataset  | IndexFlatL2 or IndexFlatIP | Exact results, slow on big data                                       |
| Semantic search (text)        | IndexFlatIP + normalize embeddings | IP ≈ Cosine if vectors are unit-length                               |
| Large dataset (>100k docs)    | IndexIVFFlat or IndexHNSWFlat | Fast ANN, need training (IVF) or parameter tuning (HNSW)             |
| Memory-efficient search       | IndexPQ                    | Lower recall, but efficient                                           |
| You care about diversity/novelty | IndexFlatIP + MMR         | Combine with MMR post-processing                                     |


In [8]:
# ✅ 1. Load HuggingFace Embeddings
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={'device': 'cpu'},  # or 'cuda' if you have GPU
    encode_kwargs={'normalize_embeddings': True}  # for cosine similarity
)

# ✅ 2. Load Chunks
import pickle

with open("data/chunks.pkl", "rb") as f:
    token_chunks = pickle.load(f)

print(f"✅ Loaded {len(token_chunks)} chunks")

KeyboardInterrupt: 

In [6]:
# ✅ 3. Create FAISS Vector Store
from langchain_community.vectorstores import FAISS

faiss_db = FAISS.from_documents(
    documents=token_chunks,
    embedding=embedding_model
)

print("✅ FAISS vector store created")

# ✅ 4. Save FAISS to disk
faiss_db.save_local("faiss_index")
print("✅ FAISS index saved to 'faiss_index'")


AttributeError: module 'faiss' has no attribute 'IndexFlatL2'

In [None]:
query = "How does the model handle missing values?"

# Run a similarity search using MMR
results = faiss_db.similarity_search_with_score(
    query,
    k=3,               # number of final documents
    fetch_k=10,        # number of candidates to rerank from
    lambda_mult=0.7,   # tradeoff between relevance and diversity
    mmr=True
)

for doc, score in results:
    print(f"[{score:.4f}] {doc.page_content[:200]}...\n")


In [None]:
query = "What is the main idea of the document?"

# Perform similarity search
results = faiss_db.similarity_search(query, k=5)

# Print results
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


Useful:

In [None]:
def create_similarity_search_collection(collection_name: str, collection_metadata: dict = None):
    """Create ChromaDB collection with sentence transformer embeddings"""
    try:
        # Try to delete existing collection to start fresh
        client.delete_collection(collection_name)
    except:
        pass
    
    # Create embedding function
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )
    
    # Create new collection
    return client.create_collection(
        name=collection_name,
        metadata=collection_metadata,
        configuration={
            "hnsw": {"space": "cosine"},
            "embedding_function": sentence_transformer_ef
        }
    )