## 📌 Indexing Techniques

- **K-Nearest Neighbours (KNN)**
- **Inverted File Vector**
- **Locality Sensitive Hashing (LSH)**
- **Random Projection**
- **Product Quantization**
- **Hierarchical Navigable Small World (HNSW)**
- **Hierarchical Indexing**
- **Multi-Representation Indexing**


## 📌 Popular Vector/Embedding Databases

- **ChromaDB** – Lightweight, open-source vector database for embeddings.  
- **Milvus** – High-performance, scalable vector database for similarity search.  
- **Pinecone** – Managed vector database for real-time AI applications.  
- **Weaviate** – Cloud-native vector search engine with semantic search capabilities.  
- **FAISS (Facebook AI Similarity Search)** – Library for efficient similarity search and clustering of dense vectors.   


## Storing Data in Vector Databases

In this section, we will demonstrate how to store data in various vector databases.  
Each database uses a specific indexing technique optimized for efficient similarity search.

- **ChromaDB** – Uses **Hierarchical Navigable Small World (HNSW)** indexing for fast and scalable approximate nearest neighbor searches.


In [None]:
import getpass
import os
from langchain_openai import OpenAIEmbeddings

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [3]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="rag_collection",
    embedding_function=embeddings,
    persist_directory="Medical_collection",
)

In [4]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
    id=2,
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
    id=3,
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
    id=4,
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
    id=5,
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
    id=6,
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
    id=7,
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
    id=8,
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
    id=9,
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
    id=10,
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['c8da5918-389d-4a75-ac71-c1562c5ff36d',
 '51d232b8-6f47-4055-ba23-a4ccdbff1eec',
 '7c85a091-f7f8-4fe3-8b50-961176937a13',
 '60d8120b-395b-4eaa-8905-fc75112decd7',
 '64d7cb52-f78d-4a38-bbdc-93cc810b42b6',
 'd08a8ac2-b2bb-4980-8451-1c90bc800866',
 'e92a5019-5d5d-48f0-814e-c60b64f0f901',
 '1e04c24b-007f-485a-a929-10a6f083e6a2',
 '2c7801c4-67c7-44fa-8dba-effe3cdd6fab',
 'efc907f4-356a-410f-909d-b35234028ea9']

In [5]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]


## Faiss

- **FAISS (Facebook AI Similarity Search)** – Provides several indexing options for efficient vector search:  
  - **Flat (brute-force)** – exact search  
  - **IVF (Inverted File)** – partitions vectors for faster approximate search  
  - **HNSW (Hierarchical Navigable Small World)** – graph-based ANN search  
  - **PQ (Product Quantization)** – compresses vectors for memory-efficient search


In [11]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [12]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [10]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['80fc960a-40a6-4e88-8d1c-39aa790703d2',
 '290662ee-5e1b-4ba2-aa01-f9e01d0a176a',
 '62328396-3fdf-462a-8d5f-ce72455a7a84',
 'ba85a075-4d2b-4b50-b870-454ff9001aab',
 'b3e88468-a9dc-4ee5-86b9-1d60274b8569',
 'a5fb7453-ac62-4e44-bd62-8abcb86e7d97',
 '53036063-ce25-4172-8236-7d7c5f2dddb7',
 '49330dcc-8eab-4098-b255-abf98c2fce30',
 '17f2b192-4e8d-433c-a104-7a37d8cdefd3',
 'fe62e242-aefd-4cff-96ec-80279f3c1503']

In [11]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]


## PineCone

- **Pinecone Indexing Techniques**:  
  - **HNSW (Hierarchical Navigable Small World)** – Graph-based approximate nearest neighbor search.  
  

In [13]:
import getpass
import os

from pinecone import Pinecone

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

In [14]:
from pinecone import ServerlessSpec

index_name = "testing"  # change if desired

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(index_name)

  from .autonotebook import tqdm as notebook_tqdm


In [15]:
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = PineconeVectorStore(index=index, embedding=embeddings)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


In [16]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents, ids=uuids)

['8c8fa098-c342-4367-adad-a955359b054e',
 '36084bf9-6e61-4058-a8b2-44038e1a70da',
 '8a831902-d44f-4e12-aeee-c9308fc29ba1',
 '25e0d899-b32a-4321-91e2-12f26c6e2fbd',
 '7101551a-aae7-410f-86fe-6503dcb64edb',
 'f83d3243-c84d-4a99-92e5-127744c32838',
 'ce866965-2ee0-4f71-8f76-4897c9dac834',
 '92566a1e-ad8b-448b-bb14-f0aa0f8f2ce1',
 'fa2edc94-be21-423c-bbf5-44657a485f8d',
 '135487b0-a514-41cd-b991-58bdaf432197']

In [18]:
results = vector_store.similarity_search(
    "Will it be hot tomorrow?",
    k=2,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* I have a bad feeling I am going to get deleted :( [{'source': 'tweet'}]
* I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]


## Milvus

- **Milvus**:  
  - **HNSW (Hierarchical Navigable Small World)** – Graph-based approximate nearest neighbor search.  
  - **IVF (Inverted File)** – Clusters vectors for efficient search within relevant partitions.  
  - **IVF-PQ (Inverted File with Product Quantization)** – Combines IVF with compression for memory-efficient search.  
  - **FLAT** – Brute-force exact search (accurate but slower for large datasets).  
  - **ScaNN (Scalable Nearest Neighbors)** – Optimized high-dimensional vector search.


In [None]:
## For runnnug milvus we have to do certain step to run milvus database

# step 1 - docker start
# step 2 - run command given below
# docker pull milvusdb/milvus:v2.0.0
# docker run -d --name milvus_cpu_2.0.0 -p 19530:19530 -p 19121:19121 milvusdb/milvus:v2.0.0

In [None]:
import getpass
import os
from langchain_openai import OpenAIEmbeddings

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [None]:
from langchain_milvus import Milvus

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"host": "localhost", "port": "19530"},
    index_params={"index_type": "FLAT", "metric_type": "L2"},
)


In [None]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

In [None]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    expr='source == "tweet"',
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

## Weaviate

- **Weaviate Indexing Techniques**:
  - **HNSW (Hierarchical Navigable Small World)** – Primary indexing method for approximate nearest neighbor search.
  - **Hybrid Search** – Combines vector search with filters or keyword-based search.


In [16]:
## For runnnug milvus we have to do certain step to run weavite database

# Give path in last command
# step 1 - docker start
# step 2 - run command given below
# docker pull semitechnologies/weaviate:latest    
# docker run -d --name weaviate_local -p 8080:8080 -p 50051:50051 -e QUERY_DEFAULTS_LIMIT=20 -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true -e PERSISTENCE_DATA_PATH="Give own path" semitechnologies/weaviate:latest

In [11]:
import weaviate
weaviate_client = weaviate.connect_to_local()

In [14]:
from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
loader = TextLoader("sample.txt")
documents = loader.load()


text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)

In [15]:
query = "Explain LLMs"
docs = db.similarity_search(query)

# Print the first 100 characters of each result
for i, doc in enumerate(docs):
    print(f"\nDocument {i + 1}:")
    print(doc.page_content[:100] + "...")


Document 1:
Key Characteristics of Transformers:
Self-Attention Mechanism: The self-attention mechanism allows T...

Document 2:
What Are Transformers?
Transformers are a type of neural network architecture that has revolutionize...
