## Introduction to Vector Databases


A Vector Database is a special type of database optimized for storing, indexing, and searching embeddings (vectors).  

- Embeddings are numerical representations of unstructured data (text, images, audio) that capture semantic meaning.

- VectorDBs are essential for Retrieval Augmented Generation (RAG) pipelines in LLM applications.



## Why use VectorDBs and not a normal database?

- SQL databases are great for structured data.
- But similarity search (e.g., "find documents most similar to this query") is inefficient in SQL.

VectorDBs solve this using Approximate Nearest Neighbor (ANN) algorithms.



### Key Features
- Store embeddings + metadata
- Perform similarity search (`top-k nearest neighbors`)
- Hybrid search (keywords + vectors)
- Scale to billions of embeddings



## Popular VectorDBs


### Local / Lightweight
- FAISS (Facebook AI): Offline, fast, scalable
- Chroma: Python-native, easy setup, bundled with LangChain

### Cloud / Production
- Pinecone: Managed cloud VectorDB
- Weaviate: Open-source, Graph + Vector
- Milvus: Open-source, scalable
- Qdrant: Open-source

In this document, we will introduce FAISS and Chroma.

In [None]:
### Install dependencies

!pip install langchain langchain-community openai




In [None]:

### Setup OpenAI API Key

import os
openai_api_key = '<your_api_key>'

# Set your OpenAI API key here
os.environ["OPENAI_API_KEY"] = openai_api_key


### Chroma Example

In [None]:
!pip install chromadb



In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document

# Example documents
texts = [
    "Machine learning is great for pattern recognition.",
    "Deep learning is a subset of machine learning.",
    "LangChain makes working with LLMs easier."
]

docs = [Document(page_content=t) for t in texts]

# Create embeddings
embeddings = OpenAIEmbeddings()

# Store in Chroma
db = Chroma.from_documents(docs, embeddings)

# Perform similarity search
query = "What is deep learning?"
results = db.similarity_search(query, k=2)

for r in results:
    print(r.page_content)


Deep learning is a subset of machine learning.
Deep learning is a subset of machine learning.


### FAISS Example

In [None]:
!pip install faiss-cpu



In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

texts = [
    "Neural networks are inspired by the human brain.",
    "Transformers revolutionized NLP.",
    "Word embeddings capture semantic meaning."
]

embeddings = OpenAIEmbeddings()

# Create FAISS index
db = FAISS.from_texts(texts, embeddings)

# Query search
query = "How do embeddings work?"
docs = db.similarity_search(query, k=2)

for doc in docs:
    print(doc.page_content)


Word embeddings capture semantic meaning.
Transformers revolutionized NLP.



### Choosing a VectorDB

- Learning / Experimenting: Chroma or FAISS
- Small local projects: Chroma
- Large-scale production: Pinecone, Weaviate, Qdrant, Milvus
- Fully managed cloud (no infra hassle): Pinecone
- Open-source + fast: Qdrant / Weaviate
