# Vector stores
Vector stores are databases that can efficiently store and retrieve embeddings.


Vector stores are a type of database designed to store and retrieve data based on semantic similarity rather than exact keyword matching. This means they allow you to search for information that has a similar meaning, not just the same words. Here’s a simplified breakdown of how vector stores work and the main features you might use with them.

# 1. Vector Embeddings
* Vector stores use embeddings, which are mathematical representations of text, images, or audio. These embeddings capture the meaning behind the data so that the vector store can find similar content.

* For example, if you have a vector store of different documents and you search "sunny day," it could return documents related to warm weather, beaches, or outdoor activities even if the exact phrase “sunny day” isn’t in those documents.

# 2. Using Vector Stores in LangChain
LangChain provides integrations for different vector store implementations, making it easy to switch between them.
LangChain also has a standard interface for adding, deleting, and searching documents in vector stores.

# 3. Main Methods of Vector Stores
* add_documents: Adds documents to the vector store. Each document has page_content (text) and metadata (extra info like source or date).
* delete_documents: Removes documents by their unique IDs.
* similarity_search: Finds documents that are similar to a given search query. The vector store will convert the query into an embedding and look for other embeddings that are close in meaning.

# 4. Similarity Metrics
* Vector stores compare embeddings using similarity metrics like:
  * Cosine Similarity: Measures how close the direction of two vectors is.
  * Euclidean Distance: Measures the straight-line distance between two points (how far they are).
  * Dot Product: Measures how much one vector aligns with another.
* The similarity metric used can vary by vector store.
# 5. Searching for Similar Documents
* When you search for something, the vector store looks for documents with similar embeddings. Many vector stores use an algorithm called HNSW (Hierarchical Navigable Small World) to make searching through large amounts of data faster.
* You can also control your search by:
  * k: Number of documents to return (e.g., k=5 to return the top 5 matches).
  * filter: Limit results based on metadata, like finding only documents with metadata saying {"source": "tweet"}.

 # Chroma
* Description: Chroma is an open-source vector database designed for AI and ML applications. It supports document retrieval, metadata filtering, and similarity search.
* Use Case: Perfect for lightweight projects or local development that doesn’t need large-scale deployment.
* Integration: Chroma integrates easily with LangChain for seamless embedding search.

In [None]:
pip install langchain chromadb sentence-transformers

In [2]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.17-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-win_amd64.whl.metadata (262 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.0-cp310-cp310-win_amd64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.27.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Do


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

# Initialize the embedding model
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize Chroma with the embedding model
vector_store = Chroma(embedding_function=embedding_model)


In [8]:
from langchain.schema import Document

# Create some sample documents
documents = [
    Document(
        page_content="The Eiffel Tower is one of the most famous landmarks in Paris.",
        metadata={"source": "tourism", "location": "Paris"}
    ),
    Document(
        page_content="Big Ben is a historic clock tower in London, England.",
        metadata={"source": "tourism", "location": "London"}
    ),
    Document(
        page_content="The Louvre is the world's largest art museum and a historic monument in Paris.",
        metadata={"source": "museum", "location": "Paris"}
    ),
]

# Add documents to Chroma vector store
vector_store.add_documents(documents=documents)


['b6c3f620-4e4a-494a-9073-0bb3e41e8f95',
 'e088a01b-c8b4-4b58-b3b2-7ada84d177fb',
 '2b275725-faa1-41f7-aff0-a5e1f01fb5fd']

In [9]:
query = "famous landmarks in Paris"
results = vector_store.similarity_search(query, k=2)

# Display the results
for idx, doc in enumerate(results):
    print(f"Result {idx + 1}:")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")
    print("\n")


Result 1:
Content: The Eiffel Tower is one of the most famous landmarks in Paris.
Metadata: {'location': 'Paris', 'source': 'tourism'}


Result 2:
Content: The Louvre is the world's largest art museum and a historic monument in Paris.
Metadata: {'location': 'Paris', 'source': 'museum'}




In [10]:
# Filter to retrieve only documents related to Paris
results_filtered = vector_store.similarity_search(query, k=2, filter={"location": "Paris"})

# Display the filtered results
for idx, doc in enumerate(results_filtered):
    print(f"Filtered Result {idx + 1}:")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")
    print("\n")


Filtered Result 1:
Content: The Eiffel Tower is one of the most famous landmarks in Paris.
Metadata: {'location': 'Paris', 'source': 'tourism'}


Filtered Result 2:
Content: The Louvre is the world's largest art museum and a historic monument in Paris.
Metadata: {'location': 'Paris', 'source': 'museum'}




# 1. FAISS (Facebook AI Similarity Search)
* Description: FAISS is an open-source library developed by Facebook for fast similarity search and clustering of dense vectors. It's great for large-scale vector storage and retrieval, but it’s more of a library than a standalone database, so you might need additional tools to manage it fully.
* Use Case: Ideal for large datasets that require high-speed similarity searches.
* Integration: Works well with LangChain, and you can use it with Python for easy setup.


In [3]:
pip install faiss-cpu sentence-transformers




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [4]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Sample documents
documents = [
    "The Eiffel Tower is located in Paris.",
    "The Statue of Liberty is in New York.",
    "The Colosseum is an ancient amphitheater in Rome.",
    "The Louvre is a famous museum in Paris.",
]

# Generate embeddings for each document
embeddings = model.encode(documents)

# Convert embeddings to a NumPy array
embeddings = np.array(embeddings)


In [5]:
import faiss

# Define the dimension of the embeddings (e.g., 384 for MiniLM)
embedding_dimension = embeddings.shape[1]

# Create a FAISS index for cosine similarity search
index = faiss.IndexFlatL2(embedding_dimension)

# Add embeddings to the index
index.add(embeddings)


In [6]:
# Define a query
query = "famous landmarks in Paris"

# Encode the query to get its embedding
query_embedding = model.encode([query])

# Perform the search (find the 2 closest documents)
k = 2
distances, indices = index.search(query_embedding, k)

# Display results
print("Query:", query)
print("Results:")
for i in range(k):
    print(f"Document {i + 1}: {documents[indices[0][i]]} (Distance: {distances[0][i]})")


Query: famous landmarks in Paris
Results:
Document 1: The Louvre is a famous museum in Paris. (Distance: 0.7133329510688782)
Document 2: The Eiffel Tower is located in Paris. (Distance: 0.8363720178604126)


# How to create and query vector stores

In [15]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the OpenAI API key
openai_api_key = os.getenv('OPENAI_API_KEY')

# Use the API key in your application as needed
print("OpenAI API Key loaded:", openai_api_key)


OpenAI API Key loaded: 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


In [17]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader(r"C:\Users\Admin\Desktop\10-20-2024\data\state_of_the_union.txt",  encoding="utf-8").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

In [19]:
pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Downloading langchain_chroma-0.1.4-py3-none-any.whl (10 kB)
Installing collected packages: langchain-chroma
Successfully installed langchain-chroma-0.1.4
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [28]:

#from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Initialize OpenAI embeddings
embedding_model = OpenAIEmbeddings()

# Load your documents into the Chroma vector store
#db = Chroma.from_documents(documents, embedding=embedding_model)

from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(documents, OpenAIEmbeddings())




In [32]:
# Similarity search
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


In [30]:
# Similarity search by vector
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


In [31]:
docs = await db.asimilarity_search(query)
docs

[Document(metadata={'source': 'C:\\Users\\Admin\\Desktop\\10-20-2024\\data\\state_of_the_union.txt'}, page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'),
 Document(metadata={'source': 'C:\\Users\\Admin\\Desktop\\10-20-2024\\data\\state_of_th