# Vector Stores
 vector stores are data structures used to store and retrieve embeddings (vector representations of data). They allow efficient similarity search by converting documents, queries, or other text into vectors and then finding the closest matching vectors based on a distance metric (e.g., cosine similarity). LangChain integrates with various vector store backends like Pinecone, ChromaDb, FAISS, and Weaviate to perform these operations. We will explore some of them.

- Setup of Embeddings

In [1]:
# =================== Section to change according to your choice of APIs you have access to ===============
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-MiniLM-L6-v2"
text = "Write a short poem about programming."
embeddings = HuggingFaceEmbeddings(model_name=model_name)


- Setup of Documents/Texts

In [2]:
from langchain_core.documents import Document

# documents
documents = [
    Document(
        page_content="Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water.",
        metadata={
            "source": "Biology Textbook",
            "author": "Dr. Alice Green",
            "date": "2021-03-15"
        }
    ),
    Document(
        page_content="Quantum entanglement is a physical phenomenon that occurs when pairs or groups of particles are generated such that the quantum state of each particle cannot be described independently of the others.",
        metadata={
            "source": "Physics Journal",
            "author": "Dr. John Quantum",
            "date": "2022-06-01"
        }
    ),
    Document(
        page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.",
        metadata={
            "source": "Geology Magazine",
            "author": "Dr. Emily Stone",
            "date": "2020-11-23"
        }
    ),
    Document(
        page_content="CRISPR is a powerful tool for editing genomes, allowing researchers to easily alter DNA sequences and modify gene function.",
        metadata={
            "source": "Genetics Weekly",
            "author": "Dr. Rachel Gene",
            "date": "2023-01-10"
        }
    )
]

texts = ["Text 1", "Text 2", "Text 3", "Text 4"]

## FAISS

FAISS (Facebook AI Similarity Search) is an **open-source** library developed by Facebook AI Research designed to efficiently search for similar vectors in large datasets. It provides tools for nearest neighbor search, clustering, and vector quantization, making it ideal for handling high-dimensional data like word embeddings or other types of vectors
FAISS indexes are typically stored in in-memory by default, but they can also be persisted to disk for later use.
- In-Memory: When you create an index, FAISS keeps it in memory. This allows fast querying but can be limited by the amount of available system memory, especially for large datasets.
- On Disk: FAISS provides functionality to save and load indexes from disk. This can be useful for large datasets or when you need to persist the index for future use. You can save the index to a file and load it again later without needing to recompute it.

In [71]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Create from texts
vectorstore = FAISS.from_documents(documents, embeddings)

# Search
docs = vectorstore.similarity_search("what can be used for editing genomes", k=1)
print(docs)

# Save and load
vectorstore.save_local("db/faiss_index")
loaded_vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

[Document(id='6a26b38b-0096-4b03-ae20-583cac160a6a', metadata={'source': 'Genetics Weekly', 'author': 'Dr. Rachel Gene', 'date': '2023-01-10'}, page_content='CRISPR is a powerful tool for editing genomes, allowing researchers to easily alter DNA sequences and modify gene function.')]


## Chroma
Chroma is an **open-source** vector database used for storing and querying high-dimensional embeddings. It stores indexes and metadata on disk by default, using an **SQLite backend** for persistent storage. The index location can be customized by specifying a directory path.

In [6]:
from langchain_chroma import Chroma

# Create from documents
vectorstore = Chroma.from_documents(
    documents = documents,
    embedding=embeddings,
    persist_directory="./db/chroma_db"
)


# Load existing DB
loaded_vectorstore = Chroma(
    persist_directory="./db/chroma_db",
    embedding_function=embeddings
)

# Search with metadata filtering
docs = loaded_vectorstore.similarity_search(
    "what do we use to edit genomes",
    k=1,
    filter={"source": "news"}
)

## Pinecone
Pinecone is a **managed** vector database used for storing and retrieving high-dimensional vectors (embeddings). When using Pinecone in LangChain, embeddings are stored in a Pinecone index, which is hosted and managed by Pinecone's infrastructure in the **cloud**.

In [64]:
# Generate the embedding for the text
text = "test"
embedding = embeddings.embed_query(text)
# Get the dimension (length of the embedding)
embedding_dimension = len(embedding)

In [65]:
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone client
pc = Pinecone()
# Create or connect to an index
index_name = "langchain-demo"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=embedding_dimension,  # OpenAI embeddings dimensions
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",  # or "gcp"
            region="us-east-1"  # or your desired region
        )
    )

![title](files/pinecone_index.png)

In [68]:
from langchain_pinecone import PineconeVectorStore
index_name = "langchain-demo"

# Create vectorstore
vectorstore = PineconeVectorStore.from_texts( # you can also use texts directly instead of documents
    texts=texts,
    embedding=embeddings,
    index_name=index_name,
)

# Search
docs = vectorstore.similarity_search("query text", k=3)
print(docs)

[Document(id='ca01e0df-51d7-4a98-81ff-0bdaf26237aa', metadata={}, page_content='Text 1')]
