# MultiVectorStore

It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base `MultiVectorRetriever` which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.

The methods to create multiple vectors per document include:

- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
- Summary: create a summary for each document, embed that along with (or instead of) the document.
- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.


Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [1]:
import os
data_folder = os.path.abspath("../data/full_stack")

import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
from langchain_openai import ChatOpenAI

from load_document import load_document
from cached_docstore import CachedDocStore
from multi_vectorstore import MultiVectorStore
from document_db import DocumentDB

In [2]:
docs = load_document("./files/state_of_the_union.txt", chunk_it=True, chunk_size=4000)

In [3]:
len(docs)

11

In [4]:
len(docs[0].page_content)

3849

In [5]:
embedding = OpenAIEmbeddings()

base_vectorstore = Chroma(
                persist_directory=data_folder+"/vectors",
                embedding_function=embedding,
            )


docstore = CachedDocStore(data_folder+"/parent_docs", cached=False)

multi_vectorstore = MultiVectorStore(
                vectorstore=base_vectorstore,
                docstore=docstore,
                ids_db_path=data_folder,
                functor="chunk",
                func_kwargs={"chunk_size": 400},
                search_kwargs={"k": 1},
        )

db = DocumentDB(data_folder, vectorstore=multi_vectorstore)
retriever = db.as_retriever()

llm = ChatOpenAI()

In [6]:
db.clean()

{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 11}

In [7]:
db.upsert_documents(docs)

{'num_added': 11, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [None]:
sub_docs = base_vectorstore.similarity_search("justice breyer")

In [None]:
len(sub_docs)

In [None]:
len(sub_docs[0].page_content)

In [None]:
related_docs = retriever.invoke("justice breyer")

In [None]:
len(related_docs)

In [None]:
len(related_docs[0].page_content)