# MultiVectorStore

It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base `MultiVectorRetriever` which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.

The methods to create multiple vectors per document include:

- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
- Summary: create a summary for each document, embed that along with (or instead of) the document.
- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.


Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [1]:
import os
data_folder = os.path.abspath("../data/multi_vectorstore")

import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma

from load_document import load_document
from cached_docstore import CachedDocStore
from multi_vectorstore import MultiVectorStore

In [2]:
docs = load_document("./files/state_of_the_union.txt", chunk_it=True, chunk_size=4000)

In [3]:
len(docs)

11

In [4]:
len(docs[0].page_content)

3849

In [29]:
embedding = OpenAIEmbeddings()

base_vectorstore = Chroma(
                persist_directory=data_folder+"/vectors",
                embedding_function=embedding,
            )


docstore = CachedDocStore(data_folder+"/parent_docs", cached=False)

multi_vectorstore = MultiVectorStore(
                vectorstore=base_vectorstore,
                docstore=docstore,
                ids_db_path=data_folder,
                search_kwargs={"k": 1},
        )

retriever = multi_vectorstore.as_retriever()

In [41]:
ids = multi_vectorstore.add_documents(docs, add_originals=True)

In [42]:
ids = multi_vectorstore.add_documents_multiple(docs, func_list=[("chunk", {"chunk_size":400}), "summary", ("question", {"q":2})], add_originals=True)

In [43]:
sub_docs = base_vectorstore.similarity_search("justice breyer")

In [44]:
len(sub_docs)

4

In [45]:
len(sub_docs[0].page_content)

390

In [46]:
related_docs = retriever.invoke("justice breyer")

In [47]:
len(related_docs)

4

In [48]:
len(related_docs[0].page_content)

3958