# MultiVectorStore

It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base `MultiVectorRetriever` which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.

The methods to create multiple vectors per document include:

- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
- Summary: create a summary for each document, embed that along with (or instead of) the document.
- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.


Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [1]:
import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma

from load_document import load_document
from cached_docstore import CachedDocStore
from multi_vectorstore import MultiVectorStore

In [2]:
docs = load_document("./files/state_of_the_union.txt", chunk_it=True, chunk_size=4000)

In [3]:
len(docs)

11

In [4]:
len(docs[0].page_content)

3849

In [5]:
embedding = OpenAIEmbeddings()

base_vectorstore = Chroma(
                persist_directory="multi_vectorstore/state_of_the_union",
                embedding_function=embedding,
            )


docstore = CachedDocStore("multi_vectorstore/state_of_the_union/parent_docs", cached=False)

multi_vectorstore = MultiVectorStore(vectorstore=base_vectorstore, docstore=docstore, search_kwargs={"k": 1})

In [10]:
child = multi_vectorstore.get_childs
aliases = multi_vectorstore.get_aliases

In [6]:
multi_vectorstore.add_documents(docs, add_originals=True)

['c481f8af-7871-49ab-ab48-befa838375b2',
 '263e9703-0adb-4d59-aee0-18f153804a34',
 'a460a603-57eb-42de-a374-bdad04bcb827',
 '6604816f-0a8f-4850-9da9-35382f615ba2',
 '0767a427-dc41-4707-a29e-27cf2720d85f',
 'eff27d7d-c1d8-4204-9eec-61c853585c23',
 '20cac85b-581f-4bc8-ab64-0951cc0c9f14',
 '5c762494-1867-473b-9b51-a6364053e324',
 '6d6b477c-3a2d-48e9-b407-eb0d881fe34b',
 'bb489335-a966-4566-8b68-60b1e4cbe5ba',
 '548f88e0-9995-4a2e-9403-2c85db2c1175']

In [20]:
id = docs[1].metadata['doc_id']
id

'9ae9e1c2-0790-44b9-9ee7-ecf179e2c72b'

In [21]:
child(id)

['080d4118-06e5-4a41-bd3e-e85d5abb4038',
 '97d92def-b57b-4b27-a1f7-3b52bf8ef772',
 '51389fb7-c2e0-4901-80b7-b5e2f4438e2b',
 '343fe7f3-3d82-4aa0-8cb6-3561baa1348c',
 'ac18e29a-be62-4bd3-95a7-145e89a33f4b',
 '9027764a-c27b-475b-8d50-2fbf94c17862',
 '3869ad9e-9549-417d-8db0-ef6d99a72194',
 '9396a7fc-3793-4989-a779-4a7462559673',
 '8cf629b4-47c2-4df2-90b6-ae2f03aa50f0',
 '9c25f9d0-5591-4fae-a73f-6dbe6f0aab2c',
 'b3fc9ad4-2c10-402d-a2e0-2329fc7177c4',
 '896fcece-239e-4130-b9dd-4300742e3a9b',
 '7c1fea81-1f94-4106-bc3a-2ea0baf2231b',
 '82d630a6-7ad4-4236-ad3d-d7ea0218f53b',
 'ff0b3255-c8bb-4a7e-8c4f-1cb1611e4a6d',
 '85e6c590-aaca-4755-a140-88d808e58e73',
 '60af3c75-ba8d-4f77-8568-e9e461137653',
 'd0269b53-eac1-4653-a3a2-98eaa8591d27',
 '339cff45-0e8f-4b5d-9742-7239bcf7147a',
 'f7c8c54d-0a7a-43ff-a714-407b83e1015a']

In [22]:
aliases(id)

[]

In [13]:
base_vectorstore.delete([child(id)[0]])

In [18]:
multi_vectorstore.delete([id])

True

In [19]:
multi_vectorstore.add_documents_multiple(docs, func_list=[("chunk", {"chunk_size":400}), "summary", ("question", {"q":2})])

['b6a4d51e-58f3-4f45-b270-af022ef389d5',
 'e6ea146a-edfa-4c6c-9388-49d8710242a4',
 '30daef46-bf1d-47f5-926e-041998dc8c0d',
 'c46e9f09-8faf-4dbb-aa19-a2593702e91f',
 'adf115dd-7bd2-4f26-a1a3-fe7ac1b79bf1',
 '27aa2079-f206-4157-bb7c-efc07ca5e315',
 '62067039-318a-4531-af3d-7ab5c35458fc',
 '0b789c9f-437d-4032-8cbb-dc591556496b',
 'd3306d51-2c5b-49ad-9e13-fd8df956dcb2',
 '8e0f30c1-5dae-494e-9245-deb12301168c',
 '8cfe09a7-960b-4eb5-a20a-c564bb3cacd5']

In [None]:
sub_docs = base_vector_store.similarity_search("justice breyer")

In [None]:
len(sub_docs)

In [None]:
len(sub_docs[0].page_content)

In [None]:
related_docs = multi_vector_store.get_relevant_documents("justice breyer")

In [None]:
len(related_docs)

In [None]:
len(related_docs[0].page_content)