## MultiVectorStore

It can often be beneficial to store multiple vectors per document. A transformation is applied to a parent document to create child documents that then are embedded and stored in the vectorstore. The methods to create child documents per each parent document include:

- Smaller chunks: split a document into smaller chunks.
- Summary: create a summary for each document using an LLM.
- Hypothetical questions: use a LLM to create hypothetical questions that each document would be appropriate to answer.
- Custom functions that generate child documents that can improve the quality of the retrieval process.

This allows the retriever to use the child document embeddings for the search for the best match, but then return the parent documents that have more content.


In [1]:
import os
import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
from langchain_openai import ChatOpenAI

from load_document import load_document, load_unstructured_document
from cached_docstore import CachedDocStore
from multi_vectorstore import MultiVectorStore

#### Load files and  and split them into chunks. This chunks will be our parent documents.

In [2]:
docs = load_document("./files/state_of_the_union.txt", chunk_it=True, chunk_size=4000, chunk_overlap=200)

len(docs)

11

In [3]:
len(docs[0].page_content)

3849

#### Initialize the document store to save the parent documents

In [4]:
data_folder = os.path.abspath("../data/multi_vectorstore")

docstore = CachedDocStore(data_folder+"/parent_docs", cached=False)

#### Initialize the base vector store to save the child documents and the embeddings

In [5]:
embedding = OpenAIEmbeddings()

base_vectorstore = Chroma(
                persist_directory=data_folder+"/child_docs",
                embedding_function=embedding,
            )

#### Initialize the LLM to be used to create summaries and questions

In [6]:
llm = ChatOpenAI()

#### Initialize the multi vector store

In [7]:
multi_vectorstore = MultiVectorStore(
                vectorstore=base_vectorstore,
                docstore=docstore,
                ids_db_path=data_folder,
        )

#### Add documents to the multi vector store
The default transformation is smaller chucks

In [8]:
ids = multi_vectorstore.add_documents(docs)

len(ids)

11

The returned ids correspont to the parent documents. **`get_by_ids`** can be used to retrieve the parent documents.

In [9]:
doc = multi_vectorstore.get_by_ids([ids[0]])

print(doc[0].page_content[:240])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again.


**`get_child_ids`** can be used to get the ids of the child documents of a parent document.

In [10]:
child_ids = multi_vectorstore.get_child_ids(ids[0])

len(child_ids)

9

In [11]:
child_docs = multi_vectorstore.vectorstore.similarity_search("", filter={"doc_id": ids[0]})

len(child_docs[0].page_content)

490

In [12]:
print(child_docs[0].page_content[:240])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again.


The transformation can also be set at the time of inserting the documents using he parameter **`functor`**. This parameter can also be used at the time of the multi vector store initialization.

In [13]:
ids2 = multi_vectorstore.add_documents(docs, functor="summary", llm=llm)

assert ids == ids2

In [14]:
child_ids = multi_vectorstore.get_child_ids(ids[0])

len(child_ids)

10

The parameter **`func_kwargs`** can be used to pass arguments to the transformation function. This parameter can also be used at the time of the multi vector store initialization.

In [15]:
ids3 = multi_vectorstore.add_documents(docs, functor="question", func_kwargs={"q": 2}, llm=llm)

In [16]:
child_ids = multi_vectorstore.get_child_ids(ids[0])

len(child_ids)

12

The parameter **`add_originals`** can be used to add the parent documents' embeddings to the vector store. **`get_aliases`** will return the ids of the embedding entries for a parent document

In [17]:
ids4 = multi_vectorstore.add_documents(docs, functor="none", add_originals=True)

In [18]:
aliases = multi_vectorstore.get_aliases(ids[0])

len(aliases)

1

In [19]:
assert ids[0] == aliases[0]

You can also pass a list of functors and its kwargs to insert multiple type of child at once.

In [20]:
ids = multi_vectorstore.add_documents(
            docs,
            functor=[("chunk", {"chunk_size":400, "chunk_overlap":40}), "summary", ("question", {"q":2})],
            llm=llm,
        )

You can also specify the transformation function(s) at the time of the multi vector store initialization. This is beneficial when the multi vector store is being managed by another layer such a DocumentDB.

In [21]:
multi_vectorstore = MultiVectorStore(
                vectorstore=base_vectorstore,
                docstore=docstore,
                ids_db_path=data_folder,
                functor=[("chunk", {"chunk_size":400, "chunk_overlap":40}), "summary", ("question", {"q":2})],
                llm=llm,
                
        )

### Retrieval

Under the hood the multi vector store retriever uses the base vector store to search within the child documents, and the document store to retrieve the associated parent document.

In [22]:
child_docs = multi_vectorstore.vectorstore.similarity_search("justice breyer")

len(child_docs)

4

In [23]:
len(child_docs[0].page_content)

390

In [24]:
print(child_docs[0].page_content[:250])

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 


In [25]:
retriever = multi_vectorstore.as_retriever(search_kwargs={"k": 5})  # k is the number of child documents to use

related_parent_docs = retriever.invoke("justice breyer")

len(related_parent_docs)

2

In [26]:
len(related_parent_docs[0].page_content)

3958

In [27]:
print(related_parent_docs[0].page_content[2456:2706])

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 
