# MultiVectorStore

It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base `MultiVectorRetriever` which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.

The methods to create multiple vectors per document include:

- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
- Summary: create a summary for each document, embed that along with (or instead of) the document.
- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.


Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [1]:
import os
data_folder = os.path.abspath("../data")

import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma

from load_document import load_document
from cached_docstore import CachedDocStore
from multi_vectorstore import MultiVectorStore

In [2]:
docs = load_document("./files/state_of_the_union.txt", chunk_it=True, chunk_size=4000)

In [3]:
len(docs)

11

In [4]:
len(docs[0].page_content)

3849

In [5]:
embedding = OpenAIEmbeddings()

base_vectorstore = Chroma(
                persist_directory="../data/multi_vectorstore/state_of_the_union",
                embedding_function=embedding,
            )


docstore = CachedDocStore(data_folder+"/multi_vectorstore/state_of_the_union/parent_docs", cached=False)

multi_vectorstore = MultiVectorStore(vectorstore=base_vectorstore, docstore=docstore, search_kwargs={"k": 1})

In [6]:
child = multi_vectorstore.get_child_ids
aliases = multi_vectorstore.get_aliases

In [None]:
ids = multi_vectorstore.add_documents(docs, add_originals=True)

In [8]:
i=10

In [9]:
id = docs[i].metadata['doc_id']
id

'52678216-dc41-491e-8246-f145604d8649'

In [11]:
ids[i]

'eb3166b6-32bb-432f-a597-b73c5a34963a'

In [12]:
id

'52678216-dc41-491e-8246-f145604d8649'

In [None]:
child(id)

In [None]:
aliases(id)

In [None]:
multi_vectorstore.get_by_ids([id])

In [None]:
multi_vectorstore.delete([id])

In [None]:
multi_vectorstore.get_by_ids(ids)

In [7]:
ids = multi_vectorstore.add_documents_multiple(docs, func_list=[("chunk", {"chunk_size":400}), "summary", ("question", {"q":2})], add_originals=True)

In [None]:
sub_docs = base_vector_store.similarity_search("justice breyer")

In [None]:
len(sub_docs)

In [None]:
len(sub_docs[0].page_content)

In [None]:
related_docs = multi_vector_store.get_relevant_documents("justice breyer")

In [None]:
len(related_docs)

In [None]:
len(related_docs[0].page_content)