## MultiVectorDocumentDB

This class extends DocumentDB to support multiple vectors per document. It provides additional functionality for:

    - Creating multiple vectors per document (e.g., smaller chunks, summaries, hypothetical questions)
    - Using LangChain's MultiVectorRetriever for efficient querying

In [1]:
import os
import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from document_loaders.load_document import load_document
from multi_vector_document_db import MultiVectorDocumentDB

#### Load files and  and split them into chunks. This chunks will be our parent documents.

In [2]:
my_file = "./files/state_of_the_union.txt"

def load_docs(file_name):
    return load_document(file_name, chunk_it=True, chunk_size=4000, chunk_overlap=200)
    
docs = load_docs(my_file)

In [3]:
len(docs)

11

In [4]:
len(docs[0].page_content)

3849

#### Initialize the document database

In [5]:
data_folder = os.path.abspath("../data/multi_embedding_document_db")

db = MultiVectorDocumentDB.create(
        data_folder,
        functor=[("chunk", {"chunk_size":500, "chunk_overlap": 50}), "summary", ("question", {"q":2})]
    )

**`upsert`** inserts documents into the database, ignoring existing documents and deleting outdated versions

In [6]:
db.upsert_documents(docs)

{'num_added': 11, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

The parent documents are stored in the `docstore` asociated to the `vectorstore` used by the data base. The generator **`yield_keys`** returns the ids of the parent documents in the docstore.

In [7]:
ids = list(db.vectorstore.docstore.yield_keys())

len(ids)

11

And the method **`get_by_ids`** returns the list of parent documents associated with a list of ids

In [8]:
doc = db.vectorstore.get_by_ids([ids[0]])

len(doc[0].page_content)

3849

In [9]:
print(doc[0].page_content[:240])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again.


The method **`get_child_ids`** returns the ids of the childs of a parent document.

In [10]:
child_ids = db.vectorstore.get_child_ids(ids[0])

len(child_ids)

12

In [11]:
child_docs = db.vectorstore.similarity_search("", k=100, filter={"id": ids[0]})

len(child_docs)

12

In [12]:
len(child_docs[0].page_content)

490

In [13]:
print(child_docs[0].page_content[:240])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again.


Note that upserting updated documents only inserts the modified content and deletes the outdated content

In [14]:
docs2 = load_docs(my_file)

In [15]:
docs2[0].page_content = docs2[0].page_content.upper()

In [16]:
db.upsert_documents(docs2)

{'num_added': 1, 'num_updated': 0, 'num_skipped': 10, 'num_deleted': 1}

#### Retrieval

**`as_retriever`** returns a retriever that can be used to query the database for documents

In [17]:
child_docs = db.vectorstore.similarity_search("justice breyer", k=5)

In [18]:
len(child_docs)

5

In [19]:
len(child_docs[0].page_content)

390

In [20]:
print(child_docs[0].page_content[:250])

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 


In [21]:
# k is the number of child docs to retrieve and used to identify the parent docs to return
retriever = db.as_retriever(k=5)

In [22]:
related_docs = retriever.invoke("justice breyer")

In [23]:
len(related_docs)

1

In [24]:
len(related_docs[0].page_content)

3958

In [25]:
print(related_docs[0].page_content[2456:2706])

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 


In [26]:
db.delete_index()