## DocumentDB

This class lets you load and keep in sync documents from any source into a vector store using an index.

Specifically, it helps avopid:

    - writing duplicated content into the vector store
    - re-writing unchanged content
    - re-computing embeddings over unchanged content
    - manually deleting outdated content
    
The index will work even with documents that have gone through several transformation steps (e.g., via text chunking) with respect to the original source files.

In [1]:
import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma

from load_document import load_unstructured_document
from document_db import DocumentDB

#### Load files and  and split them into chunks. This chunks will be our documents.

In [2]:
files = ["./files/state_of_the_union.txt", "./files/us_constitution.pdf"]

In [3]:
docs = []
for file in files:
    chunks = load_unstructured_document(file, chunk_it=True, chunk_size=1000, chunk_overlap=100)
    print(f"File {file} produced {len(chunks)} documents")
    docs.extend(chunks)

File ./files/state_of_the_union.txt produced 42 documents
File ./files/us_constitution.pdf produced 50 documents


#### Initialize the vector store to save the documents and the embeddings

In [4]:
embedding = OpenAIEmbeddings()

vectorstore = Chroma(
                persist_directory="../data/document_db",
                embedding_function=embedding,
            )

#### Set up the document database
location is the path to the directory where the database index will be stored. The vector store takes care of storing the documents.

In [5]:
db = DocumentDB(location="../data/document_db", vectorstore=vectorstore)

**`upsert`** inserts documents into the database, ignoring existing documents and deleting outdated versions

In [6]:
db.upsert_documents(docs)

{'num_added': 92, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [7]:
docs[0].page_content = docs[0].page_content.upper()

In [8]:
db.upsert_documents(docs)

{'num_added': 1, 'num_updated': 0, 'num_skipped': 91, 'num_deleted': 1}

**`as_retriever`** returns a retriever that can be used to query the database for documents

In [9]:
retriever = db.as_retriever()

In [10]:
results = retriever.invoke("Judge Ketanji Brown Jackson")
print(results[0].page_content)

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.


In [11]:
results = retriever.invoke("What is the 14th ammendmen?")
print(results[0].page_content)

the whole number shall be necessary to a choice. But no person

constitutionally ineligible to the office of President shall be eligible to

that of Vice-President of the United States.

13th Amendment

Section 1

Neither slavery nor involuntary servitude, except as a punishment for

crime whereof the party shall have been duly convicted, shall exist

within the United States, or any place subject to their jurisdiction.

Section 2

Congress shall have power to enforce this article by appropriate

legislation.

14th Amendment

Section 1

All persons born or naturalized in the United States, and subject to the

jurisdiction thereof, are citizens of the United States and of the State

wherein they reside. No State shall make or enforce any law which

shall abridge the privileges or immunities of citizens of the United

States; nor shall any State deprive any person of life, liberty, or

property, without due process of law; nor deny to any person within its


**`delete_documents`** deletes all the documents in the database comming from the same source.

A dummy document is inserted for each source being deleted.

In [12]:
db.delete_documents([files[0]])

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 42}

In [13]:
vectorstore.similarity_search("", filter={"source": files[0]})

[Document(page_content='Deleted DO NOT USE', metadata={'source': './files/state_of_the_union.txt'})]

**`clean`** erases all documents in the database

In [14]:
db.clean()

{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 51}

**`delete_index'** deletes the database directory.

In [15]:
db.delete_index()