# **Persistent Document Q&A with ChromaDB**: Save/load VectorDB from Disk.

**ChromaDB is a vector database designed for storing and querying embeddings**, which are vector representations of data like text, images, or other objects. It is particularly useful in applications such as recommendation systems, semantic search, and natural language processing tasks where finding similarities or relationships between objects is critical.

### **Key Features of ChromaDB**

1. **Vector Storage**:
   - ChromaDB stores high-dimensional vectors (embeddings) that represent data points.
   - It supports similarity search, where you can query with a vector to find the most similar vectors in the database.

2. **Metadata Storage**:
   - Alongside vectors, ChromaDB can store metadata, which are additional attributes associated with each vector (e.g., a product description, document title, etc.).

3. **Collections**:
   - ChromaDB organizes data into collections, which are logical groupings of vectors and their metadata.

4. **APIs for Querying**:
   - Supports querying based on similarity, metadata filtering, or a combination of both.

5. **Persistence and Scalability**:
   - Can be used transiently (in-memory) or persistently (saved to disk with DuckDB+Parquet).
   - Integrates seamlessly with large-scale workflows.

**Vectorization**: is important when building applications, because vector dbconstruction is normally one-time process. It's possible to create and save the db first with all the relevant documents and then connect with this db to the application for Q&A by the end users.

> Save/load VectorDB from Disk: for this, it will be used Alice in Wonderland book in pdf.

In [None]:
# !pip install langchain
# !pip install pypdf2
# !pip install openai
# !pip install chromadb
# !pip install tiktoken 

In [29]:
import PyPDF2
import os
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import VectorDBQA
from langchain.llms import OpenAI

In [2]:
# set openai key
from dotenv import load_dotenv

load_dotenv()
openai_api_key=os.getenv("OPENAI_API_KEY")

In [3]:
import re 

# load files from a directory
def read_and_textify(files):
    text_list = []
    sources_list = []
    unwanted_phrases = [
    'Download free eBook s of classic literature, books and novels at Planet eBook.', 
    'Subscribe to our free eBooks blog and email newsletter.', 
    'Free eBooks at Planet eBoo k.com',
    ]

    for file in files:
        pdfReader= PyPDF2.PdfReader(file)
        print("Page Number:", len(pdfReader.pages))
        for i in range(len(pdfReader.pages)):
            pageObj = pdfReader.pages[i]
            text = pageObj.extract_text()

            if text:
                text = ' '.join(text.split())  # normalize spaces
                # Remove unwanted phrases
                for phrase in unwanted_phrases:
                    text = re.sub(re.escape(phrase), '', text, flags=re.IGNORECASE)      
                         
                text_list.append(text)
            else:
                text_list.append('')
                
            sources_list.append(file.name + "_page_" + str(i))
    return [text_list,sources_list]

In [4]:
directory = '../data'
files = os.listdir(directory)
files = [open(os.path.join(directory,x), "rb") for x in files if x.endswith(".pdf")]
print(files)

[<_io.BufferedReader name='../data/alices-adventures-in-wonderland.pdf'>]


In [5]:
textify_output = read_and_textify(files)

Page Number: 111


In [6]:
documents = textify_output[0]
sources = textify_output[1]

In [7]:
documents[:5]

[' Alice’s Adventures in Wonderland By Lewis Caroll (1865)',
 'Alice’s Adventures in Wonderland 2All in the Golden Afternoon All in the golden afternoon Full leisurely we glide; For both our oars, with little skill, By little arms are plied, While little hands make vain pretense Our wanderings to guide. Ah, cruel Three! In such an hour, Beneath such dreamy weather, To beg a tale of breath too weak To stir the tiniest feather! Yet what can one poor voice avail Against three tongues together? Imperious Prima flashes forth Her edict to “begin it”: In gentler tones Secunda hopes “There will be nonsense in it.” While Tertia interrupts the tale Not more than once a minute.',
 '3 Anon, to sudden silence won, In fancy they pursue The dream-child moving through a land Of wonders wild and new, In friendly chat with bird or beast— And half believe it true. And ever, as the story drained The wells of fancy dry, And faintly strove that weary one To put the subject by, “The rest next time—” “It is n

In [30]:
persist_directory = '../VectorStore'
# extractembeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

In [31]:
# vectorstore with metadata. here the page numbers will be stored
vectordb= Chroma.from_texts(documents, embeddings, metadatas=[{"source": s} for s in sources], persist_directory=persist_directory)
# deciding model
model_name = 'gpt-4o-mini'

In [33]:
# it's possible to load the persisted db from disk, and use it as normal
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
vectordb.get()

{'ids': ['c69f23a8-aea8-405d-a780-c593b6112425',
  '3db5c6c9-ed59-4e5d-a93d-8d381621c877',
  '1435b032-d116-4f68-9844-99702221a02a',
  '6bab0aa2-4e63-47f2-96fc-0ddc196647a6',
  'bfae428c-1aa5-4e15-aafb-002a92b0d041',
  '6c3e1e20-973e-470b-aed4-0ec7cf5e1226',
  '0d7952f4-ea29-46f8-881e-6c4064fb46ca',
  '8ff82a07-a5a3-4e62-b335-100b5fd2452b',
  '326c5bd9-1925-452f-9568-0aca59248aae',
  '8da01c37-9647-4b6d-a8b7-fe09ec9add01',
  '68e04afa-58d4-4b5c-9b58-2390c6f67883',
  '64e2d21d-a496-4fcd-b260-8df79831dbd8',
  '75ecb3a1-7510-441d-b877-8a530921ddf5',
  '0822aee3-3689-46fb-b039-48255395acf0',
  'b37b6046-e8f3-48db-8f6d-677726b1ef79',
  '8b122073-4529-4ba7-bbfd-5524cb86b00c',
  'a8a773b5-4885-482c-a461-20bf63635f14',
  'd25a716c-76f9-4b2e-83b2-481a03bea951',
  '49a62a9e-c18f-409e-8483-1d2199581d84',
  '931d5baf-4ff9-41f1-85b8-7c3d017c6bf7',
  '09239fd5-972f-4aed-86fc-8fffb2f2836b',
  '7d1b8d9d-fc95-4b32-acd5-0689e6a302cc',
  '3095165b-431f-46f6-a94b-3acf7ddef0ed',
  '5fa7730c-67a9-4e2e-9236-

In [34]:
from langchain.chains import VectorDBQAWithSourcesChain

In [None]:
qa = VectorDBQAWithSourcesChain.from_chain_type(
    llm=OpenAI(), k=1, chain_type='stuff', vectorstore=vectordb
)

In [16]:
qa({"question": "Who wrote Alice's Adventures in Wonderland?"}, return_only_outputs=True)

{'answer': " Lewis Carroll wrote Alice's Adventures in Wonderland.\n",
 'sources': '"../data/alices-adventures-in-wonderland.pdf_page_0"'}

In [None]:
qa({"question": "Which animal does Alice follow into Wonderland?"}, return_only_outputs=True) # ??? 

{'answer': ' Alice follows a puppy into Wonderland.\n',
 'sources': "Alice's Adventures in Wonderland, page 36 (Chapter 5)"}

In [23]:
qa({"question": "What does the White Rabbit carry, and why is it important in the story?"}, return_only_outputs=True)

{'answer': " The White Rabbit carries a watch in its waistcoat-pocket. It is important because it triggers Alice's curiosity and leads her on an adventure. \n",
 'sources': "Alice's Adventures in Wonderland by Lewis Carroll - Chapter 1, page 3"}