# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader
from IPython.display import display

import config

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

### GitHub Extraction

We want to get a GitHub repo and from it, extract the Markdown document files for later processing.

In [3]:
# Import the required libraries

import requests
import markdown

# Define the GitHub repository information
repo_url = 'https://api.github.com/repos/dfcantor/obsidian-vault-sync'
directory_path = 'Obsidian Vault/Zettelkasten'

# Make a request to the GitHub API to retrieve the contents of the directory
api_url = f'{repo_url}/contents/{directory_path}?recursive=1'

headers = {
    'Authorization': f'Token {config.GITHUB_ACCESS_TOKEN}'}

response = requests.get(api_url)
response.raise_for_status()

try:
    contents = response.json()
except ValueError as e:
    print(f"Error decoding JSON response: {e}")
    print(f"Response content: {response.content}")
    raise



Iterate over the contents and extract Markdown documents.

In [4]:
documents_md = []

for item in contents:
    if item['type'] == 'file' and item['name'].endswith('.md'):
        # Download the Markdown file
        download_url = item['download_url']
        response = requests.get(download_url)
        response.raise_for_status()

        # Convert the Markdown text to HTML
        markdown_text = response.text
        html = markdown.markdown(markdown_text)

        documents_md.append(html)

### Saves the file into the local disk as a txt file

In [5]:
from pathlib import Path

my_string = str(documents_md)
file_path = 'C:\\Users\\DanielCantorBaez\\Documents\\SyncierGPT\\chroma-langchain-custom\\data.txt'


Path(file_path).write_text(my_string)

127325

In [6]:
# Load and process the text

loader = TextLoader('C:/Users/DanielCantorBaez/Documents/SyncierGPT/chroma-langchain-custom/data.txt', encoding = 'latin-1')
documents = loader.load()

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

## Initialize ChromaDB

Create embeddings for each chunk and insert into the Chroma vector database.

In [8]:
OPENAI_API_KEY = config.OPENAI_KEY

embeddings = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY)
vectordb = Chroma.from_documents(texts, embeddings)

## Create the chain

Initialize the chain we will use for question answering.

In [9]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(openai_api_key = OPENAI_API_KEY), chain_type="stuff", vectorstore=vectordb)



## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [10]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

embedding = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY)
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [11]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [12]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = VectorDBQA.from_chain_type(llm=OpenAI(openai_api_key = OPENAI_API_KEY), chain_type="stuff", vectorstore=vectordb)

## Ask questions!

Now we can use the chain to ask questions!

In [13]:
query = "How does clearing work?"
display(qa.run(query))

' Depending on the collection settings of the contract, ABS runs specific batch processes to collect the outstanding premium. For example: if a contract has direct debit as collection type, the premium is cleared on the payment due date. This happens independently of the success of the direct debit payment.'

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [14]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# Or just nuke the persist directory
!rm -rf db/

'rm' is not recognized as an internal or external command,
operable program or batch file.
