# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

__This example was build with the following package versions:__

chromadb Version: 0.4.22

openai Version: 1.7.2

langchain Version: 0.1.0
* Since VectorDBQA is also deprecated in the updated packege version this notebook uses RetrievalQA from langchain.chains in its place.

langchain-openai Version: 0.0.2.post1
* Since langchain_community is depracated and will be removed in next package version this notebook uses langchain-openai in its place.

langchain-core Version: 0.1.10

In [1]:
#Imports from langchain
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA

#Imports from langchain_openai
from langchain_openai import OpenAI 
from langchain_openai import OpenAIEmbeddings

#Imports from langchain_core
from langchain_core.vectorstores import VectorStoreRetriever

#Imports from Chroma
import chromadb

## Safely Use the OpenAI API Key Without Exposing it in Your Code
https://medium.com/@itsanirudhjoshi/how-to-safely-use-the-openai-api-key-without-exposing-it-in-your-code-%EF%B8%8F-setting-it-as-an-a10cccbb9a7f

In [2]:
import os
import openai

# Access the API key from the environment variable
api_key = os.environ.get('OPENAI_API_KEY')

# Initialize the OpenAI API client
openai.api_key = api_key

## Create a Chroma DB with local persistence

Creating a database named TheWhiteHouse_db and a particular collection named POTUS_Speeches_and_Remarks. As an example of having one database of a broader subject and quering only a particular document collection for an answer.  The same database coud have other collections for different contexts.

In [3]:
database_name = "TheWhiteHouse_db"
collection_name   = "POTUS_Speeches_and_Remarks"

chroma_client = chromadb.PersistentClient(path=database_name) 

collection = chroma_client.get_or_create_collection(collection_name)

embedding_function = OpenAIEmbeddings()

#defines "POTUS_Speeches_and_Remarks" collection from "TheWhiteHouse_db" database as the context source
langchain_chroma = Chroma(                                \
    client=chroma_client,                                 \
    collection_name=collection_name, embedding_function=embedding_function)

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [4]:
# Load and process the text
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Add documentos to Collection "POTUS_Speeches_and_Remarks" into "TheWhiteHouse_db" local database
# Create embeddings for each chunk and insert into the Chroma vector database
# thereis is no need to call persist() as would be necessary in previous versions of ChromaDB
langchain_chroma.add_documents(documents=texts)

langchain_chroma = None

## Load the Database from disk, and create the chain
Be sure to pass the same `path`, `collection_name` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [5]:
# Now we can load the persisted database from disk, and use it as normal. 
langchain_chroma = Chroma(                                \
    client=chroma_client,                                 \
    collection_name=collection_name, embedding_function=embedding_function)

llm=OpenAI()

# Using langchain_core.VectorStoreRetriever
retriever = VectorStoreRetriever(vectorstore=langchain_chroma)

qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

## Ask questions!

Now we can use the chain to ask questions!

In [6]:
query = "What did the president say about Ketanji Brown Jackson"
# `run` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead
qa.invoke(query)

{'query': 'What did the president say about Ketanji Brown Jackson',
 'result': ' The President praised Ketanji Brown Jackson as a highly qualified nominee for the United States Supreme Court, who will continue the legacy of excellence of retiring Justice Stephen Breyer. He also mentioned her background as a former litigator, federal public defender, and member of a family of public school educators and police officers. He also mentioned her broad range of support from various groups and individuals.'}

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [7]:
# To cleanup, you can delete the collection
chroma_client = chromadb.PersistentClient(path=database_name) 
chroma_client.delete_collection(collection_name)

# Or just nuke the persist directory
#!rm -rf TheWhiteHouse_db/