# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database.
You can store embeddings and documents, then use them again later.

In [1]:
!pip install --upgrade pip
!pip install -q langchain
!pip install -U langchain-openai
!pip install chromadb==0.3.29


Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 8.4 MB/s eta 0:00:00


ERROR: To modify pip, please run the following command:
C:\Users\Abdul\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting langchain-openai
  Downloading langchain_openai-0.0.2.post1-py3-none-any.whl (28 kB)
Collecting langchain-core<0.2,>=0.1.7
  Downloading langchain_core-0.1.11-py3-none-any.whl (218 kB)
     -------------------------------------- 218.6/218.6 kB 4.4 MB/s eta 0:00:00
Collecting tiktoken<0.6.0,>=0.5.2
  Downloading tiktoken-0.5.2-cp310-cp310-win_amd64.whl (786 kB)
     -------------------------------------- 786.3/786.3 kB 7.1 MB/s eta 0:00:00
Collecting openai<2.0.0,>=1.6.1
  Downloading openai-1.8.0-py3-none-any.whl (222 kB)
     -------------------------------------- 222.3/222.3 kB 6.8 MB/s eta 0:00:00
Collecting packaging<24.0,>=23.2
  Downloading packaging-23.2-py3-none-any.whl (53 kB)
     ---------------------------------------- 53.0/53.0 kB ? eta 0:00:00
Collecting langsmith<0.1.0,>=0.0.63
  Downloading langsmith-0.0.81-py3-none-any.whl (48 kB)
     ---------------------------------------- 48.4/48.4 kB ? eta 0:00:00
Collecting distro<2,>=1.7.0
  Downloading distro-1.9.0-p

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx 1.14.0 requires kubernetes<13,>=10.0.1, but you have kubernetes 29.0.0 which is incompatible.
tfx 1.14.0 requires packaging<21,>=20, but you have packaging 23.2 which is incompatible.
tensorflow-intel 2.13.1 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.9.0 which is incompatible.
ml-pipelines-sdk 1.14.0 requires packaging<21,>=20, but you have packaging 23.2 which is incompatible.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 23.2 which is incompatible.

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting chromadb==0.3.29
  Using cached chromadb-0.3.29-py3-none-any.whl (396 kB)
Collecting hnswlib>=0.7
  Using cached hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi==0.85.1
  Using cached fastapi-0.85.1-py3-none-any.whl (55 kB)
Collecting starlette==0.20.4
  Using cached starlette-0.20.4-py3-none-any.whl (63 kB)
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml): started
  Building wheel for hnswlib (pyproject.toml): finished with status 'error'
Failed to build hnswlib


  error: subprocess-exited-with-error
  
  × Building wheel for hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'hnswlib' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for hnswlib
ERROR: Could not build wheels for hnswlib, which is required to install pyproject.toml-based projects

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import chromadb
import chromadb.config

In [3]:
from langchain.vectorstores import Chroma
#from langchain-openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.llms import OpenAI
from langchain_openai import OpenAI
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [4]:
# Load and process the text
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

RuntimeError: Error loading state_of_the_union.txt

## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted.

In [None]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'
OPENAI_API_KEY = "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" # enter your OpenAI key
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [None]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [None]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = VectorDBQA.from_chain_type(llm=OpenAI(openai_api_key=OPENAI_API_KEY), chain_type="stuff", vectorstore=vectordb)



## Ask questions!

Now we can use the chain to ask questions!

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

" The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson for the United States Supreme Court and praised her as one of the nation's top legal minds and a consensus builder with broad support from both Democrats and Republicans."

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [None]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# Or just nuke the persist directory
!rm -rf db/

Persisting DB to disk, putting it in the save folder db
