# LangChain multi-doc retriever with ChromaDB

***New Points***
- Multiple Files
- ChromaDB
- Source info
- gpt-3.5-turbo API

## Setting up LangChain


In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os
from dotenv import load_dotenv

In [2]:
_ = load_dotenv()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512




## Load multiple and process documents

In [3]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./data/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [4]:
%%time
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=250)
texts = text_splitter.split_documents(documents)

CPU times: user 4.08 ms, sys: 178 µs, total: 4.26 ms
Wall time: 4.32 ms


In [5]:
len(texts)

221

In [6]:
texts[32]

Document(page_content='As social anxiety, paranoia and loneliness occur naturalistically in the flow of daily life with varying intensi-\nties across hours and days, they can be reliably captured by the experience sampling method (ESM). ESM refers \nto repeated self-report questionnaires that record subjective experiences across moments in the flow of daily \n life38. Compared to traditional retrospective questionnaires, ESM represents these experiences with less recall', metadata={'source': 'data/s41598-023-47912-0.pdf', 'page': 1})

## create the DB

In [7]:
%%time
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db_500_250'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embedding = HuggingFaceEmbeddings()

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: user 2min 5s, sys: 10.8 s, total: 2min 16s
Wall time: 2min 33s


In [8]:
# persiste the db to disk
vectordb.persist()
vectordb = None

In [9]:
%%time

# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

CPU times: user 12.1 ms, sys: 18.6 ms, total: 30.7 ms
Wall time: 55.4 ms


## Make a retriever

In [10]:
retriever = vectordb.as_retriever()

In [11]:
docs = retriever.get_relevant_documents("What is paranoia?")

In [12]:
len(docs)

4

In [13]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [14]:
retriever.search_type

'similarity'

In [15]:
retriever.search_kwargs

{'k': 2}

## Make a chain

In [16]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [17]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print(llm_response['source_documents'][0].metadata)

In [18]:
# full example
query = "What is paranoia?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Paranoia is an exaggerated belief that intentional harm is being done or will be done by others, which can manifest in milder forms as ideas of social reference or more severe forms as persecutory delusions.
{'page': 0, 'source': 'data/s41598-023-47912-0.pdf'}


In [19]:
# break it down
query = "How many young adults (or people) took part in this?"
llm_response = qa_chain(query)
# process_llm_response(llm_response)
llm_response

{'query': 'How many young adults (or people) took part in this?',
 'result': " In the study mentioned, there were 2089 non-clinical young adults who took part before the outbreak of the pandemic. However, in the current study, the authors couldn't compare the results to this earlier sample because they didn't test them during the pandemic. The authors also mentioned that they're not sure if their results would be replicated in more diverse samples, as the majority of their current sample were undergraduate students.",
 'source_documents': [Document(page_content='rable to another sample of demographically diverse non-clinical young adults tested before the outbreak of the \n pandemic10 (N = 2089), we could not ascertain the confounding impact of the pandemic on the expression of \nthese phenomena in daily  life36. Third, a majority of our sample were undergraduate students. It is not sure \nwhether our results would be replicated in demographically diverse samples. Finally, we acknowled

In [20]:
query = "How do they measure Momentary social anxiety?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Momentary social anxiety is assessed using three items suggested by Kashdan and Steger, such as "I worried that I would say or do something wrong right now." These items have shown good reliability in both the current study and previous ESM studies. The within-person reliability for momentary social anxiety in this study was 0.84, and the between-person reliability was 0.99.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [21]:
query = "What is their data collection method?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study uses a smartphone app called Experience Sampling Method (ESM) to collect data from the participants at random intervals throughout the week. The research team provides support to the participants during the assessment period, including technical assistance and encouragement to answer the prompts. The team also monitors the participants' progress and offers help to increase compliance if necessary.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [22]:
query = "What is ESM?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 ESM stands for Ecological Momentary Assessment, which is a research method that involves collecting data on a participant's experiences and behaviors in their natural environment at specific moments in time. In this study, participants used a smartphone app to complete ESM questionnaires throughout the day. The research team provided support to the participants during the assessment period.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [23]:
query = "What is the result of this study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 I do not have access to the specific results of this study as it is not provided in the given context. The study aims to investigate the replicability of the priming effect of death-related words on prosocial behavior in both healthy and clinical populations. The study was approved by the Survey and Behavioral Research Ethics Committee of The Chinese University of Hong Kong, and informed consent was obtained from all participants. Eligible participants aged 18–30 were recruited from the subject pool of the Introductory Psychology course. However, the specific findings of the study are not mentioned in the provided context.
{'page': 4, 'source': 'data/s41598-023-47912-0.pdf'}


In [24]:
query = "What is the limitations of the current study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The current study has some limitations. Firstly, the sample size is relatively small, with only 100 participants. This limits the generalizability of the findings to other populations. Secondly, all participants were recruited from the subject pool of the Introductory Psychology course, which may not be representative of the general population. Thirdly, the study only included participants with a history of depression or anxiety, and did not include individuals with other mental health conditions. Therefore, the findings may not be applicable to other clinical populations. Finally, the study only assessed the effectiveness of the intervention in reducing symptoms of depression and anxiety, and did not examine other outcomes, such as quality of life or functional impairment. Future studies should address these limitations by recruiting a larger and more diverse sample, including individuals with other mental health conditions, and assessing a wider range of outcomes.
{'page': 4, 'sourc

In [25]:
query = "What is the hypothesis of the study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The passage mentions the need for replications in clinical populations, which suggests that the study may be testing the replicability of previous findings in clinical populations. However, without further context, it is unclear what specific hypothesis the study is testing.
{'page': 4, 'source': 'data/s41598-023-47912-0.pdf'}


In [26]:
query = "What is the final sample size of the study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The final sample size of the study is not provided in the given context. More information is needed to determine the total number of participants who completed the study.
{'page': 4, 'source': 'data/s41598-023-47912-0.pdf'}


In [27]:
query = "Where did the study take place?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study was carried out at The Chi-nese University of Hong Kong, as stated in the methods section.
{'page': 4, 'source': 'data/s41598-023-47912-0.pdf'}


In [28]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity', <langchain.vectorstores.chroma.Chroma at 0x12a629f90>)

In [29]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


### Chat prompts

In [30]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

AttributeError: 'PromptTemplate' object has no attribute 'messages'

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[1].prompt.template)