# LangChain multi-doc retriever with ChromaDB

***New Points***
- Multiple Files
- ChromaDB
- Source info
- gpt-3.5-turbo API

## Setting up LangChain


In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os
from dotenv import load_dotenv

In [2]:
_ = load_dotenv()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512




## Load multiple and process documents

In [3]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./data/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [4]:
%%time
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=400)
texts = text_splitter.split_documents(documents)

CPU times: user 2.37 ms, sys: 56 µs, total: 2.42 ms
Wall time: 2.45 ms


In [5]:
len(texts)

36

In [7]:
texts[32]

Document(page_content='(SIAS) and Social Phobia Scale (SPS) using nonparametric item response theory: The SIAS-6 and the SPS-6. Psychol. Assess.  24, \n66–76. https://  doi. org/ 10. 1037/  a0024  544 (2012).\n 65. Fowler, D. et al. The Brief Core Schema Scales (BCSS): psychometric properties and associations with paranoia and grandiosity \nin non-clinical and psychosis samples. Psychol. Med. 36, 749–759 (2006).\n 66. Hughes, M. E., Waite, L. J., Hawkley, L. C. & Cacioppo, J. T. A short scale for measuring loneliness in large surveys: Results from \ntwo population-based studies. Res. Aging  26, 655–672 (2004).\n 67. Kashdan, T. B. & Steger, M. F. Expanding the topography of social anxiety: An experience-sampling assessment of positive emo -\ntions, positive events, and emotion suppression. Psychol. Sci.  17, 120–128 (2006).\n 68. Kashdan, T. B. et al. A contextual approach to experiential avoidance and social anxiety: Evidence from an experimental interaction \nand daily interactions o

## create the DB

In [8]:
%%time
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db_2000_400'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embedding = HuggingFaceEmbeddings()

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: user 36.4 s, sys: 8.68 s, total: 45.1 s
Wall time: 49 s


In [9]:
# persiste the db to disk
vectordb.persist()
vectordb = None

In [10]:
%%time

# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

CPU times: user 17.6 ms, sys: 20.5 ms, total: 38.1 ms
Wall time: 99.4 ms


## Make a retriever

In [11]:
retriever = vectordb.as_retriever()

In [12]:
docs = retriever.get_relevant_documents("What is paranoia?")

In [13]:
len(docs)

4

In [14]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [15]:
retriever.search_type

'similarity'

In [16]:
retriever.search_kwargs

{'k': 2}

## Make a chain

In [17]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [18]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print(llm_response['source_documents'][0].metadata)

In [19]:
# full example
query = "What is paranoia?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Paranoia refers to a symptom characterized by excessive or unjustified distrust and suspicion of others' motives or actions. It is often associated with anxiety and related worry processes, and its co-occurrence with social anxiety has raised questions about how the two symptoms may influence each other. The cognitive model of paranoia proposes that social evaluative concerns contribute to the development of paranoia against the backdrop of anxiety and worry processes. Studies have also identified candidate factors that may maintain social anxiety in the context of psychotic experiences, such as beliefs about self/others and worry. The assessment of paranoia and related symptoms can be conducted using ecological momentary assessment and ambulatory assessment methods, which involve collecting data in real-life settings over extended periods.
{'page': 0, 'source': 'data/s41598-023-47912-0.pdf'}


In [21]:
# break it down
query = "How many young adults (or people) took part in this?"
llm_response = qa_chain(query)
# process_llm_response(llm_response)
llm_response

{'query': 'How many young adults (or people) took part in this?',
 'result': ' The text provides no information about the number of young adults or people who participated in this study. Therefore, we cannot answer this question.',
 'source_documents': [Document(page_content='Within-person standardized fixed \neffects Random effectsWithin-person standardized fixed \neffects Random effects\nEstimate (β) 95% CrIEstimate \n(variance) 95% CrI Estimate (β) 95% CrIEstimate \n(variance) 95% CrI\nIntercepts/means\nμSA 2.08 [1.77, 2.41] 0.80 [0.62, 1.05] 1.94 [1.62, 2.27] 0.95 [0.72, 1.29]\nμPAR 2.02 [1.71, 2.33] 0.62 [0.48, 0.81] 1.88 [1.57, 2.20] 0.72 [0.55, 0.98]\nμLONE / / / / 1.96 [1.64, 2.28] 0.84 [0.64, 1.29]\nAutoregressive effects\nϕSA⟶SA 0.50 [0.28, 0.73] 0.15 [0.11, 0.21] 0.41 [0.20, 0.63] 0.20 [0.14, 0.28]\nϕPAR⟶PAR 0.47 [0.24, 0.71] 0.14 [0.10, 0.20] 0.31 [0.11, 0.52] 0.18 [0.13, 0.26]\nϕLONE⟶LONE / / / / 0.61 [0.26, 0.86] 0.11 [0.08, 0.16]\nCross-lagged effects\nϕSA⟶PAR 0.20 [0.01

In [22]:
query = "How do they measure Momentary social anxiety?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The text mentions a specific study, "Kashdan et al. (2014)," that measures momentary social anxiety through an experimental interaction and daily interactions of people with social anxiety disorder. However, the text also mentions other studies that use self-report measures to assess social anxiety, such as the Social Phobia Scale (SPS) and the Brief Core Schema Scales (BCSS). It's not clear from the context which specific measure is being referred to in this question, so it's best to say that the text doesn't provide enough information to answer this question.
{'page': 9, 'source': 'data/s41598-023-47912-0.pdf'}


In [23]:
query = "What is their data collection method?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study's data collection took place from June to October 2021, after the peak of the fourth wave of the COVID-19 pandemic in Hong Kong. Participants attended a 1-hour assessment session during which they were screened with the Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders-IV (SCI-DSM-IV; So et al. 59). Participants without any past or current psychiatric diagnosis completed a baseline survey, and were then briefed individually on the ESM procedure. The ESM questionnaires were programmed into a smartphone app (SEMA360) installed on the participant's mobile phone, and participants were asked to answer the same set of items assessing momentary loneliness, social anxiety, and paranoia ten times a day for six consecutive days. The app displayed the items one by one, and the prompt signals were pseudo-randomized into blocks of time intervals within 13 waking hours. The starting time of the ESM assessment was tailored for each participant to maxim

In [24]:
query = "What is ESM?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 ESM stands for Experience Sampling Method, which is a research technique that involves collecting data on people's experiences and behaviors in their natural environment at multiple points in time. In this study, participants completed brief surveys (ESM questionnaires) multiple times a day over a period of time to assess their social anxiety and paranoia symptoms. The data collected through ESM were analyzed using a statistical technique called Dynamic Structural Equation Modeling (DSEM) to examine the relationships between these symptoms over time.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [25]:
query = "What is the result of this study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study presented in the text material, titled "Dynamics of attachment and emotion regulation in daily life: Uni- and bidirectional associations," investigates the dynamics of attachment and emotion regulation in daily life through a multilevel time-series analysis. The study found bidirectional associations between attachment and emotion regulation, with attachment predicting emotion regulation and vice versa. The study also found that attachment and emotion regulation were associated with loneliness and paranoia. The results suggest that attachment and emotion regulation are important factors in understanding mental health and well-being over time. The study highlights the importance of longitudinal and ecological momentary assessment methods in understanding the dynamics of mental health and well-being.
{'page': 2, 'source': 'data/s41598-023-47912-0.pdf'}


In [26]:
query = "What is the limitations of the current study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The limitations of the current study, as discussed in the article, include the relatively small sample size, the use of self-reported measures, and the focus on a specific population (young adults with psychosis). Additionally, the study only assessed loneliness and social anxiety over a short period of time, and further research is needed to examine the long-term effects of these variables on mental health outcomes. The study also did not include a control group, which limits the ability to make causal inferences about the relationships between loneliness, social anxiety, and mental health outcomes. Finally, the study did not assess the potential moderating effects of other variables, such as gender or age, on these relationships.
{'page': 2, 'source': 'data/s41598-023-47912-0.pdf'}


In [27]:
query = "What is the hypothesis of the study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study tests three hypotheses related to the dynamics of social anxiety and paranoia. The first hypothesis is that social anxiety and paranoia are related to each other within individuals over time. The second hypothesis is that social anxiety and paranoia are related to negative self and other schemas, respectively, at the between-person level. The third hypothesis is tested by examining the correlation between the random effects at the between-person level and the levels of negative-self and -other schemas. These hypotheses are represented in a schematic representation of the dynamic structural equation model of social anxiety and paranoia (Model 1) in Figure 1.
{'page': 2, 'source': 'data/s41598-023-47912-0.pdf'}


In [28]:
query = "What is the final sample size of the study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study had a final sample size of 200 participants.
{'page': 2, 'source': 'data/s41598-023-47912-0.pdf'}


In [29]:
query = "Where did the study take place?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study did not provide information about the location where it was conducted. Therefore, we do not have enough context to answer this question.
{'page': 2, 'source': 'data/s41598-023-47912-0.pdf'}


In [28]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity', <langchain.vectorstores.chroma.Chroma at 0x1327da7c0>)

In [29]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


### Chat prompts

In [47]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}


In [48]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[1].prompt.template)

{question}
