# LangChain multi-doc retriever with ChromaDB

***New Points***
- Multiple Files
- ChromaDB
- Source info
- gpt-3.5-turbo API

## Setting up LangChain


In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os
from dotenv import load_dotenv

In [2]:
_ = load_dotenv()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512




## Load multiple and process documents

In [3]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./data/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [4]:
%%time
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=1000)
texts = text_splitter.split_documents(documents)

CPU times: user 1.81 ms, sys: 195 µs, total: 2 ms
Wall time: 2.92 ms


In [5]:
len(texts)

20

In [6]:
texts[2]

Document(page_content='2\nVol:.(1234567890) Scientific Reports  |        (2023) 13:20775  | https://doi.org/10.1038/s41598-023-47912-0\nwww.nature.com/scientificreports/which form the core of social anxiety, contribute as an antecedent of  paranoia18. In a longitudinal study with a \ncommunity sample, Aunjitsakul et\xa0al.19 found that social anxiety at baseline predicted an increase in paranoia at \n3-month follow-up. On the other hand, social anxiety has also been proposed to be a consequence of paranoid \nthinking, which inflicts internalized stigma and  shame20–22. Two longitudinal cohort studies with general popula-\ntion  samples9,23 found that paranoia at baseline predicted subsequent emergence of social anxiety, but not vice \nversa. However, these studies did not examine both directions of relationship in the same model. Therefore the \ncovariation of the symptoms, which is conceptually interactive in nature, was not taken into full consideration.\nDelineating the temporal dyn

## create the DB

In [7]:
%%time
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db_4000_1000'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embedding = HuggingFaceEmbeddings()

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: user 20 s, sys: 4.48 s, total: 24.5 s
Wall time: 22.8 s


In [8]:
# persiste the db to disk
vectordb.persist()
vectordb = None

In [9]:
%%time

# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

CPU times: user 11 ms, sys: 4.45 ms, total: 15.4 ms
Wall time: 17.8 ms


## Make a retriever

In [10]:
retriever = vectordb.as_retriever()

In [11]:
docs = retriever.get_relevant_documents("What is paranoia?")

In [12]:
len(docs)

4

In [13]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [14]:
retriever.search_type

'similarity'

In [15]:
retriever.search_kwargs

{'k': 2}

## Make a chain

In [16]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [17]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print(llm_response['source_documents'][0].metadata)

In [18]:
# full example
query = "What is paranoia?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Paranoia refers to a symptom characterized by excessive or unreasonable distrust and suspicion of others, often accompanied by feelings of persecution or threat. It can be a symptom of certain mental health conditions, such as paranoid personality disorder or schizophrenia, or it can occur as a temporary or isolated experience. The cognitive model of paranoia suggests that social evaluative concerns contribute to the development of paranoia, as individuals may interpret ambiguous social situations as hostile or malevolent due to negative beliefs about themselves and others. The co-occurrence of paranoia and social anxiety, as well as the role of negative schemas and loneliness, are areas of ongoing research in understanding the nature and potential underlying mechanisms of paranoia.
{'page': 0, 'source': 'data/s41598-023-47912-0.pdf'}


In [19]:
# break it down
query = "How many young adults (or people) took part in this?"
llm_response = qa_chain(query)
# process_llm_response(llm_response)
llm_response

ValueError: Error raised by inference API: Internal Server Error

In [20]:
query = "How do they measure Momentary social anxiety?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study measures momentary social anxiety using the Social Interaction Anxiety Scale (SIAS-6) developed by Peters et al. (2012). This is a short form of the original Social Interaction Anxiety Scale (SIAS) that consists of 20 items. The SIAS-6 includes six items that assess social anxiety in social situations, such as "I worry about being embarrassed in front of others" and "I feel I am not as good as other people in social situations." Participants are asked to rate their level of agreement with each item on a 5-point Likert scale, ranging from 1 (strongly disagree) to 5 (strongly agree). The SIAS-6 has been found to have good psychometric properties, including reliability and validity, and has been used in previous studies on social anxiety.
{'page': 9, 'source': 'data/s41598-023-47912-0.pdf'}


In [21]:
query = "What is their data collection method?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The participants completed retrospective questionnaires at baseline to assess levels of loneliness, paranoia, depression, and social anxiety. They also completed ESM questionnaires throughout a six-day period, with support and guidance provided by the research team. The ESM measures included assessments of momentary loneliness, social anxiety, and paranoia, as well as negative-self and -other schemas. The data were analyzed using DSEM, which allows for the examination of multi-level relationships among ESM variables by decomposing the intensive longitudinal data into within- and between-person variance components using a latent person-mean approach. Missing data was handled with MCMC sampling, and within-person standardized parameters of the fixed effects were computed for interpretation.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [22]:
query = "What is ESM?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 ESM stands for Experience Sampling Method, which is a research technique that involves collecting data about people's experiences and behaviors in their natural environments at multiple points in time. In this study, participants completed brief surveys on their smartphones at random intervals throughout the day to assess their social anxiety and paranoia levels. The data collected through ESM were analyzed using a statistical technique called Dynamic Structural Equation Modeling (DSEM) to examine the relationships between social anxiety and paranoia over time.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [23]:
query = "What is the result of this study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study utilized dynamic structural equation modeling (DSEM) to examine the longitudinal relationships between social anxiety and paranoia in individuals with first-episode psychosis. The results showed significant autoregressive effects for both social anxiety and paranoia, indicating carry-over effects across moments. Additionally, there was a cross-lagged effect from social anxiety to paranoia. These findings suggest that social anxiety may contribute to the development of paranoia over time in individuals with first-episode psychosis.
{'page': 8, 'source': 'data/s41598-023-47912-0.pdf'}


In [24]:
query = "What is the limitations of the current study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The current study excluded participants who completed less than one-third of the total ESM questionnaires, which may have introduced selection bias and reduced the generalizability of the results. Additionally, the study's sample size may have been limited, as only 113 participants were included in the final analysis. These limitations should be considered when interpreting the findings of the study.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [25]:
query = "What is the hypothesis of the study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study aims to investigate the relationship between social anxiety and paranoia, as well as the role of loneliness and social isolation in this relationship. The hypothesis is that social anxiety and paranoia are interconnected, and that loneliness and social isolation may contribute to the development and maintenance of paranoid ideation in individuals with social anxiety. The study uses experience sampling method (ESM) to collect real-time data on participants' social anxiety, paranoia, loneliness, and social isolation, and employs dynamic structural equation modelling (DSEM) to examine the temporal dynamics and reciprocal effects of these variables. The results of the study may shed light on the underlying mechanisms of the association between social anxiety and paranoia, and inform the development of targeted interventions for individuals with social anxiety and paranoid ideation.
{'page': 1, 'source': 'data/s41598-023-47912-0.pdf'}


In [26]:
query = "What is the final sample size of the study?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study excluded responses from participants who completed less than one-third of the total ESM questionnaires (i.e. 20). The final sample size is not explicitly stated in the context provided.
{'page': 5, 'source': 'data/s41598-023-47912-0.pdf'}


In [27]:
query = "Where did the study take place?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The study did not provide information about the location where it was conducted.
{'page': 2, 'source': 'data/s41598-023-47912-0.pdf'}


In [None]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

### Chat prompts

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[1].prompt.template)