# Question and Answering (QA) over documents using RAG

Here we create a chatbot that can answer questions about documents using Rich Aggregation Models (RAGs).  
This notebook covers the following concept:
- A simple in memory RAG application using langchain based on a CSV file
- Evaluation of RAG's completion performance using example generator from Langchain

**Use Case:**
- Search and summary of clinical trials

In [162]:
# import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

import pprint
# A function for printing nicely
def nprint(text, indent=2):
    pp = pprint.PrettyPrinter(indent=indent)
    pp.pprint(text)

# Loading Parameters

In [215]:
create_embeddings = True
modelID = "gpt-3.5-turbo"

## Data Description

This dataset was downloaded from the [Clinical Trial Outcome Prediction repository](https://github.com/futianfan/clinical-trial-outcome-prediction). The data is derived from clinical trials records on [ClinicalTrials.gov](https://clinicaltrials.gov), as of February 20, 2021. 

### Overview

The dataset provides comprehensive information about various clinical trials. Column information include:

- **NCT ID:** Unique identifiers assigned to each clinical study.
- **Status:** The current status of the trial (e.g., completed, not yet recruiting, active, recruiting, suspended, terminated, unknown status, withdrawn).
- **Why Stop:** Reasons why the trial was stopped (if applicable). This field can be empty.
- **Label:** Binary outcome label indicating the success (1) or failure (0) of the trial.
- **Phase:** The phase of the trial (e.g., Phase I, Phase II, Phase III).
- **Diseases:** List of disease names associated with the trial.
- **ICD Codes:** List of ICD-10 codes corresponding to the diseases.
- **Drugs:** List of drug names used in the trial.
- **SMILES:** List of SMILES (Simplified Molecular Input Line Entry System) representations of the drugs.
- **Criteria:** Eligibility criteria for participants in the trial.

### Column Description

For detailed descriptions of each column, please refer to the [column description documentation](https://github.com/futianfan/clinical-trial-outcome-prediction/blob/main/data/README.md).

This dataset is valuable for analyzing and predicting clinical trial outcomes, exploring the relationships between diseases and treatments, and understanding eligibility criteria and trial phases. It can serve as a starting point for various data-driven applications and research in clinical trials.

Downloading the data:

In [163]:
import pandas as pd

# URL to the raw_data.csv file
url = 'https://raw.githubusercontent.com/futianfan/clinical-trial-outcome-prediction/main/data/raw_data.csv'

# Read the CSV file directly into a pandas DataFrame
df_trials = pd.read_csv(url)

df_trials.head()

Unnamed: 0,nctid,status,why_stop,label,phase,diseases,icdcodes,drugs,smiless,criteria
0,NCT00000172,completed,,1,phase 3,['alzheimer disease'],"[""['G30.8', 'G30.9', 'G30.0', 'G30.1']""]",['galantamine'],['[H][C@]12C[C@@H](O)C=C[C@]11CCN(C)CC3=C1C(O2...,\n Inclusion Criteria:\r\n\r\n ...
1,NCT00000173,completed,,1,phase 3,['alzheimer disease'],"[""['G30.8', 'G30.9', 'G30.0', 'G30.1']""]","['donepezil', 'vitamin e']",['O=S(=O)(C1=CC=CC=C1)C1=CN=C2C(C=CC=C2N2CCNCC...,\n Inclusion Criteria:\r\n\r\n ...
2,NCT00000174,completed,,0,phase 3,"['alzheimer disease', 'cognition disorders']","[""['G30.8', 'G30.9', 'G30.0', 'G30.1']"", ""['F2...",['rivastigmine'],['CCN(C)C(=O)OC1=CC=CC(=C1)[C@H](C)N(C)C'],\n Inclusion Criteria:\r\n\r\n ...
3,NCT00000390,completed,,1,phase 2,['depression'],"[""['F32.A', 'F53.0', 'P91.4', 'Z13.31', 'Z13.3...",['imipramine hydrochloride'],['CN(C)CCCN1C2=CC=CC=C2CCC2=CC=CC=C12'],\n Inclusion Criteria:\r\n\r\n ...
4,NCT00000419,terminated,,0,phase 3,['systemic lupus erythematosus'],"[""['M32.9', 'M32.0', 'M32.11', 'M32.12', 'M32....",['premarin and provera'],['[H][C@@]12CC[C@](OC(C)=O)(C(C)=O)[C@@]1(C)CC...,\n Inclusion Criteria:\r\n\r\n ...


A quick view at the dataset:

In [164]:
# Display the number of missing values for each column
print("Missing Values:")
print(df_trials.info())
print("\n")

# Display statistical summary for categorical columns
print("Statistical Summary (Categorical Columns):")
print(df_trials.describe(include=['object']))

Missing Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17614 entries, 0 to 17613
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   nctid     17614 non-null  object
 1   status    17614 non-null  object
 2   why_stop  3624 non-null   object
 3   label     17614 non-null  int64 
 4   phase     17036 non-null  object
 5   diseases  17614 non-null  object
 6   icdcodes  17614 non-null  object
 7   drugs     17614 non-null  object
 8   smiless   17614 non-null  object
 9   criteria  17612 non-null  object
dtypes: int64(1), object(9)
memory usage: 1.3+ MB
None


Statistical Summary (Categorical Columns):
              nctid     status                  why_stop    phase  \
count         17614      17614                      3624    17036   
unique        17614          8                      2421        7   
top     NCT00000172  completed  \n    slow accrual\r\n    phase 2   
freq              1      12369             

A bit of data pre-processing:

In [165]:
# Correcting the string representation of lists
import ast

# Convert the string representation of lists to actual lists
df_trials['diseases'] = df_trials['diseases'].apply(ast.literal_eval)

# map lable = 1 to success and lable = 0 to failure
df_trials['label'] = df_trials['label'].map({1: 'success', 0: 'failure'})

# map why_stop null to not_stopped
df_trials['why_stop'] = df_trials['why_stop'].fillna('not stopped')

Looking at sample of criteria information of trials:

In [166]:
# showing the criteria column of a random row:
filtered_df = df_trials[df_trials['diseases'].apply(lambda x: x == ['alzheimer disease'])]
criteria_samples = filtered_df.sample(1)
print(criteria_samples['criteria'].values[0])



        Inclusion Criteria:

          1. Male or female, 55 years of age or older,

          2. Diagnosis of probable Alzheimer's disease using the National Institute of Neurological
             and Communicative Disorders and Stroke and the Alzheimer's disease and Related
             Disorders Association criteria by Principal Investigator,

          3. Score 14 to 24 (inclusive) on the Mini-Mental Status Examination,

          4. Global Clinical Dementia Rating (CDR) Scale ≥ 0.5 or greater with CDR memory ≥ 0.5 or
             greater,

          5. Score ≤ 4 or lower on the Hachinski Ischemic Scale,

          6. Score ≤ 5 on the Geriatric Depression Scale (GDS),

          7. Current (stable dose for 4 weeks or longer) or past treatment with
             acetylcholinesterase inhibitors, memantine, or cognitive enhancers are allowed,

          8. Females must not be of childbearing potential (i.e., must be post-menopausal with
             cessation of menses for ≥ 12 months

Criteria column contains important information about each trial. Therefore, one can extract valuable insight from this column.  

# Asking questions about the content of trials criteria
We are going to use Langchain to create a vectordatabase from the CSV file. Then we can perform similarity search between the user prompt and the criteria content and retrieve the relevant information. The retrieved information will be passed to the LLM to properly answer the user's question.

Loading the CSV file:

In [168]:
# Removing non-relevant columns
df_trials_cleaned = df_trials.drop(columns=['smiless','icdcodes'])
df_trials_cleaned.to_csv('../../data/trials_data.csv', index=False)
df_trials_cleaned

Unnamed: 0,nctid,status,why_stop,label,phase,diseases,drugs,criteria
0,NCT00000172,completed,not stopped,success,phase 3,[alzheimer disease],['galantamine'],\n Inclusion Criteria:\r\n\r\n ...
1,NCT00000173,completed,not stopped,success,phase 3,[alzheimer disease],"['donepezil', 'vitamin e']",\n Inclusion Criteria:\r\n\r\n ...
2,NCT00000174,completed,not stopped,failure,phase 3,"[alzheimer disease, cognition disorders]",['rivastigmine'],\n Inclusion Criteria:\r\n\r\n ...
3,NCT00000390,completed,not stopped,success,phase 2,[depression],['imipramine hydrochloride'],\n Inclusion Criteria:\r\n\r\n ...
4,NCT00000419,terminated,not stopped,failure,phase 3,[systemic lupus erythematosus],['premarin and provera'],\n Inclusion Criteria:\r\n\r\n ...
...,...,...,...,...,...,...,...,...
17609,NCT04329923,terminated,\n cohort 1: slow accrual cohort 2: other s...,failure,phase 2,[covid-19],['hydroxychloroquine sulfate 400 mg twice a da...,\n Inclusion Criteria:\r\n\r\n ...
17610,NCT04341727,suspended,\n dsmb recommended study suspension slow a...,failure,phase 3,[coronavirus infection],"['hydroxychloroquine sulfate', 'azithromycin',...",\n Inclusion Criteria:\r\n\r\n ...
17611,NCT04409327,terminated,\n insufficient accrual rate\r\n,failure,phase 2,[covid19],"['rtb101', 'placebo']",\n Inclusion Criteria:\r\n\r\n ...
17612,NCT04437953,withdrawn,\n lack of accrual\r\n,failure,phase 2,"[thrombocytopenia, cancer, liver diseases]",['avatrombopag'],\n Inclusion Criteria:\r\n\r\n ...


In [169]:
from langchain.document_loaders import CSVLoader

file = '../../data/trials_data.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')
docs = loader.load()
print(docs[0].page_content)

nctid: NCT00000172
status: completed
why_stop: not stopped
label: success
phase: phase 3
diseases: ['alzheimer disease']
drugs: ['galantamine']
criteria: Inclusion Criteria:

          -  Probable Alzheimer's disease

          -  Mini-Mental State Examination (MMSE) 10-22 and ADAS greater than or equal to 18

          -  Alzheimer's Disease Assessment Scale cognitive portion (ADAS-cog-11) score of at least
             18

          -  Opportunity for Activities of Daily Living

          -  Caregiver

          -  Subjects who live with or have regular daily visits from a responsible caregiver
             (visit frequency: preferably daily but at least 5 days/week). This includes a friend
             or relative or paid personnel. The caregiver should be capable of assisting with the
             subject's medication, prepared to attend with the subject for assessments, and willing
             to provide information about the subject.

        Exclusion Criteria:

          -  Co

Using OpenAI Embedding model to create vector embeddings:

In [170]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

query = "What are the inclusion criteria for patients with alzheimer?"
 
# sample embedding
embed_query = embeddings.embed_query(query)

print(f'The dimension of the embedding is: {len(embed_query)}')
print('Sample embedding values: \n')
print(embed_query[0:5])

The dimension of the embedding is: 1536
Sample embedding values: 

[-0.005064966157078743, 0.012979785911738873, -0.0032773311249911785, -0.01851109229028225, -0.038861632347106934]


Using DocArrayInMemorySearch from Langchain to create vector database:

In [217]:
# first 10 samples form the docs list
# from docarray import DocumentArray
from langchain.vectorstores import DocArrayInMemorySearch

# SET create_embeddings = True TO CREATE the IN MEMORY VECTOR STORE

if create_embeddings:
    docs_sub = docs[0:100]
    from langchain.vectorstores import DocArrayInMemorySearch
    db_vector = DocArrayInMemorySearch.from_documents(
        docs_sub, 
        embeddings
    )
    create_embeddings = False
    print('Vector store created')
else:
    print('Using the vector store already created')

Using the vector store already created


Let's pick a disease from the available list to investigate further.  
To have a more educated guess, less see which diseas is more common in the first 100 rows:

In [247]:
from collections import defaultdict

# diseases_list = df_trials_cleaned[0:100][['nctid','diseases']]
diseases_column = df_trials_cleaned['diseases'].iloc[:100]

# Dictionary to count frequencies
disease_frequency = defaultdict(int)

# Flatten the lists and count frequencies
for diseases_list in diseases_column:
    if isinstance(diseases_list, str):
        # Convert string representation of list to actual list
        diseases_list = eval(diseases_list)
    if isinstance(diseases_list, list):
        for disease in diseases_list:
            disease_frequency[disease] += 1

# Sort the diseases by frequency
sorted_disease_frequency = sorted(disease_frequency.items(), key=lambda item: item[1], reverse=True)

# Print the unique diseases and their frequencies
print("Unique diseases and their frequencies in the first 100 rows (sorted):")
for disease, freq in sorted_disease_frequency:
    print(f"{disease}: {freq}")

Unique diseases and their frequencies in the first 100 rows (sorted):
ovarian cancer: 11
breast cancer: 10
leukemia: 8
lymphoma: 8
sarcoma: 7
prostate cancer: 6
lung cancer: 6
bladder cancer: 5
primary peritoneal cavity cancer: 5
head and neck cancer: 5
fallopian tube cancer: 5
hiv infections: 4
colorectal cancer: 4
alzheimer disease: 3
urethral cancer: 3
endometrial cancer: 3
gastric cancer: 3
esophageal cancer: 2
cognition disorders: 1
depression: 1
systemic lupus erythematosus: 1
asthma: 1
lung diseases: 1
cardiovascular diseases: 1
heart diseases: 1
peripheral vascular diseases: 1
thromboembolism: 1
vascular diseases: 1
venous thromboembolism: 1
bacterial infections: 1
pneumonia, pneumocystis carinii: 1
albinism: 1
inborn errors of metabolism: 1
oculocutaneous albinism: 1
platelet storage pool deficiency: 1
pulmonary fibrosis: 1
diabetes mellitus: 1
hypertension: 1
metabolic disease: 1
obesity: 1
sleep apnea syndrome: 1
stage i multiple myeloma: 1
stage ii multiple myeloma: 1
stage

In [300]:
query = "What are the most important characteristics of patients taking part in breast cancer trials?"
docs_retrieved = db_vector.similarity_search(query)

Just double check if the retrieved information is relevant:

In [301]:
for i, doc in enumerate(docs_retrieved):    
    rowid = doc.metadata['row']
    print('The disease for retrieved document', i, 'is', df_trials_cleaned.iloc[rowid]['diseases'])

print('\nContent of a retrieved document:\n')
print(docs_retrieved[0].page_content)


The disease for retrieved document 0 is ['breast cancer']
The disease for retrieved document 1 is ['breast cancer']
The disease for retrieved document 2 is ['breast cancer', 'menopausal symptoms']
The disease for retrieved document 3 is ['breast cancer']

Content of a retrieved document:

nctid: NCT00002772
status: terminated
why_stop: poor accrual
label: failure
phase: phase 3
diseases: ['breast cancer']
drugs: ['carboplatin', 'carmustine', 'cisplatin', 'cyclophosphamide', 'doxorubicin hydrochloride', 'paclitaxel', 'tamoxifen citrate', 'thiotepa']
criteria: DISEASE CHARACTERISTICS: Histologically proven adenocarcinoma of the breast with at least 4
        involved axillary and/or intramammary lymph nodes No known T4, N3, or M1 disease Dermal
        lymphatic involvement without clinical inflammatory changes (edema, peau d'orange,
        erythema) allowed Must have undergone breast conserving surgery or modified radical
        mastectomy plus axillary lymph node dissection Surgical 

Let's ask questions about the disease trials:

In [306]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(temperature = 0.0, model=modelID)

template = """{qdocs}
Question: {query}.
"""
prompt_template = ChatPromptTemplate.from_template(template)

# joining all retrieved documents into a single string
qdocs = "".join([docs[i].page_content for i in range(len(docs_retrieved))])
query = "What are the most important characteristics of patients taking part in breast cancer trials?"

prompt = prompt_template.format_messages(
                    qdocs = qdocs,
                    query=query)
completion = llm(prompt) 
print(completion.content)

The most important characteristics of patients taking part in breast cancer trials may include:

1. Diagnosis of breast cancer: Patients must have a confirmed diagnosis of breast cancer to be eligible for breast cancer trials.

2. Stage of cancer: Patients may need to have a specific stage of breast cancer to qualify for certain trials, such as early-stage or metastatic breast cancer.

3. Treatment history: Patients' treatment history, including previous therapies and responses, may be important criteria for participation in breast cancer trials.

4. Health status: Patients' overall health status, including any comorbidities or medical conditions, may impact their eligibility for breast cancer trials.

5. Age and gender: Some trials may have specific age or gender requirements for participants.

6. Performance status: Patients' ability to perform daily activities and tolerate treatment may be assessed through performance status criteria.

7. Genetic mutations: Patients with specific ge

## Using the RetrievalQA module:
Now we search the database for a specific disease and also ask for summarization from the model.



In [314]:
from langchain.chains import RetrievalQA

retriever = db_vector.as_retriever()
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

query =  "Please list all trials related to alzheimer disease disease in a table \
in markdown and summarize each one."

completion = qa_stuff.invoke(query)




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [315]:
from IPython.display import display, Markdown
display(Markdown(completion['result']))

| Trial ID   | Status   | Label   | Phase | Drugs                        | Summary                                                                                                      |
|------------|----------|---------|-------|------------------------------|--------------------------------------------------------------------------------------------------------------|
| NCT00000173| Completed| Success | Phase 3| Donepezil, Vitamin E         | This trial focused on individuals with memory complaints and difficulties, testing the efficacy of donepezil and vitamin E in Alzheimer's disease. |
| NCT00000172| Completed| Success | Phase 3| Galantamine                  | This trial targeted probable Alzheimer's disease patients, assessing the impact of galantamine on cognitive function and daily living activities. |
| NCT00000174| Completed| Failure | Phase 3| Rivastigmine                 | This trial aimed to study mild cognitive impairment in older individuals using rivastigmine, but it did not achieve the desired outcome. |