# BioGraphX RAG Pipeline - Complete Demo

This notebook demonstrates the complete Retrieval-Augmented Generation (RAG) pipeline for biomedical question answering.

## Pipeline Overview

```
Question → QuestionAgent → NormalizeAgent → WikipediaAgent → RetrieverAgent → QAModelAgent → EvidenceAgent → ExplanationAgent → Answer
```

## What You'll Learn

1. How entity extraction works (SciSpaCy)
2. How entity normalization improves retrieval
3. How Wikipedia provides general medical knowledge
4. How vector search retrieves PubMed evidence
5. How the fine-tuned LLM generates answers
6. How all evidence is combined in the final explanation

## Setup

In [1]:
import sys
sys.path.append('..')

from agents import AgentGraphPipeline
import json

  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Initialize the Pipeline

This loads all 7 agents:
1. QuestionAgent (SciSpaCy NER)
2. NormalizeAgent (Fuzzy matching)
3. WikipediaAgent (General knowledge)
4. RetrieverAgent (ChromaDB vector search)
5. QAModelAgent (Fine-tuned Qwen2.5-1.5B)
6. EvidenceAgent (Format citations)
7. ExplanationAgent (Compile final output)

In [2]:
# Initialize pipeline (loads all models)
pipeline = AgentGraphPipeline()

[Pipeline] Initializing agents...

[QuestionAgent] Loading SciSpaCy biomedical NER...


  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


[NormalizeAgent] Loading canonical entity mappings...
[NormalizeAgent] Loaded 15723 canonical biomedical entities.

[WikipediaAgent] Initialized for medical article retrieval

[RetrieverAgent] Loading ChromaDB & embedder...
[RetrieverAgent] Connecting to ChromaDB at: /Users/dhruvyellanki/Documents/Projects/BioGraphX/data/chroma
[RetrieverAgent] Found existing pubmed_index collection
[QAModelAgent] Loading fine-tuned Qwen...
[QAModelAgent] Local model not found, using fallback: Qwen/Qwen2.5-1.5B-Instruct


`torch_dtype` is deprecated! Use `dtype` instead!
Some parameters are on the meta device because they were offloaded to the disk.


[QAModelAgent] Model loaded successfully!


## Step 2: Ask a Biomedical Question

In [3]:
# Example question
question = "What is asthma?"
print(f"Question: {question}")
print("\n" + "="*60 + "\n")

Question: What is asthma?




## Step 3: Run the Complete Pipeline

In [4]:
# Run all agents sequentially
result = pipeline.run(question)

[QuestionAgent] Extracted raw entities: ['asthma']
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['asthma']
   → 'asthma' → 'asthma' (score=100.0)
[WikipediaAgent] Searching Wikipedia for: ['asthma']
[WikipediaAgent] ✓ Found: Asthma
[WikipediaAgent] Retrieved 1 Wikipedia articles



The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[RetrieverAgent] Retrieved 10 evidence sentences.



## Step 4: Inspect Agent Outputs

Let's see what each agent contributed to the final answer.

### Agent 1: QuestionAgent (Entity Extraction)

In [5]:
print("Extracted Entities:")
print(json.dumps(result.get('entities', []), indent=2))

print("\nEntity Types: DISEASE, CHEMICAL")
print("Model: SciSpaCy en_ner_bc5cdr_md")

Extracted Entities:
[
  "asthma"
]

Entity Types: DISEASE, CHEMICAL
Model: SciSpaCy en_ner_bc5cdr_md


### Agent 2: NormalizeAgent (Fuzzy Matching)

In [6]:
print("Normalized Entities:")
print(json.dumps(result.get('normalized_entities', []), indent=2))

print("\nPurpose: Map variations to canonical forms")
print("Example: 'diabete' - 'diabetes mellitus'")

Normalized Entities:
[
  "asthma"
]

Purpose: Map variations to canonical forms
Example: 'diabete' - 'diabetes mellitus'


### Agent 3: WikipediaAgent (General Knowledge)

In [7]:
print("Wikipedia Evidence:")
wiki_evidence = result.get('wikipedia_evidence', [])

for i, wiki in enumerate(wiki_evidence, 1):
    print(f"\n{i}. {wiki['title']}")
    print(f"   URL: {wiki['url']}")
    print(f"   Summary: {wiki['summary'][:200]}...")

Wikipedia Evidence:

1. Asthma
   URL: https://en.wikipedia.org/wiki/Asthma
   Summary: Asthma is a common long-term inflammatory disease of the bronchioles of the lungs. It is characterized by variable and recurring symptoms, reversible airflow obstruction, and easily triggered bronchos...


### Agent 4: RetrieverAgent (PubMed Vector Search)

In [8]:
print("PubMed Evidence (Top 5):")
evidence = result.get('evidence', [])

for i, ev in enumerate(evidence, 1):
    print(f"\n{i}. PMID: {ev.get('pmid')}")
    print(f"   Sentence: {ev.get('sentence')[:150]}...")
    if 'score' in ev:
        print(f"   Similarity Score: {ev['score']:.3f}")

PubMed Evidence (Top 5):

1. PMID: deep_learning_1683
   Sentence: Asthma is a syndrome composed of heterogeneous disease entities....

2. PMID: deep_learning_3700
   Sentence: We focus on a specific syndrome-asthma/difficulty breathing....

3. PMID: human_connectome_2142
   Sentence: Brain functional deficits had been reported in asthma patients....

4. PMID: deep_learning_2215
   Sentence: Respiratory ailments afflict a wide range of people and manifests itself through conditions like asthma and sleep apnea....

5. PMID: covid_19_2426
   Sentence: In children, allergy and asthma are among the most prevalent non-communicable chronic diseases, and health care providers taking care of these patient...

6. PMID: deep_learning_5498
   Sentence: Although the complex disease of asthma has been defined as being heterogeneous, the extent of its endophenotypes remains unclear....

7. PMID: deep_learning_5498
   Sentence: The introduction of antibody therapies targeting the Type 2 inflammation 

### Agent 5: QAModelAgent (Answer Generation)

In [9]:
print("Generated Answer:")
print("="*60)
print(result.get('answer', 'No answer generated'))
print("="*60)

print("\nModel: Qwen2.5-1.5B (fine-tuned on 14K medical Q&A pairs)")
print("Method: ReAct prompting with evidence")

Generated Answer:
Asthma is a syndrome composed of heterogeneous disease entities characterized by airway inflammation, bronchial hyperresponsiveness, and variable airflow obstruction. It can manifest as difficulty breathing or wheezing during inhalation and exhalation. Asthma affects various age groups but is more common in children and young adults. The condition involves both genetic predispositions and environmental factors such as allergens and infections. Treatment typically includes medications aimed at reducing inflammation and controlling symptoms. Recent studies have also highlighted the role of immune responses, particularly those involving T-helper cells, in asthma pathogenesis. Understanding the heterogeneity of asthma helps in developing personalized treatment strategies tailored to individual patient needs.

Model: Qwen2.5-1.5B (fine-tuned on 14K medical Q&A pairs)
Method: ReAct prompting with evidence


### Agent 6 & 7: EvidenceAgent + ExplanationAgent (Final Output)

In [10]:
from IPython.display import Markdown

# Display the complete explanation in markdown format
explanation = result.get('explanation', 'No explanation available')
display(Markdown(explanation))


###  Final Answer
Asthma is a syndrome composed of heterogeneous disease entities characterized by airway inflammation, bronchial hyperresponsiveness, and variable airflow obstruction. It can manifest as difficulty breathing or wheezing during inhalation and exhalation. Asthma affects various age groups but is more common in children and young adults. The condition involves both genetic predispositions and environmental factors such as allergens and infections. Treatment typically includes medications aimed at reducing inflammation and controlling symptoms. Recent studies have also highlighted the role of immune responses, particularly those involving T-helper cells, in asthma pathogenesis. Understanding the heterogeneity of asthma helps in developing personalized treatment strategies tailored to individual patient needs.

---

###  Wikipedia Knowledge (General Medical Information)
- **[Asthma](https://en.wikipedia.org/wiki/Asthma)** — Asthma is a common long-term inflammatory disease of the bronchioles of the lungs. It is characterized by variable and recurring symptoms, reversible airflow obstruction, and easily triggered bronchospasms. Symptoms include episodes of wheezing, coughing, chest tightness, and shortness of breath.

---

###  Supporting Evidence (PubMed Research Literature)
- **PMID deep_learning_1683** — Asthma is a syndrome composed of heterogeneous disease entities.
- **PMID deep_learning_3700** — We focus on a specific syndrome-asthma/difficulty breathing.
- **PMID human_connectome_2142** — Brain functional deficits had been reported in asthma patients.
- **PMID deep_learning_2215** — Respiratory ailments afflict a wide range of people and manifests itself through conditions like asthma and sleep apnea.
- **PMID covid_19_2426** — In children, allergy and asthma are among the most prevalent non-communicable chronic diseases, and health care providers taking care of these patients need guidance.
- **PMID deep_learning_5498** — Although the complex disease of asthma has been defined as being heterogeneous, the extent of its endophenotypes remains unclear.
- **PMID deep_learning_5498** — The introduction of antibody therapies targeting the Type 2 inflammation pathway for patients with severe asthma has resulted in the recognition of an allergic and an eosinophilic phenotype, which are not mutually exclusive.
- **PMID deep_learning_4485** — The purpose of our present study was to develop a forecasting method that would help asthmatic individuals to take evasive action when the probability of an attack was at THEIR PERSONAL THRESHOLD levels.
- **PMID deep_learning_1683** — Although it is agreed that proper asthma endo-typing and appropriate type-specific interventions are crucial in the management of asthma, little data are available regarding pediatric asthma.
- **PMID covid_19_979** — Asthma is increasingly recognized as an underlying risk factor for severe respiratory disease in coronavirus disease 2019 (COVID-19) patients, particularly in the United States.

---

###  Reasoning Process Summary
The system produced this answer by:
1. **Extracting biomedical entities** using SciSpaCy  
2. **Normalizing entities** using fuzzy matching against canonical vocabulary  
3. **Retrieving Wikipedia articles** for general medical knowledge
4. **Retrieving PubMed evidence** using ChromaDB (MPNet embeddings)  
5. **Combining all evidence** inside the Qwen-based reasoning model  
6. **Generating a grounded biomedical explanation** with citations

This explanation block is included for transparency, debugging, and evaluation.


## Step 5: Try Your Own Questions

Test the pipeline with different biomedical questions!

In [11]:
# Try different questions
test_questions = [
    "What is diabetes?",
    "What are the symptoms of COVID-19?",
    "How does aspirin work?",
    "What causes hypertension?"
]

# Pick one or enter your own
my_question = test_questions[0]  # Change index or replace with your question

print(f"Question: {my_question}\n")
result = pipeline.run(my_question)

# Display answer
display(Markdown(result.get('explanation', '')))

Question: What is diabetes?

[QuestionAgent] Extracted raw entities: ['diabetes']
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['diabetes']
   → 'diabetes' → 'diabetes' (score=100.0)
[WikipediaAgent] Searching Wikipedia for: ['diabetes']
[WikipediaAgent] ✓ Found: Diabetes
[WikipediaAgent] Retrieved 1 Wikipedia articles

[RetrieverAgent] Retrieved 10 evidence sentences.




###  Final Answer
Diabetes is a metabolic disorder characterized by high blood glucose levels, often resulting from either insufficient insulin production or impaired insulin action. It can lead to various health complications such as kidney damage, cardiovascular issues, nerve damage, and vision loss if left untreated. Proper management through medication, diet, exercise, and lifestyle changes is crucial to minimize these risks. The disorder affects millions worldwide, particularly in developing nations, where it poses significant challenges to public health systems.

---

###  Wikipedia Knowledge (General Medical Information)
- **[Diabetes](https://en.wikipedia.org/wiki/Diabetes)** — Diabetes mellitus, commonly known as diabetes, is a group of common endocrine diseases characterized by sustained high blood sugar levels. Diabetes is due to either the pancreas not producing enough of the hormone insulin or the cells of the body becoming unresponsive to insulin's effects. Classic symptoms include the three Ps: polydipsia (excessive thirst), polyuria (excessive urination), polyphagia (excessive hunger), weight loss, and blurred vision.

---

###  Supporting Evidence (PubMed Research Literature)
- **PMID deep_learning_5395** — Diabetes, a metabolic disorder due to high blood glycemic index in the human body.
- **PMID deep_learning_7317** — Diabetes has become one of the biggest health problems in the world.
- **PMID deep_learning_11130** — According to the World Health Organization (WHO), Diabetes Mellitus (DM) is one of the most prevalent diseases in the world.
- **PMID virtual_reality_208** — Diabetes is a major preventable cause of costly and debilitating renal failure, heart disease, lower limb amputation, and avoidable blindness.
- **PMID covid_19_6364** — Diabetes was likely to be associated with mortality.
- **PMID deep_learning_1806** — Diabetes occurs due to the excess of glucose in the blood that may affect many organs of the body.
- **PMID deep_learning_1264** — Diabetes is responsible for considerable morbidity, healthcare utilisation and mortality in both developed and developing countries.
- **PMID deep_learning_11211** — Diabetes is a global public health disease projected to affect 642 million adults by 2040, with about 75% residing in low- and middle-income countries.
- **PMID deep_learning_7498** — Diabetes mellitus (DM) is a metabolic disorder that causes abnormal blood glucose (BG) regulation that might result in short and long-term health complications and even death if not properly managed.
- **PMID virtual_reality_5054** — Diabetes is a chronic, complex condition requiring sound knowledge and self-management skills to optimize glycemic control and health outcomes.

---

###  Reasoning Process Summary
The system produced this answer by:
1. **Extracting biomedical entities** using SciSpaCy  
2. **Normalizing entities** using fuzzy matching against canonical vocabulary  
3. **Retrieving Wikipedia articles** for general medical knowledge
4. **Retrieving PubMed evidence** using ChromaDB (MPNet embeddings)  
5. **Combining all evidence** inside the Qwen-based reasoning model  
6. **Generating a grounded biomedical explanation** with citations

This explanation block is included for transparency, debugging, and evaluation.


## Understanding the RAG Pipeline

### Why RAG?

**Traditional LLM**: Relies on training data (can be outdated, hallucinates)

**RAG System**: 
1. **Retrieves** current evidence from PubMed + Wikipedia
2. **Augments** the question with this evidence
3. **Generates** answer based on retrieved facts

### Benefits:
-  **Factual**: Grounded in research papers
-  **Current**: Can add new papers without retraining
-  **Verifiable**: Provides PMIDs and URLs
-  **Transparent**: Shows evidence used

### Key Components:

1. **Vector Database (ChromaDB)**: 298,152 PubMed sentences embedded
2. **Embedding Model**: sentence-transformers/all-mpnet-base-v2 (768-dim)
3. **Fine-tuned LLM**: Qwen2.5-1.5B trained on medical Q&A
4. **Wikipedia**: General medical knowledge
5. **Entity Processing**: SciSpaCy NER + fuzzy normalization

## Performance Metrics

In [12]:
import time
# Measure pipeline latency
test_question = "What is diabetes?"

start_time = time.time()
result = pipeline.run(test_question)
end_time = time.time()

latency = end_time - start_time

print(f"Total Latency: {latency:.2f} seconds")

[QuestionAgent] Extracted raw entities: ['diabetes']
[NormalizeAgent] Fuzzy-normalizing entities:
   Raw extracted entities: ['diabetes']
   → 'diabetes' → 'diabetes' (score=100.0)
[WikipediaAgent] Searching Wikipedia for: ['diabetes']
[WikipediaAgent] ✓ Found: Diabetes
[WikipediaAgent] Retrieved 1 Wikipedia articles

[RetrieverAgent] (cache hit) Retrieved 10 evidence.

Total Latency: 14.88 seconds
