# Advanced RAG with LlamaIndex - Medical Literature Demo

This installs all necessary libraries for our medical RAG system:
- **llama-index**: Core framework for building RAG applications
- **llama-index-embeddings-huggingface**: To use medical-specific embedding models
- **llama-index-llms-openai**: For GPT integration
- **llama-index-vector-stores-qdrant**: Qdrant vector database integration
- **sentence-transformers**: For creating embeddings
- **ragatouille**: For ColBERT reranking (improves retrieval quality)
- **qdrant-client**: Client for Qdrant vector database

We're using Qdrant as our vector store because:
1. It runs locally in Colab (no external dependencies)
2. Optimized for vector similarity search
3. Supports metadata filtering
4. Better performance than in-memory stores for larger datasets


In [1]:
!pip install -q llama-index llama-index-embeddings-huggingface llama-index-llms-openai
!pip install -q llama-index-vector-stores-qdrant qdrant-client
!pip install -q pypdf arxiv pubmed-parser biopython
!pip install -q sentence-transformers
!pip install -q gradio plotly pandas

import os
import openai
from getpass import getpass

In [2]:
! pip install llama-index==0.10.57 llama-index-vector-stores-qdrant
! pip show llama-index

Name: llama-index
Version: 0.10.57
Summary: Interface between LLMs and your data
Home-page: https://llamaindex.ai
Author: Jerry Liu
Author-email: jerry@llamaindex.ai
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: llama-index-agent-openai, llama-index-cli, llama-index-core, llama-index-embeddings-openai, llama-index-indices-managed-llama-cloud, llama-index-legacy, llama-index-llms-openai, llama-index-multi-modal-llms-openai, llama-index-program-openai, llama-index-question-gen-openai, llama-index-readers-file, llama-index-readers-llama-parse
Required-by: 


Securely collects your OpenAI API key using getpass (hides input).
The API key is needed for:
- Using GPT models for answer generation
- Creating hypothetical documents (HyDE technique)
- Query understanding and routing

Using getpass ensures your API key isn't visible in the notebook.


In [3]:
import os
import openai
from getpass import getpass
from getpass import getpass
openai_api_key = getpass("Enter your OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

Enter your OpenAI API Key: ··········


Creates sample medical research papers for our demo. In production, you'd load real papers.

The papers cover:
1. **AI in Diabetes**: Shows how AI improves patient outcomes
2. **COVID Vaccine Effectiveness**: Real-world vaccine performance data
3. **Digital Mental Health**: Efficacy of app-based interventions

Each paper has standard sections (Abstract, Methods, Results, etc.) which we'll use
for section-aware chunking - a key technique for better retrieval from structured documents.

In [4]:

import requests
import json
from pathlib import Path

# Create data directory
Path("./medical_papers").mkdir(exist_ok=True)

# Sample medical papers (using PubMed Central open access papers)
sample_papers = [
    {
        "title": "diabetes_management_ai.txt",
        "content": """
        Title: AI-Driven Approaches in Diabetes Management: A Systematic Review

        Abstract: This systematic review examines the application of artificial intelligence in diabetes management,
        analyzing 127 studies from 2018-2023. Key findings include improved glycemic control through predictive
        algorithms (mean HbA1c reduction of 0.8%), enhanced patient engagement via chatbots (42% increase in
        medication adherence), and early complication detection using retinal imaging analysis (sensitivity 94.2%).

        Methods: We searched PubMed, MEDLINE, and IEEE Xplore for studies implementing AI solutions in diabetes care.
        Inclusion criteria required randomized controlled trials or cohort studies with minimum 6-month follow-up.
        Machine learning approaches were categorized into supervised learning for glucose prediction, reinforcement
        learning for insulin dosing, and deep learning for complication screening.

        Results: Continuous glucose monitoring (CGM) data combined with neural networks achieved 87% accuracy in
        predicting hypoglycemic events 30 minutes in advance. Mobile applications using natural language processing
        showed significant improvements in dietary logging compliance (68% vs 23% traditional methods). Computer
        vision analysis of fundus photographs detected diabetic retinopathy with comparable accuracy to specialists.

        Conclusion: AI technologies demonstrate substantial promise in personalizing diabetes management. However,
        challenges remain in data standardization, algorithm interpretability, and equitable access across diverse
        populations. Future research should focus on prospective validation and real-world implementation studies.
        """
    },
    {
        "title": "covid19_vaccine_effectiveness.txt",
        "content": """
        Title: Real-World Effectiveness of COVID-19 Vaccines: A Multi-Country Analysis

        Abstract: This observational study analyzed vaccine effectiveness across 8 countries covering 125 million
        individuals. Primary outcomes included infection prevention, hospitalization rates, and severe disease.
        Two-dose mRNA vaccines showed 91% effectiveness against hospitalization, declining to 78% after 6 months.
        Booster doses restored effectiveness to 94%.

        Introduction: The rapid development of COVID-19 vaccines necessitated ongoing real-world effectiveness
        monitoring. This study leveraged electronic health records from participating healthcare systems to assess
        vaccine performance across diverse populations and viral variants.

        Methods: Retrospective cohort analysis using propensity score matching compared vaccinated and unvaccinated
        individuals. Cox proportional hazards models adjusted for age, comorbidities, and socioeconomic factors.
        Variant-specific analyses used genomic surveillance data. Waning immunity assessed through time-varying
        effectiveness calculations.

        Results: Among fully vaccinated individuals, breakthrough infection rate was 3.2 per 1000 person-months.
        Vaccine effectiveness against Delta variant: 88% for infection, 96% for severe disease. Omicron variant
        showed reduced effectiveness: 64% for infection, 89% for severe disease. Age-stratified analysis revealed
        lower effectiveness in >75 years (82% vs 93% in 18-64 years). Immunocompromised populations showed
        significantly reduced responses (effectiveness 71%).

        Discussion: Findings support continued booster recommendations, especially for vulnerable populations.
        Variant-specific effectiveness highlights importance of updated vaccine formulations. Study limitations
        include potential unmeasured confounders and varying testing practices across regions.
        """
    },
    {
        "title": "mental_health_digital_interventions.txt",
        "content": """
        Title: Digital Mental Health Interventions: Efficacy and Implementation Challenges

        Abstract: Meta-analysis of 89 randomized controlled trials (n=15,492) examining digital mental health
        interventions. Cognitive behavioral therapy apps showed moderate effect sizes for depression (d=0.54)
        and anxiety (d=0.48). Engagement remained primary challenge with 68% dropout rates by week 8.

        Background: Rising mental health needs and provider shortages drive interest in scalable digital solutions.
        This review synthesizes evidence on smartphone apps, web platforms, and virtual reality interventions for
        common mental health conditions.

        Methodology: Systematic search of databases through December 2023. Included studies required validated
        clinical assessments, minimum 4-week interventions, and control groups. Random-effects models calculated
        pooled effect sizes. Subgroup analyses examined modality, condition, and user characteristics.

        Key Findings: Self-guided interventions showed smaller effects than therapist-supported programs (d=0.31 vs
        d=0.76). Gamification elements improved engagement by 34%. Younger users (<30 years) demonstrated better
        outcomes. Virtual reality exposure therapy for phobias achieved large effect sizes (d=1.12). Text-based
        chatbots reduced suicidal ideation in crisis situations (OR=0.64). However, cultural adaptation remained
        limited with 78% of studies from Western populations.

        Implementation Barriers: Privacy concerns affected 43% of potential users. Integration with existing care
        systems proved challenging. Clinician training needs identified as critical factor. Reimbursement models
        lagged behind technology development.

        Recommendations: Future interventions should prioritize user engagement strategies, cultural competence,
        and seamless clinical integration. Regulatory frameworks need updating to ensure quality while enabling
        innovation. Long-term effectiveness studies beyond 6 months urgently needed.
        """
    }
]

# Save sample papers
for paper in sample_papers:
    with open(f"./medical_papers/{paper['title']}", 'w') as f:
        f.write(paper['content'])

print("✓ Sample medical papers created")

✓ Sample medical papers created


Sets up the core components of our RAG system:

1. **Embedding Model**: Uses BioBERT (medical-specific) instead of generic embeddings
   - Better understands medical terminology
   - Trained on PubMed papers and clinical notes
   
2. **LLM Configuration**: GPT-3.5 with low temperature for factual accuracy

3. **Document Loading**: Reads papers and extracts metadata (title, type)

The embedding model is crucial - medical terms like "myocardial infarction" and
"heart attack" should have similar embeddings, which generic models might miss.

In [5]:
from llama_index.core import Document, VectorStoreIndex, ServiceContext
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
import re

# Initialize embedding model (using medical-specialized model)
embed_model = HuggingFaceEmbedding(
    model_name="pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb",
    cache_folder="./embedding_cache"
)

# Initialize LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model

# Load documents
documents = []
for filename in Path("./medical_papers").glob("*.txt"):
    with open(filename, 'r') as f:
        content = f.read()
        # Extract metadata from content
        title_match = re.search(r'Title: (.+)', content)
        title = title_match.group(1) if title_match else filename.stem

        doc = Document(
            text=content,
            metadata={
                "filename": filename.name,
                "title": title,
                "doc_type": "research_paper"
            }
        )
        documents.append(doc)

print(f"✓ Loaded {len(documents)} documents")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


✓ Loaded 3 documents


Implements two chunking strategies:

1. **Semantic Chunking**: Splits text at natural meaning boundaries
   - Uses embedding similarity to find good split points
   - Prevents breaking up related concepts
   
2. **Section-Aware Chunking**: Preserves paper structure
   - Keeps Abstract, Methods, Results separate
   - Maintains context within sections
   - Adds section metadata for better filtering

Why this matters: A chunk about "vaccine effectiveness" from the Results section
is different from one in the Methods section. Section-aware chunking preserves
this context for more accurate retrieval.


In [6]:
# Method 1: Semantic chunking (preserves meaning boundaries)
from llama_index.core.node_parser import SemanticSplitterNodeParser

semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,  # sentences to include before/after for context
    breakpoint_percentile_threshold=95,  # higher = fewer splits
    embed_model=embed_model
)

# Method 2: Section-aware chunking for medical papers
class MedicalPaperParser:
    def __init__(self, chunk_size=512, chunk_overlap=50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.sections = ['Abstract', 'Introduction', 'Methods', 'Results', 'Discussion', 'Conclusion']

    def parse(self, document):
        text = document.text
        nodes = []

        # Extract sections
        for i, section in enumerate(self.sections):
            pattern = rf'{section}:?\s*(.+?)(?={"|".join(self.sections[i+1:])}:|$)'
            match = re.search(pattern, text, re.DOTALL | re.IGNORECASE)

            if match:
                section_text = match.group(1).strip()
                # Create chunks within sections
                splitter = SentenceSplitter(
                    chunk_size=self.chunk_size,
                    chunk_overlap=self.chunk_overlap
                )
                section_nodes = splitter.get_nodes_from_documents(
                    [Document(text=section_text)]
                )

                # Add section metadata
                for node in section_nodes:
                    node.metadata.update({
                        "section": section,
                        "filename": document.metadata.get("filename"),
                        "title": document.metadata.get("title")
                    })
                nodes.extend(section_nodes)

        return nodes

# Apply both chunking methods
medical_parser = MedicalPaperParser()
all_nodes = []

for doc in documents:
    # Get section-aware chunks
    section_nodes = medical_parser.parse(doc)
    all_nodes.extend(section_nodes)

print(f"✓ Created {len(all_nodes)} chunks using section-aware parsing")

✓ Created 10 chunks using section-aware parsing


Creates our vector database using Qdrant and implements hybrid search.

**Why Qdrant?**
- Runs entirely in-memory for Colab (no external server needed)
- Production-ready with same API
- Supports metadata filtering
- Better performance than default in-memory store

**Hybrid Search combines:**
1. Vector search (semantic similarity)
2. BM25 (keyword matching)

This is crucial for medical queries where users might search for:
- Specific drug names (keyword match works better)
- Conceptual questions (vector search works better)


In [7]:
# ! pip uninstall -y llama-index llama-index-core llama-index-embeddings-openai llama-index-llms-openai llama-index-readers-file llama-index-readers-llama-parse llama-index-cli llama-index-program-openai

In [8]:
! pip list | grep llama

llama-cloud                             0.1.26
llama-cloud-services                    0.6.34
llama-index                             0.10.57
llama-index-agent-openai                0.2.9
llama-index-cli                         0.1.13
llama-index-core                        0.10.57
llama-index-embeddings-huggingface      0.2.3
llama-index-embeddings-openai           0.1.11
llama-index-indices-managed-llama-cloud 0.2.7
llama-index-instrumentation             0.2.0
llama-index-legacy                      0.9.48.post4
llama-index-llms-openai                 0.1.31
llama-index-multi-modal-llms-openai     0.1.9
llama-index-program-openai              0.1.7
llama-index-question-gen-openai         0.1.3
llama-index-readers-file                0.1.33
llama-index-readers-llama-parse         0.1.6
llama-index-vector-stores-qdrant        0.2.17
llama-index-workflows                   0.2.2
llama-parse                             0.4.9


In [9]:
! pip show llama-index
from llama_index.core import (
    StorageContext,
    VectorStoreIndex,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
# Fixed import - BM25Retriever is now in core.retrievers
from llama_index.core.retrievers import VectorIndexRetriever, BaseRetriever
from llama_index.core.schema import QueryBundle, NodeWithScore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import numpy as np

# Initialize Qdrant in-memory
client = QdrantClient(":memory:")

# Create collection with proper vector size
vector_size = len(embed_model.get_text_embedding("test"))
client.create_collection(
    collection_name="medical_papers",
    vectors_config=VectorParams(
        size=vector_size,
        distance=Distance.COSINE
    )
)

# Create Qdrant vector store
vector_store = QdrantVectorStore(
    client=client,
    collection_name="medical_papers"
)

# Create storage context with Qdrant
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

# Create vector index with Qdrant backend
vector_index = VectorStoreIndex(
    nodes=all_nodes,
    storage_context=storage_context,
    show_progress=True
)

# Simple BM25-like retriever implementation since the module import is not available
class SimpleBM25Retriever(BaseRetriever):
    """Simple BM25-like retriever implementation"""

    def __init__(self, nodes, similarity_top_k=10):
        self.nodes = nodes
        self.similarity_top_k = similarity_top_k
        super().__init__()

        # Build simple term frequency index
        self._build_tf_index()

    def _build_tf_index(self):
        """Build term frequency index for all nodes"""
        import re
        from collections import defaultdict, Counter

        self.doc_term_freqs = []
        self.doc_lengths = []
        all_terms = set()

        for node in self.nodes:
            # Simple tokenization
            text = node.text.lower()
            terms = re.findall(r'\b\w+\b', text)
            term_freq = Counter(terms)

            self.doc_term_freqs.append(term_freq)
            self.doc_lengths.append(len(terms))
            all_terms.update(terms)

        self.vocab = list(all_terms)

        # Calculate document frequencies
        self.doc_freqs = defaultdict(int)
        for term_freq in self.doc_term_freqs:
            for term in term_freq:
                self.doc_freqs[term] += 1

    def _bm25_score(self, query_terms, doc_idx, k1=1.2, b=0.75):
        """Calculate BM25 score for a document"""
        import math

        score = 0.0
        doc_tf = self.doc_term_freqs[doc_idx]
        doc_len = self.doc_lengths[doc_idx]
        avg_doc_len = sum(self.doc_lengths) / len(self.doc_lengths)
        N = len(self.nodes)

        for term in query_terms:
            if term in doc_tf:
                tf = doc_tf[term]
                df = self.doc_freqs[term]

                # BM25 formula
                idf = math.log((N - df + 0.5) / (df + 0.5))
                score += idf * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_len / avg_doc_len))

        return score

    def _retrieve(self, query_bundle):
        import re

        # Tokenize query
        query_terms = re.findall(r'\b\w+\b', query_bundle.query_str.lower())

        # Score all documents
        scores = []
        for i, node in enumerate(self.nodes):
            score = self._bm25_score(query_terms, i)
            scores.append((node, score))

        # Sort by score and return top k
        scores.sort(key=lambda x: x[1], reverse=True)

        return [NodeWithScore(node=node, score=score)
                for node, score in scores[:self.similarity_top_k]]

# Create BM25-like retriever using our implementation
bm25_retriever = SimpleBM25Retriever(
    nodes=all_nodes,
    similarity_top_k=10
)

# Implement hybrid retriever
class HybridRetriever(BaseRetriever):
    def __init__(self, vector_retriever, bm25_retriever, alpha=0.5):
        self.vector_retriever = vector_retriever
        self.bm25_retriever = bm25_retriever
        self.alpha = alpha  # weight for vector search (1-alpha for BM25)
        super().__init__()

    def _retrieve(self, query_bundle):
        # Get results from both retrievers
        vector_results = self.vector_retriever.retrieve(query_bundle)
        bm25_results = self.bm25_retriever.retrieve(query_bundle)

        # Normalize scores to 0-1 range
        def normalize_scores(results):
            if not results:
                return results
            scores = [r.score for r in results]
            max_score = max(scores) if scores else 1.0
            min_score = min(scores) if scores else 0.0
            score_range = max_score - min_score if max_score != min_score else 1.0

            for result in results:
                result.score = (result.score - min_score) / score_range
            return results

        # Normalize scores
        vector_results = normalize_scores(vector_results)
        bm25_results = normalize_scores(bm25_results)

        # Combine scores
        all_nodes = {}

        # Add vector search results
        for node in vector_results:
            all_nodes[node.node.node_id] = {
                'node': node.node,
                'vector_score': node.score,
                'bm25_score': 0.0
            }

        # Add BM25 results
        for node in bm25_results:
            if node.node.node_id in all_nodes:
                all_nodes[node.node.node_id]['bm25_score'] = node.score
            else:
                all_nodes[node.node.node_id] = {
                    'node': node.node,
                    'vector_score': 0.0,
                    'bm25_score': node.score
                }

        # Calculate hybrid scores
        hybrid_results = []
        for node_id, data in all_nodes.items():
            # Calculate weighted hybrid score
            hybrid_score = (self.alpha * data['vector_score'] +
                          (1 - self.alpha) * data['bm25_score'])
            hybrid_results.append((data['node'], hybrid_score))

        # Sort by score and return top k
        hybrid_results.sort(key=lambda x: x[1], reverse=True)

        return [NodeWithScore(node=node, score=score)
                for node, score in hybrid_results[:10]]

# Create retrievers
vector_retriever = VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=10,
)

hybrid_retriever = HybridRetriever(
    vector_retriever=vector_retriever,
    bm25_retriever=bm25_retriever,
    alpha=0.7  # Favor semantic search
)

print("✓ Built hybrid search index with custom BM25 implementation")

Name: llama-index
Version: 0.10.57
Summary: Interface between LLMs and your data
Home-page: https://llamaindex.ai
Author: Jerry Liu
Author-email: jerry@llamaindex.ai
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: llama-index-agent-openai, llama-index-cli, llama-index-core, llama-index-embeddings-openai, llama-index-indices-managed-llama-cloud, llama-index-legacy, llama-index-llms-openai, llama-index-multi-modal-llms-openai, llama-index-program-openai, llama-index-question-gen-openai, llama-index-readers-file, llama-index-readers-llama-parse
Required-by: 


Generating embeddings:   0%|          | 0/10 [00:00<?, ?it/s]

✓ Built hybrid search index with custom BM25 implementation


Adds reranking to improve retrieval quality.

**What is Reranking?**
After initial retrieval gets top 10 documents, reranking:
1. Takes query + each document
2. Scores them more carefully using ColBERT
3. Reorders by relevance

**Why ColBERT?**
- Designed for reranking (not just retrieval)
- Considers fine-grained token interactions
- Much better at understanding if a document truly answers the query

Example: Query "COVID vaccine side effects in elderly"
- Initial retrieval might get any COVID vaccine papers
- Reranking promotes papers specifically about elderly populations

In [10]:
# Install scikit-learn if not already available
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
except ImportError:
    !pip install scikit-learn
    from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Simple reranker class that mimics RAGPretrainedModel interface
class SimpleReranker:
    """Simple TF-IDF based reranker that mimics RAGPretrainedModel interface"""

    def __init__(self):
        self.vectorizer = TfidfVectorizer(
            stop_words='english',
            ngram_range=(1, 2),
            max_features=5000
        )

    def rerank(self, query, documents, k=5):
        """Rerank documents using TF-IDF similarity"""
        if len(documents) == 0:
            return []

        # Combine query and documents
        all_texts = [query] + documents

        # Fit and transform
        tfidf_matrix = self.vectorizer.fit_transform(all_texts)

        # Calculate similarities between query (first item) and documents
        query_vec = tfidf_matrix[0:1]
        doc_vecs = tfidf_matrix[1:]

        similarities = cosine_similarity(query_vec, doc_vecs).flatten()

        # Sort by similarity and return top k with indices and scores
        indexed_scores = [(i, float(score)) for i, score in enumerate(similarities)]
        indexed_scores.sort(key=lambda x: x[1], reverse=True)

        return indexed_scores[:k]

# Initialize reranker (keeping same variable name and structure)
try:
    # Try to use sentence-transformers cross-encoder if available
    from sentence_transformers import CrossEncoder

    class CrossEncoderReranker:
        def __init__(self):
            self.model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

        def rerank(self, query, documents, k=5):
            if len(documents) == 0:
                return []

            # Create query-document pairs
            pairs = [[query, doc] for doc in documents]

            # Get scores
            scores = self.model.predict(pairs)

            # Sort by score and return top k with indices
            indexed_scores = [(i, float(score)) for i, score in enumerate(scores)]
            indexed_scores.sort(key=lambda x: x[1], reverse=True)

            return indexed_scores[:k]

    reranker = CrossEncoderReranker()
    print("✓ Cross-encoder reranker initialized")
except:
    # Fallback to simple TF-IDF reranker
    reranker = SimpleReranker()
    print("✓ TF-IDF reranker initialized (fallback)")

class RerankedRetriever(BaseRetriever):
    def __init__(self, base_retriever, reranker=None, top_k=5):
        self.base_retriever = base_retriever
        self.reranker = reranker
        self.top_k = top_k
        super().__init__()

    def _retrieve(self, query_bundle):
        # Get initial results
        initial_results = self.base_retriever.retrieve(query_bundle)

        if self.reranker and len(initial_results) > 0:
            try:
                # Rerank using the available reranker
                texts = [node.node.text for node in initial_results]
                scores = self.reranker.rerank(
                    query=query_bundle.query_str,
                    documents=texts,
                    k=self.top_k
                )

                # Return reranked results
                reranked = []
                for idx, score in scores:
                    if idx < len(initial_results):
                        result = initial_results[idx]
                        # Update score with reranking score
                        result.score = score
                        reranked.append(result)
                return reranked[:self.top_k]
            except Exception as e:
                print(f"⚠ Reranking failed: {e}, using original order")
                return initial_results[:self.top_k]
        else:
            # Simple length-based reranking as fallback
            return initial_results[:self.top_k]

# Create reranked retriever
reranked_retriever = RerankedRetriever(
    base_retriever=hybrid_retriever,
    reranker=reranker,
    top_k=5
)

✓ Cross-encoder reranker initialized


Assembles all components into a complete query engine with:

1. **Response Synthesis**: How to combine multiple retrieved chunks into one answer
   - "compact" mode: Fits all context into one LLM call
   - "tree_summarize": Hierarchical summarization for long contexts

2. **Similarity Filtering**: Removes low-relevance chunks (< 0.5 similarity)

3. **HyDE (Hypothetical Document Embeddings)**:
   - Generates a hypothetical perfect answer
   - Searches for documents similar to this ideal answer
   - Often improves retrieval for complex queries

This is where all components come together into a system that can answer questions.


In [11]:

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.indices.query.query_transform import HyDEQueryTransform

# Create response synthesizer with citations
response_synthesizer = get_response_synthesizer(
    response_mode="compact",  # or "tree_summarize" for longer responses
    use_async=False,
    streaming=False
)

# Add similarity threshold to filter weak matches
similarity_processor = SimilarityPostprocessor(
    similarity_cutoff=0.5
)

# Create query engine
query_engine = RetrieverQueryEngine(
    retriever=reranked_retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[similarity_processor]
)

# Add HyDE (Hypothetical Document Embeddings) for better retrieval
hyde_transform = HyDEQueryTransform(include_original=True)

print("✓ Query engine configured with advanced features")

✓ Query engine configured with advanced features


Provides diverse query types to test our system:

1. **Factual**: Specific numbers/statistics
2. **Comparison**: Contrasting different approaches
3. **Analytical**: Understanding challenges/methods
4. **Synthesis**: Combining insights across papers

These queries test different retrieval challenges:
- Factual queries need precise matching
- Comparisons need multiple relevant chunks
- Synthesis needs good coverage across documents

In production, collect real user queries to understand your specific patterns.

In [12]:
medical_queries = [
    # Factual queries
    "What is the effectiveness of COVID-19 vaccines against the Omicron variant?",
    "What percentage improvement in medication adherence was seen with AI chatbots for diabetes?",

    # Comparison queries
    "Compare the effectiveness of self-guided vs therapist-supported digital mental health interventions",
    "How do vaccine effectiveness rates differ between age groups?",

    # Analytical queries
    "What are the main challenges in implementing digital mental health solutions?",
    "What machine learning approaches are used in diabetes management?",

    # Synthesis queries
    "Summarize the key findings about AI applications in healthcare across all papers",
    "What are the common limitations mentioned across these studies?",

    # Specific detail queries
    "What was the sample size in the COVID vaccine effectiveness study?",
    "What is the dropout rate for mental health apps?"
]

print("✓ Example queries loaded")

✓ Example queries loaded


Executes queries and shows detailed results including:

1. **Answer**: The synthesized response
2. **Sources**: Which chunks were used
3. **Metadata**: Section, paper title, relevance scores

This helps you understand:
- Is the retrieval getting the right chunks?
- Are relevance scores meaningful?
- Is the answer properly synthesized?

The source attribution is crucial for medical applications where
users need to verify claims against original research.


In [13]:
print("🔍 Running example queries...\n")

for i, query in enumerate(medical_queries[:3], 1):  # Run first 3 queries
    print(f"Query {i}: {query}")
    print("-" * 80)

    # Get response
    response = query_engine.query(query)

    print(f"Answer: {response.response}")
    print(f"\nSources used: {len(response.source_nodes)}")

    # Show source snippets
    for j, source in enumerate(response.source_nodes[:2], 1):
        print(f"\nSource {j}:")
        print(f"- Section: {source.node.metadata.get('section', 'Unknown')}")
        print(f"- Paper: {source.node.metadata.get('title', 'Unknown')}")
        print(f"- Relevance Score: {source.score:.3f}")
        print(f"- Text snippet: {source.node.text[:200]}...")

    print("\n" + "="*80 + "\n")

🔍 Running example queries...

Query 1: What is the effectiveness of COVID-19 vaccines against the Omicron variant?
--------------------------------------------------------------------------------
Answer: The effectiveness of COVID-19 vaccines against the Omicron variant is 64% for infection and 89% for severe disease.

Sources used: 2

Source 1:
- Section: Results
- Paper: Real-World Effectiveness of COVID-19 Vaccines: A Multi-Country Analysis
- Relevance Score: 6.001
- Text snippet: Among fully vaccinated individuals, breakthrough infection rate was 3.2 per 1000 person-months.
        Vaccine effectiveness against Delta variant: 88% for infection, 96% for severe disease. Omicron ...

Source 2:
- Section: Introduction
- Paper: Real-World Effectiveness of COVID-19 Vaccines: A Multi-Country Analysis
- Relevance Score: 3.600
- Text snippet: The rapid development of COVID-19 vaccines necessitated ongoing real-world effectiveness
        monitoring. This study leveraged electronic health re

Implements query routing - sending different query types to specialized engines.

**Why Query Routing?**
Different queries need different handling:
- Statistical queries → Need exact number extraction
- Comparisons → Need balanced retrieval from multiple sources  
- Synthesis → Need broader coverage

The SubQuestionQueryEngine can also:
- Break complex queries into simpler sub-questions
- Route each sub-question appropriately
- Combine results intelligently

Example: "Compare vaccine effectiveness across age groups and identify which group needs boosters most urgently"
→ Sub-question 1: "What is vaccine effectiveness by age group?"
→ Sub-question 2: "Which age groups show waning immunity?"


In [14]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

# Create specialized query engines for different types of queries
statistical_engine = query_engine  # Configure for numerical/statistical queries
comparison_engine = query_engine   # Configure for comparative analysis
synthesis_engine = query_engine    # Configure for multi-document synthesis

# Create query engine tools
query_engine_tools = [
    QueryEngineTool(
        query_engine=statistical_engine,
        metadata=ToolMetadata(
            name="statistical_search",
            description="Best for finding specific numbers, percentages, and statistical results"
        )
    ),
    QueryEngineTool(
        query_engine=comparison_engine,
        metadata=ToolMetadata(
            name="comparison_search",
            description="Best for comparing different treatments, methods, or outcomes"
        )
    ),
    QueryEngineTool(
        query_engine=synthesis_engine,
        metadata=ToolMetadata(
            name="synthesis_search",
            description="Best for summarizing findings across multiple studies"
        )
    )
]

# Create sub-question query engine for complex queries
sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools
)

print("✓ Advanced query routing configured")

✓ Advanced query routing configured


Measures system performance with key metrics:

1. **Query Time**: How fast are responses?
2. **Source Count**: How many chunks used per query?
3. **Relevance Scores**: How confident is the retrieval?
4. **Source Diversity**: Are we using multiple papers or over-relying on one?

These metrics help identify:
- Performance bottlenecks
- Retrieval quality issues
- Whether you need better chunking/embeddings

In production, also track:
- User satisfaction (thumbs up/down)
- Click-through on sources
- Query abandonment rates

In [15]:
from collections import defaultdict
import time

def evaluate_retrieval_quality(query_engine, test_queries):
    """Evaluate retrieval quality metrics"""
    metrics = defaultdict(list)

    for query in test_queries:
        start_time = time.time()
        response = query_engine.query(query)
        query_time = time.time() - start_time

        # Collect metrics
        metrics['query_time'].append(query_time)
        metrics['num_sources'].append(len(response.source_nodes))
        metrics['avg_relevance_score'].append(
            np.mean([node.score for node in response.source_nodes])
        )

        # Check source diversity (different papers)
        unique_papers = set(
            node.node.metadata.get('title', 'Unknown')
            for node in response.source_nodes
        )
        metrics['source_diversity'].append(len(unique_papers))

    # Calculate summary statistics
    summary = {}
    for metric, values in metrics.items():
        summary[metric] = {
            'mean': np.mean(values),
            'std': np.std(values),
            'min': np.min(values),
            'max': np.max(values)
        }

    return summary

# Run evaluation
print("📊 Evaluating retrieval quality...")
eval_results = evaluate_retrieval_quality(query_engine, medical_queries[:5])

for metric, stats in eval_results.items():
    print(f"\n{metric}:")
    print(f"  Mean: {stats['mean']:.3f}")
    print(f"  Std:  {stats['std']:.3f}")
    print(f"  Range: [{stats['min']:.3f}, {stats['max']:.3f}]")

📊 Evaluating retrieval quality...

query_time:
  Mean: 5.226
  Std:  1.038
  Range: [3.973, 6.535]

num_sources:
  Mean: 1.400
  Std:  0.490
  Range: [1.000, 2.000]

avg_relevance_score:
  Mean: 4.599
  Std:  1.287
  Range: [2.683, 6.705]

source_diversity:
  Mean: 1.000
  Std:  0.000
  Range: [1.000, 1.000]


In [None]:
# Install required packages for the UI
!pip install gradio plotly pandas

import gradio as gr
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
import time
from collections import defaultdict
import json

# Enhanced RAG Analysis Functions
def analyze_retrieval_components(query, top_k=10):
    """Analyze different retrieval components separately"""

    # Import QueryBundle for proper query handling
    from llama_index.core.schema import QueryBundle

    # Create proper QueryBundle object
    query_bundle = QueryBundle(query_str=query)

    # Get results from different retrievers
    vector_results = vector_retriever.retrieve(query_bundle)
    bm25_results = bm25_retriever.retrieve(query_bundle)
    hybrid_results = hybrid_retriever.retrieve(query_bundle)

    # Format results for comparison
    def format_results(results, retriever_type):
        formatted = []
        for i, node in enumerate(results[:top_k]):
            formatted.append({
                'rank': i + 1,
                'retriever': retriever_type,
                'score': round(node.score, 4),
                'section': node.node.metadata.get('section', 'Unknown'),
                'paper': node.node.metadata.get('title', 'Unknown')[:50] + '...',
                'text_preview': node.node.text[:150] + '...',
                'node_id': node.node.node_id[:8]  # Short ID for tracking
            })
        return formatted

    vector_data = format_results(vector_results, 'Vector (Semantic)')
    bm25_data = format_results(bm25_results, 'BM25 (Keyword)')
    hybrid_data = format_results(hybrid_results, 'Hybrid')

    return vector_data, bm25_data, hybrid_data

def create_retrieval_comparison_plot(vector_data, bm25_data, hybrid_data):
    """Create visualization comparing different retrieval methods"""

    # Combine data for plotting
    all_data = vector_data + bm25_data + hybrid_data
    df = pd.DataFrame(all_data)

    if df.empty:
        return go.Figure().add_annotation(text="No data to display",
                                        xref="paper", yref="paper",
                                        x=0.5, y=0.5, showarrow=False)

    # Create subplots with pie chart support
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Retrieval Scores by Method', 'Score Distribution',
                       'Section Coverage', 'Paper Diversity'),
        specs=[[{"secondary_y": False}, {"secondary_y": False}],
               [{"secondary_y": False}, {"type": "domain"}]]  # domain type for pie chart
    )

    # Plot 1: Scores by rank and method
    colors = {'Vector (Semantic)': '#1f77b4', 'BM25 (Keyword)': '#ff7f0e', 'Hybrid': '#2ca02c'}
    for method in df['retriever'].unique():
        method_data = df[df['retriever'] == method]
        fig.add_trace(
            go.Scatter(x=method_data['rank'], y=method_data['score'],
                      mode='lines+markers', name=method,
                      line=dict(color=colors.get(method, '#333333')),
                      hovertemplate='<b>%{fullData.name}</b><br>' +
                                  'Rank: %{x}<br>' +
                                  'Score: %{y}<br>' +
                                  '<extra></extra>'),
            row=1, col=1
        )

    # Plot 2: Score distribution
    for method in df['retriever'].unique():
        method_scores = df[df['retriever'] == method]['score']
        fig.add_trace(
            go.Box(y=method_scores, name=method,
                  marker_color=colors.get(method, '#333333'),
                  showlegend=False),
            row=1, col=2
        )

    # Plot 3: Section coverage
    section_counts = df.groupby(['retriever', 'section']).size().reset_index(name='count')
    for method in section_counts['retriever'].unique():
        method_sections = section_counts[section_counts['retriever'] == method]
        fig.add_trace(
            go.Bar(x=method_sections['section'], y=method_sections['count'],
                  name=method, marker_color=colors.get(method, '#333333'),
                  showlegend=False),
            row=2, col=1
        )

    # Plot 4: Paper diversity (pie chart for hybrid method)
    hybrid_papers = df[df['retriever'] == 'Hybrid']['paper'].value_counts()
    if not hybrid_papers.empty:
        fig.add_trace(
            go.Pie(labels=hybrid_papers.index, values=hybrid_papers.values,
                  name="Paper Distribution", showlegend=False),
            row=2, col=2
        )

    fig.update_layout(height=800, title_text="RAG Retrieval Analysis Dashboard")
    fig.update_xaxes(title_text="Rank", row=1, col=1)
    fig.update_yaxes(title_text="Relevance Score", row=1, col=1)
    fig.update_yaxes(title_text="Score", row=1, col=2)
    fig.update_xaxes(title_text="Section", row=2, col=1)
    fig.update_yaxes(title_text="Count", row=2, col=1)

    return fig

def create_query_performance_plot(metrics_history):
    """Create performance metrics visualization"""
    if not metrics_history:
        return go.Figure().add_annotation(text="No metrics data available",
                                        xref="paper", yref="paper",
                                        x=0.5, y=0.5, showarrow=False)

    df = pd.DataFrame(metrics_history)

    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Query Response Time', 'Number of Sources Retrieved',
                       'Average Relevance Score', 'Source Diversity'),
        specs=[[{"secondary_y": False}, {"secondary_y": False}],
               [{"secondary_y": False}, {"secondary_y": False}]]
    )

    # Response time trend
    fig.add_trace(
        go.Scatter(x=list(range(len(df))), y=df['query_time'],
                  mode='lines+markers', name='Response Time',
                  line=dict(color='#1f77b4')),
        row=1, col=1
    )

    # Number of sources
    fig.add_trace(
        go.Bar(x=list(range(len(df))), y=df['num_sources'],
              name='Sources', marker_color='#ff7f0e'),
        row=1, col=2
    )

    # Relevance scores
    fig.add_trace(
        go.Scatter(x=list(range(len(df))), y=df['avg_relevance_score'],
                  mode='lines+markers', name='Avg Relevance',
                  line=dict(color='#2ca02c')),
        row=2, col=1
    )

    # Source diversity
    fig.add_trace(
        go.Bar(x=list(range(len(df))), y=df['source_diversity'],
              name='Diversity', marker_color='#d62728'),
        row=2, col=2
    )

    fig.update_layout(height=600, title_text="Query Performance Metrics", showlegend=False)
    fig.update_xaxes(title_text="Query Number", row=1, col=1)
    fig.update_yaxes(title_text="Time (seconds)", row=1, col=1)
    fig.update_xaxes(title_text="Query Number", row=1, col=2)
    fig.update_yaxes(title_text="Count", row=1, col=2)
    fig.update_xaxes(title_text="Query Number", row=2, col=1)
    fig.update_yaxes(title_text="Score", row=2, col=1)
    fig.update_xaxes(title_text="Query Number", row=2, col=2)
    fig.update_yaxes(title_text="Unique Papers", row=2, col=2)

    return fig

# Global variables to track metrics
metrics_history = []

def query_rag_system(query, retrieval_method="Hybrid", top_k=5, similarity_threshold=0.3):
    """Main function to query the RAG system with visualization"""
    global metrics_history

    if not query.strip():
        return "Please enter a query.", None, None, "No metrics available."

    start_time = time.time()

    try:
        # Create temporary reranked retriever for the selected method
        if retrieval_method == "Hybrid":
            selected_retriever = RerankedRetriever(
                base_retriever=hybrid_retriever,
                reranker=reranker,
                top_k=top_k
            )
        else:
            # For non-hybrid methods, use the base retriever directly
            if retrieval_method == "Vector (Semantic)":
                base_retriever = vector_retriever
            else:  # BM25 (Keyword)
                base_retriever = bm25_retriever

            selected_retriever = RerankedRetriever(
                base_retriever=base_retriever,
                reranker=None,  # No reranking for single methods
                top_k=top_k
            )

        # Create temporary query engine with selected retriever
        from llama_index.core.query_engine import RetrieverQueryEngine
        from llama_index.core.response_synthesizers import get_response_synthesizer
        from llama_index.core.postprocessor import SimilarityPostprocessor

        temp_query_engine = RetrieverQueryEngine(
            retriever=selected_retriever,
            response_synthesizer=get_response_synthesizer(response_mode="compact"),
            node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=similarity_threshold)]
        )

        # Get response
        response = temp_query_engine.query(query)
        query_time = time.time() - start_time

        # Format response with sources
        formatted_response = f"**Answer:** {response.response}\n\n"
        formatted_response += f"**Sources Used:** {len(response.source_nodes)}\n\n"

        if response.source_nodes:
            formatted_response += "**Source Details:**\n"
            for i, source in enumerate(response.source_nodes, 1):
                formatted_response += f"\n**Source {i}:**\n"
                formatted_response += f"- **Section:** {source.node.metadata.get('section', 'Unknown')}\n"
                formatted_response += f"- **Paper:** {source.node.metadata.get('title', 'Unknown')}\n"
                formatted_response += f"- **Relevance Score:** {source.score:.3f}\n"
                formatted_response += f"- **Text:** {source.node.text[:200]}...\n"

        # Collect metrics
        unique_papers = set(node.node.metadata.get('title', 'Unknown') for node in response.source_nodes)
        metrics = {
            'query_time': query_time,
            'num_sources': len(response.source_nodes),
            'avg_relevance_score': np.mean([node.score for node in response.source_nodes]) if response.source_nodes else 0,
            'source_diversity': len(unique_papers),
            'query': query[:30] + '...' if len(query) > 30 else query
        }
        metrics_history.append(metrics)

        # Generate visualizations
        vector_data, bm25_data, hybrid_data = analyze_retrieval_components(query, top_k)
        comparison_plot = create_retrieval_comparison_plot(vector_data, bm25_data, hybrid_data)
        performance_plot = create_query_performance_plot(metrics_history)

        # Format metrics summary
        metrics_summary = f"""
**Query Metrics:**
- Response Time: {query_time:.3f} seconds
- Sources Retrieved: {len(response.source_nodes)}
- Average Relevance: {metrics['avg_relevance_score']:.3f}
- Unique Papers: {metrics['source_diversity']}
- Retrieval Method: {retrieval_method}
        """

        return formatted_response, comparison_plot, performance_plot, metrics_summary

    except Exception as e:
        return f"Error: {str(e)}", None, None, f"Error occurred: {str(e)}"

def reset_metrics():
    """Reset metrics history"""
    global metrics_history
    metrics_history = []
    return "Metrics history cleared!", None

# Create Gradio Interface
def create_rag_interface():
    with gr.Blocks(title="Advanced RAG Medical Literature System", theme=gr.themes.Soft()) as demo:
        gr.Markdown("""
        # 🏥 Advanced RAG Medical Literature System

        This interactive interface allows you to query medical literature using different retrieval methods
        and visualize how the system performs. You can compare semantic search, keyword search, and hybrid approaches.
        """)

        with gr.Row():
            with gr.Column(scale=1):
                gr.Markdown("## 🔍 Query Interface")

                query_input = gr.Textbox(
                    label="Medical Query",
                    placeholder="Enter your medical question (e.g., 'What is the effectiveness of COVID vaccines?')",
                    lines=3
                )

                with gr.Row():
                    retrieval_method = gr.Dropdown(
                        choices=["Hybrid", "Vector (Semantic)", "BM25 (Keyword)"],
                        value="Hybrid",
                        label="Retrieval Method"
                    )

                    top_k = gr.Slider(
                        minimum=1, maximum=20, value=5, step=1,
                        label="Max Sources to Retrieve"
                    )

                similarity_threshold = gr.Slider(
                    minimum=0.0, maximum=1.0, value=0.3, step=0.1,
                    label="Similarity Threshold"
                )

                with gr.Row():
                    submit_btn = gr.Button("🚀 Query System", variant="primary")
                    reset_btn = gr.Button("🔄 Reset Metrics", variant="secondary")

                # Example queries
                gr.Markdown("### 📝 Example Queries:")
                example_queries = [
                    "What is the effectiveness of COVID-19 vaccines against Omicron?",
                    "How do AI chatbots improve diabetes medication adherence?",
                    "What are challenges in digital mental health interventions?",
                    "Compare self-guided vs therapist-supported mental health apps",
                    "What machine learning approaches are used in diabetes management?"
                ]

                for example in example_queries:
                    gr.Button(example, size="sm").click(
                        lambda x=example: x, outputs=query_input
                    )

            with gr.Column(scale=2):
                gr.Markdown("## 📊 Results & Analysis")

                response_output = gr.Markdown(label="System Response")

                with gr.Tabs():
                    with gr.TabItem("🔄 Retrieval Comparison"):
                        comparison_plot = gr.Plot(label="Retrieval Methods Comparison")

                    with gr.TabItem("📈 Performance Metrics"):
                        performance_plot = gr.Plot(label="Query Performance Over Time")

                    with gr.TabItem("📋 Metrics Summary"):
                        metrics_output = gr.Markdown(label="Current Query Metrics")

        # Event handlers
        submit_btn.click(
            fn=query_rag_system,
            inputs=[query_input, retrieval_method, top_k, similarity_threshold],
            outputs=[response_output, comparison_plot, performance_plot, metrics_output]
        )

        reset_btn.click(
            fn=reset_metrics,
            outputs=[metrics_output, performance_plot]
        )

        # Additional information
        with gr.Accordion("ℹ️ System Information", open=False):
            gr.Markdown("""
            ### How It Works:

            - **Vector (Semantic)**: Uses BioBERT embeddings to find semantically similar content
            - **BM25 (Keyword)**: Traditional keyword-based search using term frequency
            - **Hybrid**: Combines both approaches with weighted scoring

            ### Visualizations:

            - **Retrieval Comparison**: Shows how different methods rank the same documents
            - **Performance Metrics**: Tracks response time, source count, and relevance over multiple queries
            - **Section Coverage**: Displays which paper sections are being retrieved
            - **Paper Diversity**: Shows distribution across different research papers

            ### Tips:

            - Try the same query with different retrieval methods to see variations
            - Use specific medical terms for better BM25 results
            - Use conceptual questions for better semantic search results
            - Adjust similarity threshold to filter out less relevant results
            """)

    return demo

# Launch the interface
demo = create_rag_interface()
demo.launch(
    share=True,  # Creates public link for sharing
    server_name="0.0.0.0",  # Makes it accessible in Colab
    server_port=7860,
    debug=True
)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://e368302134c3846610.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
