# Clinical Intelligence System - Capstone Project

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline for answering clinical questions using a trusted set of medical documents. It follows the capstone requirements and incorporates best practices from the provided training courses.

## Steps:
1. Load dataset
2. Create embeddings and store in ChromaDB
3. Explore multiple retrieval strategies
4. Integrate with GPT model for generation
5. Validate using evaluation dataset
6. Test on unseen questions and save results


In [None]:
# Install required packages (uncomment if running locally)
# !pip install openai chromadb pandas langchain

import pandas as pd
import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
import numpy as np
import re


In [None]:
# Load the main dataset
rag_dataset = pd.read_csv('capstone1_rag_dataset.csv')
print(f"Dataset loaded with {len(rag_dataset)} documents.")
rag_dataset.head()


In [None]:
# Initialize OpenAI client and ChromaDB
client = OpenAI()

# Create ChromaDB client and collection
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="medical_docs")

# Define embedding function using OpenAI's text-embedding-3-small
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="YOUR_API_KEY", model_name="text-embedding-3-small")

# Add documents to ChromaDB
for idx, row in rag_dataset.iterrows():
    collection.add(documents=[row['document']], ids=[str(idx)], metadatas=[{"source": "medical"}])

print("Documents added to ChromaDB.")


## Retrieval Strategy Exploration
We will implement and compare multiple strategies:
- Semantic Search
- Semantic Search with Threshold Filtering
- Hybrid Search (Keyword + Semantic)
- Reranking based on relevance scores


In [None]:
# Basic Semantic Search
def semantic_search(query, top_k=3):
    results = collection.query(query_texts=[query], n_results=top_k)
    return results['documents'][0]

print(semantic_search("What are symptoms of diabetes?"))


In [None]:
# Threshold Filtering
def semantic_search_with_threshold(query, top_k=5, threshold=0.75):
    results = collection.query(query_texts=[query], n_results=top_k)
    docs = results['documents'][0]
    scores = results['distances'][0]
    filtered_docs = [doc for doc, score in zip(docs, scores) if score >= threshold]
    return filtered_docs

# Hybrid Search
def hybrid_search(query, top_k=5):
    keyword_matches = [doc for doc in rag_dataset['document'] if re.search(query, doc, re.IGNORECASE)]
    semantic_results = collection.query(query_texts=[query], n_results=top_k)
    combined = list(set(keyword_matches[:top_k] + semantic_results['documents'][0]))
    return combined[:top_k]

# Reranking
def rerank_search(query, top_k=5):
    results = collection.query(query_texts=[query], n_results=top_k*2)
    docs = results['documents'][0]
    scores = results['distances'][0]
    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_k]]


In [None]:
# Connect retriever with GPT model for generation
def generate_answer(question, retrieved_docs):
    context = "
".join(retrieved_docs)
    prompt = f"You are a medical assistant. Answer the question based only on the context below.
Context:
{context}
Question: {question}
Answer:"
    response = client.chat.completions.create(model="gpt-4.1-mini", messages=[{"role": "user", "content": prompt}])
    return response.choices[0].message.content


## Evaluation Metrics
We will compute Precision@k, Recall@k, and F1-score for the validation dataset.


In [None]:
def precision_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    return len(set(retrieved_k) & set(relevant)) / len(retrieved_k) if retrieved_k else 0

def recall_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    return len(set(retrieved_k) & set(relevant)) / len(relevant) if relevant else 0

def f1_score(precision, recall):
    return (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

validation_df = pd.read_csv('capstone1_rag_validation.csv')
results = []
for _, row in validation_df.iterrows():
    question = row['question']
    relevant_docs = row['reference_context'].split('|')
    retrieved_docs = semantic_search(question)
    p = precision_at_k(retrieved_docs, relevant_docs, k=3)
    r = recall_at_k(retrieved_docs, relevant_docs, k=3)
    f1 = f1_score(p, r)
    results.append({"question": question, "precision@3": p, "recall@3": r, "f1": f1})

metrics_df = pd.DataFrame(results)
print(metrics_df)


In [None]:
submission_df = pd.read_csv('capstone1_rag_test_questions.csv')
results = []
for _, row in submission_df.iterrows():
    question = row['question']
    retrieved_docs = semantic_search(question)
    answer = generate_answer(question, retrieved_docs) if retrieved_docs else "The question cannot be answered using the available documents."
    results.append({"question": question, "retrieved_documents": retrieved_docs, "generated_answer": answer})
final_df = pd.DataFrame(results)
final_df.to_csv('submission.csv', index=False)
print("submission.csv created successfully.")
