# Clinical Intelligence System - Capstone Project

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline for answering clinical questions using a trusted set of medical documents. It follows the capstone requirements and incorporates best practices from the provided training courses.

## Steps:
1. Load dataset
2. Create embeddings and store in ChromaDB
3. Explore multiple retrieval strategies
4. Integrate with GPT model for generation
5. Validate using evaluation dataset
6. Test on unseen questions and save results


In [None]:
# Install required packages (uncomment if running locally)
# !pip install openai chromadb pandas langchain

import pandas as pd
import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
import numpy as np


In [None]:
# Load the main dataset
# Replace with actual file path when running
rag_dataset = pd.read_csv('capstone1_rag_dataset.csv')
print(f"Dataset loaded with {len(rag_dataset)} documents.")
rag_dataset.head()


In [None]:
# Initialize OpenAI client and ChromaDB
client = OpenAI()

# Create ChromaDB client and collection
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="medical_docs")

# Define embedding function using OpenAI's text-embedding-3-small
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="YOUR_API_KEY", model_name="text-embedding-3-small")

# Add documents to ChromaDB
for idx, row in rag_dataset.iterrows():
    collection.add(documents=[row['document']], ids=[str(idx)], metadatas=[{"source": "medical"}])

print("Documents added to ChromaDB.")


## Retrieval Strategy Exploration
We will implement and compare multiple strategies:
- Semantic Search
- Semantic Search with Threshold Filtering
- Hybrid Search (Keyword + Semantic)
- Reranking based on relevance scores


In [None]:
# Example: Semantic Search function
def semantic_search(query, top_k=3):
    results = collection.query(query_texts=[query], n_results=top_k)
    return results['documents'][0]

# Example usage
print(semantic_search("What are symptoms of diabetes?"))


In [None]:
# Connect retriever with GPT model for generation
def generate_answer(question, retrieved_docs):
    context = "
".join(retrieved_docs)
    prompt = f"You are a medical assistant. Answer the question based only on the context below.
Context:
{context}
Question: {question}
Answer:"
    response = client.chat.completions.create(model="gpt-4.1-mini", messages=[{"role": "user", "content": prompt}])
    return response.choices[0].message.content

# Example usage
retrieved = semantic_search("What are symptoms of diabetes?")
print(generate_answer("What are symptoms of diabetes?", retrieved))


In [None]:
# Load validation dataset and evaluate
validation_df = pd.read_csv('capstone1_rag_validation.csv')

# Placeholder for evaluation metrics calculation (Precision@k, Recall@k, F1)
# Implement metric functions and display results in a DataFrame


In [None]:
# Load test dataset and generate answers
submission_df = pd.read_csv('capstone1_rag_test_questions.csv')
results = []

for _, row in submission_df.iterrows():
    question = row['question']
    retrieved_docs = semantic_search(question)
    answer = generate_answer(question, retrieved_docs) if retrieved_docs else "The question cannot be answered using the available documents."
    results.append({"question": question, "retrieved_documents": retrieved_docs, "generated_answer": answer})

# Save to submission.csv
final_df = pd.DataFrame(results)
final_df.to_csv('submission.csv', index=False)
print("submission.csv created successfully.")
