Skip to content

arshad98333/MedicalDoc-RagSearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Powered Clinical Guideline Search with HIPAA Guardrails

This repository contains the source code for a production-ready Retrieval-Augmented Generation (RAG) system designed for the healthcare domain. It provides clinicians with a powerful, intuitive interface to ask natural language questions against a large corpus of medical documents and receive accurate, source-cited answers, all while maintaining a crucial layer of HIPAA compliance.


Table of Contents


Vision

The vision behind this project is to bridge the information gap for healthcare professionals. Clinicians are often forced to sift through hundreds of pages of dense clinical guidelines to find critical information, a process that is both time-consuming and prone to error.

This project leverages state-of-the-art AI to create an intelligent assistant that:

  • Understands medical semantics
  • Retrieves precise information instantly
  • Synthesizes trustworthy answers

Ultimately, it aims to improve patient care by providing faster access to knowledge.


Live Demo

A live version of this application is deployed on Streamlit Community Cloud. Access the Application Here


The Problem Solved

Example scenario:

  • A clinician needs the recommended dosage of a drug for a specific patient profile in new clinical guidelines.
  • A keyword search for the drug might yield hundreds of irrelevant results.
  • Searching for "heart attack" might miss documents that only use the term "myocardial infarction."

This system solves that by:

  1. Understanding Semantics → Goes beyond keywords to capture meaning.
  2. Retrieving Relevant Context → Pinpoints the most relevant text.
  3. Synthesizing Accurate Answers → Generates concise, human-readable answers with citations.

Key Features

  • Semantic Search → Natural language queries return conceptually related results.
  • Retrieval-Augmented Generation (RAG) → Synthesized answers grounded in real documents.
  • Source Citations → Transparency via exact referenced text chunks.
  • HIPAA Guardrail → PHI detection prevents sensitive data leaks.
  • Modular & Scalable → Clean, production-ready architecture.

Technology Stack & Architecture

  • Frontend: Streamlit
  • Vector Database: Pinecone (Serverless)
  • Embedding Model: sentence-transformers/all-MiniLM-L6-v2
  • Language Model: OpenAI GPT-3.5-Turbo
  • PDF Processing: PyMuPDF

Architecture Flow

User Query → [Streamlit UI] → [HIPAA Guardrail]
   ├─(PHI Detected) → BLOCK
   └─(Query Clean) → [RAG Core]
                     ├─ Embed Query (SentenceTransformer)
                     ├─ Query Pinecone Vector DB
                     ├─ Retrieve Top-K Chunks
                     ├─ Augment Prompt w/ Context
                     └─ Generate Answer (OpenAI GPT)

Answer + Sources → [Streamlit UI] → User

Core RAG Concepts Implemented

  • Embed → Store → Retrieve workflow:
    • Embed: Chunks converted into vectors.
    • Store: Vectors stored in Pinecone.
    • Retrieve: Query embeddings used for cosine similarity search.
  • RAG Pattern: Retrieved chunks injected into GPT prompt for grounded answers.
  • Data Handling & Chunking: Large PDFs split into smaller overlapping segments.
  • HIPAA Compliance Guardrail: Regex-based PHI detector halts risky queries.

Project Structure

Code
├── app/
│   ├── ui/              # Streamlit UI code
│   ├── rag_core/        # Core RAG pipeline logic
│   └── hipaa_guardrail/ # PHI detection module
├── data/
│   └── medical_corpus/  # PDF documents
├── scripts/
│   └── ingest_data.py   # Ingestion script for Pinecone
├── .env                 # Secret API keys (gitignored)
├── requirements.txt     # Dependencies
└── LICENSE              # MIT License

Getting Started (Local Setup)

1. Prerequisites

  • Python 3.9+
  • Git
  • Pinecone & OpenAI API keys

2. Clone the Repository

git clone https://github.com/arshad98333/MedicalDoc-RagSearch.git
cd MedicalDoc-RagSearch

3. Set Up Virtual Environment & Install Dependencies

# Create & activate venv
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

# Install packages
pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file in the root folder:

PINECONE_API_KEY="YOUR_PINECONE_API_KEY"
OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY"

5. Ingest Your Data

  • Place PDF files into data/medical_corpus/
  • Run ingestion script:
python scripts/ingest_data.py

6. Run the Application

streamlit run app/ui/main_ui.py

Open your browser at → http://localhost:8501


Deployment

This app is deployed on Streamlit Community Cloud.

Steps:

  1. Push code to a public GitHub repo.
  2. Ensure .env is in .gitignore.
  3. On Streamlit Cloud → Click New app → Select repo & branch.
  4. Set main file path: app/ui/main_ui.py.
  5. Add API keys under Secrets Management.
  6. Click Deploy!

Streamlit handles the deployment automatically.


Future Improvements

  • More Robust PHI Detection: Upgrade regex → NER-based models (spaCy, healthcare-specific).
  • Evaluation Framework: Integrate tools like RAGAs.
  • Async Operations: Improve speed for large docs.
  • Enhanced UI: Conversation history, feedback, in-text highlights.

License

MIT License – see LICENSE.

About

A Retrieval-Augmented Generation (RAG) pipeline designed for healthcare and medical research. This system enables semantic search over medical documents (PDFs, clinical guidelines, research papers) and augments LLMs with trusted medical knowledge for safer, more accurate responses.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages