AI-Powered Clinical Guideline Search with HIPAA Guardrails

This repository contains the source code for a production-ready Retrieval-Augmented Generation (RAG) system designed for the healthcare domain. It provides clinicians with a powerful, intuitive interface to ask natural language questions against a large corpus of medical documents and receive accurate, source-cited answers, all while maintaining a crucial layer of HIPAA compliance.

Vision

The vision behind this project is to bridge the information gap for healthcare professionals. Clinicians are often forced to sift through hundreds of pages of dense clinical guidelines to find critical information, a process that is both time-consuming and prone to error.

This project leverages state-of-the-art AI to create an intelligent assistant that:

Understands medical semantics
Retrieves precise information instantly
Synthesizes trustworthy answers

Ultimately, it aims to improve patient care by providing faster access to knowledge.

Live Demo

A live version of this application is deployed on Streamlit Community Cloud. Access the Application Here

The Problem Solved

Example scenario:

A clinician needs the recommended dosage of a drug for a specific patient profile in new clinical guidelines.
A keyword search for the drug might yield hundreds of irrelevant results.
Searching for "heart attack" might miss documents that only use the term "myocardial infarction."

This system solves that by:

Understanding Semantics → Goes beyond keywords to capture meaning.
Retrieving Relevant Context → Pinpoints the most relevant text.
Synthesizing Accurate Answers → Generates concise, human-readable answers with citations.

Key Features

Semantic Search → Natural language queries return conceptually related results.
Retrieval-Augmented Generation (RAG) → Synthesized answers grounded in real documents.
Source Citations → Transparency via exact referenced text chunks.
HIPAA Guardrail → PHI detection prevents sensitive data leaks.
Modular & Scalable → Clean, production-ready architecture.

Technology Stack & Architecture

Frontend: Streamlit
Vector Database: Pinecone (Serverless)
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Language Model: OpenAI GPT-3.5-Turbo
PDF Processing: PyMuPDF

Architecture Flow

User Query → [Streamlit UI] → [HIPAA Guardrail]
   ├─(PHI Detected) → BLOCK
   └─(Query Clean) → [RAG Core]
                     ├─ Embed Query (SentenceTransformer)
                     ├─ Query Pinecone Vector DB
                     ├─ Retrieve Top-K Chunks
                     ├─ Augment Prompt w/ Context
                     └─ Generate Answer (OpenAI GPT)

Answer + Sources → [Streamlit UI] → User

Core RAG Concepts Implemented

Embed → Store → Retrieve workflow:
- Embed: Chunks converted into vectors.
- Store: Vectors stored in Pinecone.
- Retrieve: Query embeddings used for cosine similarity search.
RAG Pattern: Retrieved chunks injected into GPT prompt for grounded answers.
Data Handling & Chunking: Large PDFs split into smaller overlapping segments.
HIPAA Compliance Guardrail: Regex-based PHI detector halts risky queries.

Project Structure

Code
├── app/
│   ├── ui/              # Streamlit UI code
│   ├── rag_core/        # Core RAG pipeline logic
│   └── hipaa_guardrail/ # PHI detection module
├── data/
│   └── medical_corpus/  # PDF documents
├── scripts/
│   └── ingest_data.py   # Ingestion script for Pinecone
├── .env                 # Secret API keys (gitignored)
├── requirements.txt     # Dependencies
└── LICENSE              # MIT License

Getting Started (Local Setup)

1. Prerequisites

Python 3.9+
Git
Pinecone & OpenAI API keys

2. Clone the Repository

git clone https://github.com/arshad98333/MedicalDoc-RagSearch.git
cd MedicalDoc-RagSearch

3. Set Up Virtual Environment & Install Dependencies

# Create & activate venv
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

# Install packages
pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file in the root folder:

PINECONE_API_KEY="YOUR_PINECONE_API_KEY"
OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY"

5. Ingest Your Data

Place PDF files into data/medical_corpus/
Run ingestion script:

python scripts/ingest_data.py

6. Run the Application

streamlit run app/ui/main_ui.py

Open your browser at → http://localhost:8501

Deployment

This app is deployed on Streamlit Community Cloud.

Steps:

Push code to a public GitHub repo.
Ensure .env is in .gitignore.
On Streamlit Cloud → Click New app → Select repo & branch.
Set main file path: app/ui/main_ui.py.
Add API keys under Secrets Management.
Click Deploy!

Streamlit handles the deployment automatically.

Future Improvements

More Robust PHI Detection: Upgrade regex → NER-based models (spaCy, healthcare-specific).
Evaluation Framework: Integrate tools like RAGAs.
Async Operations: Improve speed for large docs.
Enhanced UI: Conversation history, feedback, in-text highlights.

License

MIT License – see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Clinical Guideline Search with HIPAA Guardrails

Table of Contents

Vision

Live Demo

The Problem Solved

Key Features

Technology Stack & Architecture

Architecture Flow

Core RAG Concepts Implemented

Project Structure

Getting Started (Local Setup)

1. Prerequisites

2. Clone the Repository

3. Set Up Virtual Environment & Install Dependencies

4. Configure Environment Variables

5. Ingest Your Data

6. Run the Application

Deployment

Future Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
data/medical_corpus		data/medical_corpus
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Clinical Guideline Search with HIPAA Guardrails

Table of Contents

Vision

Live Demo

The Problem Solved

Key Features

Technology Stack & Architecture

Architecture Flow

Core RAG Concepts Implemented

Project Structure

Getting Started (Local Setup)

1. Prerequisites

2. Clone the Repository

3. Set Up Virtual Environment & Install Dependencies

4. Configure Environment Variables

5. Ingest Your Data

6. Run the Application

Deployment

Future Improvements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages