This repository contains the source code for a production-ready Retrieval-Augmented Generation (RAG) system designed for the healthcare domain. It provides clinicians with a powerful, intuitive interface to ask natural language questions against a large corpus of medical documents and receive accurate, source-cited answers, all while maintaining a crucial layer of HIPAA compliance.
- Vision
- Live Demo
- The Problem Solved
- Key Features
- Technology Stack & Architecture
- Core RAG Concepts Implemented
- Project Structure
- Getting Started (Local Setup)
- Deployment
- Future Improvements
- License
The vision behind this project is to bridge the information gap for healthcare professionals. Clinicians are often forced to sift through hundreds of pages of dense clinical guidelines to find critical information, a process that is both time-consuming and prone to error.
This project leverages state-of-the-art AI to create an intelligent assistant that:
- Understands medical semantics
- Retrieves precise information instantly
- Synthesizes trustworthy answers
Ultimately, it aims to improve patient care by providing faster access to knowledge.
A live version of this application is deployed on Streamlit Community Cloud. Access the Application Here
Example scenario:
- A clinician needs the recommended dosage of a drug for a specific patient profile in new clinical guidelines.
- A keyword search for the drug might yield hundreds of irrelevant results.
- Searching for "heart attack" might miss documents that only use the term "myocardial infarction."
This system solves that by:
- Understanding Semantics → Goes beyond keywords to capture meaning.
- Retrieving Relevant Context → Pinpoints the most relevant text.
- Synthesizing Accurate Answers → Generates concise, human-readable answers with citations.
- Semantic Search → Natural language queries return conceptually related results.
- Retrieval-Augmented Generation (RAG) → Synthesized answers grounded in real documents.
- Source Citations → Transparency via exact referenced text chunks.
- HIPAA Guardrail → PHI detection prevents sensitive data leaks.
- Modular & Scalable → Clean, production-ready architecture.
- Frontend: Streamlit
- Vector Database: Pinecone (Serverless)
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2 - Language Model: OpenAI GPT-3.5-Turbo
- PDF Processing: PyMuPDF
User Query → [Streamlit UI] → [HIPAA Guardrail]
├─(PHI Detected) → BLOCK
└─(Query Clean) → [RAG Core]
├─ Embed Query (SentenceTransformer)
├─ Query Pinecone Vector DB
├─ Retrieve Top-K Chunks
├─ Augment Prompt w/ Context
└─ Generate Answer (OpenAI GPT)
Answer + Sources → [Streamlit UI] → User
- Embed → Store → Retrieve workflow:
- Embed: Chunks converted into vectors.
- Store: Vectors stored in Pinecone.
- Retrieve: Query embeddings used for cosine similarity search.
- RAG Pattern: Retrieved chunks injected into GPT prompt for grounded answers.
- Data Handling & Chunking: Large PDFs split into smaller overlapping segments.
- HIPAA Compliance Guardrail: Regex-based PHI detector halts risky queries.
Code
├── app/
│ ├── ui/ # Streamlit UI code
│ ├── rag_core/ # Core RAG pipeline logic
│ └── hipaa_guardrail/ # PHI detection module
├── data/
│ └── medical_corpus/ # PDF documents
├── scripts/
│ └── ingest_data.py # Ingestion script for Pinecone
├── .env # Secret API keys (gitignored)
├── requirements.txt # Dependencies
└── LICENSE # MIT License
- Python 3.9+
- Git
- Pinecone & OpenAI API keys
git clone https://github.com/arshad98333/MedicalDoc-RagSearch.git
cd MedicalDoc-RagSearch# Create & activate venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install packages
pip install -r requirements.txtCreate a .env file in the root folder:
PINECONE_API_KEY="YOUR_PINECONE_API_KEY"
OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY"- Place PDF files into
data/medical_corpus/ - Run ingestion script:
python scripts/ingest_data.pystreamlit run app/ui/main_ui.pyOpen your browser at → http://localhost:8501
This app is deployed on Streamlit Community Cloud.
Steps:
- Push code to a public GitHub repo.
- Ensure
.envis in.gitignore. - On Streamlit Cloud → Click New app → Select repo & branch.
- Set main file path:
app/ui/main_ui.py. - Add API keys under Secrets Management.
- Click Deploy!
Streamlit handles the deployment automatically.
- More Robust PHI Detection: Upgrade regex → NER-based models (spaCy, healthcare-specific).
- Evaluation Framework: Integrate tools like RAGAs.
- Async Operations: Improve speed for large docs.
- Enhanced UI: Conversation history, feedback, in-text highlights.
MIT License – see LICENSE.