A fully offline clinical intelligence platform that captures doctor-patient consultations via browser-based audio recording or direct file upload, transcribes speech using a local Whisper ASR container, performs LLM-driven speaker diarization, and generates structured SOAP notes with ICD-10 and CPT billing codes — all without any cloud dependency. Approved notes are embedded into a persistent ChromaDB vector store, enabling retrieval-augmented clinical Q&A and PDF knowledge base ingestion entirely within a local Docker environment.
- MediVault AI — Offline Clinical Intelligence Platform
MediVault AI demonstrates how modern open-source AI components can be composed into a clinical documentation workflow that operates entirely on-premises. The platform accepts raw consultation audio, produces a structured SOAP note with billing codes, stores approved notes in a semantic vector database, and exposes a conversational clinical Q&A interface backed by retrieval-augmented generation — all without transmitting any data to external services.
This makes MediVault AI suitable for:
- Clinical AI research — reference implementation of a full speech-to-note pipeline
- Air-gapped environments — run fully offline with Ollama and locally hosted models
- Clinical informatics engineering — integrate Whisper ASR, Ollama inference, Flowise chain orchestration, and ChromaDB vector storage
- Healthcare AI prototyping — build and evaluate offline clinical documentation tooling
- The clinician records a consultation in the browser or uploads a WAV or MP3 file.
- The React frontend sends the audio to the FastAPI backend.
- The backend forwards the audio to the Whisper ASR container and receives a timestamped transcript.
- The backend sends the transcript segments to Ollama for LLM-driven speaker diarization — each segment is labelled Doctor or Patient.
- The diarized transcript is sent to the Flowise SOAP Generator, which invokes an LLMChain with a specialty-aware prompt via Ollama and returns a structured SOAP note.
- The clinician reviews and edits the SOAP note, then requests ICD-10 and CPT billing codes from the backend.
- After review, the clinician approves the note — the backend embeds it into ChromaDB using
nomic-embed-textvia the direct Python client. - Approved notes and uploaded clinical PDFs are immediately available to the Clinical QA system, which retrieves relevant passages and passes them to Ollama for grounded answers.
All inference runs through Ollama on the host machine. No data leaves the local environment.
The application follows a modular five-service architecture. The React frontend communicates exclusively with the FastAPI backend. The backend delegates speech-to-text to the Whisper container, invokes Flowise LLM chains for SOAP generation, writes and queries embeddings directly to ChromaDB, and calls Ollama for diarization, billing codes, and clinical Q&A. Flowise flows are auto-provisioned at startup so the stack is fully operational on the first docker compose up.
graph TB
subgraph Client Layer
UI[React UI<br/>port 3000]
end
subgraph Backend Layer
API[FastAPI<br/>port 5001]
end
subgraph Flowise Orchestration Layer
FW[Flowise<br/>port 3001]
end
subgraph Vector Store
CDB[ChromaDB<br/>port 8100]
end
subgraph Speech Processing
WH[Whisper ASR<br/>port 9000]
end
subgraph LLM Inference
OL[Ollama<br/>host machine]
end
UI -->|HTTP / Axios| API
API -->|Audio upload| WH
API -->|Chain invocation| FW
API -->|Direct Python client| CDB
FW -->|LLM calls| OL
API -->|Embeddings + diarization| OL
Frontend (React + Vite)
- Consultation Recorder — mode toggle between browser recording and file upload, specialty selector, diarized transcript view with Doctor/Patient colour-coded labels
- SOAP Note Editor — human-in-the-loop review with editable sections, billing code generation, and approve-to-knowledge-base action
- Clinical Chat — conversational Q&A with cited source documents
- Knowledge Base — document list with PDF upload and document deletion
- Nginx serves the production build and proxies all
/api/requests to the backend
Backend Services
- API Server (
server.py): FastAPI application with CORS middleware, request validation, and all route handlers - Whisper Client (
services/whisper_client.py): Submits audio to the Whisper ASR container and returns timestamped segments - LLM Client (
services/llm_client.py): Calls Ollama directly for speaker diarization, billing code generation, and clinical Q&A - Flowise Client (
services/flowise_client.py): Invokes Flowise prediction and upsert endpoints - Flowise Provisioner (
services/flowise_provisioner.py): Auto-creates the three Flowise flows at API startup if they do not already exist - Chroma Client (
services/chroma_client.py): Writes and queries theclinical_kbChromaDB collection using the direct Python client - PDF Service (
services/pdf_service.py): Validates and extracts text from uploaded PDF files
External Integration
- LLM inference: Ollama running natively on the host machine, accessed from containers via
host.docker.internal:11434 - LLM orchestration: Flowise running as a Docker service, auto-provisioned with three flows at startup
- Vector store: ChromaDB running as a Docker service with a persistent named volume
| Service | Container | Host Port | Description |
|---|---|---|---|
medivault-api |
medivault-api |
5001 |
FastAPI backend — transcription, SOAP generation, RAG, billing codes |
medivault-ui |
medivault-ui |
3000 |
React frontend — served by Nginx, proxies /api/ to the backend |
medivault-flowise |
medivault-flowise |
3001 |
Flowise — LLM chain orchestration, auto-provisioned flows |
medivault-chromadb |
medivault-chromadb |
8100 |
ChromaDB — persistent vector store for clinical knowledge base |
medivault-whisper |
medivault-whisper |
9000 |
Whisper ASR — speech-to-text with timestamped segment output |
Ollama is intentionally not a Docker service. Running Ollama inside Docker bypasses GPU acceleration. Ollama must run natively on the host so the backend and Flowise containers can reach it via
host.docker.internal:11434.
- Clinician records or uploads consultation audio in the web UI.
- The backend transcribes the audio via Whisper ASR and receives timestamped segments.
- The backend calls Ollama to classify each segment as Doctor or Patient.
- The diarized transcript is sent to the Flowise SOAP Generator — Flowise invokes the LLMChain via Ollama and returns structured SOAP JSON.
- The clinician reviews the note and requests billing codes — the backend calls Ollama directly for ICD-10 and CPT suggestions.
- The clinician approves the note — the backend embeds it into ChromaDB via the direct Python client.
- The clinician asks a clinical question — the backend queries ChromaDB for relevant passages, passes them to Ollama, and returns a grounded answer with cited sources.
Before you begin, ensure you have the following installed and configured:
- Docker and Docker Compose (v2)
- Ollama installed natively on the host machine with the required models:
ollama pull llama3.1:8b
ollama pull nomic-embed-textdocker --version
docker compose version
docker ps
ollama listgit clone https://github.com/cld2labs/MediVaultAI.git
cd MediVaultAIcp .env.example .envOpen .env and confirm the Ollama and service URLs match your environment. See Environment Variables for all available settings.
# Standard (attached)
docker compose up --build
# Detached (background)
docker compose up -d --buildOnce containers are running:
- Frontend UI: http://localhost:3000
- Backend API: http://localhost:5001
- API Docs (Swagger): http://localhost:5001/docs
- Flowise Canvas: http://localhost:3001
# Health check
curl http://localhost:5001/health
# View running containers
docker compose psView logs:
# All services
docker compose logs -f
# Backend only
docker compose logs -f medivault-api
# Flowise only
docker compose logs -f medivault-flowisedocker compose downRun the backend and frontend directly on the host without Docker. Start the required containers first:
docker compose up medivault-chromadb medivault-whisper medivault-flowiseBackend (Python / FastAPI)
cd api
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp ../.env.example ../.env # configure CHROMA_HOST=localhost, WHISPER_ENDPOINT=http://localhost:9000
uvicorn server:app --reload --port 5001Frontend (Node / Vite)
cd ui
npm install
npm run devThe Vite dev server proxies /api/ to http://localhost:5001. Open http://localhost:5173.
MediVaultAI/
├── api/ # FastAPI backend
│ ├── config.py # All environment-driven settings
│ ├── models.py # Pydantic request/response schemas
│ ├── server.py # FastAPI app, routes, and middleware
│ ├── services/
│ │ ├── chroma_client.py # ChromaDB direct Python client
│ │ ├── flowise_client.py # Flowise prediction and upsert
│ │ ├── flowise_provisioner.py # Auto-provision flows at startup
│ │ ├── llm_client.py # Ollama calls for diarization, billing, QA
│ │ ├── pdf_service.py # PDF validation and text extraction
│ │ └── whisper_client.py # Whisper ASR transcription
│ ├── Dockerfile
│ └── requirements.txt
├── ui/ # React frontend
│ ├── src/
│ │ ├── App.jsx
│ │ ├── components/
│ │ │ ├── ClinicalChat.jsx
│ │ │ ├── ConsultationRecorder.jsx
│ │ │ ├── FlowCanvas.jsx
│ │ │ ├── Header.jsx
│ │ │ ├── KnowledgeBase.jsx
│ │ │ ├── LandingPage.jsx
│ │ │ ├── SoapNoteEditor.jsx
│ │ │ └── StatusBadge.jsx
│ │ └── main.jsx
│ ├── Dockerfile
│ └── nginx.conf
├── docs/
│ └── assets/ # Documentation images
├── docker-compose.yaml # Main orchestration file
├── .env.example # Environment variable reference
├── CONTRIBUTING.md
├── DISCLAIMER.md
├── LICENSE.md
├── README.md
├── SECURITY.md
├── TERMS_AND_CONDITIONS.md
└── TROUBLESHOOTING.md
Record or upload a consultation:
- Open the application at http://localhost:3000.
- Click Launch App from the landing page.
- Select a clinical specialty from the dropdown.
- Click Record to capture audio via the browser microphone, or click Upload to submit a WAV or MP3 file.
- Submit the audio to trigger transcription.
Generate a SOAP note:
- After transcription completes, review the diarized transcript — Doctor segments are shown in purple, Patient segments in cyan.
- Click Generate SOAP Note.
- The SOAP note (Subjective, Objective, Assessment, Plan) appears in the right panel with extracted keywords.
- Edit any section in the human-in-the-loop editor before proceeding.
Generate billing codes:
- After the SOAP note is displayed, click Generate Billing Codes.
- Review the suggested ICD-10 diagnosis codes and CPT procedure codes.
- All billing codes require clinician verification before use.
Approve to knowledge base:
- After reviewing the SOAP note and billing codes, enter an optional patient reference.
- Click Approve & Save.
- The note is embedded into the ChromaDB
clinical_kbcollection and becomes immediately available in Clinical QA.
Clinical QA:
- Open the Clinical Chat panel.
- Enter any clinical question.
- The backend retrieves semantically relevant passages from the knowledge base and passes them to Ollama.
- The answer is displayed with cited source documents.
Knowledge base management:
- Open the Knowledge Base panel.
- Upload PDF clinical guidelines using the document upload control.
- Remove any document by ID using the delete control.
The table below compares inference performance across different providers, deployment modes, and hardware profiles. The workload covers the full MediVault AI consultation pipeline: Whisper transcription, diarization, SOAP generation, and billing codes
| Provider | Model | Deployment | Context Window | Avg Input Tokens | Avg Output Tokens | Avg Tokens / Request | P50 Latency (ms) | P95 Latency (ms) | Throughput (req/s) | Hardware |
|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI (Cloud) | gpt-4o-mini + whisper-1 |
API (Cloud) | 128K | 558 | 310 | 867 | 9,500 | 171,900 | 0.006 | N/A |
| Intel OPEA EI | meta-llama/Llama-3.1-8B-Instruct + BAAI/bge-base-en-v1.5 |
Enterprise (On-Prem) | 128K | 588 | 372 | 960 | 68,463 | 156,099 | 0.0035 | CPU-only (Xeon) |
Notes:
- All metrics use the same MediVault AI workload and identical inputs (audio~1.9 min). Token counts may vary slightly per run due to non-deterministic model output.
- OpenAI metrics are averaged over 5 zero-shot runs. P95 is elevated due to SOAP generation routing through a local Flowise intermediary
An 8-billion-parameter open-weight instruction-tuned model from Meta (July 2023 release), designed for on-prem and enterprise deployment.
| Attribute | Details |
|---|---|
| Parameters | 8.0B total |
| Architecture | Transformer with Grouped Query Attention (GQA) — 32 layers, 32 Q-heads / 8 KV-heads |
| Context Window | 128,000 tokens (128K) native |
| Reasoning Mode | Standard instruction-following |
| Tool / Function Calling | Supported via structured prompts |
| Structured Output | JSON-structured responses supported |
| Multilingual | English-focused with multilingual capabilities |
| Benchmarks | MMLU: 73.0%, GSM8K: 84.4%, HumanEval: 72.6% |
| Quantization Formats | GGUF (Q4_K_M ~4.9 GB, Q8_0 ~8.5 GB), AWQ (int4), GPTQ (int4) |
| Inference Runtimes | Ollama, vLLM, llama.cpp, LMStudio, TGI |
| Fine-Tuning | Full fine-tuning and adapter-based (LoRA); community adapters available |
| License | Llama 3.1 Community License (permits commercial use with conditions) |
| Deployment | Local, on-prem, air-gapped, cloud — full data sovereignty |
A 109M-parameter English text embedding model from the Beijing Academy of Artificial Intelligence (BAAI), optimised for dense retrieval and semantic similarity tasks.
| Attribute | Details |
|---|---|
| Parameters | 109M total |
| Architecture | BERT-based bi-encoder |
| Embedding Dimension | 768 |
| Max Sequence Length | 512 tokens |
| Task | Dense retrieval / semantic similarity |
| Benchmarks | MTEB (English) avg: 63.55 |
| Quantization Formats | FP32, FP16, INT8 (ONNX) |
| Inference Runtimes | vLLM, Hugging Face Transformers, ONNX Runtime |
| Fine-Tuning | Full fine-tuning and adapter-based (LoRA) |
| License | MIT |
| Deployment | Local, on-prem, air-gapped — full data sovereignty |
OpenAI's cost-efficient multimodal model, accessible exclusively via cloud API.
| Attribute | Details |
|---|---|
| Parameters | Not publicly disclosed |
| Architecture | Multimodal Transformer (text + image input, text output) |
| Context Window | 128,000 tokens input / 16,384 tokens max output |
| Reasoning Mode | Standard inference (no explicit chain-of-thought toggle) |
| Tool / Function Calling | Supported; parallel function calling |
| Structured Output | JSON mode and strict JSON schema adherence supported |
| Multilingual | Broad multilingual support |
| Benchmarks | MMLU: ~87%, strong HumanEval and MBPP scores |
| Pricing | $0.15 / 1M input tokens, $0.60 / 1M output tokens (Batch API: 50% discount) |
| Fine-Tuning | Supervised fine-tuning via OpenAI API |
| License | Proprietary (OpenAI Terms of Use) |
| Deployment | Cloud-only — OpenAI API or Azure OpenAI Service. No self-hosted or on-prem option |
| Knowledge Cutoff | October 2023 |
| Capability | Meta-Llama-3.1-8B-Instruct | GPT-4o-mini |
|---|---|---|
| SOAP note generation | Yes | Yes |
| Billing code extraction (ICD-10 / CPT) | Yes | Yes |
| Speaker diarization classification | Yes | Yes |
| Clinical QA with RAG | Yes | Yes |
| Function / tool calling | Yes | Yes |
| JSON structured output | Yes | Yes |
| On-prem / air-gapped deployment | Yes | No |
| Data sovereignty | Full (weights run locally) | No (data sent to cloud API) |
| Open weights | Yes (Llama 3.1 Community License) | No (proprietary) |
| Custom fine-tuning | Full fine-tuning + LoRA adapters | Supervised fine-tuning (API only) |
| Quantization for edge devices | GGUF / AWQ / GPTQ | N/A |
| Multimodal (image input) | No | Yes |
| Native context window | 128K | 128K |
Both models support SOAP generation, billing codes, and clinical QA with RAG. However, only Meta-Llama-3.1-8B-Instruct offers open weights, data sovereignty, and local deployment flexibility — making it suitable for air-gapped, regulated, or cost-sensitive clinical environments. GPT-4o-mini offers lower latency and higher throughput via OpenAI's cloud infrastructure, with added multimodal capabilities.
Three Flowise flows are automatically provisioned when the stack starts. No manual flow configuration is required.
An LLMChain composed of a ChatPromptTemplate and a ChatOllama node. The prompt is specialty-aware and instructs the model to produce a structured SOAP note from diarized consultation transcript segments.
A ConversationalRetrievalQAChain composed of a ChromaDB retriever, a BufferMemory node for conversation history, and a ChatOllama node. The chain returns answers with returnSourceDocuments: true.
A flow combining PlainText input, OllamaEmbeddings, and a ChromaDB sink for document ingestion operations.
Inspect all live flow topologies by opening the Flowise canvas at http://localhost:3001.
All variables are defined in .env (copied from .env.example). The backend reads them at startup via python-dotenv.
| Variable | Description | Default |
|---|---|---|
FLOWISE_ENDPOINT |
Internal URL of the Flowise service | http://medivault-flowise:3001 |
FLOWISE_API_KEY |
Flowise API key for authenticated requests | (empty — auth disabled) |
| Variable | Description | Default |
|---|---|---|
OLLAMA_BASE_URL |
URL of the Ollama service on the host | http://host.docker.internal:11434 |
OLLAMA_MODEL |
Ollama model used for LLM inference | llama3.1:8b |
OLLAMA_EMBED_MODEL |
Ollama model used for embeddings | nomic-embed-text |
| Variable | Description | Default |
|---|---|---|
CHROMA_HOST |
ChromaDB service hostname | medivault-chromadb |
CHROMA_PORT |
ChromaDB internal port | 8000 |
| Variable | Description | Default |
|---|---|---|
WHISPER_ENDPOINT |
Internal URL of the Whisper ASR service | http://medivault-whisper:9000 |
WHISPER_MODEL |
Whisper model size | small |
| Variable | Description | Default |
|---|---|---|
MAX_AUDIO_SIZE |
Maximum accepted audio file size in bytes | 26214400 (25 MB) |
MAX_FILE_SIZE |
Maximum accepted document file size in bytes | 10485760 (10 MB) |
| Variable | Description | Default |
|---|---|---|
BACKEND_PORT |
Port the FastAPI server listens on | 5001 |
- Framework: FastAPI (Python 3.11+) with Uvicorn ASGI server
- LLM Orchestration: Flowise — auto-provisioned chains for SOAP generation
- LLM Inference: Ollama — runs natively on host for diarization, billing codes, and clinical QA
- Vector DB Client: chromadb-client — direct Python client for all upsert and query operations
- PDF Processing: pypdf for text extraction from uploaded PDF files
- Config Management: python-dotenv for environment variable injection at startup
- Data Validation: Pydantic v2 for request/response schema enforcement
- Framework: React 18 with Vite (fast HMR and production bundler)
- Styling: Tailwind CSS with dark mode
- Icons: Lucide React
- HTTP Client: Axios
- Flow Visualisation: @xyflow/react
- Production Server: Nginx — serves the built assets and proxies
/api/to the backend container
| Component | Technology |
|---|---|
| Containerisation | Docker Compose (5 services) |
| LLM inference | Ollama (host machine) |
| LLM orchestration | Flowise |
| Speech-to-text | Whisper ASR (onerahmet/openai-whisper-asr-webservice, faster_whisper engine) |
| Vector store | ChromaDB (containerised, persistent named volume) |
For common issues and solutions, see TROUBLESHOOTING.md.
Quick diagnostic commands:
# Health check
curl http://localhost:5001/health
# View logs for all services
docker compose logs -f
# View logs for a specific service
docker compose logs -f medivault-api
# Check container health status
docker compose ps
# Restart a single service
docker compose restart medivault-flowise
# Rebuild and restart the entire stack
docker compose down && docker compose up --buildThis project is licensed under the MIT License. See LICENSE.md for details.
MediVault AI is provided as-is for demonstration and educational purposes. While we strive for accuracy:
- All AI-generated SOAP notes, ICD-10 codes, CPT codes, and clinical Q&A responses must be reviewed and approved by a licensed clinician before use in any clinical context
- Do not rely solely on AI-generated outputs without independent clinical verification
- Do not use this system with real patient data without implementing full HIPAA, GDPR, and applicable regulatory compliance measures
- The quality of outputs depends on the underlying Ollama model and the content of the ingested knowledge base
For full disclaimer details, see DISCLAIMER.md.
