Gemini Live Agent Challenge Hackathon β A multimodal AI assistant that helps doctors analyze patient symptoms during rounds using voice and vision.
- Problem Statement
- Solution
- Architecture
- Tech Stack
- Features
- Demo Scenario
- Getting Started
- Deploying to Google Cloud
- Google Cloud Services Used
- Project Structure
During hospital rounds, physicians must rapidly assess patient symptoms, recall drug interactions, reference clinical guidelines, and make critical decisions β often under time pressure with limited access to reference materials. Current tools require manual lookup and switching between multiple systems.
MedsightAI is a real-time clinical decision support assistant powered by Gemini Live API. A doctor opens the web app and:
- π£οΈ Speaks to the AI assistant naturally
- πΈ Shows symptoms (rash, wound, X-ray) through the webcam
- π§ AI analyzes both voice and image inputs in real-time
- π AI responds with spoken clinical insights
- β‘ Doctor can interrupt at any time (barge-in support)
The AI provides differential diagnoses, severity assessments, drug interaction checks, and clinical guideline references β all through a natural voice conversation.
graph TD
subgraph Client ["Browser (Web App)"]
UI["UI (Vanilla JS)"]
Mic["Mic (PCM Audio)"]
Cam["Webcam (JPEG)"]
Insights["Clinical Insights Panel"]
end
subgraph Backend ["FastAPI Backend (Cloud Run)"]
Proxy["WebSocket Proxy"]
GS["Gemini Live Session"]
Tools["agent_tools.py"]
OCR["OCR & Report Engine"]
end
subgraph Google ["Google Cloud & Gemini"]
GLA["Gemini Live API (2.5 Flash Audio)"]
VAI["Gemini 2.5 Flash (Symptom/OCR)"]
end
subgraph Ext ["External Services"]
FDA["OpenFDA API"]
end
Mic -->|Audio Binary| Proxy
Cam -->|JPEG Frames| Proxy
Proxy <--> GS
GS <-->|Bidirectional Stream| GLA
GS --> Tools
Tools -->|Differential Diagnosis| VAI
Tools -->|Drug Interactions| FDA
GS --> OCR
OCR -->|Medical Reports| Insights
Insights -.-> UI
| Layer | Technology |
|---|---|
| AI Model | Gemini 2.5 Flash (Native Audio Dialog) |
| Live API | Gemini Live API (WebSocket, real-time multimodal) |
| SDK | Google GenAI Python SDK (google-genai) |
| Backend | Python 3.11 + FastAPI + Uvicorn |
| Frontend | Vanilla JS + Web Audio API + MediaDevices API |
| Styling | Custom CSS (Dark Medical Theme) |
| Deployment | Google Cloud Run |
| Container | Docker |
| CI/CD | Google Cloud Build |
- π€ Voice input β speak naturally to the AI
- πΉ Webcam video β show symptoms, X-rays, wounds
- π Voice output β AI responds with natural speech
- β‘ Barge-in β interrupt the AI at any time
This project breaks the "text box" paradigm. It acts as a true Live Agent for clinical environments where hands are often sterilized or busy. The interaction is fully natural:
- The agent "Sees, Hears, and Speaks" via Gemini's multimodal Live API.
- Interruptions (barge-in) are handled gracefully (e.g., "Wait, they are allergic to penicillin").
- Seamlessly weaves real-time video observation with medical fact-checking.
- π¬ Symptom Analysis β differential diagnoses from visual observation via Gemini Vision.
- π Drug Interactions β safety checks leveraging the OpenFDA API.
- π Clinical Guidelines β evidence-based treatment protocols for 12+ major conditions.
β οΈ Risk Assessment β validated NEWS2 (National Early Warning Score 2) calculation.- π Prescription OCR β Multi-modal insight from handwritten medical slips.
- π¨οΈ Automated PDF Reports β summarized clinical prescriptions generated from live audio transcripts.
- π Dark medical theme with glassmorphism.
- π¬ Concurrent streaming bubbles for Doctor & AI transcripts.
- π¨οΈ Optimized print stylesheets for medical letterhead export.
Doctor opens MedsightAI and points the webcam at a rash.
Doctor: "MedsightAI, what do you think about this rash on the patient's forearm?"
MedsightAI: "I can see what appears to be an erythematous, raised rash on the forearm. Let me run an analysis..."
The Clinical Insights panel populates with:
- Contact Dermatitis β 75% confidence
- Cellulitis β 60% confidence
- Recommended tests: Skin biopsy, CBC, IgE levels
Doctor interrupts: "Wait β the patient is allergic to penicillin. What antibiotics are safe?"
MedsightAI immediately stops speaking and responds:
MedsightAI: "Given the penicillin allergy, amoxicillin is contraindicated. Safe alternatives include azithromycin, doxycycline, or trimethoprim-sulfamethoxazole..."
- Python 3.11+
- A Google AI Studio API key
git clone https://github.com/YOUR_USERNAME/medsight-ai.git
cd medsight-aicd backend
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# Install dependencies
pip install -r requirements.txt
# Set your API key in .env
cp .env.example .envpython main.pyNavigate to http://localhost:8000 in your browser.
- Google Cloud SDK installed
- A GCP project with billing enabled
export GEMINI_API_KEY=your_key_here
chmod +x infrastructure/deploy.sh
./infrastructure/deploy.sh my-project-id us-central1medsight-ai/
βββ backend/
β βββ main.py # FastAPI server
β βββ gemini_live.py # Gemini Live API wrapper
β βββ agent_tools.py # Clinical reasoning tools (NEWS2, OpenFDA)
β βββ requirements.txt # Python dependencies
βββ frontend/
β βββ index.html # Main web page
β βββ css/styles.css # Premium dark theme
β βββ js/ # WebSocket & Media handlers
βββ prompts/
β βββ system_prompt.txt # Clinical system prompt
βββ infrastructure/ # Cloud Run & Build configs
βββ Dockerfile # Container definition
βββ README.md
MedsightAI is a demonstration project built for the Gemini Live Agent Challenge hackathon. It is not a certified medical device and should not be used for actual clinical decision-making.
Built with β€οΈ for the Gemini Live Agent Challenge
Powered by Google Gemini and Google Cloud