A real-time voice support backend that bridges browser clients to Google's Gemini Multimodal Live API over WebSockets. It streams bidirectional PCM audio, dispatches tool calls against a MongoDB database, and powers an AI customer support agent that can look up orders, process refunds, file complaints, and escalate to humans — all through natural voice conversation.
Browser (Next.js Frontend)
│
│ WebSocket (/ws/voice?user_id=N)
│ ▲ PCM 16-bit audio (16 kHz) ▼ PCM 16-bit audio (24 kHz) + JSON
│
┌───▼────────────────────────────────────────────────┐
│ Voice Agent Server (FastAPI) │
│ │
│ ┌────────────────┐ ┌────────────────────────┐ │
│ │ WebSocket Hub │◄───►│ Gemini Live API Proxy │ │
│ │ (client I/O) │ │ (wss://generative...) │ │
│ └────────────────┘ └──────────┬─────────────┘ │
│ │ │
│ ┌──────────▼─────────────┐ │
│ │ Tool Dispatcher │ │
│ │ get_user_profile │ │
│ │ check_order_status │ │
│ │ get_order_details │ │
│ │ initiate_refund │ │
│ │ file_complaint │ │
│ │ escalate_to_human │ │
│ └──────────┬─────────────┘ │
│ │ │
│ ┌──────────▼─────────────┐ │
│ │ MongoDB (Motor async) │ │
│ └────────────────────────┘ │
└────────────────────────────────────────────────────┘
- Framework: FastAPI with native WebSocket support
- ASGI Server: Uvicorn
- AI Model: Gemini Multimodal Live API (
gemini-3.1-flash-live-preview) - Database Driver: Motor (async MongoDB driver)
- WebSocket Client: websockets (for Gemini connection)
- Runtime: Python 3.11+
- Deployment: Docker → Google Cloud Run
voice_agent_server/
├── main.py # FastAPI app, WebSocket handler, Gemini proxy logic
├── tools.py # Tool implementations (MongoDB queries, refund logic)
├── requirements.txt # Python dependencies
├── Dockerfile # Production container image
├── .dockerignore # Files excluded from Docker build
└── .env # Environment variables (not committed)
- Python 3.11+
- A Google AI API key with access to the Gemini Multimodal Live API
- A MongoDB Atlas cluster (or local MongoDB instance)
python -m venv venv
source venv/bin/activatepip install -r requirements.txtCreate a .env file in the project root (parent directory) or in voice_agent_server/:
GEMINI_API_KEY=your-google-ai-api-key
MONGODB_URI=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/?appName=<app>| Variable | Description |
|---|---|
GEMINI_API_KEY |
Google AI API key for the Gemini Multimodal Live API |
MONGODB_URI |
MongoDB connection string for order, user, and ticket data |
uvicorn main:app --host 0.0.0.0 --port 8080The server will be available at ws://localhost:8080/ws/voice.
ws://host:port/ws/voice?user_id=<int>
The user_id query parameter identifies the customer for the session. All tool calls will default to this user.
- Binary frames: Raw 16-bit PCM audio at 16 kHz, mono. The server base64-encodes and forwards to Gemini as
realtimeInput.audio.
- Binary frames: Raw 16-bit PCM audio at 24 kHz from Gemini's voice response. Play directly via
AudioContext. - Text frames (JSON):
The agent has access to six server-side tools that execute against MongoDB. Gemini decides when to call them based on conversation context.
| Tool | Description |
|---|---|
get_user_profile |
Fetches name, phone, email, wallet balance, total orders, lifetime spend (LTV), and customer tier |
check_order_status |
Lists all user orders with status, amount, timestamps, and refund eligibility |
get_order_details |
Detailed view of a single order including hours since delivery and refund window status |
initiate_refund |
Processes a refund — enforces the 2-hour delivery window and calculates amount based on LTV tier |
file_complaint |
Files a structured complaint (categories: food_quality, late_delivery, missing_items, wrong_order, hygiene, other) |
escalate_to_human |
Creates an escalation ticket and signals handoff to a human agent |
Refunds are governed by two rules:
- Time window: Only orders delivered within the last 2 hours are eligible.
- LTV-based amount: The refund percentage scales with customer lifetime spend:
| Lifetime Spend | Refund Percentage |
|---|---|
| ₹10,000+ | 100% |
| ₹5,000+ | 75% |
| ₹2,000+ | 50% |
| Below ₹2,000 | 30% |
The server configures Gemini with a detailed system instruction that defines the agent's personality and operational rules:
- Tone: Natural, empathetic, concise — like a real phone support agent
- Primary goal: Resolve issues while minimizing refund costs
- Refund handling: Always check user profile and order eligibility first; offer alternatives (complaints, re-delivery) before processing
- Escalation: Only when the customer is genuinely distressed or explicitly requests a human
- Voice model:
Kore(Gemini prebuilt voice)
Build and deploy the container directly:
gcloud run deploy voice-server-agent \
--source . \
--region <your-region> \
--allow-unauthenticated \
--set-env-vars "GEMINI_API_KEY=<key>,MONGODB_URI=<uri>"Cloud Run automatically sets the PORT environment variable. The Dockerfile reads it via ${PORT}.
docker build -t voice-agent-server .
docker run -p 8080:8080 \
-e GEMINI_API_KEY=<key> \
-e MONGODB_URI=<uri> \
voice-agent-serveruvicorn main:app --host 0.0.0.0 --port 8080 --reloadcurl http://localhost:8080/
# → {"status": "ok", "service": "voice-agent-server"}This project is part of the Zomato Live Agent monorepo. See the root LICENSE for details.