A production-ready, cleanly-architected voice assistant that connects a Twilio phone call to OpenAI’s Realtime API over WebSockets for true two-way, low-latency audio conversations.
- Clean architecture with Single Responsibility Principle (SRP)
- Real-time audio: Twilio Media Streams ⇄ OpenAI Realtime API
- Interruption handling (user can interrupt the AI mid-response)
- Modular, testable services with clear boundaries
- Simple local setup and Twilio webhook wiring
- Graceful call termination: model triggers an end_call tool → assistant plays a short farewell → call hangs up automatically
├── main.py # Application entrypoint (FastAPI + orchestration)
├── config.py # Centralized configuration (env/env vars)
├── services/ # SRP-aligned service modules
│ ├── __init__.py # Package exports
│ ├── audio_service.py # Audio conversion, timing, buffering, marks
│ ├── connection_manager.py # WebSocket plumbing (Twilio ⇄ OpenAI)
│ ├── log_utils.py # Structured logging helpers
│ ├── openai_service.py # Session + events + conversation control
│ └── twilio_service.py # TwiML + Twilio payload helpers
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── CODE_OF_CONDUCT.md # Contribution/code of conduct
├── LICENSE # MIT license
├── Promptingrealtimemodels.md # Realtime prompting notes
└── README.md # This file
- Python 3.9+
- Twilio account and a phone number with Voice capability
- OpenAI API key with Realtime API access
- A tunneling tool (e.g. ngrok) to expose your local server
- Create a virtual environment and install dependencies
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
- Configure environment
cp .env.example .env
# open .env and set your values
Required variables (see more in “Configuration”):
- OPENAI_API_KEY=...
- PORT=5050 (default)
- TEMPERATURE=0.8 (default)
- SHOW_TIMING_MATH=false (default)
- Run the app locally
python main.py
You should see Uvicorn running on 0.0.0.0:5050 by default.
- Expose your server with ngrok (or similar)
ngrok http 5050
Copy the HTTPS forwarding URL for use in the next step.
- Point your Twilio phone number to the webhook
In Twilio Console → Phone Numbers → Manage → Active Numbers → your number:
- “A call comes in” → Webhook → https://YOUR-NGROK-SUBDOMAIN.ngrok.app/incoming-call
- Save
Call your number and start talking to the assistant.
Configuration is centralized in config.py
and loaded from environment variables.
Variable | Description | Default |
---|---|---|
OPENAI_API_KEY | OpenAI API key | — |
PORT | HTTP server port | 5050 |
TEMPERATURE | Model temperature (0–1) | 0.8 |
SHOW_TIMING_MATH | Verbose timing logs for interruption math | false |
COMPANY_NAME | Brand used in prompts | Acme Realty |
VOICE | OpenAI voice id (e.g., alloy) | alloy |
END_CALL_GRACE_SECONDS | Pause after farewell before hangup (seconds) | 3 |
END_CALL_WATCHDOG_SECONDS | Timeout if farewell audio doesn’t start (seconds) | 4 |
REALTIME_SESSION_RENEW_SECONDS | Preemptive session renewal cadence (seconds) | 3300 |
TWILIO_ACCOUNT_SID | Optional: enable REST hangup | — |
TWILIO_AUTH_TOKEN | Optional: enable REST hangup | — |
config.py
builds the OpenAI Realtime WS URL and headers dynamically using these values.
- Twilio hits
/incoming-call
and receives TwiML that connects the call to a WebSocket media stream (/media-stream
). WebSocketConnectionManager
opens a second WebSocket to OpenAI Realtime.AudioService
converts/labels audio, tracks timestamps, buffers, and manages “marks”.OpenAIService
initializes the session, processes events, and handles truncation when the caller speaks.- The system streams audio in both directions with minimal latency.
- When the model calls the
end_call
tool, the app queues a brief farewell viaresponse.create
, ignores user interruptions during the goodbye, then ends the call using Twilio REST (if credentials are set) or by closing the Twilio media stream WebSocket.
- Trigger: the model issues a function call
end_call
when the user says goodbye or requests to end. - Farewell: the app sends a
response.create
withresponse.instructions
to have the assistant speak one short goodbye. - Finalize: after a short grace period, the call ends via Twilio REST or, if not configured, by closing the Twilio WS stream.
- Watchdog: if farewell audio doesn’t begin within
END_CALL_WATCHDOG_SECONDS
, the app finalizes anyway to avoid stalling.
- Change the AI personality in
config.py
(theSYSTEM_MESSAGE
). - Adjust timing behavior in
AudioTimingManager
(insideaudio_service.py
). - Swap audio formats in
AudioFormatConverter
if your integration needs change.
- WebSocket errors: verify
OPENAI_API_KEY
, network egress, and that Realtime API access is enabled for your key. - No audio: ensure your Twilio number is configured to call the correct tunnel URL and that the tunnel is active.
- Interruption isn’t working: set
SHOW_TIMING_MATH=true
and watch server logs to confirm timestamps/marks. - Realtime schema errors (e.g.,
unknown_parameter
): the app uses{"type":"response.create","response":{"instructions":"..."}}
and sets modalities in the session. If you change session config, align payloads with the latest Realtime docs.
MIT — see LICENSE
.
A professional speech assistant built with Python, Twilio Voice, and OpenAI Realtime API. Features clean architecture with Single Responsibility Principle, organized service modules, and enterprise-level code organization.