-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature or improvement request related to a problem? Please describe.
Asterisk’s WebSocket media driver (chan_websocket) allows streaming binary audio data from external applications (like AI voice bots) but does not provide any mechanism to know when a specific portion of audio has actually been played.
This causes synchronization issues for AI-driven real-time streaming systems that generate audio dynamically (e.g., Text-to-Speech or conversational AI). Without playback progress acknowledgment, the application has no way to determine when the audio it sent has finished playing.
This leads to two major problems:
- Over-buffering — increases latency since the application keeps sending new audio before the previous one is played.
- Under-buffering — causes playback gaps when the application waits too long to send the next chunk.
Describe the solution you'd like
Introduce a playback progress acknowledgment mechanism for the WebSocket media driver.
Possible designs:
-
Mark-based acknowledgment
- Allow clients to send a
MARK id=<uuid>text control command that sets a logical boundary in the playback queue. - When Asterisk finishes playing all media queued before that mark, it responds with
MARK_PLAYED id=<uuid>.
Example:
- Allow clients to send a
Describe alternatives you've considered
We explored existing Asterisk WebSocket control messages:
REPORT_QUEUE_DRAINED: Notifies only when the entire queue is empty — not useful for partial playback acknowledgment.FLUSH_MEDIA: Clears buffered audio but provides no confirmation of playback.MEDIA_XOFF/MEDIA_XON: Flow control only, unrelated to playback timing.
We also attempted to simulate playback timing locally by estimating real-time audio consumption (1 ms per 8 bytes for μ-law), but this approach is only an approximation and cannot confirm actual playback progress in Asterisk.
Additional context
Use case: AI-driven voice bots and real-time speech generation systems streaming audio to Asterisk over WebSocket.
Modern AI engines (like OpenAI Realtime API, ElevenLabs, or custom TTS models) generate audio in small, variable-sized chunks. These systems require acknowledgment when certain chunks have been played so they can dynamically:
- Generate the next segment of speech
- Handle interruptions or “barge-in” events
- Avoid excessive buffering and latency
Adding per-chunk or progress-based acknowledgment would significantly improve synchronization for real-time applications and make Asterisk more compatible with emerging AI voice technologies.
Proposed area: WebSocket Media Driver (chan_websocket)