Skip to content

[improvement]: Add playback progress acknowledgment for WebSocket media (per-chunk or byte-level acknowledgment) #1574

@gauravs456

Description

@gauravs456

Is your feature or improvement request related to a problem? Please describe.

Asterisk’s WebSocket media driver (chan_websocket) allows streaming binary audio data from external applications (like AI voice bots) but does not provide any mechanism to know when a specific portion of audio has actually been played.

This causes synchronization issues for AI-driven real-time streaming systems that generate audio dynamically (e.g., Text-to-Speech or conversational AI). Without playback progress acknowledgment, the application has no way to determine when the audio it sent has finished playing.

This leads to two major problems:

  1. Over-buffering — increases latency since the application keeps sending new audio before the previous one is played.
  2. Under-buffering — causes playback gaps when the application waits too long to send the next chunk.

Describe the solution you'd like

Introduce a playback progress acknowledgment mechanism for the WebSocket media driver.

Possible designs:

  1. Mark-based acknowledgment

    • Allow clients to send a MARK id=<uuid> text control command that sets a logical boundary in the playback queue.
    • When Asterisk finishes playing all media queued before that mark, it responds with MARK_PLAYED id=<uuid>.

    Example:

Describe alternatives you've considered

We explored existing Asterisk WebSocket control messages:

  • REPORT_QUEUE_DRAINED: Notifies only when the entire queue is empty — not useful for partial playback acknowledgment.
  • FLUSH_MEDIA: Clears buffered audio but provides no confirmation of playback.
  • MEDIA_XOFF / MEDIA_XON: Flow control only, unrelated to playback timing.

We also attempted to simulate playback timing locally by estimating real-time audio consumption (1 ms per 8 bytes for μ-law), but this approach is only an approximation and cannot confirm actual playback progress in Asterisk.

Additional context

Use case: AI-driven voice bots and real-time speech generation systems streaming audio to Asterisk over WebSocket.

Modern AI engines (like OpenAI Realtime API, ElevenLabs, or custom TTS models) generate audio in small, variable-sized chunks. These systems require acknowledgment when certain chunks have been played so they can dynamically:

  • Generate the next segment of speech
  • Handle interruptions or “barge-in” events
  • Avoid excessive buffering and latency

Adding per-chunk or progress-based acknowledgment would significantly improve synchronization for real-time applications and make Asterisk more compatible with emerging AI voice technologies.

Proposed area: WebSocket Media Driver (chan_websocket)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions