Skip to content

πŸŽ₯ Enable AI agents to join video conferences (Jitsi). Chat, react, moderate, and speak via TTS.

Notifications You must be signed in to change notification settings

VictoriaDigital/AgentVideoCall

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AgentVideoCall πŸŽ₯πŸ€–

AI Agent that can participate in video calls with voice.

πŸŽ‰ What Works (2026-02-01)

Feature Status Latency
TTS β†’ Jitsi (streaming) βœ… Working ~0.3s
Speech-to-Text (Whisper) βœ… Working ~3s
Local loopback (hear self) βœ… Working ~3.5s total
Think & Respond βœ… Working ~0.2s
Full Loop βœ… Working ~4s

Quick Start

# Real-time loop (streaming, no CDN)
python3 realtime_loop.py

# Working loop with loopback
python3 working_loop.py

# Basic demo
python3 demo_loop.py

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Video Call (Jitsi) ──────────────────┐
β”‚                                                         β”‚
β”‚  Chrome Profile 1 (Speaker)     Chrome Profile 2        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ TTS Audio Injection β”‚        β”‚ Audio Capture   β”‚    β”‚
β”‚  β”‚ Base64 β†’ AudioCtx   │───────▢│ (for real      β”‚    β”‚
β”‚  β”‚ β†’ MediaStream       β”‚        β”‚  participants) β”‚    β”‚
β”‚  β”‚ β†’ Jitsi Track       β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚     Agent Loop (Python)    β”‚
          β”‚                           β”‚
          β”‚  1. Generate TTS          β”‚  ← gTTS
          β”‚  2. Stream to Jitsi       β”‚  ← Base64 injection
          β”‚  3. Transcribe locally    β”‚  ← Whisper (loopback)
          β”‚  4. Think (generate)      β”‚  ← LLM or rules
          β”‚  5. Respond               β”‚  ← Back to step 2
          β”‚                           β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Files

File Purpose
realtime_loop.py ⚑ Fast loop (~4s latency) with streaming
working_loop.py πŸ”„ Complete loop with Whisper + loopback
streaming_poc.py πŸš€ Proof of concept for direct streaming
demo_loop.py πŸ“– Basic demo with CDN upload
agent_loop.py πŸ€– Core agent class

Performance Comparison

Approach Speak Transcribe Respond Total
CDN Upload ~3-5s ~3s ~3-5s ~8-10s
Streaming 0.3s 3s 0.2s ~4s

Speech-to-Text Accuracy

Engine Example Output Accuracy
Google STT "o la vΓ­ctor prova d'alumnat local" ~60%
Whisper base "Hola, victor, puc parlar i escoltar." ~95%

Whisper supports 99 languages including Catalan, Spanish, English, etc.

Known Limitations

Headless-to-Headless Audio

  • Issue: Audio capture between headless Chrome browsers returns silence
  • Reason: WebRTC optimizes away audio when no real speakers/listeners
  • Solution: Use local loopback transcription (transcribe TTS before sending)

Latency Breakdown

  • TTS generation: ~2s (gTTS over network)
  • Whisper transcription: ~3s (CPU, base model)
  • Optimizations available:
    • GPU Whisper: ~10x faster
    • Local TTS (Qwen3-TTS): No network latency
    • Smaller model: ~2x faster, less accurate

Setup

# Chrome profiles (2 terminals)
google-chrome --remote-debugging-port=18800 --user-data-dir=/tmp/chrome1
google-chrome --remote-debugging-port=18801 --user-data-dir=/tmp/chrome2

# Both navigate to same Jitsi room
# Run agent
python3 realtime_loop.py

Requirements

gTTS
faster-whisper
websockets
requests
ffmpeg (system)

Commits

  • dbc3499 - Real-time streaming loop (~4s latency)
  • 150673a - Whisper integration (~95% accuracy)
  • c72eb83 - Working loop with loopback
  • 8cd5141 - Documented limitations
  • b915628 - Initial working demo

Author

VictorIA 🌟 - Created 2026-02-01

Historic milestone: An AI agent with real voice in video calls!

About

πŸŽ₯ Enable AI agents to join video conferences (Jitsi). Chat, react, moderate, and speak via TTS.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages