AI Agent that can participate in video calls with voice.
| Feature | Status | Latency |
|---|---|---|
| TTS β Jitsi (streaming) | β Working | ~0.3s |
| Speech-to-Text (Whisper) | β Working | ~3s |
| Local loopback (hear self) | β Working | ~3.5s total |
| Think & Respond | β Working | ~0.2s |
| Full Loop | β Working | ~4s |
# Real-time loop (streaming, no CDN)
python3 realtime_loop.py
# Working loop with loopback
python3 working_loop.py
# Basic demo
python3 demo_loop.pyβββββββββββββββββββ Video Call (Jitsi) βββββββββββββββββββ
β β
β Chrome Profile 1 (Speaker) Chrome Profile 2 β
β βββββββββββββββββββββββ βββββββββββββββββββ β
β β TTS Audio Injection β β Audio Capture β β
β β Base64 β AudioCtx βββββββββΆβ (for real β β
β β β MediaStream β β participants) β β
β β β Jitsi Track β βββββββββββββββββββ β
β βββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββ
β Agent Loop (Python) β
β β
β 1. Generate TTS β β gTTS
β 2. Stream to Jitsi β β Base64 injection
β 3. Transcribe locally β β Whisper (loopback)
β 4. Think (generate) β β LLM or rules
β 5. Respond β β Back to step 2
β β
βββββββββββββββββββββββββββββ
| File | Purpose |
|---|---|
realtime_loop.py |
β‘ Fast loop (~4s latency) with streaming |
working_loop.py |
π Complete loop with Whisper + loopback |
streaming_poc.py |
π Proof of concept for direct streaming |
demo_loop.py |
π Basic demo with CDN upload |
agent_loop.py |
π€ Core agent class |
| Approach | Speak | Transcribe | Respond | Total |
|---|---|---|---|---|
| CDN Upload | ~3-5s | ~3s | ~3-5s | ~8-10s |
| Streaming | 0.3s | 3s | 0.2s | ~4s |
| Engine | Example Output | Accuracy |
|---|---|---|
| Google STT | "o la vΓctor prova d'alumnat local" | ~60% |
| Whisper base | "Hola, victor, puc parlar i escoltar." | ~95% |
Whisper supports 99 languages including Catalan, Spanish, English, etc.
- Issue: Audio capture between headless Chrome browsers returns silence
- Reason: WebRTC optimizes away audio when no real speakers/listeners
- Solution: Use local loopback transcription (transcribe TTS before sending)
- TTS generation: ~2s (gTTS over network)
- Whisper transcription: ~3s (CPU, base model)
- Optimizations available:
- GPU Whisper: ~10x faster
- Local TTS (Qwen3-TTS): No network latency
- Smaller model: ~2x faster, less accurate
# Chrome profiles (2 terminals)
google-chrome --remote-debugging-port=18800 --user-data-dir=/tmp/chrome1
google-chrome --remote-debugging-port=18801 --user-data-dir=/tmp/chrome2
# Both navigate to same Jitsi room
# Run agent
python3 realtime_loop.pygTTS
faster-whisper
websockets
requests
ffmpeg (system)
dbc3499- Real-time streaming loop (~4s latency)150673a- Whisper integration (~95% accuracy)c72eb83- Working loop with loopback8cd5141- Documented limitationsb915628- Initial working demo
VictorIA π - Created 2026-02-01
Historic milestone: An AI agent with real voice in video calls!