BridgeSpeak
A cross-agent skill from BridgeMind that gives Claude Code, Hermes, and OpenClaw agents the ability to speak.
Powered by OpenAI's gpt-realtime-2. Plays on the user's speakers via the system's native audio player.
AI coding agents are getting good enough to work alongside you — but they still only talk through text. When you're walking the dog, driving, cooking, or staring at another monitor, a silent terminal is dead air. A short voice read-out — "Build complete. All 47 tests passed." — turns the agent into a teammate you can actually leave running.
The OpenAI Realtime API (gpt-realtime-2) ships flagship voices (marin, cedar) that finally sound human enough that you stop noticing they're synthesized. But wiring a WebSocket, decoding base64 PCM chunks, wrapping them in a WAV header, and piping that to the right system audio player on macOS / Linux / Windows is a half-day of plumbing every agent author writes from scratch.
BridgeSpeak is that half-day, packaged once. Drop it into any agent that follows the agentskills.io standard. Your agent gains a single capability: bash speak.sh "text". Everything else — auth, transport, audio formats, player detection, headless fallback — is encapsulated.
| Component | Type | What It Does |
|---|---|---|
bridgespeak |
Skill | The capability. Auto-loaded when the user asks the agent to speak / read aloud / narrate / announce. Tells the agent how to invoke speak.sh, which voice to pick, and when not to speak. |
speak.py |
Script | Python WebSocket client. Connects to gpt-realtime-2, streams pcm16 @ 24 kHz mono, wraps as WAV, plays with the system's native player. ~280 lines, no external dependencies beyond websockets. |
speak.sh |
Wrapper | POSIX shell entry point for macOS / Linux. Picks Python 3, forwards args. |
speak.ps1 |
Wrapper | PowerShell entry point for Windows. |
claude plugin install bridgespeak@bridgemind-plugins# Project-level
mkdir -p .claude/skills
cp -r skills/bridgespeak .claude/skills/
# Personal / global
mkdir -p ~/.claude/skills
cp -r skills/bridgespeak ~/.claude/skills/Then make the script executable:
chmod +x ~/.claude/skills/bridgespeak/scripts/speak.sh# Hermes loads skills from skills/<category>/<name>/
mkdir -p ~/.hermes/skills/voice
cp -r skills/bridgespeak ~/.hermes/skills/voice/bridgespeak
cp -r scripts ~/.hermes/skills/voice/bridgespeak/scriptsHermes will pick up metadata.hermes.required_environment_variables and pass OPENAI_API_KEY through.
# OpenClaw default skill workspace
mkdir -p ~/.openclaw/workspace/skills
cp -r skills/bridgespeak ~/.openclaw/workspace/skills/bridgespeak
cp -r scripts ~/.openclaw/workspace/skills/bridgespeak/scriptsOr publish to clawdhub: clawdhub publish ./skills/bridgespeak.
ln -s "$(pwd)/skills/bridgespeak" ~/.claude/skills/bridgespeak
ln -s "$(pwd)/scripts" ~/.claude/skills/bridgespeak/scriptspython3 -m pip install --user websocketsThat's the only Python dep. Audio playback uses the system's native player (no install needed on macOS; one apt install on Linux if you don't already have paplay / aplay / ffplay; built-in on Windows via Media.SoundPlayer).
export OPENAI_API_KEY=sk-...Or persist to a chmod-600 config file — see api-key-setup.md.
The script does exactly four things:
- Auth — reads
OPENAI_API_KEY(env or~/.config/bridgespeak/config.json). Fails fast with a copy-pasteable setup snippet if missing. - WebSocket — opens
wss://api.openai.com/v1/realtime?model=gpt-realtime-2, sendssession.update(voice, format) →conversation.item.create(the text) →response.create. - Stream → WAV — collects
response.output_audio.deltaevents, base64-decodes each chunk, concatenates, wraps the result in a 44-byte RIFF/WAVE header (24 kHz mono signed-LE). - Play — detects the platform's native audio player (
afplayon macOS,paplay/aplay/ffplayon Linux,Media.SoundPlayer/ffplayon Windows), plays the WAV from a temp file, deletes it on exit.
No live two-way conversation. No microphone. No barge-in. One round trip per invocation, audio out, done.
SKILL.md follows the agentskills.io open standard — the same format Claude Code, Cursor, Codex, Gemini CLI, Goose, Cline, Roo, and 30+ others consume. Hermes and OpenClaw both extend the same base spec; the manifest carries optional metadata.hermes and metadata.openclaw blocks for first-class integration with their respective registries, but the skill works without them.
When the user says "speak that summary", the agent calls:
bash ~/.claude/skills/bridgespeak/scripts/speak.sh \
--voice marin \
"All 47 tests passed. Build took 12 seconds."That's it. The agent doesn't need to know about WebSockets, base64, PCM16, or audio playback. The skill teaches it when to speak, which voice to pick, and when not to (long output, sensitive data, headless environments, cost-sensitive contexts).
Install BridgeSpeak if your agent needs to:
- Announce completion of long-running tasks while you're away from the terminal
- Read short status / summary out loud during multitasking (driving, cooking, walking)
- Provide accessibility narration (low vision, RSI, hands-busy)
- Speak code review findings hands-free
- Read AI-generated content aloud for proofing (you catch issues by ear that you miss by eye)
Don't install BridgeSpeak if:
- Your agent runs only in CI / headless environments (use a TTS-to-file workflow instead)
- You need real-time two-way voice chat (use the Realtime API directly, or BridgeMind's
bridgevoiceTauri app for on-device dictation) - You're cost-sensitive and synthesizing long passages (~$0.24/minute on Realtime). For batch TTS, OpenAI's
/v1/audio/speechendpoint is ~16× cheaper — but is not what this skill provides (per project requirement, BridgeSpeak is Realtime-only)
gpt-realtime-2 ships ten voices. Default is marin (warm, natural, OpenAI's recommended flagship). Full guide: voices.md.
speak.sh --voice marin "warm, natural" # default flagship
speak.sh --voice cedar "deeper, calm, formal" # other flagship
speak.sh --voice shimmer --instructions "excited" "PR merged!"
speak.sh --voice sage --instructions "slow, pedagogical" "Reading the migration plan."Once a voice has produced audio in a session, the API locks it for that session. The script opens one fresh session per invocation, so each call picks a fresh voice freely.
gpt-realtime-2 bills both audio and text tokens:
| Bucket | $/1M tokens |
|---|---|
| Text input | 4.00 |
| Text output | 24.00 |
| Audio input | 32.00 |
| Audio output | 64.00 (~$0.24/minute of speech) |
A 30-second status read-out is roughly 1¢. A 5-minute narration is roughly $1.20. Set a monthly cap on platform.openai.com/account/billing. The skill nudges agents to speak summaries, not full output.
BridgeSpeak/
├── .claude-plugin/
│ └── plugin.json
├── skills/
│ └── bridgespeak/
│ ├── SKILL.md
│ └── references/
│ ├── realtime-protocol.md
│ ├── voices.md
│ ├── api-key-setup.md
│ └── troubleshooting.md
├── scripts/
│ ├── speak.py # Python WebSocket client (the workhorse)
│ ├── speak.sh # POSIX wrapper (macOS / Linux)
│ └── speak.ps1 # PowerShell wrapper (Windows)
├── README.md
├── LICENSE
├── CHANGELOG.md
└── CONTRIBUTING.md
BridgeSpeak is a standard agentskills.io skill. The base spec is supported by 30+ tools.
| Tool | Skill | Plugin | Notes |
|---|---|---|---|
| Claude Code | ✅ | ✅ | Full plugin support via .claude-plugin/ |
| Hermes (NousResearch) | ✅ | — | metadata.hermes block declares env vars |
| OpenClaw | ✅ | — | metadata.openclaw block declares python3 requirement; install via clawdhub |
| Cursor | ✅ | — | Drop into .cursor/skills/ |
| OpenAI Codex | ✅ | — | Skill format |
| Gemini CLI | ✅ | — | Skill format |
| Cline / Roo Code | ✅ | — | Skill format |
| GitHub Copilot | ✅ | — | Reference via .github/copilot-instructions.md |
| Continue.dev | ✅ | — | Skill format |
| Goose | ✅ | — | Skill format |
| OS | Python | Player | Status |
|---|---|---|---|
| macOS 12+ | system Python 3 or brew install python3 |
afplay (preinstalled) |
✅ |
| Ubuntu / Debian | apt install python3 python3-pip |
paplay / aplay / ffplay |
✅ |
| Fedora / Arch | system Python 3 | paplay / aplay / ffplay |
✅ |
| Windows 10+ | winget install Python.Python.3.12 |
Media.SoundPlayer (built-in) or ffplay |
✅ |
| Headless / SSH / CI | any Python 3.9+ | — | ✅ via --no-play --output out.wav |
- Not speech-to-text. Output only. For on-device dictation, see BridgeVoice — BridgeMind's Tauri Whisper app.
- Not a live two-way voice agent. One round-trip per call. No microphone capture, no barge-in, no interruption handling.
- Not a TTS API wrapper. It uses the Realtime model specifically. For batch TTS at ~16× lower cost, use OpenAI's
/v1/audio/speechendpoint directly. - Not a sandbox. It will speak whatever text you give it. Audio is broadcast — treat it like any other output channel for prompt-injection purposes (don't read aloud anything that could leak secrets in earshot of others).
- Not a guarantee of low latency. First-audio latency is typically 0.5–2s on a good connection; longer on flaky networks. The Realtime API was designed for live conversation; this skill uses it for one-shot synthesis.
- OpenAI Realtime API — gpt-realtime-2
- OpenAI Realtime WebSocket guide
- OpenAI Realtime sessions API reference
- Managing Realtime costs
- Agent Skills specification (agentskills.io)
- Claude Code Skills documentation
- Hermes Agent (NousResearch)
- OpenClaw
PRs welcome — especially for:
- New player backends (e.g.,
pw-playfor PipeWire-only systems) - Streaming playback (pipe deltas to
ffplay -i -as they arrive instead of buffering) - Additional voice/tone presets
- Per-agent installer scripts (Hermes, OpenClaw, Cursor)
- Cost-tracking telemetry (opt-in)
See CONTRIBUTING.md.
MIT. See LICENSE. True open source. No license traps. Ship freely.
BridgeMind is an agentic organization — AI agents are teammates, not tools. We build open-source plugins for the builder community to ship faster through vibe coding.
Other open-source projects in the BridgeMind family:
- BridgeWard — prompt-injection defense for any agent reading untrusted content
- BridgeSecurity — app-sec vulnerability detection skill
- BridgeUI — design instincts for your agent
- BridgeRemotion — Remotion expert skill for marketing videos
- BridgeMotion — MIT-licensed React video framework
Built by BridgeMind. Give agents a voice. Ship audio-first.