Skip to content

bridge-mind/BridgeSpeak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BridgeSpeak

Give agents a voice. Ship audio-first.

A cross-agent skill from BridgeMind that gives Claude Code, Hermes, and OpenClaw agents the ability to speak.
Powered by OpenAI's gpt-realtime-2. Plays on the user's speakers via the system's native audio player.

MIT License Discord


Why BridgeSpeak?

AI coding agents are getting good enough to work alongside you — but they still only talk through text. When you're walking the dog, driving, cooking, or staring at another monitor, a silent terminal is dead air. A short voice read-out — "Build complete. All 47 tests passed." — turns the agent into a teammate you can actually leave running.

The OpenAI Realtime API (gpt-realtime-2) ships flagship voices (marin, cedar) that finally sound human enough that you stop noticing they're synthesized. But wiring a WebSocket, decoding base64 PCM chunks, wrapping them in a WAV header, and piping that to the right system audio player on macOS / Linux / Windows is a half-day of plumbing every agent author writes from scratch.

BridgeSpeak is that half-day, packaged once. Drop it into any agent that follows the agentskills.io standard. Your agent gains a single capability: bash speak.sh "text". Everything else — auth, transport, audio formats, player detection, headless fallback — is encapsulated.


What's Inside

Component Type What It Does
bridgespeak Skill The capability. Auto-loaded when the user asks the agent to speak / read aloud / narrate / announce. Tells the agent how to invoke speak.sh, which voice to pick, and when not to speak.
speak.py Script Python WebSocket client. Connects to gpt-realtime-2, streams pcm16 @ 24 kHz mono, wraps as WAV, plays with the system's native player. ~280 lines, no external dependencies beyond websockets.
speak.sh Wrapper POSIX shell entry point for macOS / Linux. Picks Python 3, forwards args.
speak.ps1 Wrapper PowerShell entry point for Windows.

Install

As a Claude Code plugin

claude plugin install bridgespeak@bridgemind-plugins

Or copy the skill manually (Claude Code, Cursor, Codex, Gemini CLI, …)

# Project-level
mkdir -p .claude/skills
cp -r skills/bridgespeak .claude/skills/

# Personal / global
mkdir -p ~/.claude/skills
cp -r skills/bridgespeak ~/.claude/skills/

Then make the script executable:

chmod +x ~/.claude/skills/bridgespeak/scripts/speak.sh

Hermes (NousResearch)

# Hermes loads skills from skills/<category>/<name>/
mkdir -p ~/.hermes/skills/voice
cp -r skills/bridgespeak ~/.hermes/skills/voice/bridgespeak
cp -r scripts ~/.hermes/skills/voice/bridgespeak/scripts

Hermes will pick up metadata.hermes.required_environment_variables and pass OPENAI_API_KEY through.

OpenClaw

# OpenClaw default skill workspace
mkdir -p ~/.openclaw/workspace/skills
cp -r skills/bridgespeak ~/.openclaw/workspace/skills/bridgespeak
cp -r scripts ~/.openclaw/workspace/skills/bridgespeak/scripts

Or publish to clawdhub: clawdhub publish ./skills/bridgespeak.

Or symlink during development

ln -s "$(pwd)/skills/bridgespeak" ~/.claude/skills/bridgespeak
ln -s "$(pwd)/scripts"           ~/.claude/skills/bridgespeak/scripts

Install the runtime dependency

python3 -m pip install --user websockets

That's the only Python dep. Audio playback uses the system's native player (no install needed on macOS; one apt install on Linux if you don't already have paplay / aplay / ffplay; built-in on Windows via Media.SoundPlayer).

Set your OpenAI API key

export OPENAI_API_KEY=sk-...

Or persist to a chmod-600 config file — see api-key-setup.md.


How It Works

One-shot text-to-audio over WebSocket

The script does exactly four things:

  1. Auth — reads OPENAI_API_KEY (env or ~/.config/bridgespeak/config.json). Fails fast with a copy-pasteable setup snippet if missing.
  2. WebSocket — opens wss://api.openai.com/v1/realtime?model=gpt-realtime-2, sends session.update (voice, format) → conversation.item.create (the text) → response.create.
  3. Stream → WAV — collects response.output_audio.delta events, base64-decodes each chunk, concatenates, wraps the result in a 44-byte RIFF/WAVE header (24 kHz mono signed-LE).
  4. Play — detects the platform's native audio player (afplay on macOS, paplay/aplay/ffplay on Linux, Media.SoundPlayer/ffplay on Windows), plays the WAV from a temp file, deletes it on exit.

No live two-way conversation. No microphone. No barge-in. One round trip per invocation, audio out, done.

Cross-agent format

SKILL.md follows the agentskills.io open standard — the same format Claude Code, Cursor, Codex, Gemini CLI, Goose, Cline, Roo, and 30+ others consume. Hermes and OpenClaw both extend the same base spec; the manifest carries optional metadata.hermes and metadata.openclaw blocks for first-class integration with their respective registries, but the skill works without them.

What the agent actually does

When the user says "speak that summary", the agent calls:

bash ~/.claude/skills/bridgespeak/scripts/speak.sh \
  --voice marin \
  "All 47 tests passed. Build took 12 seconds."

That's it. The agent doesn't need to know about WebSockets, base64, PCM16, or audio playback. The skill teaches it when to speak, which voice to pick, and when not to (long output, sensitive data, headless environments, cost-sensitive contexts).


When to Use BridgeSpeak

Install BridgeSpeak if your agent needs to:

  • Announce completion of long-running tasks while you're away from the terminal
  • Read short status / summary out loud during multitasking (driving, cooking, walking)
  • Provide accessibility narration (low vision, RSI, hands-busy)
  • Speak code review findings hands-free
  • Read AI-generated content aloud for proofing (you catch issues by ear that you miss by eye)

Don't install BridgeSpeak if:

  • Your agent runs only in CI / headless environments (use a TTS-to-file workflow instead)
  • You need real-time two-way voice chat (use the Realtime API directly, or BridgeMind's bridgevoice Tauri app for on-device dictation)
  • You're cost-sensitive and synthesizing long passages (~$0.24/minute on Realtime). For batch TTS, OpenAI's /v1/audio/speech endpoint is ~16× cheaper — but is not what this skill provides (per project requirement, BridgeSpeak is Realtime-only)

Voices

gpt-realtime-2 ships ten voices. Default is marin (warm, natural, OpenAI's recommended flagship). Full guide: voices.md.

speak.sh --voice marin   "warm, natural"           # default flagship
speak.sh --voice cedar   "deeper, calm, formal"    # other flagship
speak.sh --voice shimmer --instructions "excited"  "PR merged!"
speak.sh --voice sage    --instructions "slow, pedagogical"  "Reading the migration plan."

Once a voice has produced audio in a session, the API locks it for that session. The script opens one fresh session per invocation, so each call picks a fresh voice freely.


Cost Awareness

gpt-realtime-2 bills both audio and text tokens:

Bucket $/1M tokens
Text input 4.00
Text output 24.00
Audio input 32.00
Audio output 64.00 (~$0.24/minute of speech)

A 30-second status read-out is roughly . A 5-minute narration is roughly $1.20. Set a monthly cap on platform.openai.com/account/billing. The skill nudges agents to speak summaries, not full output.


Project Layout

BridgeSpeak/
├── .claude-plugin/
│   └── plugin.json
├── skills/
│   └── bridgespeak/
│       ├── SKILL.md
│       └── references/
│           ├── realtime-protocol.md
│           ├── voices.md
│           ├── api-key-setup.md
│           └── troubleshooting.md
├── scripts/
│   ├── speak.py        # Python WebSocket client (the workhorse)
│   ├── speak.sh        # POSIX wrapper (macOS / Linux)
│   └── speak.ps1       # PowerShell wrapper (Windows)
├── README.md
├── LICENSE
├── CHANGELOG.md
└── CONTRIBUTING.md

Compatibility

BridgeSpeak is a standard agentskills.io skill. The base spec is supported by 30+ tools.

Tool Skill Plugin Notes
Claude Code Full plugin support via .claude-plugin/
Hermes (NousResearch) metadata.hermes block declares env vars
OpenClaw metadata.openclaw block declares python3 requirement; install via clawdhub
Cursor Drop into .cursor/skills/
OpenAI Codex Skill format
Gemini CLI Skill format
Cline / Roo Code Skill format
GitHub Copilot Reference via .github/copilot-instructions.md
Continue.dev Skill format
Goose Skill format

Platform support

OS Python Player Status
macOS 12+ system Python 3 or brew install python3 afplay (preinstalled)
Ubuntu / Debian apt install python3 python3-pip paplay / aplay / ffplay
Fedora / Arch system Python 3 paplay / aplay / ffplay
Windows 10+ winget install Python.Python.3.12 Media.SoundPlayer (built-in) or ffplay
Headless / SSH / CI any Python 3.9+ ✅ via --no-play --output out.wav

What BridgeSpeak Is Not

  • Not speech-to-text. Output only. For on-device dictation, see BridgeVoice — BridgeMind's Tauri Whisper app.
  • Not a live two-way voice agent. One round-trip per call. No microphone capture, no barge-in, no interruption handling.
  • Not a TTS API wrapper. It uses the Realtime model specifically. For batch TTS at ~16× lower cost, use OpenAI's /v1/audio/speech endpoint directly.
  • Not a sandbox. It will speak whatever text you give it. Audio is broadcast — treat it like any other output channel for prompt-injection purposes (don't read aloud anything that could leak secrets in earshot of others).
  • Not a guarantee of low latency. First-audio latency is typically 0.5–2s on a good connection; longer on flaky networks. The Realtime API was designed for live conversation; this skill uses it for one-shot synthesis.

Authoritative References


Contributing

PRs welcome — especially for:

  • New player backends (e.g., pw-play for PipeWire-only systems)
  • Streaming playback (pipe deltas to ffplay -i - as they arrive instead of buffering)
  • Additional voice/tone presets
  • Per-agent installer scripts (Hermes, OpenClaw, Cursor)
  • Cost-tracking telemetry (opt-in)

See CONTRIBUTING.md.


License

MIT. See LICENSE. True open source. No license traps. Ship freely.


About BridgeMind

BridgeMind is an agentic organization — AI agents are teammates, not tools. We build open-source plugins for the builder community to ship faster through vibe coding.

Other open-source projects in the BridgeMind family:

  • BridgeWard — prompt-injection defense for any agent reading untrusted content
  • BridgeSecurity — app-sec vulnerability detection skill
  • BridgeUI — design instincts for your agent
  • BridgeRemotion — Remotion expert skill for marketing videos
  • BridgeMotion — MIT-licensed React video framework

Built by BridgeMind. Give agents a voice. Ship audio-first.

About

Give AI coding agents a voice. Cross-agent skill that wraps OpenAI gpt-realtime-2 for Claude Code, Hermes, and OpenClaw.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors