GitHub - bridge-mind/BridgeSpeak: Give AI coding agents a voice. Cross-agent skill that wraps OpenAI gpt-realtime-2 for Claude Code, Hermes, and OpenClaw.

BridgeSpeak

Give agents a voice. Ship audio-first.

A cross-agent skill from BridgeMind that gives Claude Code, Hermes, and OpenClaw agents the ability to speak.
Powered by OpenAI's gpt-realtime-2. Plays on the user's speakers via the system's native audio player.

Why BridgeSpeak?

AI coding agents are getting good enough to work alongside you — but they still only talk through text. When you're walking the dog, driving, cooking, or staring at another monitor, a silent terminal is dead air. A short voice read-out — "Build complete. All 47 tests passed." — turns the agent into a teammate you can actually leave running.

The OpenAI Realtime API (gpt-realtime-2) ships flagship voices (marin, cedar) that finally sound human enough that you stop noticing they're synthesized. But wiring a WebSocket, decoding base64 PCM chunks, wrapping them in a WAV header, and piping that to the right system audio player on macOS / Linux / Windows is a half-day of plumbing every agent author writes from scratch.

BridgeSpeak is that half-day, packaged once. Drop it into any agent that follows the agentskills.io standard. Your agent gains a single capability: bash speak.sh "text". Everything else — auth, transport, audio formats, player detection, headless fallback — is encapsulated.

What's Inside

Component	Type	What It Does
`bridgespeak`	Skill	The capability. Auto-loaded when the user asks the agent to speak / read aloud / narrate / announce. Tells the agent how to invoke `speak.sh`, which voice to pick, and when not to speak.
`speak.py`	Script	Python WebSocket client. Connects to `gpt-realtime-2`, streams `pcm16` @ 24 kHz mono, wraps as WAV, plays with the system's native player. ~280 lines, no external dependencies beyond `websockets`.
`speak.sh`	Wrapper	POSIX shell entry point for macOS / Linux. Picks Python 3, forwards args.
`speak.ps1`	Wrapper	PowerShell entry point for Windows.

Install

As a Claude Code plugin

claude plugin install bridgespeak@bridgemind-plugins

Or copy the skill manually (Claude Code, Cursor, Codex, Gemini CLI, …)

# Project-level
mkdir -p .claude/skills
cp -r skills/bridgespeak .claude/skills/

# Personal / global
mkdir -p ~/.claude/skills
cp -r skills/bridgespeak ~/.claude/skills/

Then make the script executable:

chmod +x ~/.claude/skills/bridgespeak/scripts/speak.sh

Hermes (NousResearch)

# Hermes loads skills from skills/<category>/<name>/
mkdir -p ~/.hermes/skills/voice
cp -r skills/bridgespeak ~/.hermes/skills/voice/bridgespeak
cp -r scripts ~/.hermes/skills/voice/bridgespeak/scripts

Hermes will pick up metadata.hermes.required_environment_variables and pass OPENAI_API_KEY through.

OpenClaw

# OpenClaw default skill workspace
mkdir -p ~/.openclaw/workspace/skills
cp -r skills/bridgespeak ~/.openclaw/workspace/skills/bridgespeak
cp -r scripts ~/.openclaw/workspace/skills/bridgespeak/scripts

Or publish to clawdhub: clawdhub publish ./skills/bridgespeak.

Or symlink during development

ln -s "$(pwd)/skills/bridgespeak" ~/.claude/skills/bridgespeak
ln -s "$(pwd)/scripts"           ~/.claude/skills/bridgespeak/scripts

Install the runtime dependency

python3 -m pip install --user websockets

That's the only Python dep. Audio playback uses the system's native player (no install needed on macOS; one apt install on Linux if you don't already have paplay / aplay / ffplay; built-in on Windows via Media.SoundPlayer).

Set your OpenAI API key

export OPENAI_API_KEY=sk-...

Or persist to a chmod-600 config file — see api-key-setup.md.

How It Works

One-shot text-to-audio over WebSocket

The script does exactly four things:

Auth — reads OPENAI_API_KEY (env or ~/.config/bridgespeak/config.json). Fails fast with a copy-pasteable setup snippet if missing.
WebSocket — opens wss://api.openai.com/v1/realtime?model=gpt-realtime-2, sends session.update (voice, format) → conversation.item.create (the text) → response.create.
Stream → WAV — collects response.output_audio.delta events, base64-decodes each chunk, concatenates, wraps the result in a 44-byte RIFF/WAVE header (24 kHz mono signed-LE).
Play — detects the platform's native audio player (afplay on macOS, paplay/aplay/ffplay on Linux, Media.SoundPlayer/ffplay on Windows), plays the WAV from a temp file, deletes it on exit.

No live two-way conversation. No microphone. No barge-in. One round trip per invocation, audio out, done.

Cross-agent format

SKILL.md follows the agentskills.io open standard — the same format Claude Code, Cursor, Codex, Gemini CLI, Goose, Cline, Roo, and 30+ others consume. Hermes and OpenClaw both extend the same base spec; the manifest carries optional metadata.hermes and metadata.openclaw blocks for first-class integration with their respective registries, but the skill works without them.

What the agent actually does

When the user says "speak that summary", the agent calls:

bash ~/.claude/skills/bridgespeak/scripts/speak.sh \
  --voice marin \
  "All 47 tests passed. Build took 12 seconds."

That's it. The agent doesn't need to know about WebSockets, base64, PCM16, or audio playback. The skill teaches it when to speak, which voice to pick, and when not to (long output, sensitive data, headless environments, cost-sensitive contexts).

When to Use BridgeSpeak

Install BridgeSpeak if your agent needs to:

Announce completion of long-running tasks while you're away from the terminal
Read short status / summary out loud during multitasking (driving, cooking, walking)
Provide accessibility narration (low vision, RSI, hands-busy)
Speak code review findings hands-free
Read AI-generated content aloud for proofing (you catch issues by ear that you miss by eye)

Don't install BridgeSpeak if:

Your agent runs only in CI / headless environments (use a TTS-to-file workflow instead)
You need real-time two-way voice chat (use the Realtime API directly, or BridgeMind's bridgevoice Tauri app for on-device dictation)
You're cost-sensitive and synthesizing long passages (~$0.24/minute on Realtime). For batch TTS, OpenAI's /v1/audio/speech endpoint is ~16× cheaper — but is not what this skill provides (per project requirement, BridgeSpeak is Realtime-only)

Voices

gpt-realtime-2 ships ten voices. Default is marin (warm, natural, OpenAI's recommended flagship). Full guide: voices.md.

speak.sh --voice marin   "warm, natural"           # default flagship
speak.sh --voice cedar   "deeper, calm, formal"    # other flagship
speak.sh --voice shimmer --instructions "excited"  "PR merged!"
speak.sh --voice sage    --instructions "slow, pedagogical"  "Reading the migration plan."

Once a voice has produced audio in a session, the API locks it for that session. The script opens one fresh session per invocation, so each call picks a fresh voice freely.

Cost Awareness

gpt-realtime-2 bills both audio and text tokens:

Bucket	$/1M tokens
Text input	4.00
Text output	24.00
Audio input	32.00
Audio output	64.00 (~$0.24/minute of speech)

A 30-second status read-out is roughly 1¢. A 5-minute narration is roughly $1.20. Set a monthly cap on platform.openai.com/account/billing. The skill nudges agents to speak summaries, not full output.

Project Layout

BridgeSpeak/
├── .claude-plugin/
│   └── plugin.json
├── skills/
│   └── bridgespeak/
│       ├── SKILL.md
│       └── references/
│           ├── realtime-protocol.md
│           ├── voices.md
│           ├── api-key-setup.md
│           └── troubleshooting.md
├── scripts/
│   ├── speak.py        # Python WebSocket client (the workhorse)
│   ├── speak.sh        # POSIX wrapper (macOS / Linux)
│   └── speak.ps1       # PowerShell wrapper (Windows)
├── README.md
├── LICENSE
├── CHANGELOG.md
└── CONTRIBUTING.md

Compatibility

BridgeSpeak is a standard agentskills.io skill. The base spec is supported by 30+ tools.

Tool	Skill	Plugin	Notes
Claude Code	✅	✅	Full plugin support via `.claude-plugin/`
Hermes (NousResearch)	✅	—	`metadata.hermes` block declares env vars
OpenClaw	✅	—	`metadata.openclaw` block declares `python3` requirement; install via `clawdhub`
Cursor	✅	—	Drop into `.cursor/skills/`
OpenAI Codex	✅	—	Skill format
Gemini CLI	✅	—	Skill format
Cline / Roo Code	✅	—	Skill format
GitHub Copilot	✅	—	Reference via `.github/copilot-instructions.md`
Continue.dev	✅	—	Skill format
Goose	✅	—	Skill format

Platform support

OS	Python	Player	Status
macOS 12+	system Python 3 or `brew install python3`	`afplay` (preinstalled)	✅
Ubuntu / Debian	`apt install python3 python3-pip`	`paplay` / `aplay` / `ffplay`	✅
Fedora / Arch	system Python 3	`paplay` / `aplay` / `ffplay`	✅
Windows 10+	`winget install Python.Python.3.12`	`Media.SoundPlayer` (built-in) or `ffplay`	✅
Headless / SSH / CI	any Python 3.9+	—	✅ via `--no-play --output out.wav`

What BridgeSpeak Is Not

Not speech-to-text. Output only. For on-device dictation, see BridgeVoice — BridgeMind's Tauri Whisper app.
Not a live two-way voice agent. One round-trip per call. No microphone capture, no barge-in, no interruption handling.
Not a TTS API wrapper. It uses the Realtime model specifically. For batch TTS at ~16× lower cost, use OpenAI's /v1/audio/speech endpoint directly.
Not a sandbox. It will speak whatever text you give it. Audio is broadcast — treat it like any other output channel for prompt-injection purposes (don't read aloud anything that could leak secrets in earshot of others).
Not a guarantee of low latency. First-audio latency is typically 0.5–2s on a good connection; longer on flaky networks. The Realtime API was designed for live conversation; this skill uses it for one-shot synthesis.

Authoritative References

Contributing

PRs welcome — especially for:

New player backends (e.g., pw-play for PipeWire-only systems)
Streaming playback (pipe deltas to ffplay -i - as they arrive instead of buffering)
Additional voice/tone presets
Per-agent installer scripts (Hermes, OpenClaw, Cursor)
Cost-tracking telemetry (opt-in)

See CONTRIBUTING.md.

License

MIT. See LICENSE. True open source. No license traps. Ship freely.

About BridgeMind

BridgeMind is an agentic organization — AI agents are teammates, not tools. We build open-source plugins for the builder community to ship faster through vibe coding.

Other open-source projects in the BridgeMind family:

BridgeWard — prompt-injection defense for any agent reading untrusted content
BridgeSecurity — app-sec vulnerability detection skill
BridgeUI — design instincts for your agent
BridgeRemotion — Remotion expert skill for marketing videos
BridgeMotion — MIT-licensed React video framework

Built by BridgeMind. Give agents a voice. Ship audio-first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Give agents a voice. Ship audio-first.

Why BridgeSpeak?

What's Inside

Install

As a Claude Code plugin

Or copy the skill manually (Claude Code, Cursor, Codex, Gemini CLI, …)

Hermes (NousResearch)

OpenClaw

Or symlink during development

Install the runtime dependency

Set your OpenAI API key

How It Works

One-shot text-to-audio over WebSocket

Cross-agent format

What the agent actually does

When to Use BridgeSpeak

Voices

Cost Awareness

Project Layout

Compatibility

Platform support

What BridgeSpeak Is Not

Authoritative References

Contributing

License

About BridgeMind

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude-plugin		.claude-plugin
scripts		scripts
skills/bridgespeak		skills/bridgespeak
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Give agents a voice. Ship audio-first.

Why BridgeSpeak?

What's Inside

Install

As a Claude Code plugin

Or copy the skill manually (Claude Code, Cursor, Codex, Gemini CLI, …)

Hermes (NousResearch)

OpenClaw

Or symlink during development

Install the runtime dependency

Set your OpenAI API key

How It Works

One-shot text-to-audio over WebSocket

Cross-agent format

What the agent actually does

When to Use BridgeSpeak

Voices

Cost Awareness

Project Layout

Compatibility

Platform support

What BridgeSpeak Is Not

Authoritative References

Contributing

License

About BridgeMind

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages