A self-hosted voice agent built with LiveKit Agents, wired
to local STT, LLM, and TTS services. Everything runs on localhost.
┌──────────────────────────┐
│ agent/ │ LiveKit Agents worker (Python)
│ agents.py + plugins │ ── glues STT → LLM → TTS together
└──────────┬───────────────┘
│ HTTP
┌───────┼─────────┬──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ stt/ │ │ llm/ │ │ tts/ │
│ Parakeet │ │ vLLM │ │ Kokoro │
│ :8989 │ │ :8080 │ │ :8880 │
└──────────┘ └──────────┘ └──────────┘
VA/
├── agent/ LiveKit agent + custom STT/TTS LiveKit plugins (was VAtest/)
├── stt/ Parakeet FastAPI service (was parakeet-FastAPI/)
├── llm/ vLLM start scripts (Llama-3 / Qwen)
└── tts/ Kokoro FastAPI service (was kokoro-FastAPI/)
| Service | Folder | Default URL | Source repo origin |
|---|---|---|---|
| Agent | agent/ |
connects to LiveKit Cloud | (custom) |
| STT | stt/ |
http://localhost:8989 |
NVIDIA NeMo Parakeet TDT 0.6B |
| LLM | llm/ |
http://localhost:8080/v1 |
vLLM serving Llama-3.1-8B-Instruct or Qwen3.5-9B (AWQ) |
| TTS | tts/ |
http://localhost:8880 |
Kokoro 82M |
- Linux with NVIDIA GPU(s), CUDA 12.4 toolkit
- Python 3.11+ (Parakeet) / 3.12+ (agent),
uvpackage managercurl -LsSf https://astral.sh/uv/install.sh | sh espeak-ngfor Kokoro TTS:sudo apt install espeak-ng
- A LiveKit Cloud project (or self-hosted LiveKit server). Credentials live in
agent/.env.
Note: the original
.venv/folders were intentionally not copied — recreate them withuv syncinside each subfolder (instructions below). All other config files (.env,pyproject.toml,uv.lock,requirements.txt) were preserved.
Run each of these once. They each create a fresh .venv inside the corresponding folder.
# 1. Agent
cd VA/agent
uv sync
# 2. STT (Parakeet)
cd ../stt
uv sync # or: pip install -r requirements.txt
# 3. LLM (vLLM) — see llm/README.md for details
cd ../llm
uv venv --python 3.12
uv pip install vllm
# 4. TTS (Kokoro) — installed by start-gpu.sh on first run, no manual step needed
cd ../ttsYou'll need 4 terminals, one per service. Bring them up in this order:
cd VA/stt
chmod +x start.sh
./start.shThe script sets CUDA_VISIBLE_DEVICES=0, downloads/loads the model, and serves
POST /v1/transcribe/parakeet on http://0.0.0.0:8989.
GPU:
cd VA/tts
./start-gpu.shCPU-only (slower):
cd VA/tts
./start-cpu.shThe first run downloads the Kokoro v1.0 voice model into api/src/models/v1_0/.
Server listens on http://0.0.0.0:8880 with an OpenAI-compatible
/v1/audio/speech endpoint.
cd VA/llm
source .venv/bin/activate # if you used plain pip; uv users can skip
./start-llama.sh # Llama-3.1-8B-Instruct-AWQ (recommended)
# or
./start-qwen.sh # Qwen3.5-9B-AWQ (strongest)Both serve on port 8080. See llm/README.md
for installation, VRAM math, and per-model tuning.
Once STT, TTS, and LLM are all reachable, start the agent:
cd VA/agent
uv run python agents.py dev # interactive dev mode
# or
uv run python agents.py start # production worker
# or
uv run python agents.py console # local terminal session, no LiveKit roomThe agent registers with LiveKit Cloud using credentials in agent/.env
(LIVEKIT_URL, LIVEKIT_API_KEY, LIVEKIT_API_SECRET) and waits for incoming sessions.
agent/parakeet.py— custom LiveKitSTTplugin →POST {STT_URL}/v1/transcribe/parakeetwith raw 16-bit PCM (sample rate 24 kHz).agent/kokoro.py— custom LiveKitTTSplugin (uses theopenaiPython client) →POST {TTS_URL}/v1/audio/speechwithresponse_format="pcm". Includes aClauseTokenizerthat splits LLM output at clause boundaries so TTS starts speaking before the LLM finishes.agent/agents.py— wiresParakeetSTT,openai.LLM,KokoroTTS, and Silero VAD into alivekit.agents.AgentSession.
If you change ports, update the URLs in agents.py:
stt_plugin = ParakeetSTT(server_url="http://localhost:8989", language="en")
llm = openai.LLM(base_url="http://localhost:8080/v1", ...)
tts = KokoroTTS(base_url="http://localhost:8880/v1", ...)# STT
curl -X POST "http://localhost:8989/v1/transcribe/parakeet?sample_rate=16000" \
--data-binary @stt/test_audio.wav
# TTS
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","voice":"af_sky","input":"hello world","response_format":"mp3"}' \
--output /tmp/hello.mp3
# LLM
curl http://localhost:8080/v1/modelsModuleNotFoundErrorafteruv sync: make sure you ranuv syncinside each service folder and you're usinguv run ...(or activated.venv/bin/activate).- STT or TTS connection refused: confirm the service is up on the expected port —
the agent uses hard-coded URLs in
agents.py. - vLLM
No available memory for the cache blocks:--gpu-memory-utilizationis too small for the model. On a 12GB card with STT also running, use 0.80. Seellm/README.mdfor the math. - GPU OOM at runtime: most likely Kokoro TTS is running on GPU and crowding out
vLLM. Switch TTS to
tts/start-cpu.sh. Kokoro on CPU is still real-time on modern x86. Default layout assumes a single 12GB GPU; with two GPUs, editllm/start-qwen.shto useCUDA_VISIBLE_DEVICES=1and raise utilization to 0.92. NVMLError_InvalidArgumentfrom vLLM:CUDA_VISIBLE_DEVICESpoints at a GPU index that doesn't exist. Runnvidia-smi -Land pick a valid index.- espeak missing for Kokoro:
sudo apt install espeak-ng. On non-Debian distros, setESPEAK_DATA_PATHintts/start-cpu.shto your espeak-ng data directory. - LiveKit auth error: check
agent/.envand that the project URL/keys match a live LiveKit Cloud project.
Each service keeps its own README and config:
agent/pyproject.toml— agent dependencies (LiveKit Agents, OpenAI plugin, etc.)stt/README.md,stt/requirements.txt— Parakeet service detailsllm/README.md— vLLM install, model options, GPU layouttts/README.md— full Kokoro docs (Docker, voices, options)