Hosted voice-cloning TTS API. Clone your voice once on a Mac, push it to the cloud, synthesize from anywhere via REST API — no Apple Silicon required at call time.
Core thesis: "Clone your voice once, own it forever, use it anywhere via API."
Client
│
│ HTTPS /v1/*
▼
Cloudflare Worker (Hono)
├── API key hashes → KV
├── Voice + job metadata → D1 (SQLite)
└── Ref audio + cache → R2
│
│ POST trigger (<1s response)
▼
Modal trigger endpoint (CPU)
│
│ .spawn()
▼
Modal synthesize_job (A100 GPU)
│ Qwen3-TTS inference (~60–90s cold / ~10–15s warm)
│
│ POST /v1/jobs/:id/complete
▼
Cloudflare Worker (webhook)
└── writes WAV to R2, marks job ready
Synthesis is always async — the Worker has a 30s CPU limit, Modal cold starts take 60–90s. The client receives 202 Accepted immediately and polls for completion.
User endpoints require Authorization: Bearer aw_<key>.
Admin endpoint (POST /v1/keys) requires Authorization: Bearer <ADMIN_SECRET>.
No auth required.
{ "status": "ok", "backends": ["qwen3-1.7b", "qwen3-0.6b"] }Create a new user API key. Admin-only.
Headers: Authorization: Bearer <ADMIN_SECRET>
Body (optional): { "label": "my-app" }
201 Created:
{ "api_key": "aw_...", "label": "my-app", "created_at": "2026-05-04T00:00:00.000Z" }401 — wrong or missing ADMIN_SECRET.
Register a new cloned voice.
Auth: User API key
Content-Type: multipart/form-data
| Field | Type | Required | Notes |
|---|---|---|---|
name |
string | ✓ | Friendly label |
ref_audio |
WAV file | ✓ | PCM WAV, 5–60s, max 20MB |
ref_text |
string | — | Transcript of reference audio (optional, used at inference time) |
backend |
string | — | qwen3-1.7b (default) or qwen3-0.6b |
lang |
string | — | BCP-47 code, default en. Supported: en zh ja ko es fr de it pt ru |
family |
string | — | Optional grouping label (stored, not used by inference) |
201 Created:
{ "voice_id": "uuid", "name": "picard", "backend": "qwen3-1.7b", "created_at": "..." }400 — name or ref_audio missing.
422 — non-WAV content-type, duration outside 5–60s, or file > 20MB.
List voices owned by this API key.
200:
[
{ "voice_id": "uuid", "name": "picard", "backend": "qwen3-1.7b", "lang": "en", "created_at": "..." }
]Get full voice record. Returns all fields except api_key_hash (stripped server-side).
404 — not found or belongs to a different key.
Delete voice, all cached synthesis WAVs, and all job records/audio.
204 No Content on success. 404 — not found or belongs to a different key.
Synthesize speech from a cloned voice.
Auth: User API key Body:
{
"voice_id": "uuid",
"text": "The text to synthesize.",
"lang": "en",
"backend": "qwen3-1.7b"
}lang and backend are optional — backend falls back to the voice's registered backend. Supported lang values: en zh ja ko es fr de it pt ru (others will cause the job to fail with an unsupported language error).
200 — Cache hit (X-Cache: HIT):
Binary audio/wav stream with headers X-Cache: HIT, X-Voice-Id, X-Backend, X-Synthesis-Time-Ms.
202 — Cache miss:
{ "job_id": "uuid", "status": "pending" }Response headers: X-Cache: MISS, X-Voice-Id, X-Backend, X-Synthesis-Time-Ms: -1.
400 — voice_id or text missing.
404 — voice not found or wrong owner.
422 — invalid backend value.
500 — reference audio missing from R2.
Poll synthesis job status.
200:
{
"job_id": "uuid",
"status": "pending | ready | failed",
"audio_url": "https://.../v1/jobs/uuid/audio",
"error": "optional error message"
}audio_url only appears when status is "ready".
404 — not found or belongs to a different key.
Download synthesized WAV.
200 — Binary audio/wav with X-Job-Id header.
404 — job not found, wrong owner, or not yet ready.
500 — job shows ready in D1 but audio missing from R2 (self-heals: marks job failed).
The hosted billing flow is documented in full in docs/openapi.yaml. Summary:
| Endpoint | Auth | Purpose |
|---|---|---|
GET /v1/signup |
— | HTML signup form |
POST /v1/checkout |
— | Creates a Stripe Checkout Session; returns { url }. Existing active subscribers get a URL to /v1/checkout/already-subscribed instead of a second Stripe session. |
GET /v1/checkout/success?token=… |
— | Polled by the browser after Stripe redirect. Shows the API key once (browser navigation) or signals `{ ready: true |
GET /v1/checkout/already-subscribed |
— | HTML page shown when a customer attempts to subscribe a second time. |
POST /v1/webhooks/stripe |
Stripe signature | Handles checkout.session.completed, customer.subscription.updated (period sync + revoke on unpaid), and customer.subscription.deleted. |
GET /v1/usage |
User API key | Current-period character usage anchored to the customer's real Stripe billing window. |
Text cap: POST /v1/synthesize rejects text longer than 5000 characters (422). Rate limiting per key is a known gap; see roadmap.
Called by Modal when synthesis finishes. Not for direct use.
Auth: Authorization: Bearer <MODAL_CALLBACK_SECRET> (timing-safe compare — CF Workers synchronous extension, not standard Web Crypto)
Body:
{
"job_id": "uuid",
"voice_id": "uuid",
"cache_key": "sha256hex",
"wav_b64": "base64-encoded-wav | null",
"error": "error message | null"
}(voice_id is sent by Modal but not consumed by the webhook — ownership is derived from the job record in D1.)
Idempotent — re-delivery after a job has already completed is silently accepted (checked via status !== "pending" || completed_at IS NOT NULL).
1. POST /v1/synthesize
↓ cache miss
2. D1: INSERT job (status=pending, completed_at=null)
3. Fetch ref audio bytes from R2
4. POST Modal trigger URL — body: job_id, voice_id, ref_audio_b64, ref_text, text, lang,
backend_name, cache_key, callback_url, callback_secret
↓ trigger returns <1s; 202 returned to client immediately
5. Modal trigger .spawn()s synthesize_job on A100
6. synthesize_job:
a. Load Qwen3CudaBackend (warm: ~1s, cold: 60–90s)
b. Write ref audio bytes to temp WAV
c. Run Qwen3-TTS inference → output PCM WAV
d. POST /v1/jobs/:id/complete { wav_b64: <base64>, cache_key, job_id }
7. Worker webhook:
a. Verify MODAL_CALLBACK_SECRET (timing-safe sync compare)
b. Replay guard: skip if status != pending OR completed_at is not null
c. Decode WAV, store at jobs/{id}.wav in R2
d. Store in synthesis cache at voices/{voice_id}/cache/{cache_key}.wav
e. UPDATE job: status=ready, audio_key=…, completed_at=now
8. Client polls GET /v1/jobs/:id until status=ready
9. Client downloads GET /v1/jobs/:id/audio → WAV stream
Cache key: sha256(text + "\x00" + lang + "\x00" + backend) — null-byte separators prevent collisions.
| Binding | Type | Purpose |
|---|---|---|
API_KEYS |
KV Namespace | API key hash → metadata |
DB |
D1 Database | voices + jobs tables |
STORAGE |
R2 Bucket | ref WAVs, synthesis cache, job audio |
ADMIN_SECRET |
Secret | gates POST /v1/keys |
MODAL_TRIGGER_URL |
Var | URL of the Modal trigger web endpoint |
MODAL_CALLBACK_SECRET |
Secret | authenticates Modal → Worker callbacks |
STRIPE_SECRET_KEY |
Secret | Stripe live secret key (sk_live_...) — required for the billing flow |
STRIPE_WEBHOOK_SECRET |
Secret | Stripe webhook signing secret (whsec_...) |
STRIPE_PRICE_ID |
Var | Stripe Price ID for the Hobby plan |
APP_BASE_URL |
Var | Public base URL of the deployed Worker (used for Checkout success/cancel URLs) |
No Modal-native secret configuration required. The callback secret is passed in the trigger payload by the Worker and forwarded in the callback request header.
- Node.js 20+, Wrangler 3+
- Python 3.11+, Modal CLI:
pip install modal && modal setup - Cloudflare account with Workers, D1, R2, KV enabled
npx wrangler kv namespace create API_KEYS
npx wrangler d1 create afterwords-voices
npx wrangler r2 bucket create afterwords-voicesCopy the resulting IDs into api/wrangler.toml.
npx wrangler d1 execute afterwords-voices --file=migrations/0001_voices.sql --remote
npx wrangler d1 execute afterwords-voices --file=migrations/0002_billing.sql --remote
npx wrangler d1 execute afterwords-voices --file=migrations/0003_jobs_char_count.sql --remote
npx wrangler d1 execute afterwords-voices --file=migrations/0004_stripe_customer_unique.sql --remote(Drop --remote for local dev against miniflare.)
cd inference
pip install -r requirements.txt
modal deploy app.pyCopy the printed trigger URL into api/wrangler.toml as MODAL_TRIGGER_URL.
cd api
npx wrangler secret put ADMIN_SECRET
npx wrangler secret put MODAL_CALLBACK_SECRET # must match the secret sent in trigger payload
npx wrangler secret put STRIPE_SECRET_KEY # billing flow
npx wrangler secret put STRIPE_WEBHOOK_SECRET # webhook signature verificationSet STRIPE_PRICE_ID and APP_BASE_URL as [vars] in wrangler.toml (already pinned in the committed config).
cd api
npm install
npx wrangler deploycd api
npm install
npm test # vitest + @cloudflare/vitest-pool-workers — no wrangler neededTests use a real miniflare D1/R2/KV environment. The migration is applied automatically via applyD1Migrations. 67 tests covering auth, cache, storage, and all routes.
cd inference
pip install -r requirements.txt
pytest tests/ -v # no GPU or Modal auth needed — model is mockedcd api
cp .dev.vars.example .dev.vars # fill in ADMIN_SECRET and MODAL_CALLBACK_SECRET
npx wrangler dev| Column | Type | Notes |
|---|---|---|
voice_id |
TEXT PK | UUID |
api_key_hash |
TEXT | SHA-256 of API key (ownership) |
name |
TEXT | Friendly name |
backend |
TEXT | qwen3-1.7b or qwen3-0.6b |
ref_text |
TEXT | Optional transcript |
lang |
TEXT | BCP-47, default en |
family |
TEXT | Optional grouping label |
ref_audio_key |
TEXT | R2 key: voices/{id}/ref.wav |
created_at |
TEXT | ISO-8601 |
metadata |
TEXT | JSON blob |
| Column | Type | Notes |
|---|---|---|
job_id |
TEXT PK | UUID |
voice_id |
TEXT | FK → voices |
api_key_hash |
TEXT | Ownership check (no cross-key access) |
status |
TEXT | pending → ready or failed |
audio_key |
TEXT | R2 key when ready |
error |
TEXT | Error message on failure |
created_at |
TEXT | ISO-8601 |
completed_at |
TEXT | Stamped on first terminal transition; replay guard for Modal callbacks |
voices/{voice_id}/ref.wav ← reference audio
voices/{voice_id}/cache/{cache_key}.wav ← synthesis cache
jobs/{job_id}.wav ← individual job output
afterwords-cloud/
api/
src/
index.ts ← Hono app, route mounting, global error handler
types.ts ← Voice, Job, Bindings interfaces
auth.ts ← generateApiKey, hashKey, authMiddleware
cache.ts ← synthCacheKey (SHA-256, null-byte separated)
storage.ts ← Storage class: R2/D1/KV wrappers
routes/
keys.ts ← POST /v1/keys
voices.ts ← POST/GET/DELETE /v1/voices (WAV validation, RIFF parser)
synthesize.ts ← POST /v1/synthesize (cache check, job creation, trigger)
jobs.ts ← GET /v1/jobs/:id, audio download, POST complete webhook
test/ ← vitest suite (67 tests)
wrangler.toml
vitest.config.mts
inference/
app.py ← Modal trigger endpoint + synthesize_job (A100)
backend.py ← Qwen3CudaBackend (PyTorch Qwen3-TTS)
requirements.txt
tests/
test_backend.py ← pytest suite (mocked, no GPU needed)
migrations/
0001_voices.sql ← D1 schema
docs/
superpowers/
specs/ ← v1 design spec
plans/ ← v1 implementation plan (7 tasks)
- afterwords — local Apple Silicon TTS server.
backends/qwen3.pyis the MLX counterpart toinference/backend.pyhere. Thevoices/directory has reference WAVs useful for smoke testing.