Skip to content

adrianwedd/afterwords-cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

afterwords-cloud

Hosted voice-cloning TTS API. Clone your voice once on a Mac, push it to the cloud, synthesize from anywhere via REST API — no Apple Silicon required at call time.

Core thesis: "Clone your voice once, own it forever, use it anywhere via API."


Architecture

Client
  │
  │  HTTPS /v1/*
  ▼
Cloudflare Worker (Hono)
  ├── API key hashes       → KV
  ├── Voice + job metadata → D1 (SQLite)
  └── Ref audio + cache    → R2
  │
  │  POST trigger (<1s response)
  ▼
Modal trigger endpoint (CPU)
  │
  │  .spawn()
  ▼
Modal synthesize_job (A100 GPU)
  │  Qwen3-TTS inference (~60–90s cold / ~10–15s warm)
  │
  │  POST /v1/jobs/:id/complete
  ▼
Cloudflare Worker (webhook)
  └── writes WAV to R2, marks job ready

Synthesis is always async — the Worker has a 30s CPU limit, Modal cold starts take 60–90s. The client receives 202 Accepted immediately and polls for completion.


API Reference

Authentication

User endpoints require Authorization: Bearer aw_<key>. Admin endpoint (POST /v1/keys) requires Authorization: Bearer <ADMIN_SECRET>.


GET /v1/health

No auth required.

{ "status": "ok", "backends": ["qwen3-1.7b", "qwen3-0.6b"] }

POST /v1/keys

Create a new user API key. Admin-only.

Headers: Authorization: Bearer <ADMIN_SECRET> Body (optional): { "label": "my-app" }

201 Created:

{ "api_key": "aw_...", "label": "my-app", "created_at": "2026-05-04T00:00:00.000Z" }

401 — wrong or missing ADMIN_SECRET.


POST /v1/voices

Register a new cloned voice.

Auth: User API key Content-Type: multipart/form-data

Field Type Required Notes
name string Friendly label
ref_audio WAV file PCM WAV, 5–60s, max 20MB
ref_text string Transcript of reference audio (optional, used at inference time)
backend string qwen3-1.7b (default) or qwen3-0.6b
lang string BCP-47 code, default en. Supported: en zh ja ko es fr de it pt ru
family string Optional grouping label (stored, not used by inference)

201 Created:

{ "voice_id": "uuid", "name": "picard", "backend": "qwen3-1.7b", "created_at": "..." }

400name or ref_audio missing. 422 — non-WAV content-type, duration outside 5–60s, or file > 20MB.


GET /v1/voices

List voices owned by this API key.

200:

[
  { "voice_id": "uuid", "name": "picard", "backend": "qwen3-1.7b", "lang": "en", "created_at": "..." }
]

GET /v1/voices/:id

Get full voice record. Returns all fields except api_key_hash (stripped server-side).

404 — not found or belongs to a different key.


DELETE /v1/voices/:id

Delete voice, all cached synthesis WAVs, and all job records/audio.

204 No Content on success. 404 — not found or belongs to a different key.


POST /v1/synthesize

Synthesize speech from a cloned voice.

Auth: User API key Body:

{
  "voice_id": "uuid",
  "text": "The text to synthesize.",
  "lang": "en",
  "backend": "qwen3-1.7b"
}

lang and backend are optional — backend falls back to the voice's registered backend. Supported lang values: en zh ja ko es fr de it pt ru (others will cause the job to fail with an unsupported language error).

200 — Cache hit (X-Cache: HIT): Binary audio/wav stream with headers X-Cache: HIT, X-Voice-Id, X-Backend, X-Synthesis-Time-Ms.

202 — Cache miss:

{ "job_id": "uuid", "status": "pending" }

Response headers: X-Cache: MISS, X-Voice-Id, X-Backend, X-Synthesis-Time-Ms: -1.

400voice_id or text missing. 404 — voice not found or wrong owner. 422 — invalid backend value. 500 — reference audio missing from R2.


GET /v1/jobs/:id

Poll synthesis job status.

200:

{
  "job_id": "uuid",
  "status": "pending | ready | failed",
  "audio_url": "https://.../v1/jobs/uuid/audio",
  "error": "optional error message"
}

audio_url only appears when status is "ready".

404 — not found or belongs to a different key.


GET /v1/jobs/:id/audio

Download synthesized WAV.

200 — Binary audio/wav with X-Job-Id header. 404 — job not found, wrong owner, or not yet ready. 500 — job shows ready in D1 but audio missing from R2 (self-heals: marks job failed).


Billing endpoints

The hosted billing flow is documented in full in docs/openapi.yaml. Summary:

Endpoint Auth Purpose
GET /v1/signup HTML signup form
POST /v1/checkout Creates a Stripe Checkout Session; returns { url }. Existing active subscribers get a URL to /v1/checkout/already-subscribed instead of a second Stripe session.
GET /v1/checkout/success?token=… Polled by the browser after Stripe redirect. Shows the API key once (browser navigation) or signals `{ ready: true
GET /v1/checkout/already-subscribed HTML page shown when a customer attempts to subscribe a second time.
POST /v1/webhooks/stripe Stripe signature Handles checkout.session.completed, customer.subscription.updated (period sync + revoke on unpaid), and customer.subscription.deleted.
GET /v1/usage User API key Current-period character usage anchored to the customer's real Stripe billing window.

Text cap: POST /v1/synthesize rejects text longer than 5000 characters (422). Rate limiting per key is a known gap; see roadmap.


POST /v1/jobs/:id/complete (Modal webhook — internal)

Called by Modal when synthesis finishes. Not for direct use.

Auth: Authorization: Bearer <MODAL_CALLBACK_SECRET> (timing-safe compare — CF Workers synchronous extension, not standard Web Crypto) Body:

{
  "job_id": "uuid",
  "voice_id": "uuid",
  "cache_key": "sha256hex",
  "wav_b64": "base64-encoded-wav | null",
  "error": "error message | null"
}

(voice_id is sent by Modal but not consumed by the webhook — ownership is derived from the job record in D1.)

Idempotent — re-delivery after a job has already completed is silently accepted (checked via status !== "pending" || completed_at IS NOT NULL).


Async Synthesis Flow

1.  POST /v1/synthesize
      ↓ cache miss
2.  D1: INSERT job (status=pending, completed_at=null)
3.  Fetch ref audio bytes from R2
4.  POST Modal trigger URL — body: job_id, voice_id, ref_audio_b64, ref_text, text, lang,
                                   backend_name, cache_key, callback_url, callback_secret
      ↓ trigger returns <1s; 202 returned to client immediately
5.  Modal trigger .spawn()s synthesize_job on A100
6.  synthesize_job:
      a. Load Qwen3CudaBackend (warm: ~1s, cold: 60–90s)
      b. Write ref audio bytes to temp WAV
      c. Run Qwen3-TTS inference → output PCM WAV
      d. POST /v1/jobs/:id/complete  { wav_b64: <base64>, cache_key, job_id }
7.  Worker webhook:
      a. Verify MODAL_CALLBACK_SECRET (timing-safe sync compare)
      b. Replay guard: skip if status != pending OR completed_at is not null
      c. Decode WAV, store at jobs/{id}.wav in R2
      d. Store in synthesis cache at voices/{voice_id}/cache/{cache_key}.wav
      e. UPDATE job: status=ready, audio_key=…, completed_at=now
8.  Client polls GET /v1/jobs/:id until status=ready
9.  Client downloads GET /v1/jobs/:id/audio  →  WAV stream

Cache key: sha256(text + "\x00" + lang + "\x00" + backend) — null-byte separators prevent collisions.


Configuration

Cloudflare Worker (api/wrangler.toml)

Binding Type Purpose
API_KEYS KV Namespace API key hash → metadata
DB D1 Database voices + jobs tables
STORAGE R2 Bucket ref WAVs, synthesis cache, job audio
ADMIN_SECRET Secret gates POST /v1/keys
MODAL_TRIGGER_URL Var URL of the Modal trigger web endpoint
MODAL_CALLBACK_SECRET Secret authenticates Modal → Worker callbacks
STRIPE_SECRET_KEY Secret Stripe live secret key (sk_live_...) — required for the billing flow
STRIPE_WEBHOOK_SECRET Secret Stripe webhook signing secret (whsec_...)
STRIPE_PRICE_ID Var Stripe Price ID for the Hobby plan
APP_BASE_URL Var Public base URL of the deployed Worker (used for Checkout success/cancel URLs)

Modal (inference/)

No Modal-native secret configuration required. The callback secret is passed in the trigger payload by the Worker and forwarded in the callback request header.


Setup & Deployment

Prerequisites

  • Node.js 20+, Wrangler 3+
  • Python 3.11+, Modal CLI: pip install modal && modal setup
  • Cloudflare account with Workers, D1, R2, KV enabled

1. Provision Cloudflare resources

npx wrangler kv namespace create API_KEYS
npx wrangler d1 create afterwords-voices
npx wrangler r2 bucket create afterwords-voices

Copy the resulting IDs into api/wrangler.toml.

2. Apply D1 schema

npx wrangler d1 execute afterwords-voices --file=migrations/0001_voices.sql --remote
npx wrangler d1 execute afterwords-voices --file=migrations/0002_billing.sql --remote
npx wrangler d1 execute afterwords-voices --file=migrations/0003_jobs_char_count.sql --remote
npx wrangler d1 execute afterwords-voices --file=migrations/0004_stripe_customer_unique.sql --remote

(Drop --remote for local dev against miniflare.)

3. Deploy Modal inference app

cd inference
pip install -r requirements.txt
modal deploy app.py

Copy the printed trigger URL into api/wrangler.toml as MODAL_TRIGGER_URL.

4. Set Worker secrets

cd api
npx wrangler secret put ADMIN_SECRET
npx wrangler secret put MODAL_CALLBACK_SECRET   # must match the secret sent in trigger payload
npx wrangler secret put STRIPE_SECRET_KEY       # billing flow
npx wrangler secret put STRIPE_WEBHOOK_SECRET   # webhook signature verification

Set STRIPE_PRICE_ID and APP_BASE_URL as [vars] in wrangler.toml (already pinned in the committed config).

5. Deploy Worker

cd api
npm install
npx wrangler deploy

Development

Worker tests

cd api
npm install
npm test      # vitest + @cloudflare/vitest-pool-workers — no wrangler needed

Tests use a real miniflare D1/R2/KV environment. The migration is applied automatically via applyD1Migrations. 67 tests covering auth, cache, storage, and all routes.

Inference tests

cd inference
pip install -r requirements.txt
pytest tests/ -v   # no GPU or Modal auth needed — model is mocked

Local Worker dev

cd api
cp .dev.vars.example .dev.vars   # fill in ADMIN_SECRET and MODAL_CALLBACK_SECRET
npx wrangler dev

Data Model

voices

Column Type Notes
voice_id TEXT PK UUID
api_key_hash TEXT SHA-256 of API key (ownership)
name TEXT Friendly name
backend TEXT qwen3-1.7b or qwen3-0.6b
ref_text TEXT Optional transcript
lang TEXT BCP-47, default en
family TEXT Optional grouping label
ref_audio_key TEXT R2 key: voices/{id}/ref.wav
created_at TEXT ISO-8601
metadata TEXT JSON blob

jobs

Column Type Notes
job_id TEXT PK UUID
voice_id TEXT FK → voices
api_key_hash TEXT Ownership check (no cross-key access)
status TEXT pendingready or failed
audio_key TEXT R2 key when ready
error TEXT Error message on failure
created_at TEXT ISO-8601
completed_at TEXT Stamped on first terminal transition; replay guard for Modal callbacks

R2 Layout

voices/{voice_id}/ref.wav                    ← reference audio
voices/{voice_id}/cache/{cache_key}.wav      ← synthesis cache
jobs/{job_id}.wav                            ← individual job output

Repository Structure

afterwords-cloud/
  api/
    src/
      index.ts          ← Hono app, route mounting, global error handler
      types.ts          ← Voice, Job, Bindings interfaces
      auth.ts           ← generateApiKey, hashKey, authMiddleware
      cache.ts          ← synthCacheKey (SHA-256, null-byte separated)
      storage.ts        ← Storage class: R2/D1/KV wrappers
      routes/
        keys.ts         ← POST /v1/keys
        voices.ts       ← POST/GET/DELETE /v1/voices (WAV validation, RIFF parser)
        synthesize.ts   ← POST /v1/synthesize (cache check, job creation, trigger)
        jobs.ts         ← GET /v1/jobs/:id, audio download, POST complete webhook
    test/               ← vitest suite (67 tests)
    wrangler.toml
    vitest.config.mts
  inference/
    app.py              ← Modal trigger endpoint + synthesize_job (A100)
    backend.py          ← Qwen3CudaBackend (PyTorch Qwen3-TTS)
    requirements.txt
    tests/
      test_backend.py   ← pytest suite (mocked, no GPU needed)
  migrations/
    0001_voices.sql     ← D1 schema
  docs/
    superpowers/
      specs/            ← v1 design spec
      plans/            ← v1 implementation plan (7 tasks)

Related

  • afterwords — local Apple Silicon TTS server. backends/qwen3.py is the MLX counterpart to inference/backend.py here. The voices/ directory has reference WAVs useful for smoke testing.

About

WIP — SaaS layer over afterwords engine. ElevenLabs-style voice cloning API with provenance + backend choice. Working title.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors