Calm is a modular Python project that generates custom meditation audio using LLM-structured scripts and Text-to-Speech (TTS), with optional background ambience and simple ducking.
- LLM script generation with OpenAI Structured Outputs (schema-validated)
- TTS via OpenAI (
tts-1/tts-1-hd/gpt-4o-mini-tts), with speaking speed control and voice mapping - Audio stitching with pause handling, optional background music, and automatic looping
- Content-addressed caching for LLM outputs and TTS segments
- Clean Python API and a simple CLI
calm/core/— types, settings, and centralizedinstructions.pycalm/providers/— LLM and TTS adapters (OpenAI by default)calm/audio/— audio engine (stitching, background mixing)calm/utils/— helpers like hashing and cachingexamples/— sample generation script(s)scripts/— CLI entrypointcache/— cached scripts/segments and background assetsoutput/— final rendered audio
flowchart TD
U[User / CLI / API] --> G[MeditationGenerator]
G --> LLM[LLM - OpenAI]
LLM --> TTS[TTS - OpenAI]
TTS --> AE[Audio Engine]
AE --> OUT[output/*.mp3]
subgraph Cache
C1[cache/llm]
C2[cache/tts]
C3[cache/backgrounds]
end
LLM <-->|read/write| C1
TTS <-->|read/write| C2
AE -->|optional bg| C3
- Python 3.9+
- ffmpeg (for
pydub):brew install ffmpeg
- Create venv and install dependencies:
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt- Configure environment:
cp .env.example .env
# Set at least: OPENAI_API_KEYLoaded from environment (see calm/core/settings.py):
OPENAI_API_KEY: your OpenAI API key (required for real LLM/TTS)OPENAI_TTS_VOICE: defaultnovaCALM_TTS_MODEL: defaultopenai:gpt-4o-mini-tts(can useopenai:tts-1-hd)CALM_MODEL: defaultopenai:gpt-4oCALM_TTS: defaultopenai-ttsCALM_OUTPUT_DIR: defaultoutputCALM_CACHE_DIR: defaultcacheCALM_DEFAULT_FORMAT: defaultmp3CALM_TTS_RESPONSE_FORMAT: defaultwavCALM_TTS_SPEED: default1.0(mapped to OpenAIspeed, 0.25–4.0)
python -m scripts.calm_cli \
--type mindfulness \
--duration 5 \
--focus "stress relief" \
--voice female \
--background ocean_waves \
--instructions "Include a focus on ocean imagery and gentle wave sounds"- Place assets under
cache/backgrounds/named<name>.mp3or<name>.wav. - Example:
cache/backgrounds/nature_sounds.mp3andcache/backgrounds/soothing_flute.mp3. - Set
background_musicto"nature_sounds"or"soothing_flute"in config. - The engine loops/trim the track to match the final duration and ducks under speech.
- Provider: OpenAI client with Structured Outputs
- Schema is defined in code and converted to
calm/core/types.pymodels - Prompts are centralized in
calm/core/instructions.pyfor easy tuning
We use a structured prompt paradigm to make the model's behavior predictable and the output schema-stable.
Template:
<OBJECTIVE_AND_PERSONA>
You are a world-class meditation teacher and scriptwriter. Your task is to produce a calm, inclusive, secular meditation script tailored to the provided type, focus, and duration.
</OBJECTIVE_AND_PERSONA>
<INSTRUCTIONS>
To complete the task, follow these steps:
1. Plan intro (10–15%), body (70–80%), outro (10–15%).
2. Write short 'speech' paragraphs in a gentle, invitational style.
3. After every 2–4 sentences, add a 'pause' with target_seconds.
4. Keep total_target_seconds within ±15% by adjusting pauses.
5. Reflect the requested type and focus with simple, supportive imagery.
6. Conclude by gently transitioning back.
</INSTRUCTIONS>
<CONSTRAINTS>
Dos and don'ts:
1. Do: be gentle, inclusive, secular; prefer "you may", "if you like".
2. Don't: bracketed directions, sound-effect words, medical claims, timers, or background-audio references in speech.
</CONSTRAINTS>
<RECAP>
Gentle tone, three-part structure, alternate speech/pause, adjust pauses to meet time, strict schema.
</RECAP>
At runtime we embed context from the user request inside <CONTEXT> via build_user_instruction, and enforce schema with Pydantic Structured Outputs.
- Provider: OpenAI
tts-1/tts-1-hdwith streaming to file - Voices supported:
alloy, ash, fable, nova, shimmer, echo, onyx, sage, coral voicemapping supportsmale/femaleas conveniences (ash/nova)speaking_ratemaps tospeed(0.25–4.0; default 0.8)
- LLM script cache:
cache/llm/keyed by config and model - TTS segment cache:
cache/tts/keyed by text+voice+speed - Pauses cached in
cache/pauses/
python -m examples.generate_samplesOutputs are saved in output/ with filenames like meditation_<type>_<duration>m.mp3.
pytest -q- We use a structured prompt with explicit sections (OBJECTIVE_AND_PERSONA, INSTRUCTIONS, CONSTRAINTS, RECAP) to constrain style and structure while keeping outputs consistent.
- Request-specific details are injected via
<CONTEXT>frombuild_user_instruction, separating stable guidance from dynamic inputs. - Responses are parsed with OpenAI Structured Outputs into Pydantic models, ensuring schema correctness and reducing retries.
- The design is inspired by best practices such as hierarchical prompting and explicit constraints;
- The prompt budgets time: three-part structure with a total duration target within ±15% and explicit
pausesegments carryingtarget_seconds. - During synthesis, actual speech durations are measured from rendered audio files.
- In the final mix, we adjust only pause lengths to close any remaining gap when off by more than ~3 seconds, minimizing audible artifacts.
- Segment-first pipeline: synthesize each
speechsegment, generate precisepausesilences, then stitch in order. - Optional background: loop/trim the chosen track to duration; apply simple ducking (stronger under speech, lighter during pauses).
- Export via
pydub/ffmpeg with configurable format/bitrate (default MP3 192k). - Content-addressed caching for LLM scripts, TTS segments, and generated pauses reduces cost and speeds up repeats.
- We keep things simple: For adjusting background music, we just lower the volume (a basic technique) instead of using complex tools. This keeps things easy to manage.
- To match the target meditation duration, we change the length of silent pauses rather than re-creating the spoken parts, so there's less cost and more consistency.
- We balance reliability with creativity: By setting clear rules for how scripts are structured, the outputs are more predictable, but there's still room to adjust for more or less creativity if needed.
- While we use OpenAI for text and voices by default, the code is built in a way that lets us swap out these services in the future if we want to use something else easily.
The following major enhancements are planned for future releases. These are not yet implemented, but are priorities for upcoming development:
- Streaming audio output: Enable real-time, chunked audio generation and playback, allowing users to listen to meditations as they are being created, instead of waiting for the full file to finish processing.
- Async FastAPI server for text-to-text and TTS: Provide an HTTP API with endpoints for both script generation and audio synthesis, implemented using FastAPI with asynchronous processing for scalability and low-latency user experience.
- Streamlit-based UI for rapid prototyping: Develop an interactive web interface for testing meditation types, voices, and background options—ideal for demoing and manual QA without using the command line.
- Advanced caching strategy: Improve caching mechanisms for LLM outputs, TTS segments, and audio mixes to further reduce latency, avoid repeated computation, and support cache expiration/invalidation for dynamic changes.
- Smooth ducking with fades: Implement ducking so that background music volume transitions smoothly using fade-ins and fade-outs at the start and end of silence, preventing abrupt changes in volume.