Calm — Meditation Audio Generator

Calm is a modular Python project that generates custom meditation audio using LLM-structured scripts and Text-to-Speech (TTS), with optional background ambience and simple ducking.

Features

LLM script generation with OpenAI Structured Outputs (schema-validated)
TTS via OpenAI (tts-1/tts-1-hd/gpt-4o-mini-tts), with speaking speed control and voice mapping
Audio stitching with pause handling, optional background music, and automatic looping
Content-addressed caching for LLM outputs and TTS segments
Clean Python API and a simple CLI

Project layout

calm/core/ — types, settings, and centralized instructions.py
calm/providers/ — LLM and TTS adapters (OpenAI by default)
calm/audio/ — audio engine (stitching, background mixing)
calm/utils/ — helpers like hashing and caching
examples/ — sample generation script(s)
scripts/ — CLI entrypoint
cache/ — cached scripts/segments and background assets
output/ — final rendered audio

High-Level Design

 flowchart TD
 U[User / CLI / API] --> G[MeditationGenerator]
  G --> LLM[LLM - OpenAI]
  LLM --> TTS[TTS - OpenAI]
  TTS --> AE[Audio Engine]
  AE --> OUT[output/*.mp3]

  subgraph Cache
    C1[cache/llm]
    C2[cache/tts]
    C3[cache/backgrounds]
  end
  LLM <-->|read/write| C1
  TTS <-->|read/write| C2
  AE -->|optional bg| C3

Requirements

Python 3.9+
ffmpeg (for pydub):
```
brew install ffmpeg
```

Setup

Create venv and install dependencies:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Configure environment:

cp .env.example .env
# Set at least: OPENAI_API_KEY

Default settings

Loaded from environment (see calm/core/settings.py):

OPENAI_API_KEY: your OpenAI API key (required for real LLM/TTS)
OPENAI_TTS_VOICE: default nova
CALM_TTS_MODEL: default openai:gpt-4o-mini-tts (can use openai:tts-1-hd)
CALM_MODEL: default openai:gpt-4o
CALM_TTS: default openai-tts
CALM_OUTPUT_DIR: default output
CALM_CACHE_DIR: default cache
CALM_DEFAULT_FORMAT: default mp3
CALM_TTS_RESPONSE_FORMAT: default wav
CALM_TTS_SPEED: default 1.0 (mapped to OpenAI speed, 0.25–4.0)

Usage

CLI

python -m scripts.calm_cli \
  --type mindfulness \
  --duration 5 \
  --focus "stress relief" \
  --voice female \
  --background ocean_waves \
  --instructions "Include a focus on ocean imagery and gentle wave sounds"

Background music

Place assets under cache/backgrounds/ named <name>.mp3 or <name>.wav.
Example: cache/backgrounds/nature_sounds.mp3 and cache/backgrounds/soothing_flute.mp3.
Set background_music to "nature_sounds" or "soothing_flute" in config.
The engine loops/trim the track to match the final duration and ducks under speech.

LLM and structured outputs

Provider: OpenAI client with Structured Outputs
Schema is defined in code and converted to calm/core/types.py models
Prompts are centralized in calm/core/instructions.py for easy tuning

Prompt engineering

We use a structured prompt paradigm to make the model's behavior predictable and the output schema-stable.

Template:

<OBJECTIVE_AND_PERSONA>
You are a world-class meditation teacher and scriptwriter. Your task is to produce a calm, inclusive, secular meditation script tailored to the provided type, focus, and duration.
</OBJECTIVE_AND_PERSONA>

<INSTRUCTIONS>
To complete the task, follow these steps:
1. Plan intro (10–15%), body (70–80%), outro (10–15%).
2. Write short 'speech' paragraphs in a gentle, invitational style.
3. After every 2–4 sentences, add a 'pause' with target_seconds.
4. Keep total_target_seconds within ±15% by adjusting pauses.
5. Reflect the requested type and focus with simple, supportive imagery.
6. Conclude by gently transitioning back.
</INSTRUCTIONS>

<CONSTRAINTS>
Dos and don'ts:
1. Do: be gentle, inclusive, secular; prefer "you may", "if you like".
2. Don't: bracketed directions, sound-effect words, medical claims, timers, or background-audio references in speech.
</CONSTRAINTS>

<RECAP>
Gentle tone, three-part structure, alternate speech/pause, adjust pauses to meet time, strict schema.
</RECAP>

At runtime we embed context from the user request inside <CONTEXT> via build_user_instruction, and enforce schema with Pydantic Structured Outputs.

TTS

Provider: OpenAI tts-1/tts-1-hd with streaming to file
Voices supported: alloy, ash, fable, nova, shimmer, echo, onyx, sage, coral
voice mapping supports male/female as conveniences (ash/nova)
speaking_rate maps to speed (0.25–4.0; default 0.8)

Caching

LLM script cache: cache/llm/ keyed by config and model
TTS segment cache: cache/tts/ keyed by text+voice+speed
Pauses cached in cache/pauses/

Running examples

python -m examples.generate_samples

Outputs are saved in output/ with filenames like meditation_<type>_<duration>m.mp3.

Tests

pytest -q

Design decisions and rationale

Approach to prompt engineering

We use a structured prompt with explicit sections (OBJECTIVE_AND_PERSONA, INSTRUCTIONS, CONSTRAINTS, RECAP) to constrain style and structure while keeping outputs consistent.
Request-specific details are injected via <CONTEXT> from build_user_instruction, separating stable guidance from dynamic inputs.
Responses are parsed with OpenAI Structured Outputs into Pydantic models, ensuring schema correctness and reducing retries.
The design is inspired by best practices such as hierarchical prompting and explicit constraints;

How we ensure timing accuracy

The prompt budgets time: three-part structure with a total duration target within ±15% and explicit pause segments carrying target_seconds.
During synthesis, actual speech durations are measured from rendered audio files.
In the final mix, we adjust only pause lengths to close any remaining gap when off by more than ~3 seconds, minimizing audible artifacts.

Audio processing strategy

Segment-first pipeline: synthesize each speech segment, generate precise pause silences, then stitch in order.
Optional background: loop/trim the chosen track to duration; apply simple ducking (stronger under speech, lighter during pauses).
Export via pydub/ffmpeg with configurable format/bitrate (default MP3 192k).
Content-addressed caching for LLM scripts, TTS segments, and generated pauses reduces cost and speeds up repeats.

Trade-offs considered

We keep things simple: For adjusting background music, we just lower the volume (a basic technique) instead of using complex tools. This keeps things easy to manage.
To match the target meditation duration, we change the length of silent pauses rather than re-creating the spoken parts, so there's less cost and more consistency.
We balance reliability with creativity: By setting clear rules for how scripts are structured, the outputs are more predictable, but there's still room to adjust for more or less creativity if needed.
While we use OpenAI for text and voices by default, the code is built in a way that lets us swap out these services in the future if we want to use something else easily.

Planned Features and Roadmap

The following major enhancements are planned for future releases. These are not yet implemented, but are priorities for upcoming development:

Streaming audio output: Enable real-time, chunked audio generation and playback, allowing users to listen to meditations as they are being created, instead of waiting for the full file to finish processing.
Async FastAPI server for text-to-text and TTS: Provide an HTTP API with endpoints for both script generation and audio synthesis, implemented using FastAPI with asynchronous processing for scalability and low-latency user experience.
Streamlit-based UI for rapid prototyping: Develop an interactive web interface for testing meditation types, voices, and background options—ideal for demoing and manual QA without using the command line.
Advanced caching strategy: Improve caching mechanisms for LLM outputs, TTS segments, and audio mixes to further reduce latency, avoid repeated computation, and support cache expiration/invalidation for dynamic changes.
Smooth ducking with fades: Implement ducking so that background music volume transitions smoothly using fade-ins and fade-outs at the start and end of silence, preventing abrupt changes in volume.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
cache/backgrounds		cache/backgrounds
calm		calm
examples		examples
output		output
scripts		scripts
tests		tests
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calm — Meditation Audio Generator

Features

Project layout

High-Level Design

Requirements

Setup

Default settings

Usage

CLI

Background music

LLM and structured outputs

Prompt engineering

TTS

Caching

Running examples

Tests

Design decisions and rationale

Approach to prompt engineering

How we ensure timing accuracy

Audio processing strategy

Trade-offs considered

Planned Features and Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Calm — Meditation Audio Generator

Features

Project layout

High-Level Design

Requirements

Setup

Default settings

Usage

CLI

Background music

LLM and structured outputs

Prompt engineering

TTS

Caching

Running examples

Tests

Design decisions and rationale

Approach to prompt engineering

How we ensure timing accuracy

Audio processing strategy

Trade-offs considered

Planned Features and Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages