Skip to content

criminact/Calm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Calm — Meditation Audio Generator

Calm is a modular Python project that generates custom meditation audio using LLM-structured scripts and Text-to-Speech (TTS), with optional background ambience and simple ducking.

Features

  • LLM script generation with OpenAI Structured Outputs (schema-validated)
  • TTS via OpenAI (tts-1/tts-1-hd/gpt-4o-mini-tts), with speaking speed control and voice mapping
  • Audio stitching with pause handling, optional background music, and automatic looping
  • Content-addressed caching for LLM outputs and TTS segments
  • Clean Python API and a simple CLI

Project layout

  • calm/core/ — types, settings, and centralized instructions.py
  • calm/providers/ — LLM and TTS adapters (OpenAI by default)
  • calm/audio/ — audio engine (stitching, background mixing)
  • calm/utils/ — helpers like hashing and caching
  • examples/ — sample generation script(s)
  • scripts/ — CLI entrypoint
  • cache/ — cached scripts/segments and background assets
  • output/ — final rendered audio

High-Level Design

 flowchart TD
 U[User / CLI / API] --> G[MeditationGenerator]
  G --> LLM[LLM - OpenAI]
  LLM --> TTS[TTS - OpenAI]
  TTS --> AE[Audio Engine]
  AE --> OUT[output/*.mp3]

  subgraph Cache
    C1[cache/llm]
    C2[cache/tts]
    C3[cache/backgrounds]
  end
  LLM <-->|read/write| C1
  TTS <-->|read/write| C2
  AE -->|optional bg| C3
Loading

Requirements

  • Python 3.9+
  • ffmpeg (for pydub):
    brew install ffmpeg

Setup

  1. Create venv and install dependencies:
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
  1. Configure environment:
cp .env.example .env
# Set at least: OPENAI_API_KEY

Default settings

Loaded from environment (see calm/core/settings.py):

  • OPENAI_API_KEY: your OpenAI API key (required for real LLM/TTS)
  • OPENAI_TTS_VOICE: default nova
  • CALM_TTS_MODEL: default openai:gpt-4o-mini-tts (can use openai:tts-1-hd)
  • CALM_MODEL: default openai:gpt-4o
  • CALM_TTS: default openai-tts
  • CALM_OUTPUT_DIR: default output
  • CALM_CACHE_DIR: default cache
  • CALM_DEFAULT_FORMAT: default mp3
  • CALM_TTS_RESPONSE_FORMAT: default wav
  • CALM_TTS_SPEED: default 1.0 (mapped to OpenAI speed, 0.25–4.0)

Usage

CLI

python -m scripts.calm_cli \
  --type mindfulness \
  --duration 5 \
  --focus "stress relief" \
  --voice female \
  --background ocean_waves \
  --instructions "Include a focus on ocean imagery and gentle wave sounds"

Background music

  • Place assets under cache/backgrounds/ named <name>.mp3 or <name>.wav.
  • Example: cache/backgrounds/nature_sounds.mp3 and cache/backgrounds/soothing_flute.mp3.
  • Set background_music to "nature_sounds" or "soothing_flute" in config.
  • The engine loops/trim the track to match the final duration and ducks under speech.

LLM and structured outputs

  • Provider: OpenAI client with Structured Outputs
  • Schema is defined in code and converted to calm/core/types.py models
  • Prompts are centralized in calm/core/instructions.py for easy tuning

Prompt engineering

We use a structured prompt paradigm to make the model's behavior predictable and the output schema-stable.

Template:

<OBJECTIVE_AND_PERSONA>
You are a world-class meditation teacher and scriptwriter. Your task is to produce a calm, inclusive, secular meditation script tailored to the provided type, focus, and duration.
</OBJECTIVE_AND_PERSONA>

<INSTRUCTIONS>
To complete the task, follow these steps:
1. Plan intro (10–15%), body (70–80%), outro (10–15%).
2. Write short 'speech' paragraphs in a gentle, invitational style.
3. After every 2–4 sentences, add a 'pause' with target_seconds.
4. Keep total_target_seconds within ±15% by adjusting pauses.
5. Reflect the requested type and focus with simple, supportive imagery.
6. Conclude by gently transitioning back.
</INSTRUCTIONS>

<CONSTRAINTS>
Dos and don'ts:
1. Do: be gentle, inclusive, secular; prefer "you may", "if you like".
2. Don't: bracketed directions, sound-effect words, medical claims, timers, or background-audio references in speech.
</CONSTRAINTS>

<RECAP>
Gentle tone, three-part structure, alternate speech/pause, adjust pauses to meet time, strict schema.
</RECAP>

At runtime we embed context from the user request inside <CONTEXT> via build_user_instruction, and enforce schema with Pydantic Structured Outputs.

TTS

  • Provider: OpenAI tts-1/tts-1-hd with streaming to file
  • Voices supported: alloy, ash, fable, nova, shimmer, echo, onyx, sage, coral
  • voice mapping supports male/female as conveniences (ash/nova)
  • speaking_rate maps to speed (0.25–4.0; default 0.8)

Caching

  • LLM script cache: cache/llm/ keyed by config and model
  • TTS segment cache: cache/tts/ keyed by text+voice+speed
  • Pauses cached in cache/pauses/

Running examples

python -m examples.generate_samples

Outputs are saved in output/ with filenames like meditation_<type>_<duration>m.mp3.

Tests

pytest -q

Design decisions and rationale

Approach to prompt engineering

  • We use a structured prompt with explicit sections (OBJECTIVE_AND_PERSONA, INSTRUCTIONS, CONSTRAINTS, RECAP) to constrain style and structure while keeping outputs consistent.
  • Request-specific details are injected via <CONTEXT> from build_user_instruction, separating stable guidance from dynamic inputs.
  • Responses are parsed with OpenAI Structured Outputs into Pydantic models, ensuring schema correctness and reducing retries.
  • The design is inspired by best practices such as hierarchical prompting and explicit constraints;

How we ensure timing accuracy

  • The prompt budgets time: three-part structure with a total duration target within ±15% and explicit pause segments carrying target_seconds.
  • During synthesis, actual speech durations are measured from rendered audio files.
  • In the final mix, we adjust only pause lengths to close any remaining gap when off by more than ~3 seconds, minimizing audible artifacts.

Audio processing strategy

  • Segment-first pipeline: synthesize each speech segment, generate precise pause silences, then stitch in order.
  • Optional background: loop/trim the chosen track to duration; apply simple ducking (stronger under speech, lighter during pauses).
  • Export via pydub/ffmpeg with configurable format/bitrate (default MP3 192k).
  • Content-addressed caching for LLM scripts, TTS segments, and generated pauses reduces cost and speeds up repeats.

Trade-offs considered

  • We keep things simple: For adjusting background music, we just lower the volume (a basic technique) instead of using complex tools. This keeps things easy to manage.
  • To match the target meditation duration, we change the length of silent pauses rather than re-creating the spoken parts, so there's less cost and more consistency.
  • We balance reliability with creativity: By setting clear rules for how scripts are structured, the outputs are more predictable, but there's still room to adjust for more or less creativity if needed.
  • While we use OpenAI for text and voices by default, the code is built in a way that lets us swap out these services in the future if we want to use something else easily.

Planned Features and Roadmap

The following major enhancements are planned for future releases. These are not yet implemented, but are priorities for upcoming development:

  • Streaming audio output: Enable real-time, chunked audio generation and playback, allowing users to listen to meditations as they are being created, instead of waiting for the full file to finish processing.
  • Async FastAPI server for text-to-text and TTS: Provide an HTTP API with endpoints for both script generation and audio synthesis, implemented using FastAPI with asynchronous processing for scalability and low-latency user experience.
  • Streamlit-based UI for rapid prototyping: Develop an interactive web interface for testing meditation types, voices, and background options—ideal for demoing and manual QA without using the command line.
  • Advanced caching strategy: Improve caching mechanisms for LLM outputs, TTS segments, and audio mixes to further reduce latency, avoid repeated computation, and support cache expiration/invalidation for dynamic changes.
  • Smooth ducking with fades: Implement ducking so that background music volume transitions smoothly using fade-ins and fade-outs at the start and end of silence, preventing abrupt changes in volume.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages