A privacy-focused audio transcription toolkit that converts video recordings into searchable, summarized notes in Obsidian. Runs entirely locally with optional cloud summarization.
Video Recording → Audio Extraction → Transcription → AI Summary → Obsidian Note
(.mov) (.mp3) (.md) (Claude/ (Person note
Ollama) updated)
Key Features:
- Local transcription - Uses faster-whisper (large-v3 model) for high-quality, offline speech-to-text
- Multi-track support - Handles separate mic/system audio tracks from screen recordings
- Speaker identification - Merges tracks into a speaker-tagged conversation
- AI summarization - Generates structured summaries using Claude API or local Ollama
- Obsidian integration - Automatically updates person notes with dated summaries
- Hungarian language - Optimized for Hungarian transcription and summaries
- Fully airgapped option - Can run 100% offline with Ollama
# Process all new recordings interactively
./process_calls.jsOr step by step:
./1_video2audio.sh recording.mov # Extract audio tracks
node 2_audio2text.js recording # Transcribe to text
node 3_text2notes.js recording.md # Summarize and add to Obsidian- macOS (tested on Apple Silicon)
- Node.js 18+
- Python 3
- ffmpeg (
brew install ffmpeg)
-
Clone and install dependencies:
git clone <repo-url> cd transcribe npm install pip3 install faster-whisper
-
Configure environment:
cp .env.example .env
Edit
.envwith your settings (see Configuration below). -
First run - The Whisper model (~3GB) downloads automatically on first transcription.
Edit .env to configure the toolkit:
Choose between cloud or local summarization:
# Option 1: Claude API (default, requires internet)
LLM_BACKEND=claude
ANTHROPIC_API_KEY=sk-ant-...
# Option 2: Ollama (local, airgapped)
LLM_BACKEND=ollama
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5:32bOllama setup (for airgapped operation):
brew install ollama
ollama serve # Start the server (runs in foreground)
# In a new terminal:
ollama pull qwen2.5:32b # Downloads ~20GB, requires ~20GB RAM to runFor better quality with more RAM: ollama pull qwen2.5:72b (~40GB+ RAM)
Note: ollama serve must be running whenever you use the Ollama backend. You can also install Ollama as an app from ollama.ai which runs automatically in the menu bar.
OBSIDIAN_VAULT_PATH=/path/to/your/vault/Call - meet notes
TRANSCRIPTS_FOLDER=TranscriptsEdit prompts.yaml to customize summarization scenarios:
- leadership_mentoring - Engineering leadership discussions
- first_meeting - Initial introductory calls
- technical_discussion - Architecture and tech decisions
- career_planning - Career development sessions
- general - Catch-all for other conversations
./process_calls.jsScans recordings/ for .mov files and processes them interactively:
- Asks which files to process
- Runs the full pipeline
- Moves completed files to Trash
- Shows summary at the end
Extract audio from video:
./1_video2audio.sh recording.mov
# Creates: recording-mic.mp3, recording-mac.mp3Transcribe audio:
node 2_audio2text.js recording
# Creates: recording.json, recording.md (merged transcript)Summarize and add to Obsidian:
node 3_text2notes.js recordings/recording.md
# Interactive: select person, scenario, descriptor
# Updates person note and moves transcript to vaultAll scripts support:
-v, --verbose- Show detailed output-f, --force- Overwrite existing files without prompting-h, --help- Show help
Extracts audio tracks from video files using ffmpeg:
- First track →
-mic.mp3(usually your microphone) - Second track →
-mac.mp3(system audio / other participant) - Additional tracks →
-mac-2.mp3,-mac-3.mp3, etc.
Transcribes using faster-whisper with the large-v3 model:
- Runs locally on CPU (no GPU required, but slower)
- ~2-3x realtime on Apple Silicon (1 hour audio ≈ 2-3 hours processing)
- Anti-hallucination parameters for reliable output
- Prompts for speaker names to label the conversation
Summarizes transcripts and integrates with Obsidian:
- Generates context-aware summaries based on conversation type
- Creates/updates person notes with dated sections
- Links to full transcript for reference
- All output in Hungarian
The toolkit expects MOV files with two separate audio tracks for speaker separation:
- Track 1: Your microphone (you)
- Track 2: System/desktop audio (other participants on the call)
This simple setup provides effective speaker identification without complex diarization - each track becomes a separate speaker in the transcript.
OBS Studio is recommended for recording. An example profile is provided in examples/obs-profile-basic.ini.
Quick setup:
-
Install the profile:
- Copy
examples/obs-profile-basic.inito your OBS profiles folder:- macOS:
~/Library/Application Support/obs-studio/basic/profiles/CallRecording/basic.ini
- macOS:
- Edit the file and update
FilePathto point to yourrecordings/folder
- Copy
-
Configure audio routing in OBS:
- Go to Settings > Audio
- Set "Mic/Auxiliary Audio" to your microphone
- Set "Desktop Audio" to capture system sound
-
Assign tracks to sources:
- Edit > Advanced Audio Properties
- Your mic: Enable Track 1 only
- Desktop Audio: Enable Track 2 only
-
Add a video source (required by OBS, but we only use the audio):
- Add a Display Capture or Window Capture source
- The example profile uses low video settings (720p, 30fps, 2Mbps) to minimize file size
The example profile is optimized for this audio-focused workflow:
| Setting | Value | Why |
|---|---|---|
| Resolution | 720p | Minimum needed; video is incidental |
| Frame rate | 30 fps | Sufficient for screen content |
| Video bitrate | 2 Mbps | Low since video isn't used |
| Audio bitrate | 160 kbps | Good quality for voice |
| Format | MOV (hybrid) | Supports multiple audio tracks |
| Audio tracks | 1 + 2 | Mic and system audio separated |
A 1-hour call produces roughly 1-1.5 GB with these settings (mostly video). The extracted audio is ~15 MB per track.
transcribe/
├── recordings/ # Input/output directory (gitignored)
│ ├── *.mov # Input video files
│ ├── *-mic.mp3 # Extracted mic audio
│ ├── *-mac.mp3 # Extracted system audio
│ ├── *.json # Transcription data
│ └── *.md # Merged transcript
├── examples/
│ └── obs-profile-basic.ini # OBS Studio profile for recording
├── 1_video2audio.sh # Audio extraction script
├── 2_audio2text.js # Transcription script
├── 2_transcribe.py # Python transcription engine
├── 3_text2notes.js # Summarization script
├── process_calls.js # Pipeline orchestrator
├── prompts.yaml # Summarization prompts
├── .env.example # Environment template
└── .env # Your configuration (gitignored)
On Apple Silicon M-series with large-v3 model:
- Transcription: ~2-3x realtime (1 hour audio takes 2-3 hours)
- Summarization: ~10-30 seconds with Claude, longer with local Ollama
For faster transcription, you can switch to medium or small models in 2_audio2text.js, though with reduced accuracy.
"Cannot connect to Ollama"
- Start the server:
ollama serve - Check it's running:
curl http://localhost:11434/api/tags
"Model not found"
- Pull the model first:
ollama pull qwen2.5:32b
Transcription is slow
- This is expected with large-v3 on CPU
- Consider using a smaller model for drafts
ffmpeg not found
- Install with:
brew install ffmpeg
MIT