Turns raw audio/video into polished, magazine-quality interview transcripts with title/thumbnail suggestions and chapter timestamps — all in one command.
- ElevenLabs Scribe v2 transcribes the audio with speaker diarization
- Gemini 3 Pro listens to the original audio alongside the raw transcript and cleans it up — removing filler words, backchannel noise, and false starts while preserving what the speakers actually said
- Claude 4.6 Opus reads the finished transcript and generates YouTube title/thumbnail combos and chapter timestamps
You only need to do this once on your machine.
1. Install Python (if you don't already have it)
Open Terminal (search for "Terminal" in Spotlight on Mac) and type:
python3 --version
If you see a version number (like Python 3.11.5), you're good — skip to step 2. If not, install it:
- Mac: Go to https://www.python.org/downloads/ and download the latest version. Run the installer.
- After installing, close and reopen Terminal, then try
python3 --versionagain.
2. Download or clone this project
If someone sent you the folder, just put it somewhere you can find it (like your Documents folder). In Terminal, navigate to it:
cd ~/Documents/transcripts
(Replace the path with wherever you put the folder.)
3. Set up the Python environment
Run these commands one at a time in Terminal:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
4. Set up API keys
You need three API keys. Ask Dwarkesh if you don't have them.
- ElevenLabs API key — from https://elevenlabs.io
- Gemini API key — from https://aistudio.google.com/apikey
- Anthropic API key — from https://console.anthropic.com/settings/keys
Add them to your shell config so they're always available. Run this in Terminal (paste the whole block, replacing the placeholder values with your actual keys):
echo 'export ELEVENLABS_API_KEY="your-elevenlabs-key-here"' >> ~/.zshrc
echo 'export GEMINI_API_KEY="your-gemini-key-here"' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY="your-anthropic-key-here"' >> ~/.zshrc
source ~/.zshrc
To verify they're set:
echo $ELEVENLABS_API_KEY
echo $GEMINI_API_KEY
echo $ANTHROPIC_API_KEY
All three should print your keys.
Every time you open a new Terminal window, activate the environment first:
cd ~/Documents/transcripts
source venv/bin/activate
Then run:
python transcribe.py your-audio-file.mp3
This will:
- Upload the audio to ElevenLabs for transcription (~minutes depending on length)
- Upload the audio to Gemini and enhance each chunk (~5+ minutes for a long episode)
- Generate title/thumbnail suggestions and chapter timestamps with Claude (~1 minute)
- Save everything in a project folder
For an input file called episode.mp3, everything goes into projects/episode/:
projects/
episode/
transcript.md # The cleaned, polished transcript
raw.md # The raw transcript before cleanup (for comparison)
postprod.md # Title/thumbnail suggestions + chapter timestamps
.cache/ # Cached API results (hidden)
Each episode gets its own folder. The projects/ directory is gitignored.
Specify number of speakers (helps with diarization accuracy):
python transcribe.py episode.mp3 --speakers 2
Just get the raw transcript (skip cleanup and post-production — much faster):
python transcribe.py episode.mp3 --raw
Skip post-production (just transcript, no titles/timestamps):
python transcribe.py episode.mp3 --no-postprod
Save both raw and cleaned versions:
python transcribe.py episode.mp3 --save-raw
Force a fresh run (ignore all cached results):
python transcribe.py episode.mp3 --no-cache
Results are cached automatically inside each project's .cache/ folder. If you run the same file again, the pipeline skips any steps that already completed — no repeated API calls. If something fails partway through, just re-run the same command and it picks up where it left off. Use --no-cache to force a completely fresh run.
Audio: .mp3, .wav, .m4a, .flac, .ogg
Video: .mp4, .mov, .avi, .mkv, .webm (audio is automatically extracted)
The pipeline splits the raw transcript into chunks of ~4000 tokens each and sends each chunk to Gemini along with the full audio file. Gemini listens to the audio to correct transcription errors and uses the editorial prompt to clean up the text — removing filler words, deleting backchannel-only turns ("Mm-hmm", "Yeah"), merging interrupted thoughts, adding paragraph breaks, and smoothing grammar. The goal is a transcript that reads like a written magazine interview while faithfully preserving what the speakers actually said.
After the transcript is finalized, it's sent to Claude 4.6 Opus for post-production: generating YouTube title/thumbnail combinations (5 titles with 3 thumbnail text ideas each) and chapter timestamps (spaced 8-15 minutes apart).
Each step is cached as it completes, so if the process is interrupted you can resume without re-doing work.