End-to-end pipeline for turning podcast episodes into readable, speaker-attributed transcripts.
This README is the canonical documentation for the repo.
- Ingests episode metadata from RSS-exported CSV.
- Builds/updates a global episode manifest.
- Downloads audio and runs ASR transcription.
- Runs speaker diarization and word-to-speaker alignment.
- Produces deterministic cleanup output (
clean_python). - Produces LLM-assisted label-correction output (
clean_llm) and a separate web formatting stage for final Markdown/pages.
- Linux/macOS shell
- Conda
ffmpegavailable on system- Optional but recommended: NVIDIA GPU + CUDA
- Hugging Face account/token (for diarization model)
- OpenAI API key (for LLM cleanup pass)
- Minimum required: Python
3.11(seepyproject.tomlrequires-python = ">=3.11"). - Tested with: Python
3.11(Conda envpds_env). - Expected to work with: Python
3.11+. - Python
3.9/3.10are not supported. pdscript.clialso enforces this at runtime and exits early on Python<3.11.
# cd to repo root
conda create -n pds_env python=3.11 -y
conda run -n pds_env python -m pip install -U pip
conda run -n pds_env python -m pip install -r transcription/requirements.txt
# Optional packaging install (if you want a shell command entrypoint):
# conda run -n pds_env python -m pip install -e .
# Additional runtime deps used by diarization + LLM cleanup scripts
conda run -n pds_env python -m pip install \
openai \
huggingface_hub \
pyannote.audio \
torch \
torchaudio \
soundfile \
numpyPodcast/site-specific values are config-driven.
- Active config:
transcription/config/podcast.yaml - Starter template:
transcription/config/podcast.template.yaml
Required:
podcast.rss_feed_url
Value precedence:
- CLI flag
- YAML config
- RSS-derived value (where available)
# cd to repo root
# Hugging Face (for pyannote model access)
conda run -n pds_env hf auth login
# OpenAI (for clean_llm pass)
export OPENAI_API_KEY='your_key_here'If you want the OpenAI key persistent for your shell sessions, add this to your shell rc (for example ~/.bashrc):
# cd to repo root
export OPENAI_API_KEY='your_key_here'faster-whispermodel: typicallysmall(configurable).- Download happens automatically on first run and is cached under repo-local model/cache dirs.
- Hugging Face model:
pyannote/speaker-diarization-community-1 - Requires HF token + accepted access terms on Hugging Face.
- OpenAI model default in script:
gpt-5-nano - Used for speaker-label correction (not full transcript rewriting).
PodcastTranscriptor/
episodes_source.csv # recommended episode metadata source
dev_log.md # concise running engineering log
AGENTS.md # repo-level operating preferences
README.md # canonical docs (this file)
pyproject.toml # package metadata
pdscript/
__init__.py
cli.py # package CLI entrypoint (`python -m pdscript.cli`)
config.py # YAML config loader/helpers
transcription/
config/
podcast.yaml # active podcast/site config
podcast.template.yaml # starter template for new podcasts
manifests/
pipeline_manifest.csv # global pipeline state by episode
scripts/
build_manifest.py
transcribe_batch.py
speaker_batch.py
clean_dialogue_batch.py
render_transcripts.py
artifacts/
01_whisper_transcript/
audio/ # downloaded audio
transcripts/ # whisper txt/json (+ partial files while running)
02_diarization/
md/ # speaker-attributed markdown
diarization/ # diarization json + rttm
debug/ # words.csv + segments.csv
03_clean_python/
md/ # deterministic cleaned markdown
json/ # deterministic cleaned json
04_clean_llm/
json/ # LLM-cleaned canonical json
raw/ # raw LLM outputs (json)
meta/ # per-episode clean/validation stats
05_webformat/
md/ # website-facing markdown generated from 04_clean_llm/json
old/ # archived/legacy scratch outputs
logs/ # run logs (generated)
tmp/ # temp/cache (generated)
models/ # local model cache (generated)
Input CSV:
episodes_source.csv- You can also pass any CSV path via
--episodes-csv.
Key fields used downstream:
guid,title,pub_date_iso,link,audio_url
Run all stages sequentially:
# cd to repo root
conda run -n pds_env python -m pdscript.cli --allRun one stage at a time:
# cd to repo root
conda run -n pds_env python -m pdscript.cli manifest
conda run -n pds_env python -m pdscript.cli transcribe
conda run -n pds_env python -m pdscript.cli speaker
conda run -n pds_env python -m pdscript.cli clean-python
conda run -n pds_env python -m pdscript.cli clean-llm
conda run -n pds_env python -m pdscript.cli render
conda run -n pds_env python -m pdscript.cli status# cd to repo root
conda run -n pds_env python -m pdscript.cli manifestOr explicitly choose a source CSV:
# cd to repo root
conda run -n pds_env python -m pdscript.cli manifest \
--episodes-csv /path/to/your_episodes_source.csvOutput:
transcription/manifests/pipeline_manifest.csv
Purpose:
- Single state table that tracks each episode across transcription, diarization, and cleanup.
# cd to repo root
conda run -n pds_env python -m pdscript.cli transcribe \
--model-size small \
--device cuda \
--compute-type int8_float16 \
--download-retries 3 \
--retry-delay-sec 3 \
--episode-progress-step 5Outputs:
transcription/artifacts/01_whisper_transcript/audio/<base>.mp3transcription/artifacts/01_whisper_transcript/transcripts/<base>.txttranscription/artifacts/01_whisper_transcript/transcripts/<base>.json- Live partials during run:
*.partial.txt,*.partial.json
Behavior:
- Episode-by-episode processing, resumable on rerun.
- Errors are recorded per episode without stopping the entire batch.
# cd to repo root
conda run -n pds_env python -m pdscript.cli speaker \
--manifest transcription/manifests/pipeline_manifest.csv \
--model-size small \
--device cuda \
--compute-type int8_float16 \
--min-speakers 1 \
--max-speakers 15 \
--telemetry-interval-sec 30Outputs:
transcription/artifacts/02_diarization/md/<base>.mdtranscription/artifacts/02_diarization/diarization/<base>.diarization.jsontranscription/artifacts/02_diarization/debug/<base>.words.csvtranscription/artifacts/02_diarization/debug/<base>.segments.csv
# cd to repo root
conda run -n pds_env python -m pdscript.cli clean-python \
--segments-dir transcription/artifacts/02_diarization/debugOutputs:
transcription/artifacts/03_clean_python/md/<base>.clean.mdtranscription/artifacts/03_clean_python/json/<base>.clean.json
# cd to repo root
conda run -n pds_env python -m pdscript.cli clean-llm \
--segments-dir transcription/artifacts/02_diarization/debug \
--llm-model gpt-5-nano \
--llm-max-words-per-chunk 700 \
--llm-overlap-words 100 \
--llm-request-timeout-sec 180 \
--llm-max-retries 4 \
--llm-retry-backoff-sec 6Outputs:
transcription/artifacts/04_clean_llm/json/<base>.clean.jsontranscription/artifacts/04_clean_llm/raw/<base>.llm_raw.jsontranscription/artifacts/04_clean_llm/meta/<base>.clean_meta.json- Live partials while running:
transcription/artifacts/04_clean_llm/json/*.clean.partial.jsontranscription/artifacts/04_clean_llm/raw/*.llm_raw.partial.json
Current LLM behavior:
- Works on chunked turns with context windows.
- Returns speaker-label corrections by line index.
- Pipeline applies label changes deterministically to original text/timestamps.
- Includes retry/backoff handling and chunk-level logging.
# cd to repo root
conda run -n pds_env python -m pdscript.cli render \
--clean-json-dir transcription/artifacts/04_clean_llm/json \
--web-md-dir transcription/artifacts/05_webformat/md \
--episodes-dir episodesOutputs:
transcription/artifacts/05_webformat/md/<base>.clean.mdepisodes/<slug>.md(Jekyll page with permalink/episodes/<episode_number>/)
Purpose:
- Stage 5 only produces canonical LLM outputs; Stage 6 owns all final markdown/page rendering.
- This decoupling means website format/theme changes do not require rerunning Stage 5.
# cd to repo root
conda run -n pds_env python -m pdscript.cli status# cd to repo root
LATEST_LOG=$(ls -1t transcription/logs/*.log | head -n 1)
tail -f "$LATEST_LOG"# cd to repo root
find transcription/artifacts -type f \( -name '*.partial.txt' -o -name '*.partial.json' -o -name '*.partial.md' \) | sort- Keep only the latest active
.logintranscription/logs/. - Move older logs to
transcription/logs/old/.
- Default behavior is resumable and skips completed outputs.
- Use
--redoonly when intentionally reprocessing. - Batch jobs continue past per-episode errors and log them.
- Target: GitHub Pages + Jekyll (
just-the-docstheme) using generated Markdown transcripts. - For public repos, GitHub Pages hosting is free (subject to GitHub Pages usage limits).
- Use Jekyll with the
just-the-docstheme (_config.yml+Gemfile). - Keep generated transcript pages under
episodes/(produced bypdscript.cli render). - Use GitHub Actions for build/deploy (
.github/workflows/pages.yml). - In GitHub
Settings -> Pages, set source toGitHub Actions. - Push to
mainto publish site updates.
This repo now includes:
_config.yml(site config)Gemfile(Jekyll + just-the-docs gems).github/workflows/pages.yml(build + deploy workflow)index.mdandepisodes/index.md(site entry pages)
To enable publishing:
- Push these files to
main. - In GitHub, open
Settings -> Pages. - Set
SourcetoGitHub Actions. - Wait for the
Deploy Jekyll site to Pagesworkflow to finish.
Published URL pattern:
https://alik-git.github.io/TheoryOfAnythingTranscripts/
To reduce unnecessary GitHub Actions usage, automatic Pages deploys run only when website files change:
.github/workflows/pages.yml_config.ymlGemfile,Gemfile.lock_includes/**assets/**index.mdepisodes/**
If you push only transcription pipeline code (for example under pdscript/ or transcription/scripts/), Pages will not redeploy automatically.
You can still trigger a deploy manually from the Actions tab using workflow_dispatch.
# cd to repo root
bundle install
bundle exec jekyll serve- Keep
dev_log.mdconcise with timestamped##headings. - Keep legacy artifacts under
transcription/artifacts/old/.