MediaScribe

Language: English | Chinese

MediaScribe is a practical CLI for transcribing and summarizing audio, text, and video with reusable staged workflows.

Highlights

One tool for audio transcription, transcript summary, text summary, and video summary
Reusable staged workflow: transcript first, summary later; or run everything end to end
Local and cloud ASR support with provider-based architecture
Video pipeline supports subtitle-first summary, ASR fallback, and audio-only extraction
Output keeps source metadata such as original file paths and remote URLs
Summary and transcription logic are abstracted as standalone services for reuse in other Python scripts

Why MediaScribe

MediaScribe is designed around separation of concerns:

transcription is a standalone service
summary is a standalone service
video is an orchestration layer that prefers subtitles, then falls back to audio extraction + ASR + summary
provider adapters keep cloud and local implementations isolated and easy to reuse elsewhere

Command Names

Preferred command names

mediascribe
mediascribe-transcriber
mediascribe-text

Architecture Snapshot

mediascribe
  -> transcription_service / audio_summary_service / text_summary_service / video_summary_service
    -> scanner / subtitle_fetch_service / media_extract_service / video_input_service
      -> asr providers (local / azure / aliyun / iflytek)
      -> summary providers (litellm-backed model routing)
      -> ffmpeg / yt-dlp

Audio
  -> transcript
  -> optional summary

Text
  -> summary

Video
  -> subtitles first
  -> or extract/download audio
  -> transcript
  -> summary

Install

Recommended with uv

# Windows PowerShell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

Create the environment

uv venv --python 3.11

For local summary generation, install Ollama, make sure it is running, and pull the default model:

ollama pull qwen2.5:3b
# If Ollama is not already running in the background:
ollama serve

Install by scenario

# Cloud ASR + summary
uv sync

# Add remote video support (yt-dlp)
uv sync --extra video

# Add local ASR
uv sync --extra local

# Local ASR + remote video
uv sync --extra local --extra video

# Development
uv sync --extra local --extra dev

Minimal `.env`

# Local ASR (usually needed for local diarization)
HF_TOKEN=hf_xxx

# Local summary model via Ollama
# These are also the built-in defaults.
MEDIASCRIBE_LLM_MODEL=ollama/qwen2.5:3b
MEDIASCRIBE_LLM_API_BASE=http://localhost:11434

# Azure ASR
AZURE_SPEECH_KEY=xxx
AZURE_SPEECH_REGION=westus2

# Optional cloud summary keys
# ANTHROPIC_API_KEY=sk-xxx
# GEMINI_API_KEY=xxx
# DEEPSEEK_API_KEY=xxx

Remote video auth examples

YTDLP_COOKIES_FILE=.\cookies\global.txt
YTDLP_COOKIES_FROM_BROWSER=chrome:Profile 12
YTDLP_SITE_COOKIE_MAP=bilibili.com=.\cookies\bilibili_profile12.txt

See .env.example for a fuller template

Local Summary Configuration

MediaScribe now defaults to a local summary model:

model: ollama/qwen2.5:3b
API base: http://localhost:11434
status: manually verified end to end through the current MediaScribe summary flow

You can override that in either place:

MEDIASCRIBE_LLM_MODEL=ollama/llama3.2:3b
MEDIASCRIBE_LLM_API_BASE=http://localhost:11434

uv run mediascribe-text .\notes --llm-model ollama/qwen2.5:3b --llm-api-base http://localhost:11434

If you want a cloud model instead, pass --llm-model and provide the matching API key in .env.

Manual integration check:

uv run python scripts/manual_check_ollama_summary.py

Quick Start

Audio: transcript + summary

uv run mediascribe ".\meeting.wav" --asr azure

Audio: transcript only

uv run mediascribe ".\meeting.wav" --asr azure --no-summary

Audio directory

uv run mediascribe .\audios --asr azure -o .\output

Existing transcripts: summary only

uv run mediascribe .\output --summary-only

Text or notes directory

uv run mediascribe-text .\notes

Text summary with explicit local model override

uv run mediascribe-text .\notes --llm-model ollama/qwen2.5:3b --llm-api-base http://localhost:11434

Local video summary

uv run mediascribe video ".\lesson.mp4" --asr azure

Remote video summary

uv run mediascribe video "https://www.youtube.com/watch?v=aircAruvnKk"

Typical Workflows

1. One-shot audio flow

uv run mediascribe ".\meeting.wav" --asr azure

2. Transcript first, summary later

uv run mediascribe ".\meeting.wav" --asr azure --no-summary -o .\output
uv run mediascribe .\output --summary-only

3. Summarize raw text from another script

from pathlib import Path

from mediascribe.text_summary_service import summarize_raw_text_to_file

summary_path = summarize_raw_text_to_file(
    "Long text to summarize",
    output_dir=Path("manual_output"),
    source_name="manual-note",
    llm_model="ollama/qwen2.5:3b",
    llm_api_base="http://localhost:11434",
)

print(summary_path)

4. Video summary with subtitle-first strategy

uv run mediascribe video ".\lesson.mp4" --asr azure

Default video strategy

try subtitles first
if no usable subtitles, extract or download audio
run ASR if needed
generate summary

5. Video -> audio only -> audio summary later

uv run mediascribe video ".\lesson.mp4" --extract-audio-only -o .\output
uv run mediascribe ".\output\media\lesson.wav" --asr azure

6. Speaker naming

uv run mediascribe ".\meeting.wav" --speaker-name Alice --speaker-name Bob

Video ASR path

uv run mediascribe video ".\lesson.mp4" --force-asr --asr azure --speaker-name Alice --speaker-name Bob

Hardware and Cost Notes

Local ASR (`--asr local`)

Uses significantly more local CPU / GPU / RAM
Better when you prefer local processing and can accept heavier hardware usage
For long media or weaker machines, cloud ASR is often the more practical path

Cloud ASR (`--asr azure` / `aliyun` / `iflytek`)

Reduces local hardware usage
Usually handles long recordings more comfortably
May incur cloud ASR cost

Summary generation

Default path uses a local Ollama model: ollama/qwen2.5:3b
Default local endpoint is http://localhost:11434
Cloud models are still supported through --llm-model plus the matching API key
If you only want transcripts, use --no-summary

Video Notes and Auth

Supported video inputs

local video files
remote URLs such as YouTube, Bilibili, and other yt-dlp-supported sources

Authentication order

When cookies are configured, MediaScribe tries

site-specific cookie file
global cookie file
browser profile cookies
unauthenticated request

Inspect the auth resolution path with

uv run mediascribe doctor-video-auth "https://www.bilibili.com/video/BV1VtcYzTEZn/"

Azure long audio behavior

For Azure fast transcription, MediaScribe auto-splits risky inputs before upload:

auto-split when duration is over 60 minutes
auto-split when file size is over 150 MB
default chunk size is 30 minutes

Performance Reference

These are practical end-to-end timing references gathered during real runs on the current machine and network environment.

Scenario	Input Duration	Method	End-to-End Time	Approx Speed
Local MP4 `cleaned_...mp4`	6m06s	local video -> extract audio -> Azure ASR -> cloud summary	1m41s	about 3.6x realtime
YouTube `3DlXq9nsQOE`	18m30s	remote video -> download audio -> Azure ASR -> cloud summary	2m26s	about 7.6x realtime
YouTube `aircAruvnKk` with subtitles	18m26s	remote subtitles -> cloud summary	31s	about 35.9x realtime
Facebook `1ahSKdqfDU`	19s	remote video -> download audio -> Azure ASR -> cloud summary	36s	about 0.5x realtime
Bilibili `BV1VtcYzTEZn`	2m59s	subtitle fail -> audio fallback -> Azure ASR -> cloud summary	58s to 1m04s	about 2.8x to 3.1x realtime

More details

Project Layout

mediascribe/
  cli.py
  transcription_service.py
  text_summary_service.py
  video_summary_service.py
  asr/
  summary/
  ffmpeg_utils.py
  yt_dlp_auth.py

docs/
examples/

Note

The public project name is MediaScribe
The canonical Python package path is mediascribe

Documentation Map

Quick start: docs/quickstart.md
Local workspace: docs/local-workspace.md
Architecture: docs/architecture.md
ASR API and extension guide: docs/asr.md
Summary API and extension guide: docs/summary.md
Third-party provider integration: docs/plugin-providers.md
Benchmark notes: docs/benchmark-notes.md
Custom provider examples: examples/custom_providers/README.md

License

This project is licensed under the MIT License. See LICENSE.

Release Notes

See CHANGELOG.md for release notes.

Local Model Hardware Reference

These are practical starting points, not hard limits. Real usage varies with audio length, concurrency, device type, and whether CPU/GPU memory is shared.

Local speech transcription

Model / feature	GPU path	CPU path	Notes
`Whisper small`	Around `2 GB VRAM`	Roughly `8 GB RAM`	Good for lower-spec machines
`Whisper medium`	Around `5 GB VRAM`	Roughly `16 GB RAM`	Current default and a solid balance
`Whisper turbo`	Around `6 GB VRAM`	Roughly `16 GB RAM`	Faster, but still fairly heavy
`Whisper large`	Around `10 GB VRAM`	Roughly `16-32 GB RAM`	Best quality, highest local cost
`pyannote.audio` diarization add-on	Extra GPU headroom recommended	More comfortable on `16 GB RAM+`	Adds noticeable load on top of local ASR

Local text summary

Model	Download size	CPU path	GPU path	Notes
`ollama/qwen2.5:3b`	About `1.9 GB`	`6-8 GB RAM`	`4-6 GB VRAM`	Recommended default for Chinese-friendly summaries
`ollama/llama3.2:1b`	About `1.3 GB`	`4-6 GB RAM`	`2-3 GB VRAM`	Smallest practical option
`ollama/llama3.2:3b`	About `2.0 GB`	`6-8 GB RAM`	`4-6 GB VRAM`	Good general-purpose fallback
`ollama/phi4-mini`	About `2.5 GB`	`8 GB RAM`	`4-6 GB VRAM`	Useful when you want stronger structured output

Whole-machine suggestions

Hardware budget	Suggested setup	Notes
`8 GB RAM`	`Whisper small` + `llama3.2:1b`	Or use cloud ASR to reduce local load
`16 GB RAM`	`Whisper medium` + `qwen2.5:3b`	Most practical balanced setup
`32 GB RAM` or `8-12 GB VRAM`	Stronger Whisper variants plus local diarization	More comfortable for long local runs

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples/custom_providers		examples/custom_providers
manual_checks/ollama_summary		manual_checks/ollama_summary
mediascribe		mediascribe
notes		notes
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-local.txt		requirements-local.txt
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MediaScribe

Highlights

Why MediaScribe

Command Names

Architecture Snapshot

Install

Minimal .env

Local Summary Configuration

Quick Start

Audio: transcript + summary

Audio: transcript only

Audio directory

Existing transcripts: summary only

Text or notes directory

Text summary with explicit local model override

Local video summary

Remote video summary

Typical Workflows

1. One-shot audio flow

2. Transcript first, summary later

3. Summarize raw text from another script

4. Video summary with subtitle-first strategy

5. Video -> audio only -> audio summary later

6. Speaker naming

Hardware and Cost Notes

Local ASR (--asr local)

Cloud ASR (--asr azure / aliyun / iflytek)

Summary generation

Video Notes and Auth

Supported video inputs

Authentication order

Azure long audio behavior

Performance Reference

Project Layout

Documentation Map

License

Release Notes

Local Model Hardware Reference

Local speech transcription

Local text summary

Whole-machine suggestions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Minimal `.env`

Local ASR (`--asr local`)

Cloud ASR (`--asr azure` / `aliyun` / `iflytek`)

Packages