Generate SRT/LRC subtitles and Markdown transcripts from audio/video files using multiple speech-to-text engines, with auto-correction, LLM polishing, and multimodal content summarization.
TingShuo recursively scans directories for media files, transcribes them using your choice of STT engine, and outputs subtitle files in SRT, LRC, or Markdown transcript format. Features include LLM-based auto-correction of typos and verbal mistakes, subtitle polishing via LLM or NLP, and content summarization with multimodal video analysis.
- 4 STT Engines: faster-whisper, Vosk, OpenAI Whisper, whisper.cpp
- 3 Output Formats: SRT (SubRip), LRC (lyrics), and MD (Markdown transcript)
- Markdown Transcript: Generate clean, structured transcripts from speeches and lectures
- Auto-Correction: Fix typos, wrong characters, and verbal mistakes automatically via LLM
- Content Summarization: Summarize audio/video content with multimodal video analysis (keyframe extraction + vision LLM)
- Subtitle Translation: Translate subtitles to multiple target languages using NLLB or LLM
- Multi-language UI: Interface supports English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian
- LLM Polishing: Merge fragmented subtitles into natural sentences via Ollama or OpenAI-compatible API
- NLP Polishing: Sentence boundary detection via nltk (no LLM required)
- CLI + GUI + Web: Full command-line interface, tkinter graphical interface, and Web interface
- Recursive Scanning: Process entire directory trees of media files
- HuggingFace Mirror: Built-in support for HF mirror (useful in China mainland)
- Flexible Output: Save subtitles alongside source files or to a custom directory
- Settings Persistence: UI language and preferences saved to
~/.config/tingshuo/settings.json
# Base install (no STT engine included)
pip install tingshuo
# With a specific engine:
pip install tingshuo[faster-whisper] # Recommended
pip install tingshuo[vosk]
pip install tingshuo[whisper]
pip install tingshuo[whisper-cpp]
# With NLP polishing:
pip install tingshuo[nlp]
# Everything:
pip install tingshuo[all]git clone https://github.com/cycleuser/TingShuo.git
cd tingshuo
pip install -e .[faster-whisper,nlp]- Python 3.9+
- ffmpeg must be installed and available on your PATH
- Linux:
sudo apt install ffmpeg - macOS:
brew install ffmpeg - Windows: Download from ffmpeg.org and add to PATH
- Linux:
Basic transcription (SRT):
tingshuo -i ./videos -e faster-whisper -f srtGenerate LRC files to a specific output directory:
tingshuo -i ./audio -e vosk -f lrc -o ./subtitlesWith LLM polishing (Ollama):
tingshuo -i ./media --polish-llm --ollama-model qwen2.5With LLM polishing (OpenAI-compatible API):
tingshuo -i ./media --polish-llm --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-miniWith NLP polishing:
tingshuo -i ./media --polish-nlp -l enGenerate Markdown transcript from lectures:
tingshuo -i ./lectures -f md --polish-llm --ollama-model qwen2.5Auto-correct typos and verbal mistakes:
tingshuo -i ./media --auto-correct --ollama-model qwen2.5Auto-correct + LLM polishing combined:
tingshuo -i ./media --auto-correct --polish-llm --ollama-model qwen2.5Generate content summary:
tingshuo -i ./media --summarize --ollama-model qwen2.5Summarize with multimodal video analysis (OpenAI-compatible API):
tingshuo -i ./videos --summarize --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-miniSpecify language and model:
tingshuo -i ./videos -e faster-whisper -m large-v3 -l zhUse HuggingFace mirror (China mainland):
tingshuo -i ./videos -e faster-whisper --hf-mirror https://hf-mirror.comTranslate subtitles to multiple languages (NLLB):
tingshuo -i ./videos -e faster-whisper --translate --target-lang zh,ja,koTranslate subtitles using LLM:
tingshuo -i ./videos -e faster-whisper --translate --target-lang zh --trans-backend llm --ollama-model qwen2.5Download a model before transcription:
tingshuo --download -e faster-whisper -m large-v3
tingshuo --download -e faster-whisper -m large-v3 --hf-mirror https://hf-mirror.comDownload all models for an engine:
tingshuo --download-all -e faster-whisperList installed Ollama models:
tingshuo --list-ollama-models
tingshuo --list-ollama-models --ollama-url http://192.168.1.100:11434tingshuo --guiThe GUI provides:
- Directory selection with browse buttons
- Engine and model selection dropdowns
- Language dropdown with common languages
- Model download buttons
- Format toggle (SRT/LRC/MD)
- Auto-correction checkbox
- Content summary checkbox
- Polishing options (None / LLM / NLP)
- Translation panel
- Ollama model dropdown
- Multi-language interface
- Progress bar and real-time log output
Launch the Gradio-based Web interface:
python web_ui.pyOpen http://localhost:7860 in your browser.
The Web UI provides:
- File upload (drag and drop)
- Engine, Model, Language selection
- Format selection (SRT/LRC/MD)
- Polishing options (Ollama/NLP)
- Translation settings
- Real-time logs and file download
usage: tingshuo [-h] [--version] [--gui] [-i DIR] [-o DIR] [-f {srt,lrc,md}]
[--no-recursive] [-e ENGINE] [-m NAME] [-l CODE]
[--hf-mirror URL] [--download] [--download-all]
[--list-ollama-models] [--auto-correct]
[--polish-llm | --polish-nlp]
[--ollama-url URL] [--ollama-model NAME] [--api-url URL]
[--api-key KEY] [--api-model NAME] [-v]
[--translate] [--target-lang CODES]
[--trans-backend {nllb,llm}] [--nllb-model NAME]
[--summarize] [--keyframe-interval SECONDS]
| Argument | Description |
|---|---|
-i, --input DIR |
Input directory containing audio/video files (required) |
-o, --output DIR |
Output directory for subtitles (default: same as source) |
-f, --format {srt,lrc,md} |
Output format: srt, lrc, or md (Markdown transcript) (default: srt) |
--no-recursive |
Do not scan subdirectories |
| Argument | Description |
|---|---|
-e, --engine |
Engine: faster-whisper, vosk, whisper, whisper-cpp (default: faster-whisper) |
-m, --model NAME |
Model name or path (default: engine-specific, usually "base") |
-l, --language CODE |
Language code: zh, en, ja, etc. Use "auto" for auto-detection (default: auto) |
| Argument | Description |
|---|---|
--hf-mirror URL |
HuggingFace mirror URL, e.g. https://hf-mirror.com |
| Argument | Description |
|---|---|
--download |
Download the model specified by -e and -m, then exit |
--download-all |
Download all known models for the engine specified by -e, then exit |
--list-ollama-models |
List installed Ollama models from the server (uses --ollama-url), then exit |
| Argument | Description |
|---|---|
--polish-llm |
Polish with LLM (Ollama or OpenAI-compatible API) |
--polish-nlp |
Polish with NLP sentence segmentation (nltk) |
| Argument | Description |
|---|---|
--auto-correct |
Auto-correct typos, wrong characters, and verbal mistakes using LLM |
| Argument | Description |
|---|---|
--ollama-url URL |
Ollama API URL (default: http://localhost:11434) |
--ollama-model NAME |
Ollama model name (default: qwen2.5) |
--api-url URL |
OpenAI-compatible API base URL |
--api-key KEY |
API key for OpenAI-compatible service |
--api-model NAME |
Model name for API |
| Argument | Description |
|---|---|
--gui |
Launch graphical interface |
-v, --verbose |
Enable debug logging |
--version |
Show version and exit |
| Argument | Description |
|---|---|
--translate |
Enable subtitle translation to target language(s) |
--target-lang CODES |
Comma-separated target language codes, e.g. zh,en,ja |
--trans-backend {nllb,llm} |
Translation backend: nllb (Helsinki-NLP/NLLB) or llm (default: nllb) |
--nllb-model NAME |
NLLB model name (default: facebook/nllb-200-distilled-600M) |
| Argument | Description |
|---|---|
--summarize |
Generate a content summary (.summary.md) alongside the output |
--keyframe-interval SECONDS |
Seconds between keyframe extractions for video summarization (default: 60) |
Audio: mp3, wav, flac, aac, ogg, wma, m4a, opus
Video: mp4, mkv, avi, mov, wmv, flv, webm, ts, m4v, mpg, mpeg
SRT (SubRip Text):
1
00:00:01,500 --> 00:00:04,200
This is the first subtitle line.
2
00:00:05,000 --> 00:00:08,300
This is the second subtitle line.
LRC (Lyrics):
[ti:filename]
[re:TingShuo v0.1.3]
[00:01.50]This is the first subtitle line.
[00:05.00]This is the second subtitle line.
MD (Markdown Transcript):
## Introduction
This is the opening section of the speech, organized into
natural paragraphs by the LLM.
## Main Topic
The speaker then moved on to discuss the main topic,
with key points organized into readable paragraphs.CTranslate2-based Whisper implementation. Fast, supports GPU acceleration.
pip install faster-whisperModels: tiny, base, small, medium, large-v2, large-v3
Lightweight offline speech recognition. Lower accuracy but very fast on CPU.
pip install voskModels: Downloaded automatically by language, or specify a local path with -m /path/to/model.
The original Whisper model from OpenAI.
pip install openai-whisperModels: tiny, base, small, medium, large
C++ implementation of Whisper via Python bindings. Very fast on CPU.
pip install pywhispercppModels: tiny, base, small, medium, large
Sends subtitle segments to an LLM to merge fragments into complete, natural sentences.
With Ollama (local):
- Install and start Ollama
- Pull a model:
ollama pull qwen2.5 - Run:
tingshuo -i ./media --polish-llm --ollama-model qwen2.5
With Ollama (LAN):
tingshuo -i ./media --polish-llm --ollama-url http://192.168.1.100:11434 --ollama-model qwen2.5With OpenAI-compatible API:
tingshuo -i ./media --polish-llm --api-url https://api.openai.com --api-key sk-xxx --api-model gpt-4o-miniUses nltk sentence tokenization to detect sentence boundaries and merge fragments. No LLM or network access required.
pip install nltk
tingshuo -i ./media --polish-nlp -l enSupports English, German, French, Spanish, Italian, Portuguese, and more via nltk. For Chinese/Japanese/Korean, uses punctuation-based sentence splitting.
TingShuo can generate clean, structured Markdown transcripts from speeches, lectures, and presentations. Instead of timestamped subtitles, the MD format produces flowing text organized into sections and paragraphs.
# Generate Markdown transcript (uses LLM to structure paragraphs)
tingshuo -i ./lectures -f md --polish-llm --ollama-model qwen2.5
# With auto-correction for cleaner output
tingshuo -i ./lectures -f md --auto-correct --polish-llm --ollama-model qwen2.5The LLM organizes the raw transcription into logical sections with Markdown headers and paragraphs. If no LLM is configured, a simple paragraph grouping fallback is used.
TingShuo can automatically fix transcription errors before polishing or output. This includes:
- Typos and wrong characters (错别字): Common misrecognitions from STT engines
- Verbal mistakes (口误): Slips of the tongue in speech
- Filler words: Remove "um", "uh", "嗯", "那个", etc. when they add no meaning
# Auto-correct only
tingshuo -i ./media --auto-correct --ollama-model qwen2.5
# Auto-correct + LLM polishing (correction happens first, then polishing)
tingshuo -i ./media --auto-correct --polish-llm --ollama-model qwen2.5
# Auto-correct with OpenAI-compatible API
tingshuo -i ./media --auto-correct --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-miniAuto-correction preserves segment boundaries (timestamps remain unchanged) and works with all output formats (SRT, LRC, MD).
TingShuo can generate a content summary (.summary.md) alongside the normal output. For video files, it supports multimodal analysis using keyframe extraction and vision-capable LLMs.
# Summarize using Ollama
tingshuo -i ./media --summarize --ollama-model qwen2.5
# Summarize using OpenAI-compatible API
tingshuo -i ./media --summarize --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-miniFor video files, TingShuo extracts keyframes using ffmpeg and sends them along with the transcript to a vision-capable LLM for comprehensive analysis:
# Multimodal summary with keyframe extraction (default: 60s intervals)
tingshuo -i ./videos --summarize --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini
# Custom keyframe interval (every 30 seconds)
tingshuo -i ./videos --summarize --keyframe-interval 30 --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini
# With Ollama multimodal models (e.g., llava, llama3.2-vision)
tingshuo -i ./videos --summarize --ollama-model llavaThe multimodal summary integrates:
- Spoken content from the transcript
- Visual elements: slides, diagrams, charts, demonstrations
- Key visual information that complements the spoken content
If the LLM does not support vision, TingShuo automatically falls back to a text-only summary.
TingShuo can automatically translate generated subtitles to multiple target languages. Translated subtitles are saved as separate files with language codes (e.g., video.zh.srt, video.ja.srt).
Uses Helsinki-NLP/NLLB models for high-quality offline translation supporting 200+ languages.
# Install dependencies
pip install transformers sentencepiece
# Translate to Chinese and Japanese
tingshuo -i ./videos -e faster-whisper --translate --target-lang zh,ja
# Use a larger NLLB model for better quality
tingshuo -i ./videos --translate --target-lang zh --nllb-model facebook/nllb-200-distilled-1.3BAvailable NLLB models: facebook/nllb-200-distilled-600M (default), facebook/nllb-200-distilled-1.3B, facebook/nllb-200-3.3B
Uses Ollama or OpenAI-compatible API for translation.
# Translate using Ollama
tingshuo -i ./videos --translate --target-lang zh --trans-backend llm --ollama-model qwen2.5
# Translate using OpenAI API
tingshuo -i ./videos --translate --target-lang zh --trans-backend llm --api-url https://api.openai.com --api-key sk-xxx --api-model gpt-4o-miniFor users in China mainland who have difficulty downloading models from HuggingFace:
tingshuo -i ./videos -e faster-whisper --hf-mirror https://hf-mirror.comOr set the environment variable directly:
export HF_ENDPOINT=https://hf-mirror.com
tingshuo -i ./videos -e faster-whisperThis project is licensed under the GNU General Public License v3.0. See LICENSE for details.

