EasyScript

Standalone desktop transcription & translation app powered by Whisper AI. Transcribe audio/video files or live microphone input with real-time sentence splitting, speaker diarization, and multilingual translation.

Built for macOS (Apple Silicon optimized with Metal GPU acceleration) and Windows (CUDA support).

Screenshots

Editor Mode — Main Interface

Load audio files, detect silence, transcribe speech, and identify speakers — all from a clean, intuitive interface with waveform visualization.

Waveform & Speaker Diarization

Interactive waveform viewer with color-coded speech/silence/breath segments. After running speaker diarization, each segment is labeled with the identified speaker (Speaker A, B, C...) in distinct colors.

Speech Segments with Speaker Labels

Transcribed segments are displayed with speaker identification, timecodes, and the transcribed text. Filter by All, Speech, Silence, or Translation tabs.

Display Mode Settings

Choose how text is displayed: Natural sentences, Word by word, By punctuation, or Max words per line — flexible formatting for different use cases.

Translation — Language Selection

Translate transcriptions into 30+ languages. Click "+" to add a target language from the language picker with support for Vietnamese, Chinese, Japanese, Korean, Thai, and many more.

Translation — Text View

View translations in a continuous text format. Add multiple target languages (e.g., Vietnamese + Japanese) with tabs to switch between them.

Live Mode — Real-time Transcription

Record from your microphone and get real-time transcription with sentence-level splitting. Choose the Turbo model for fast Vietnamese transcription, or any other model for your language.

Features

Editor Mode — File-based Transcription

Audio/Video file support — Load any audio or video file for analysis
Silence & breath detection — Detect silent and breath segments using FFmpeg-based VAD
AI Transcription — Speech-to-text using Whisper (MLX on Apple Silicon, faster-whisper on CUDA/CPU)
Speaker diarization — Identify who speaks when using pyannote-audio 4.0
Interactive waveform — Zoomable waveform viewer with playback, cut markers, and segment navigation
Search & Replace — Find and replace text across all segments
Multiple display modes — Natural sentences, word-by-word, punctuation-based, or max-words-per-line
Translation — Translate segments using Ollama (local) or Claude API (cloud)
Export — Export to XML (Premiere Pro timeline), SRT subtitles (original or after cuts)

Live Mode — Real-time Transcription

Microphone input — Record and transcribe from any connected microphone
System Audio — Capture browser tab audio (browser mode only)
Real-time sentence splitting — Sentences split at punctuation marks (.!?;,) as you speak
Live translation — Each sentence is translated immediately after finalization
Speech / Translation tabs — Speech tab shows fast real-time transcription; Translation tab holds position until translation completes for reading comfort
Pause / Continue / Stop — Full session control with data preservation
Non-blocking pipeline — Heavy Whisper inference runs asynchronously, never blocking audio capture

Transcription Models

Model	Size	Speed	Quality
Tiny	~75MB	Fastest	Basic
Base	~140MB	Fast	Good
Small	~460MB	Medium	Better
Medium	~1.5GB	Slow	Great
Turbo	~800MB	Fast	Best for Vietnamese
Large V3	~3GB	Slowest	Best overall

GPU Acceleration

Apple Silicon (M1/M2/M3/M4) — MLX backend with Metal GPU, optimized for macOS
NVIDIA GPU — faster-whisper with CUDA acceleration
CPU fallback — Automatic fallback when no GPU available

Installation

Option 1: Download Standalone App (Recommended)

Download EasyScript.app from Releases and drag to Applications.

Option 2: Build from Source

Requirements:

Python 3.11+
FFmpeg (brew install ffmpeg)
macOS 13+ (for Apple Silicon MLX) or Windows with NVIDIA GPU

# Clone repository
git clone https://github.com/XavierChuu/EasyScript.git
cd EasyScript

# Build standalone app
chmod +x scripts/build_app.sh
./scripts/build_app.sh

# App will be at dist/EasyScript.app (macOS)
open dist/EasyScript.app

Option 3: Development Mode

# Setup backend
cd backend
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Start backend server
python server.py
# Server runs at http://localhost:9876

# In another terminal, serve frontend
npx serve ./plugin
# Open http://localhost:3000 in browser

Usage Guide

Editor Mode

1. Load Audio

Click the folder icon next to the audio path field
Select any audio or video file (MP3, WAV, M4A, MP4, MOV, etc.)
Audio info (duration, sample rate, channels) will be displayed

2. Detect Silence

Click "Detect Silence" to analyze the audio
The waveform will show speech (blue), silence (gray), and breath (orange) segments
Adjust cut settings (padding, min silence, threshold) via the gear icon on the waveform

3. Transcribe

Select your preferred Model (Turbo recommended for Vietnamese)
Select Language (or leave as Auto-detect)
Optionally check "Identify speakers after transcribe"
Click "Transcribe" — progress bar shows real-time status with ETA
Right-click to resume transcription from the current playhead position

4. Speaker Diarization

Requires a HuggingFace token (configure in Settings)
Accept terms at pyannote/speaker-diarization-3.1
Click "Speakers" to identify who speaks when
Speakers are labeled as Speaker A, Speaker B, etc.

5. Translate

Switch to the Translation tab in segments
Click "+" to add a target language
Click "Translate" to translate all segments
Supports Ollama (local, free) or Claude API (cloud, higher quality)

6. Export

Export XML — Premiere Pro compatible timeline with cuts applied
Export SRT (Original) — Subtitles with original timecodes
Export SRT (After Cuts) — Subtitles adjusted for silence removal
Choose output folder via the folder selector

Live Mode

1. Select Input Source

Microphone — Select from available microphones
System Audio — Capture browser tab audio (only works in browser mode, not in bundled app)

2. Configure

Select transcription Model (Turbo recommended)
Select Language
Optionally enable Translation with target language

3. Start Recording

Click "Start Live" to begin real-time transcription
Speech is transcribed and split into sentences in real-time
Switch between Speech and Translation tabs:
- Speech tab — Shows transcription as fast as possible
- Translation tab — Holds position until translation for current sentence is ready

4. Session Controls

Pause — Temporarily stop recording, stay in focus mode
Continue — Resume recording without clearing data
Stop — End session, return to full UI with all data preserved for export
New — Start a fresh session (clears previous data)

Settings

HuggingFace Token

Required for speaker diarization. Get your token at huggingface.co/settings/tokens.

Translation Provider

Ollama (Local) — Free, runs locally. Install Ollama and pull a model
Claude API (Cloud) — Higher quality translations. Requires an Anthropic API key

Architecture

EasyScript/
├── plugin/                 # Frontend (HTML/CSS/JS)
│   ├── index.html          # Main UI layout
│   ├── index.js            # App logic, WebSocket handling
│   └── styles.css          # Styling
├── backend/                # Python backend
│   ├── server.py           # FastAPI server + WebSocket live streaming
│   ├── transcriber.py      # Whisper transcription (MLX / faster-whisper)
│   ├── silence_detector.py # FFmpeg-based silence/breath detection
│   ├── diarizer.py         # Speaker diarization (pyannote-audio)
│   ├── translator.py       # Translation (Ollama / Claude API)
│   ├── main.py             # PyWebView launcher
│   ├── easyscript.spec     # PyInstaller build spec
│   └── requirements.txt    # Python dependencies
├── scripts/
│   └── build_app.sh        # Build standalone app script
└── README.md

Tech Stack

Frontend: HTML/CSS/JS (runs in browser or PyWebView)
Backend: Python 3.11, FastAPI, uvicorn
Transcription: mlx-whisper (Apple Silicon Metal GPU) / faster-whisper (CUDA/CPU)
Speaker ID: pyannote-audio 4.0 + torchcodec
Live mode: WebSocket streaming, webrtcvad sentence splitting
Translation: Ollama (local) / Claude API (cloud)
Distribution: PyInstaller bundled .app

API Endpoints

The backend exposes a REST API at http://localhost:9876:

Method	Endpoint	Description
GET	`/health`	Server health check + GPU info
POST	`/upload`	Upload audio/video file
POST	`/analyze`	Detect silence/breath segments
POST	`/transcribe`	Transcribe audio to text
POST	`/diarize`	Speaker diarization
POST	`/translate`	Batch translate segments
POST	`/translate/one`	Translate single segment
GET	`/models`	List available models
GET	`/model/status`	Check if model is cached
WebSocket	`/ws/live`	Live transcription stream

System Requirements

macOS: 13.0+ (Ventura or later), Apple Silicon recommended
Windows: Windows 10+, NVIDIA GPU recommended
RAM: 8GB minimum, 16GB recommended for large models
Storage: ~1GB for app + model storage (varies by model size)
FFmpeg: Required (bundled in standalone app)

License

MIT License

Credits

Whisper by OpenAI
mlx-whisper by Apple MLX team
faster-whisper by SYSTRAN
pyannote-audio for speaker diarization
FastAPI for the backend framework

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
docs/images		docs/images
plugin		plugin
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

EasyScript

Screenshots

Editor Mode — Main Interface

Waveform & Speaker Diarization

Speech Segments with Speaker Labels

Display Mode Settings

Translation — Language Selection

Translation — Text View

Live Mode — Real-time Transcription

Features

Editor Mode — File-based Transcription

Live Mode — Real-time Transcription

Transcription Models

GPU Acceleration

Installation

Option 1: Download Standalone App (Recommended)

Option 2: Build from Source

Option 3: Development Mode

Usage Guide

Editor Mode

1. Load Audio

2. Detect Silence

3. Transcribe

4. Speaker Diarization

5. Translate

6. Export

Live Mode

1. Select Input Source

2. Configure

3. Start Recording

4. Session Controls

Settings

HuggingFace Token

Translation Provider

Architecture

Tech Stack

API Endpoints

System Requirements

License

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages