Releases: bevsxyz/WhisperForge
0.5.5 - 2026-06-08
Release Notes
Install whisperforge 0.5.5
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/bevsxyz/WhisperForge/releases/download/v0.5.5/whisperforge-installer.sh | shInstall prebuilt binaries via powershell script
powershell -ExecutionPolicy Bypass -c "irm https://github.com/bevsxyz/WhisperForge/releases/download/v0.5.5/whisperforge-installer.ps1 | iex"Install prebuilt binaries via Homebrew
brew install bevsxyz/tap/whisperforgeInstall prebuilt binaries into your npm project
npm install whisperforge@0.5.5Download whisperforge 0.5.5
| File | Platform | Checksum |
|---|---|---|
| whisperforge-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
| whisperforge-x86_64-pc-windows-msvc.zip | x64 Windows | checksum |
| whisperforge-x86_64-pc-windows-msvc.msi | x64 Windows | checksum |
| whisperforge-aarch64-unknown-linux-gnu.tar.xz | ARM64 Linux | checksum |
| whisperforge-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
WhisperForge v0.4.0
0.4.0 — 2026-05-22
Breaking — Phase E: foundation refactor (crate merger + CLI UX + device selection).
Removed (breaking)
whisperforge-convertcrate removed — model conversion lives inwf convert.whisperforgebinary alias dropped — onlywfships.--cpuflag on the transcribe path — replaced by--device cpu.--taskflag on the transcribe path — parsed-then-runtime-errored before; less code, clearer help.--wgpuflag (vestigial; was already replaced by feature-driven dispatch).
Added
wfsubcommand surface:transcribe,convert,list-models. No more implicit-default subcommand — must specify one explicitly.wf list-models: tabulates.mpkmodels under the models directory with precision,n_audio_layer,n_mels, file size.--device <auto|cpu|wgpu|cuda>onwf transcribe.autoprefers CUDA → WGPU → CPU based on compiled-in features.- Native CUDA backend via optional
burn-cuda0.21.0 (opt-in--features cuda; requires CUDA toolkit at build time). --models-dir <PATH>andWF_MODELS_DIRenv var honored bytranscribeandlist-models.- Friendly missing-model error pointing at
wf list-modelsand the matchingwf convertinvocation.
Changed
whisperforge-clicrate renamed towhisperforge. Workspace shrinks 5 → 4 crates.- mise
release-checksmoke-tests--device cudaand the missing-model friendly error. - Project docs (CLAUDE.md + README + per-crate READMEs) updated for the new CLI surface; Roadmap marks Phase E ✅ COMPLETE and Phase D (WASM) ⏸ DEFERRED. Next: Phase F = streaming realtime.
Migration
| Before | After |
|---|---|
whisperforge-cli -a foo.wav -m tiny_en |
wf transcribe -a foo.wav -m tiny_en |
cargo run -p whisperforge-convert -- --output models/tiny_en |
wf convert --output models/tiny_en |
wf -a foo.wav --cpu |
wf transcribe -a foo.wav --device cpu |
wf -a foo.wav --wgpu |
wf transcribe -a foo.wav --device wgpu (or default auto) |
wf -a foo.wav --task translate |
(removed — never worked) |
Deferred (Phase E carve-outs, queued for later phases)
- VRAM-aware
--encoder-batch-sizeauto-tune. Users override with--encoder-batch-sizeuntil a workload justifies the heuristic cost. - Windows wgpu runtime fallback. Upstream
windows-crate conflict betweenwgpu-halandgpu-allocatoris compile-time; no runtime probe can rescue it. Windows binary still ships CPU-only.
WhisperForge v0.3.1
Changelog
All notable changes to WhisperForge are documented in this file.
Format follows Keep a Changelog.
All workspace crates are versioned together.
0.2.0 — 2026-04-30
Added
Core transcription pipeline (whisperforge-core)
- Whisper model architecture for all sizes (tiny.en through large-v2/v3), built on Burn 0.20.
- Audio pipeline: WAV loading with format dispatch (f32/i16/i24/i32), rubato resampling to 16 kHz mono, Slaney-scale mel spectrogram matching Python Whisper (power spectrum, center-padding, 80-band filter bank).
HybridDecoderwith quality-gated temperature fallback matching faster-whisper SOTA: compression ratio gate (2.4), log-probability gate (−1.0), no-speech threshold (0.6), temperature sequence [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].KvCache<B>+forward_decoder_cached: O(n) per step via static cross-attention K,V and growing self-attention K,V — ~2.6× speedup over naive O(n²) decoder.batch_mel_spectrograms: all audio chunks mel-encoded in a singleforward_encodercall before sequential decoding.transcribe_with_timestamps: per-token timestamps via cross-attention peak, ~100 ms precision.extract_speaker_embedding: mean-pool encoder output + L2-normalise → speaker fingerprint.TranscriptionResult/TranscriptionSegment: fully serde-serializable;token_timestampsandspeakerfields skipped when absent.- 0.8% average WER on LJSpeech
tiny.enbenchmark.
VAD and alignment (whisperforge-align)
VoiceActivityDetectorviaearshotwith configurable threshold.AudioSegmenter: splits audio into voice spans ≤30 s, filters silence.BatchedTranscriber<B>: real batch transcription pipeline —batch_mel_spectrograms→forward_encoder→ per-segment KV-cached greedy decode →HybridDecoder.SrtWriter/SrtEntry: generates well-formed SRT subtitle files.
Speaker diarization (whisperforge-diarize)
SpeakerDiarizer: agglomerative single-linkage clustering of speaker embeddings.cluster_embeddings: cosine similarity with configurable threshold,SPEAKER_NNlabel assignment.
Model conversion (whisperforge-convert)
- Converts OpenAI Whisper safetensors (via HuggingFace
hf-hub) to BurnNamedMpkformat. - Correct tensor name mapping for all encoder and decoder layers including cross-attention.
- Config auto-detection for all model sizes from tensor shapes.
CLI binary (whisperforge / wf)
- Two binary aliases:
whisperforgeandwf(same binary, shorter alias). --audio-file/-a: WAV input.--model/-m: model directory undermodels/(default:tiny_en_converted).--output-format text|srt|json: plain text, SRT subtitles, or JSON with segments and timestamps.--decoding-preset fast|balanced|accurate: pre-configured beam size and temperature sequences.--beam-size,--temperature,--length-penalty,--no-speech-threshold: per-run overrides.--vad-enabled/--vad-threshold: voice activity detection; silence segments skipped.--wgpu: WGPU GPU backend (Vulkan, DX12, or Metal — no CUDA required).--diarize/--diarize-threshold: speaker labels ([SPEAKER_NN]:) in SRT and JSON.--task transcribe|translate,--language: Whisper task and language selection.- Automatic 30 s chunking with 1 s overlap for long audio.
Notes
- Model files (
.mpk,.cfg,tokenizer.json) are not bundled — download from HuggingFace and convert withwhisperforge-convert. See README for instructions. - Requires Rust 1.85+ (Rust 2024 edition).
--wgpurequires Vulkan/DX12/Metal drivers.whisperforge-alignhas known pre-existing test failures; always excluded from the test suite.- Phase 7 Option B (ResNet293 speaker embeddings via burn-onnx) is deferred pending Burn 0.21 stable.
v0.3.0
🎯 WhisperForge 0.3.0: Quantization & Performance
🚀 Major Features
Phase C: INT8 Post-Training Quantization ✅
- 4× model size reduction: Tiny (150 MB → 37 MB), ideal for edge deployment
--quantize int8flag in model converter- Transparent loading—no code changes required
- Full precision (FP32) and quantized (INT8) models interoperable
Phase B.5: GPU-Accelerated Mel Spectrogram ✅
- CubeCL DFT kernel for GPU mel filterbank matmul
--features cubecl-stftenables GPU STFT pipeline- Faster audio preprocessing on WGPU backend
Burn 0.21 & burn-flex Migration
- Latest Burn with improved numerical stability
- CPU fallback (burn-flex) seamlessly handles CPU inference
- Better WGPU runtime integration
📊 What's Changed
- Quantized models fully compatible with CLI and library API
- Streaming audio pipeline (Phase B) now fully integrated
- Fixed EOT suppression at step 0 for robustness
- Improved error handling across all crates
📦 All 5 Crates Published
whisperforge-corev0.3.0 — librarywhisperforge-cliv0.3.0 — binarywhisperforge-convertv0.3.0 — model converterwhisperforge-alignv0.3.0 — VAD + SRTwhisperforge-diarizev0.3.0 — speaker diarization
🛠️ Dependency Updates
Simplified workspace dependency management: inter-crate deps now use workspace version automatically.
📖 Next Phase
Phase D: WASM Target—browser-native speech-to-text with wasm-bindgen.
Full Changelog: v0.2.0...v0.3.0
WhisperForge v0.2.0
Changelog
All notable changes to WhisperForge are documented in this file.
Format follows Keep a Changelog.
All workspace crates are versioned together.
0.2.0 — 2026-04-30
Added
Core transcription pipeline (whisperforge-core)
- Whisper model architecture for all sizes (tiny.en through large-v2/v3), built on Burn 0.20.
- Audio pipeline: WAV loading with format dispatch (f32/i16/i24/i32), rubato resampling to 16 kHz mono, Slaney-scale mel spectrogram matching Python Whisper (power spectrum, center-padding, 80-band filter bank).
HybridDecoderwith quality-gated temperature fallback matching faster-whisper SOTA: compression ratio gate (2.4), log-probability gate (−1.0), no-speech threshold (0.6), temperature sequence [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].KvCache<B>+forward_decoder_cached: O(n) per step via static cross-attention K,V and growing self-attention K,V — ~2.6× speedup over naive O(n²) decoder.batch_mel_spectrograms: all audio chunks mel-encoded in a singleforward_encodercall before sequential decoding.transcribe_with_timestamps: per-token timestamps via cross-attention peak, ~100 ms precision.extract_speaker_embedding: mean-pool encoder output + L2-normalise → speaker fingerprint.TranscriptionResult/TranscriptionSegment: fully serde-serializable;token_timestampsandspeakerfields skipped when absent.- 0.8% average WER on LJSpeech
tiny.enbenchmark.
VAD and alignment (whisperforge-align)
VoiceActivityDetectorviaearshotwith configurable threshold.AudioSegmenter: splits audio into voice spans ≤30 s, filters silence.BatchedTranscriber<B>: real batch transcription pipeline —batch_mel_spectrograms→forward_encoder→ per-segment KV-cached greedy decode →HybridDecoder.SrtWriter/SrtEntry: generates well-formed SRT subtitle files.
Speaker diarization (whisperforge-diarize)
SpeakerDiarizer: agglomerative single-linkage clustering of speaker embeddings.cluster_embeddings: cosine similarity with configurable threshold,SPEAKER_NNlabel assignment.
Model conversion (whisperforge-convert)
- Converts OpenAI Whisper safetensors (via HuggingFace
hf-hub) to BurnNamedMpkformat. - Correct tensor name mapping for all encoder and decoder layers including cross-attention.
- Config auto-detection for all model sizes from tensor shapes.
CLI binary (whisperforge / wf)
- Two binary aliases:
whisperforgeandwf(same binary, shorter alias). --audio-file/-a: WAV input.--model/-m: model directory undermodels/(default:tiny_en_converted).--output-format text|srt|json: plain text, SRT subtitles, or JSON with segments and timestamps.--decoding-preset fast|balanced|accurate: pre-configured beam size and temperature sequences.--beam-size,--temperature,--length-penalty,--no-speech-threshold: per-run overrides.--vad-enabled/--vad-threshold: voice activity detection; silence segments skipped.--wgpu: WGPU GPU backend (Vulkan, DX12, or Metal — no CUDA required).--diarize/--diarize-threshold: speaker labels ([SPEAKER_NN]:) in SRT and JSON.--task transcribe|translate,--language: Whisper task and language selection.- Automatic 30 s chunking with 1 s overlap for long audio.
Notes
- Model files (
.mpk,.cfg,tokenizer.json) are not bundled — download from HuggingFace and convert withwhisperforge-convert. See README for instructions. - Requires Rust 1.85+ (Rust 2024 edition).
--wgpurequires Vulkan/DX12/Metal drivers.whisperforge-alignhas known pre-existing test failures; always excluded from the test suite.- Phase 7 Option B (ResNet293 speaker embeddings via burn-onnx) is deferred pending Burn 0.21 stable.