Fast Russian speech recognition on Apple Silicon — up to 330x realtime
MLX port of GigaAM-v3 (220M params, Conformer + CTC/RNNT) by Salute Developers. Produces punctuated, normalized text directly. No PyTorch required.
pip install git+https://github.com/aystream/gigaam-mlx.gitfrom gigaam_mlx import load_model, transcribe
model, tokenizer = load_model() # auto-downloads from HuggingFace
text = transcribe(model, tokenizer, "meeting.wav")
print(text)# Transcribe any audio/video file (CTC — fast, default)
gigaam-mlx recording.mkv
# Use RNNT for higher quality
gigaam-mlx recording.mkv --model-type rnnt
# Output subtitles
gigaam-mlx call.wav --output-dir ./transcripts --format srtOutputs .srt (subtitles) and .txt (plain text). Model weights download automatically on first run.
MacBook Pro M2 Max, 20-second audio chunk (avg of 3 runs, warmed up):
| Backend | Model | Time | Realtime factor |
|---|---|---|---|
| MLX (this) | v3_e2e_ctc | 0.06s | ~330x |
| MLX (this) | v3_e2e_rnnt | 0.26s | ~77x |
| PyTorch MPS | v3_e2e_rnnt | 0.76s | ~26x |
| PyTorch CPU | v3_e2e_rnnt | 1.13s | ~18x |
| ONNX CPU | v3_e2e_ctc | 1.66s | ~12x |
Full 18-minute video: CTC 21.5s (~50x realtime), RNNT 25.0s (~42x realtime).
| Variant | Speed | Quality | Use case |
|---|---|---|---|
| CTC (default) | ~330x realtime | Good | Batch processing, speed-critical |
| RNNT | ~77x realtime | Better | When accuracy matters most |
# Higher quality with RNNT
model, tokenizer = load_model("rnnt")- up to 330x realtime on Apple Silicon (M1/M2/M3/M4)
- Russian + English — recognizes English words/terms in Russian speech
- Punctuation built-in — end-to-end model, no post-processing
- No PyTorch — pure MLX + librosa + numpy
- Any format — video and audio via ffmpeg (mkv, mp4, wav, mp3, ...)
- Auto-download — model weights from HuggingFace Hub
- macOS with Apple Silicon (M1+)
- Python >= 3.10
- ffmpeg (
brew install ffmpeg)
GigaAM model family (source)
Audio/Video → ffmpeg (16kHz mono) → Mel spectrogram (librosa)
→ Conformer encoder (16 layers, 768d, 16 heads, RoPE)
→ CTC/RNNT head → greedy decode → punctuated text
The model is a 220M parameter Conformer pretrained on 700,000 hours of Russian speech. The v3_e2e_ctc variant produces punctuated, normalized text directly — no language model or post-processing needed.
pip install gigaam-mlx[convert]
python -m gigaam_mlx.convert --model v3_e2e_ctc --output-dir ./weights_ctc
python -m gigaam_mlx.convert --model v3_e2e_rnnt --output-dir ./weights_rnnt- GigaAM by Salute Developers / SberDevices — original model (paper, InterSpeech 2025)
- MLX by Apple — ML framework for Apple Silicon
- ai-sage/GigaAM-v3 — HuggingFace transformers integration
MIT — same as the original GigaAM model.