Streaming on-device speech recognition on the VoxRT custom inference runtime. NeMo FastConformer (32M parameters), 16 kHz mono in, P&C-aware text out, cache-aware streaming with ~1.1 s chunks.
- Current version:
v0.1.2 - Minimum iOS: 16.0
- Architectures shipped:
arm64(iPhone / iPad, NEON-accelerated) - License: Apache-2.0 (Swift wrapper) · proprietary (compiled runtime, redistribution allowed via this Swift Package)
- Upstream model license: CC-BY-4.0 (NVIDIA NeMo)
VoxRT is a from-scratch inference runtime for on-device speech models. No ONNX Runtime, no PyTorch Mobile, no LiteRT — a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.
VoxrtAsr is the streaming-ASR product on that runtime, alongside the free VoxrtSilero VAD demo. Both share the same runtime crate and the same NEON kernel set. The runtime is the product; the models are what it runs.
Commercial wake-word / KWS / domain-specific ASR models built on the same runtime live at voxrt.com.
Measured at ship time, arm64 device build, post-warmup, RTF = wall-time-per-chunk ÷ chunk audio duration (lower is better):
| Device | RTF | per-chunk latency |
|---|---|---|
| iPhone 13 Pro Max (A15 Bionic) | 0.08–0.10 | ~90 ms / 1.12 s chunk |
At RTF ≈ 0.10 you've got ~90 % of one core free during live transcription. CTC mode is ~15 % cheaper per chunk than RNN-T at the cost of marginally lower accuracy (CTC: 4.895 % WER on LibriSpeech test-clean vs 3.267 % for RNN-T).
- Swift wrapper source: ~20 KB total
VoxrtAsrNative.xcframework(compressed): ~5 MB device slice- Streaming model
streaming_medium_pc.vxrt: ~61 MB fp16 on disk (downloaded separately) - Native heap at runtime: ~150 MB steady-state (weights expand to f32 for inference; mmap'd zero-copy at load time)
In Xcode: File → Add Package Dependencies → paste:
https://github.com/VoxRT/voxrt-asr-ios
…and pin to v0.1.2.
Or in Package.swift:
dependencies: [
.package(url: "https://github.com/VoxRT/voxrt-asr-ios.git", from: "0.1.2"),
],The model weights are NOT bundled — you fetch them once from
voxrt-asr-models:
https://github.com/VoxRT/voxrt-asr-models/releases/download/v0.1.2/streaming_medium_pc.vxrt
SHA-256: 93416c104937ee28b4d69f34ce3f858c2d536bae8eddb25f3bda204b7f9fdec8
You decide where it lives. Three common patterns:
- Bundle in app resources — drag
streaming_medium_pc.vxrtinto your Xcode project. Works offline from first launch. Adds ~61 MB to your app. - Download on first run —
URLSessionfetch intoFileManager.default.urls(for: .applicationSupportDirectory, ...). Smaller App Store binary; needs network at first launch. - App Thinning / On-Demand Resources — Apple's per-asset delivery if you want App Store to host the file.
import VoxrtAsr
// 1. Resolve the bundled model URL.
guard let modelURL = Bundle.main.url(forResource: "streaming_medium_pc",
withExtension: "vxrt") else {
fatalError("streaming_medium_pc.vxrt not found in bundle")
}
// 2. Build the engine. `init(modelURL:)` memory-maps the file via
// `Data(contentsOf:options: .mappedIfSafe)` under the hood — no
// eager copy. RNN-T decoder is the recommended default
// (higher accuracy); pass `.ctc` for the ~15 % cheaper head.
let engine = try VoxrtAsrStreamingEngine(modelURL: modelURL)
// (Convenience: same as above for the default bundle + name)
// let engine = try VoxrtAsrStreamingEngine.fromBundleResource()
// 3. Feed PCM (Float32, 16 kHz, mono, [-1, 1]) blocks of any size.
// processPcm returns the text emitted during this call — often
// "" until ~1.12 s of audio has accumulated, then non-empty
// every chunk boundary.
let delta = try engine.processPcm(pcmFloatArray)
if !delta.isEmpty {
print("delta: \(delta)")
}
// 4. When the utterance ends, drain the tail.
let tail = try engine.stop()engine.processPcm / stop / reset are synchronous and stateful — same shape as VoxrtSileroVadEngine.processPcm in the companion VAD library. The engine does NOT own a worker thread. You drive it from your own capture / IO thread.
The canonical streaming pattern — capture-thread owns the AVAudioEngine tap, engine is just a stateful function.
import AVFAudio
import VoxrtAsr
// NOTE: tap callbacks fire on a real-time audio thread. Pre-size
// + reuse buffers; do not allocate per callback in production.
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord, mode: .measurement)
try session.setActive(true)
let audioEngine = AVAudioEngine()
let input = audioEngine.inputNode
let hwFormat = input.outputFormat(forBus: 0) // 44.1 / 48 kHz
let voxrtFormat = AVAudioFormat( // engine target
commonFormat: .pcmFormatFloat32,
sampleRate: 16_000,
channels: 1,
interleaved: true,
)!
let converter = AVAudioConverter(from: hwFormat, to: voxrtFormat)!
guard let modelURL = Bundle.main.url(forResource: "streaming_medium_pc",
withExtension: "vxrt") else { fatalError() }
let asr = try VoxrtAsrStreamingEngine(modelURL: modelURL)
// 3200 samples @ 16 kHz = 200 ms — the recommended push block.
let scratchCapacity: AVAudioFrameCount = 3_200
let voxrtBuf = AVAudioPCMBuffer(pcmFormat: voxrtFormat,
frameCapacity: scratchCapacity)!
var cumulativeTranscript = ""
input.installTap(
onBus: 0,
bufferSize: 4_096,
format: hwFormat
) { hwBuf, _ in
voxrtBuf.frameLength = 0
var error: NSError?
converter.convert(to: voxrtBuf, error: &error) { _, status in
status.pointee = .haveData
return hwBuf
}
if error != nil { return }
guard let f32 = voxrtBuf.floatChannelData?[0] else { return }
let n = Int(voxrtBuf.frameLength)
let samples = Array(UnsafeBufferPointer(start: f32, count: n))
do {
let delta = try asr.processPcm(samples)
if !delta.isEmpty {
DispatchQueue.main.async {
cumulativeTranscript += delta
// update UI with cumulativeTranscript
}
}
} catch {
// surface error to UI
}
}
try audioEngine.start()
// ... later, on stop:
audioEngine.stop()
input.removeTap(onBus: 0)
let tail = try asr.stop()
if !tail.isEmpty {
DispatchQueue.main.async { cumulativeTranscript += tail }
}- Sample rate: 16 000 Hz. No automatic resampling. Phone mic hardware delivers 44.1 / 48 kHz to
AVAudioEngine; convert viaAVAudioConverterto 16 kHz Float32 mono before feedingprocessPcm. Feeding the wrong rate is the #1 source of "transcript is gibberish" bugs. - Sample format:
[Float]PCM in[-1, 1], mono, native endian. - Buffer size: any. The engine internally accumulates to its steady-state chunk size (17 920 samples ≈ 1.12 s) and emits text every chunk.
- Latency: one chunk (~1.12 s) of inherent buffering. Output text becomes available chunk-by-chunk from
processPcmreturn values.
- The engine is a synchronous, stateful function. It does NOT own a queue. Each
processPcmcall blocks on the calling thread for the duration of the inference work — typically theAVAudioEnginetap thread for live mic. Marshal text deltas back to UI viaDispatchQueue.main.async(or your concurrency framework of choice). - One instance is single-thread-at-a-time. Serialise
processPcm/stop/resetagainst each other on a given instance. - One engine instance handles a stream of utterances. Between utterances, call
engine.reset()to zero the K/V cache + LSTM state without paying weight-load cost again.
iOS requires a usage-description string for microphone access. Add to your app's Info.plist:
<key>NSMicrophoneUsageDescription</key>
<string>Used for on-device speech recognition.</string>AVAudioSession.requestRecordPermission(...) triggers the user prompt the first time mic capture is initiated. Without the Info.plist key the app crashes with a privacy-violation exception on first request.
Recommended: RNN-T — higher accuracy, modest extra cost. This is the SDK default; you only need to pass an explicit decoder constant if you specifically want CTC.
| Decoder | Constant | WER on LibriSpeech-500 | Per-chunk cost | When to use |
|---|---|---|---|---|
| RNN-T ★ | .rnnt |
3.267 % | ~50 ms | Recommended default. Higher accuracy. LSTM state survives chunk boundaries. |
| CTC | .ctc |
4.895 % | ~5 ms | Battery-constrained long sessions, or background transcription where the ~1.6 % WER hit is acceptable. |
Both decoders run the same Conformer encoder; the head is selected at session-create time. TDT is not supported on streaming-medium-pc (no duration head) — passing .tdt fails the session creation.
v0.1.2 ships only arm64 for physical devices, NEON-optimized. Simulator slices (arm64-sim + x86_64) are included for build convenience but are not part of the supported production target list.
| Target | Status |
|---|---|
| iOS arm64 (device) | ✅ Shipped |
| iOS arm64 simulator | ✅ Shipped (build-time only) |
| iOS x86_64 simulator | ✅ Shipped (build-time only) |
| macOS arm64 | 🟡 Coming soon |
| macOS x86_64 (AVX) | 🟡 Coming soon |
| visionOS / tvOS / watchOS | ☁️ On request |
voxrt-asr-ios/
├── Package.swift # SPM manifest (binaryTarget URL + checksum)
├── README.md # this file
├── LICENSE # Swift wrapper terms (Apache-2.0)
├── LICENSE-BINARY # compiled runtime terms (proprietary)
└── Sources/
└── VoxrtAsr/
└── VoxrtAsr.swift # Swift wrapper (open, Apache-2.0)
The compiled VoxrtAsrNative.xcframework lives on the GitHub Release page for this tag, downloaded by SwiftPM via the URL+checksum pinned in Package.swift.
- The Swift wrapper (
Sources/VoxrtAsr/) is licensed under Apache-2.0. SeeLICENSE. - The compiled
VoxrtAsrNative.xcframeworkis proprietary VoxRT runtime code owned by Elephant Enterprises LLC, redistributable as part of this unmodified Swift Package. SeeLICENSE-BINARY. - The streaming-medium-pc model weights are derived from
nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms_pc, released under CC-BY-4.0. Attribution travels with the model on the voxrt-asr-models repo. - Commercial integration / custom-model packaging questions: help@voxrt.com.
- VoxRT runtime + commercial models: voxrt.com
- Android counterpart: voxrt-asr-android
- ASR model weights & versions: voxrt-asr-models
- VAD companion: voxrt-silero-ios
- Bugs / questions: open an issue on this repo