VoxrtAsr for iOS

Streaming on-device speech recognition on the VoxRT custom inference runtime. NeMo FastConformer (32M parameters), 16 kHz mono in, P&C-aware text out, cache-aware streaming with ~1.1 s chunks.

Current version: v0.1.2
Minimum iOS: 16.0
Architectures shipped: arm64 (iPhone / iPad, NEON-accelerated)
License: Apache-2.0 (Swift wrapper) · proprietary (compiled runtime, redistribution allowed via this Swift Package)
Upstream model license: CC-BY-4.0 (NVIDIA NeMo)

What is VoxRT?

VoxRT is a from-scratch inference runtime for on-device speech models. No ONNX Runtime, no PyTorch Mobile, no LiteRT — a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.

VoxrtAsr is the streaming-ASR product on that runtime, alongside the free VoxrtSilero VAD demo. Both share the same runtime crate and the same NEON kernel set. The runtime is the product; the models are what it runs.

Commercial wake-word / KWS / domain-specific ASR models built on the same runtime live at voxrt.com.

Performance

Measured at ship time, arm64 device build, post-warmup, RTF = wall-time-per-chunk ÷ chunk audio duration (lower is better):

Device	RTF	per-chunk latency
iPhone 13 Pro Max (A15 Bionic)	0.08–0.10	~90 ms / 1.12 s chunk

At RTF ≈ 0.10 you've got ~90 % of one core free during live transcription. CTC mode is ~15 % cheaper per chunk than RNN-T at the cost of marginally lower accuracy (CTC: 4.895 % WER on LibriSpeech test-clean vs 3.267 % for RNN-T).

Binary footprint

Swift wrapper source: ~20 KB total
VoxrtAsrNative.xcframework (compressed): ~5 MB device slice
Streaming model streaming_medium_pc.vxrt: ~61 MB fp16 on disk (downloaded separately)
Native heap at runtime: ~150 MB steady-state (weights expand to f32 for inference; mmap'd zero-copy at load time)

Install

In Xcode: File → Add Package Dependencies → paste:

https://github.com/VoxRT/voxrt-asr-ios

…and pin to v0.1.2.

Or in Package.swift:

dependencies: [
    .package(url: "https://github.com/VoxRT/voxrt-asr-ios.git", from: "0.1.2"),
],

Get the streaming model

The model weights are NOT bundled — you fetch them once from voxrt-asr-models:

https://github.com/VoxRT/voxrt-asr-models/releases/download/v0.1.2/streaming_medium_pc.vxrt

SHA-256: 93416c104937ee28b4d69f34ce3f858c2d536bae8eddb25f3bda204b7f9fdec8

You decide where it lives. Three common patterns:

Bundle in app resources — drag streaming_medium_pc.vxrt into your Xcode project. Works offline from first launch. Adds ~61 MB to your app.
Download on first run — URLSession fetch into FileManager.default.urls(for: .applicationSupportDirectory, ...). Smaller App Store binary; needs network at first launch.
App Thinning / On-Demand Resources — Apple's per-asset delivery if you want App Store to host the file.

Quick start

import VoxrtAsr

// 1. Resolve the bundled model URL.
guard let modelURL = Bundle.main.url(forResource: "streaming_medium_pc",
                                     withExtension: "vxrt") else {
    fatalError("streaming_medium_pc.vxrt not found in bundle")
}

// 2. Build the engine. `init(modelURL:)` memory-maps the file via
//    `Data(contentsOf:options: .mappedIfSafe)` under the hood — no
//    eager copy. RNN-T decoder is the recommended default
//    (higher accuracy); pass `.ctc` for the ~15 % cheaper head.
let engine = try VoxrtAsrStreamingEngine(modelURL: modelURL)

// (Convenience: same as above for the default bundle + name)
//    let engine = try VoxrtAsrStreamingEngine.fromBundleResource()

// 3. Feed PCM (Float32, 16 kHz, mono, [-1, 1]) blocks of any size.
//    processPcm returns the text emitted during this call — often
//    "" until ~1.12 s of audio has accumulated, then non-empty
//    every chunk boundary.
let delta = try engine.processPcm(pcmFloatArray)
if !delta.isEmpty {
    print("delta: \(delta)")
}

// 4. When the utterance ends, drain the tail.
let tail = try engine.stop()

engine.processPcm / stop / reset are synchronous and stateful — same shape as VoxrtSileroVadEngine.processPcm in the companion VAD library. The engine does NOT own a worker thread. You drive it from your own capture / IO thread.

Live microphone example

The canonical streaming pattern — capture-thread owns the AVAudioEngine tap, engine is just a stateful function.

import AVFAudio
import VoxrtAsr

// NOTE: tap callbacks fire on a real-time audio thread. Pre-size
// + reuse buffers; do not allocate per callback in production.

let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord, mode: .measurement)
try session.setActive(true)

let audioEngine = AVAudioEngine()
let input = audioEngine.inputNode
let hwFormat = input.outputFormat(forBus: 0)        // 44.1 / 48 kHz
let voxrtFormat = AVAudioFormat(                    // engine target
    commonFormat: .pcmFormatFloat32,
    sampleRate: 16_000,
    channels: 1,
    interleaved: true,
)!
let converter = AVAudioConverter(from: hwFormat, to: voxrtFormat)!

guard let modelURL = Bundle.main.url(forResource: "streaming_medium_pc",
                                     withExtension: "vxrt") else { fatalError() }
let asr = try VoxrtAsrStreamingEngine(modelURL: modelURL)

// 3200 samples @ 16 kHz = 200 ms — the recommended push block.
let scratchCapacity: AVAudioFrameCount = 3_200
let voxrtBuf = AVAudioPCMBuffer(pcmFormat: voxrtFormat,
                                 frameCapacity: scratchCapacity)!
var cumulativeTranscript = ""

input.installTap(
    onBus: 0,
    bufferSize: 4_096,
    format: hwFormat
) { hwBuf, _ in
    voxrtBuf.frameLength = 0
    var error: NSError?
    converter.convert(to: voxrtBuf, error: &error) { _, status in
        status.pointee = .haveData
        return hwBuf
    }
    if error != nil { return }
    guard let f32 = voxrtBuf.floatChannelData?[0] else { return }
    let n = Int(voxrtBuf.frameLength)
    let samples = Array(UnsafeBufferPointer(start: f32, count: n))

    do {
        let delta = try asr.processPcm(samples)
        if !delta.isEmpty {
            DispatchQueue.main.async {
                cumulativeTranscript += delta
                // update UI with cumulativeTranscript
            }
        }
    } catch {
        // surface error to UI
    }
}

try audioEngine.start()
// ... later, on stop:
audioEngine.stop()
input.removeTap(onBus: 0)
let tail = try asr.stop()
if !tail.isEmpty {
    DispatchQueue.main.async { cumulativeTranscript += tail }
}

Audio contract

Sample rate: 16 000 Hz. No automatic resampling. Phone mic hardware delivers 44.1 / 48 kHz to AVAudioEngine; convert via AVAudioConverter to 16 kHz Float32 mono before feeding processPcm. Feeding the wrong rate is the #1 source of "transcript is gibberish" bugs.
Sample format: [Float] PCM in [-1, 1], mono, native endian.
Buffer size: any. The engine internally accumulates to its steady-state chunk size (17 920 samples ≈ 1.12 s) and emits text every chunk.
Latency: one chunk (~1.12 s) of inherent buffering. Output text becomes available chunk-by-chunk from processPcm return values.

Threading

The engine is a synchronous, stateful function. It does NOT own a queue. Each processPcm call blocks on the calling thread for the duration of the inference work — typically the AVAudioEngine tap thread for live mic. Marshal text deltas back to UI via DispatchQueue.main.async (or your concurrency framework of choice).
One instance is single-thread-at-a-time. Serialise processPcm / stop / reset against each other on a given instance.
One engine instance handles a stream of utterances. Between utterances, call engine.reset() to zero the K/V cache + LSTM state without paying weight-load cost again.

Permissions

iOS requires a usage-description string for microphone access. Add to your app's Info.plist:

<key>NSMicrophoneUsageDescription</key>
<string>Used for on-device speech recognition.</string>

AVAudioSession.requestRecordPermission(...) triggers the user prompt the first time mic capture is initiated. Without the Info.plist key the app crashes with a privacy-violation exception on first request.

Decoder selection

Recommended: RNN-T — higher accuracy, modest extra cost. This is the SDK default; you only need to pass an explicit decoder constant if you specifically want CTC.

Decoder	Constant	WER on LibriSpeech-500	Per-chunk cost	When to use
RNN-T ★	`.rnnt`	3.267 %	~50 ms	Recommended default. Higher accuracy. LSTM state survives chunk boundaries.
CTC	`.ctc`	4.895 %	~5 ms	Battery-constrained long sessions, or background transcription where the ~1.6 % WER hit is acceptable.

Both decoders run the same Conformer encoder; the head is selected at session-create time. TDT is not supported on streaming-medium-pc (no duration head) — passing .tdt fails the session creation.

Architectures roadmap

v0.1.2 ships only arm64 for physical devices, NEON-optimized. Simulator slices (arm64-sim + x86_64) are included for build convenience but are not part of the supported production target list.

Target	Status
iOS arm64 (device)	✅ Shipped
iOS arm64 simulator	✅ Shipped (build-time only)
iOS x86_64 simulator	✅ Shipped (build-time only)
macOS arm64	🟡 Coming soon
macOS x86_64 (AVX)	🟡 Coming soon
visionOS / tvOS / watchOS	☁️ On request

Project layout

voxrt-asr-ios/
├── Package.swift                # SPM manifest (binaryTarget URL + checksum)
├── README.md                    # this file
├── LICENSE                      # Swift wrapper terms (Apache-2.0)
├── LICENSE-BINARY               # compiled runtime terms (proprietary)
└── Sources/
    └── VoxrtAsr/
        └── VoxrtAsr.swift       # Swift wrapper (open, Apache-2.0)

The compiled VoxrtAsrNative.xcframework lives on the GitHub Release page for this tag, downloaded by SwiftPM via the URL+checksum pinned in Package.swift.

License

The Swift wrapper (Sources/VoxrtAsr/) is licensed under Apache-2.0. See LICENSE.
The compiled VoxrtAsrNative.xcframework is proprietary VoxRT runtime code owned by Elephant Enterprises LLC, redistributable as part of this unmodified Swift Package. See LICENSE-BINARY.
The streaming-medium-pc model weights are derived from nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms_pc, released under CC-BY-4.0. Attribution travels with the model on the voxrt-asr-models repo.
Commercial integration / custom-model packaging questions: help@voxrt.com.

Links

VoxRT runtime + commercial models: voxrt.com
Android counterpart: voxrt-asr-android
ASR model weights & versions: voxrt-asr-models
VAD companion: voxrt-silero-ios
Bugs / questions: open an issue on this repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxrtAsr for iOS

What is VoxRT?

Performance

Binary footprint

Install

Get the streaming model

Quick start

Live microphone example

Audio contract

Threading

Permissions

Decoder selection

Architectures roadmap

Project layout

License

Links

About

Licenses found

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Sources/VoxrtAsr		Sources/VoxrtAsr
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-BINARY		LICENSE-BINARY
Package.swift		Package.swift
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

VoxrtAsr for iOS

What is VoxRT?

Performance

Binary footprint

Install

Get the streaming model

Quick start

Live microphone example

Audio contract

Threading

Permissions

Decoder selection

Architectures roadmap

Project layout

License

Links

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages