Add MIGraphX execution mode for AMD GPUs by maherr · Pull Request #3 · avencera/speakrs

maherr · 2026-04-19T16:08:46Z

Adds an ExecutionMode::MiGraphX variant gated behind a new migraphx Cargo feature. The new path forwards to ONNX Runtime's MIGraphX execution provider so users on AMD GPUs can get an ORT-accelerated path without touching CUDA or CoreML code.

This is purely additive: existing modes, file layouts, and feature sets are unchanged.

What changed

`Cargo.toml`

New migraphx = ["ort/migraphx"] feature.
(First commit) Moves the two [target.'cfg(...)'.dependencies] tables for ndarray-linalg-default below the core [dependencies] block. In the current ordering, ort, libloading, tracing, thiserror, crossbeam-channel, and rayon all fall inside [target.'cfg(not(target_arch = "x86_64"))'.dependencies], which makes cargo check fail on x86_64 with unresolved module or unlinked crate 'ort'. This is a pre-existing issue unrelated to the MIGraphX feature but the MIGraphX feature can't be verified without it, so it's bundled as a separate commit. Happy to split into a standalone PR if preferred.

`src/inference.rs`

ExecutionMode::MiGraphX variant plus an is_migraphx() helper that mirrors is_cuda() / is_coreml().
validate() returns the same feature-gated error pattern used by coreml and cuda when the feature is off.
with_execution_mode() attaches the MIGraphX EP with device 0 and SameAsRequested arena growth.
New migraphx_mode_requires_feature unit test following the existing pattern.

`src/models.rs` / `src/pipeline/config.rs`

required_files(MiGraphX) reuses the CPU file set since MIGraphX loads stock ONNX models directly (no split-backend assets needed).
segmentation_step_seconds(MiGraphX) mirrors the CUDA step.

`src/pipeline.rs`

ExecutionMode::MiGraphX added to the concurrent inference path match arm so segmentation and embedding stream in parallel instead of running sequentially. Measured on an RX 9070:

Input speakrs wall (serial) speakrs wall (concurrent) Delta

3-min call 12.34s 9.73s -21%

20-min VoxConverse 61.87s 44.06s -29%

Scheduling-only win; segment counts and batching unchanged.

Notes

Verified with cargo check --all-features, cargo check --no-default-features, cargo check --no-default-features --features migraphx on x86_64 linux.
Happy to split, squash, or rework in any shape that's easier to review.

Thanks for your time on this.

The `[target.'cfg(target_arch = "x86_64")'.dependencies]` and `[target.'cfg(not(target_arch = "x86_64"))'.dependencies]` tables were declared between `ndarray-npy` and `ort`, so everything after them in the `[dependencies]` section (`ort`, `libloading`, `tracing`, `thiserror`, `crossbeam-channel`, `rayon`) was silently re-scoped into the second target-specific table. On x86_64 those deps became unreachable and `cargo check` failed with "unresolved module or unlinked crate `ort`". Moving the two target tables below the untargeted deps restores the intended scoping without changing which backend is selected on either arch.

Adds an `ExecutionMode::MiGraphX` variant gated behind a new `migraphx` Cargo feature, forwarding to ONNX Runtime's MIGraphX execution provider. Users on AMD GPUs can now select an ORT-accelerated path without touching the existing CUDA or CoreML code paths. * New `migraphx = ["ort/migraphx"]` feature. * `ExecutionMode::MiGraphX` variant plus an `is_migraphx()` helper that mirrors `is_cuda()` / `is_coreml()`. * `validate()` returns the same feature-gated error pattern used by `coreml` and `cuda` when the feature is off. * `with_execution_mode()` attaches the MIGraphX execution provider with device 0 and `SameAsRequested` arena growth. Users who need compiled graph caching can set the ORT-standard env vars (`ORT_MIGRAPHX_LOAD_COMPILED_MODEL`, `ORT_MIGRAPHX_SAVE_COMPILED_MODEL`, `ORT_MIGRAPHX_SAVE_COMPILE_PATH`); no programmatic cache configuration is added here. * `required_files(MiGraphX)` reuses the CPU file set since MIGraphX loads stock ONNX models directly (no split-backend assets). * `segmentation_step_seconds(MiGraphX)` mirrors the CUDA step. * Added `migraphx_mode_requires_feature` unit test following the existing `coreml_modes_require_feature` / `cuda_modes_require_feature` pattern. Verified against VoxConverse (232 files) on an RX 9070 (RDNA 4, gfx1201) using a patched onnxruntime build: 10.65% strict DER at 15.47x realtime. Background and patch set for the RDNA 4 ORT+MIGraphX stack: https://maherr.dev/rdna4-missing-rung/

The inference_path() selector previously routed MIGraphX to Sequential, so segmentation ran to completion before embedding began. The existing run_concurrent_inference machinery (streaming segmentation windows over a bounded crossbeam channel into ConcurrentEmbeddingRunner::run_masked) already handles the MIGraphX EmbeddingPath::Masked case with no MIGraphX-specific gaps. This one-line change adds MiGraphX to the same match arm used for CoreML and CUDA. Measured on an RX 9070 (gfx1201) with the MIGraphX provider built against ORT 1.24.2: - 3-min call, speakrs alone: 12.34s -> 9.1s (-26%) - 20-min VoxConverse file, speakrs alone: 61.87s -> 44.06s (-28.8%) - Gain scales with audio length: segmentation fully overlaps with embedding, so the CPU-side segmentation prelude is absorbed. - Inside a parallel Whisper + speakrs wrapper the end-to-end saving is smaller (~9% on 3-min) because GPU contention on the shared device partially offsets the overlap, but it remains positive. - Segment counts and batching are unchanged (10x32 + 1x11 on the 3-min file, before and after). This is a scheduling change, not a modeling change.

TEST set: 15.47x -> 20.37x realtime, 10.65%/7.85%/6.90% -> 10.76%/7.96%/7.01% DER across the three conventions. Reflects the streaming-segmentation optimization that shipped 2026-04-18 (concurrent seg+emb dispatch in speakrs; details at ROCm/AMDMIGraphX#4792 and avencera/speakrs#3). DEV numbers unchanged (13.63x chart, 6.84% strict remain the pre-streaming baseline). Also updates the per-dollar speed margin from 22% to 60% to reflect the new 20.37x/$550 figure and adjusts the strict-DER margin vs pyannote 3.1 from -0.65pp to -0.54pp accordingly.

maherr · 2026-05-01T18:29:17Z

Bumping in case this slipped past. The change is gated behind the new migraphx Cargo feature so default builds are unaffected. Happy to split the 3 commits into separate PRs (target-table reorder, MIGraphX mode, concurrent inference path) if that would be easier to review, or rebase if there's been drift.

maherr added 3 commits April 19, 2026 12:07

maherr mentioned this pull request Apr 19, 2026

Add MIGraphX execution mode for AMD GPUs #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIGraphX execution mode for AMD GPUs#3

Add MIGraphX execution mode for AMD GPUs#3
maherr wants to merge 3 commits intoavencera:masterfrom
maherr:add-migraphx-feature

maherr commented Apr 19, 2026 •

edited

Loading

Uh oh!

maherr commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Input	speakrs wall (serial)	speakrs wall (concurrent)	Delta
3-min call	12.34s	9.73s	-21%
20-min VoxConverse	61.87s	44.06s	-29%

Conversation

maherr commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Cargo.toml

src/inference.rs

src/models.rs / src/pipeline/config.rs

src/pipeline.rs

Notes

Uh oh!

maherr commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maherr commented Apr 19, 2026 •

edited

Loading

`Cargo.toml`

`src/inference.rs`

`src/models.rs` / `src/pipeline/config.rs`

`src/pipeline.rs`