Skip to content

Add MIGraphX execution mode for AMD GPUs#3

Open
maherr wants to merge 3 commits intoavencera:masterfrom
maherr:add-migraphx-feature
Open

Add MIGraphX execution mode for AMD GPUs#3
maherr wants to merge 3 commits intoavencera:masterfrom
maherr:add-migraphx-feature

Conversation

@maherr
Copy link
Copy Markdown

@maherr maherr commented Apr 19, 2026

Adds an ExecutionMode::MiGraphX variant gated behind a new migraphx Cargo feature. The new path forwards to ONNX Runtime's MIGraphX execution provider so users on AMD GPUs can get an ORT-accelerated path without touching CUDA or CoreML code.

This is purely additive: existing modes, file layouts, and feature sets are unchanged.

What changed

Cargo.toml

  • New migraphx = ["ort/migraphx"] feature.
  • (First commit) Moves the two [target.'cfg(...)'.dependencies] tables for ndarray-linalg-default below the core [dependencies] block. In the current ordering, ort, libloading, tracing, thiserror, crossbeam-channel, and rayon all fall inside [target.'cfg(not(target_arch = "x86_64"))'.dependencies], which makes cargo check fail on x86_64 with unresolved module or unlinked crate 'ort'. This is a pre-existing issue unrelated to the MIGraphX feature but the MIGraphX feature can't be verified without it, so it's bundled as a separate commit. Happy to split into a standalone PR if preferred.

src/inference.rs

  • ExecutionMode::MiGraphX variant plus an is_migraphx() helper that mirrors is_cuda() / is_coreml().
  • validate() returns the same feature-gated error pattern used by coreml and cuda when the feature is off.
  • with_execution_mode() attaches the MIGraphX EP with device 0 and SameAsRequested arena growth.
  • New migraphx_mode_requires_feature unit test following the existing pattern.

src/models.rs / src/pipeline/config.rs

  • required_files(MiGraphX) reuses the CPU file set since MIGraphX loads stock ONNX models directly (no split-backend assets needed).
  • segmentation_step_seconds(MiGraphX) mirrors the CUDA step.

src/pipeline.rs

  • ExecutionMode::MiGraphX added to the concurrent inference path match arm so segmentation and embedding stream in parallel instead of running sequentially. Measured on an RX 9070:

    Input speakrs wall (serial) speakrs wall (concurrent) Delta
    3-min call 12.34s 9.73s -21%
    20-min VoxConverse 61.87s 44.06s -29%

    Scheduling-only win; segment counts and batching unchanged.

Notes

  • Verified with cargo check --all-features, cargo check --no-default-features, cargo check --no-default-features --features migraphx on x86_64 linux.
  • Happy to split, squash, or rework in any shape that's easier to review.

Thanks for your time on this.

maherr added 3 commits April 19, 2026 12:07
The `[target.'cfg(target_arch = "x86_64")'.dependencies]` and
`[target.'cfg(not(target_arch = "x86_64"))'.dependencies]` tables were
declared between `ndarray-npy` and `ort`, so everything after them in the
`[dependencies]` section (`ort`, `libloading`, `tracing`, `thiserror`,
`crossbeam-channel`, `rayon`) was silently re-scoped into the second
target-specific table. On x86_64 those deps became unreachable and
`cargo check` failed with "unresolved module or unlinked crate `ort`".

Moving the two target tables below the untargeted deps restores the
intended scoping without changing which backend is selected on either
arch.
Adds an `ExecutionMode::MiGraphX` variant gated behind a new `migraphx`
Cargo feature, forwarding to ONNX Runtime's MIGraphX execution provider.
Users on AMD GPUs can now select an ORT-accelerated path without touching
the existing CUDA or CoreML code paths.

* New `migraphx = ["ort/migraphx"]` feature.
* `ExecutionMode::MiGraphX` variant plus an `is_migraphx()` helper that
  mirrors `is_cuda()` / `is_coreml()`.
* `validate()` returns the same feature-gated error pattern used by
  `coreml` and `cuda` when the feature is off.
* `with_execution_mode()` attaches the MIGraphX execution provider with
  device 0 and `SameAsRequested` arena growth. Users who need compiled
  graph caching can set the ORT-standard env vars
  (`ORT_MIGRAPHX_LOAD_COMPILED_MODEL`, `ORT_MIGRAPHX_SAVE_COMPILED_MODEL`,
  `ORT_MIGRAPHX_SAVE_COMPILE_PATH`); no programmatic cache configuration
  is added here.
* `required_files(MiGraphX)` reuses the CPU file set since MIGraphX loads
  stock ONNX models directly (no split-backend assets).
* `segmentation_step_seconds(MiGraphX)` mirrors the CUDA step.
* Added `migraphx_mode_requires_feature` unit test following the existing
  `coreml_modes_require_feature` / `cuda_modes_require_feature` pattern.

Verified against VoxConverse (232 files) on an RX 9070 (RDNA 4, gfx1201)
using a patched onnxruntime build: 10.65% strict DER at 15.47x realtime.
Background and patch set for the RDNA 4 ORT+MIGraphX stack:
https://maherr.dev/rdna4-missing-rung/
The inference_path() selector previously routed MIGraphX to Sequential,
so segmentation ran to completion before embedding began. The existing
run_concurrent_inference machinery (streaming segmentation windows over
a bounded crossbeam channel into ConcurrentEmbeddingRunner::run_masked)
already handles the MIGraphX EmbeddingPath::Masked case with no
MIGraphX-specific gaps. This one-line change adds MiGraphX to the
same match arm used for CoreML and CUDA.

Measured on an RX 9070 (gfx1201) with the MIGraphX provider built
against ORT 1.24.2:

- 3-min call, speakrs alone: 12.34s -> 9.1s (-26%)
- 20-min VoxConverse file, speakrs alone: 61.87s -> 44.06s (-28.8%)
- Gain scales with audio length: segmentation fully overlaps with
  embedding, so the CPU-side segmentation prelude is absorbed.
- Inside a parallel Whisper + speakrs wrapper the end-to-end saving
  is smaller (~9% on 3-min) because GPU contention on the shared
  device partially offsets the overlap, but it remains positive.
- Segment counts and batching are unchanged (10x32 + 1x11 on the
  3-min file, before and after). This is a scheduling change, not a
  modeling change.
maherr added a commit to maherr/maherr.github.io that referenced this pull request Apr 21, 2026
TEST set: 15.47x -> 20.37x realtime, 10.65%/7.85%/6.90% ->
10.76%/7.96%/7.01% DER across the three conventions. Reflects the
streaming-segmentation optimization that shipped 2026-04-18 (concurrent
seg+emb dispatch in speakrs; details at ROCm/AMDMIGraphX#4792 and
avencera/speakrs#3). DEV numbers unchanged (13.63x chart, 6.84% strict
remain the pre-streaming baseline).

Also updates the per-dollar speed margin from 22% to 60% to reflect the
new 20.37x/$550 figure and adjusts the strict-DER margin vs pyannote 3.1
from -0.65pp to -0.54pp accordingly.
@maherr
Copy link
Copy Markdown
Author

maherr commented May 1, 2026

Bumping in case this slipped past. The change is gated behind the new migraphx Cargo feature so default builds are unaffected. Happy to split the 3 commits into separate PRs (target-table reorder, MIGraphX mode, concurrent inference path) if that would be easier to review, or rebase if there's been drift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant