A direct Neural Engine encoder backend, ~2x faster than CoreML and faster than Metal #3903

sbryngelson · 2026-06-22T08:10:47Z

sbryngelson
Jun 22, 2026

I built a whisper.cpp encoder backend that runs the encoder directly on the Apple Neural
Engine via ANEForge (no CoreML). On M-series it is the fastest of the three Apple-Silicon
paths: about 2x faster than the CoreML encoder at every size, faster than Metal, and ~5x
less energy.

model	ANEForge	CoreML	Metal	vs CoreML	vs Metal
tiny	5.7 ms	11.2	7.3	2.0x	1.3x
base	12.2 ms	23.0	13.5	1.9x	1.1x
small	40.3 ms	77.2	40.9	1.9x	1.0x
medium	117.6 ms	236	120	2.0x	1.0x

Same transcripts as the reference encoder (cosine 0.999), so it is not a quality
tradeoff. The speed comes from an ANE-native layout: channels-first 1x1-conv projections
and query-tiled attention, so the full score matrix is never materialized.

It drops in like the CoreML and OpenVINO backends you already ship: the same fill-embd_enc
seam, about 30 lines plus one CMake line, gated by an env var and inert when unset. The
ANE work lives in the external ANEForge package, so the part in whisper.cpp is a thin
shim. Encoder only; the decoder is unchanged.

Code, patch, and benchmark: https://github.com/sbryngelson/whisper-aneforge

Would you take it as an optional backend? I am happy to open a PR mirroring the CoreML one.

ggerganov · 2026-06-22T10:50:33Z

ggerganov
Jun 22, 2026
Maintainer

Have you ran some additional benchmarks on longer audio - in the readme, I only see tests with jfk.wav?

0 replies

sbryngelson · 2026-06-22T13:09:57Z

sbryngelson
Jun 22, 2026
Author

@ggerganov Fair point.
Did a multi-window run, whisper-tiny on a 199s speech (the Bush/Columbia address, 8 encoder windows), end to end:

ANEForge encoder: 8.2 ms/window
stock Metal: 15.9 ms/window

So ~1.9x, stable across all 8 encoder windows with no drift.
The transcript comes out 99.2% word-identical to the stock Metal encoder (3 words out of 370 differ: love/loved and two name spellings).

Added this to RESULTS.md in the repo. Happy to run a specific file or a larger model if you have one in mind.

0 replies

sbryngelson · 2026-06-23T00:36:49Z

sbryngelson
Jun 23, 2026
Author

Opened the PR: #3905. It applies on current master and I verified whisper-cli builds with it. Gated by ANEFORGE_ENCODER, so the build behaves exactly like stock when the variable is unset (the dispatch dylib is dlopened only when enabled, no new link-time dependency). Happy to adjust the seam or gating to fit how you would prefer an optional backend to land.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A direct Neural Engine encoder backend, ~2x faster than CoreML and faster than Metal #3903

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

A direct Neural Engine encoder backend, ~2x faster than CoreML and faster than Metal #3903

Uh oh!

sbryngelson Jun 22, 2026

Replies: 3 comments

Uh oh!

ggerganov Jun 22, 2026 Maintainer

Uh oh!

sbryngelson Jun 22, 2026 Author

Uh oh!

sbryngelson Jun 23, 2026 Author

sbryngelson
Jun 22, 2026

ggerganov
Jun 22, 2026
Maintainer

sbryngelson
Jun 22, 2026
Author

sbryngelson
Jun 23, 2026
Author