Question: where would a flash-resident tiny language runtime fit relative to MNN’s on-device LLM direction? #4273

Alpha-Guardian · 2026-03-17T08:48:57Z

Alpha-Guardian
Mar 17, 2026

Hi MNN folks,

I wanted to share a small edge-language-runtime experiment and ask how people here would think about it relative to MNN’s current on-device LLM and Edge AI direction.

We built a public demo line called Engram and deployed it on a commodity ESP32-C3.

Current public numbers:

Host-side benchmark capability
- LogiQA = 0.392523
- IFEval = 0.780037
Published board proof
- LogiQA 642 = 249 / 642 = 0.3878504672897196
- host_full_match = 642 / 642
- runtime artifact size = 1,380,771 bytes

Important scope note:

This is not presented as unrestricted open-input native LLM generation on MCU.

The board-side path is closer to a flash-resident, table-driven runtime with:

packed token weights
hashed lookup structures
fixed compiled probe batches
streaming fold / checksum style execution over precompiled structures

So this is not simply a smaller graph-driven dense inference stack. It feels more like a task-specialized language runtime whose behavior has been compiled into a highly constrained executable form.

Repo:
https://github.com/Alpha-Guardian/Engram

The thing I’m curious about is whether systems like this should be viewed as:

an extreme endpoint of on-device LLM specialization
outside the normal graph/runtime deployment path
or an adjacent class of edge language systems that future frameworks may need to account for

Would be very interested in how people here think about that boundary.

wangzhaode · 2026-04-07T02:55:09Z

wangzhaode
Apr 7, 2026
Maintainer

Thanks for sharing the Engram experiment — really interesting work, especially getting something running on an ESP32-C3 with only ~1.38 MB. It's great to see people pushing the boundary of what's possible on extreme edge hardware.

To answer your question about positioning: MNN's primary focus is on mobile devices (smartphones) as the first-class target. Beyond that, we also support embedded platforms (mainly ARM-based) and PC/Mac. For now, we don't consider extreme resource-constrained scenarios like ESP32-class MCUs.

So I'd say Engram falls into the "adjacent class of edge language systems" bucket — it's solving a different problem with a fundamentally different paradigm (table-driven / hash-lookup vs. dense tensor computation). MNN's graph-based runtime and operator optimization are designed for devices that have enough compute (CPU/GPU/NPU) to run full neural networks, just more efficiently.

That said, the middle ground is definitely worth watching. Chips like ARM Cortex-M7 or mid-range RISC-V could be an overlap area where both approaches have something to contribute. And ideas from extreme compression — like your packed token weights — could also inspire new quantization formats that frameworks like MNN could learn from.

Appreciate you bringing this up — it's a useful boundary to think about, and we'd love to see where Engram goes next.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: where would a flash-resident tiny language runtime fit relative to MNN’s on-device LLM direction? #4273

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question: where would a flash-resident tiny language runtime fit relative to MNN’s on-device LLM direction? #4273

Uh oh!

Alpha-Guardian Mar 17, 2026

Replies: 1 comment

Uh oh!

wangzhaode Apr 7, 2026 Maintainer

Alpha-Guardian
Mar 17, 2026

wangzhaode
Apr 7, 2026
Maintainer