Question: where would a flash-resident tiny language runtime fit relative to MNN’s on-device LLM direction? #4273
Replies: 1 comment
-
|
Thanks for sharing the Engram experiment — really interesting work, especially getting something running on an ESP32-C3 with only ~1.38 MB. It's great to see people pushing the boundary of what's possible on extreme edge hardware. To answer your question about positioning: MNN's primary focus is on mobile devices (smartphones) as the first-class target. Beyond that, we also support embedded platforms (mainly ARM-based) and PC/Mac. For now, we don't consider extreme resource-constrained scenarios like ESP32-class MCUs. So I'd say Engram falls into the "adjacent class of edge language systems" bucket — it's solving a different problem with a fundamentally different paradigm (table-driven / hash-lookup vs. dense tensor computation). MNN's graph-based runtime and operator optimization are designed for devices that have enough compute (CPU/GPU/NPU) to run full neural networks, just more efficiently. That said, the middle ground is definitely worth watching. Chips like ARM Cortex-M7 or mid-range RISC-V could be an overlap area where both approaches have something to contribute. And ideas from extreme compression — like your packed token weights — could also inspire new quantization formats that frameworks like MNN could learn from. Appreciate you bringing this up — it's a useful boundary to think about, and we'd love to see where Engram goes next. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi MNN folks,
I wanted to share a small edge-language-runtime experiment and ask how people here would think about it relative to MNN’s current on-device LLM and Edge AI direction.
We built a public demo line called Engram and deployed it on a commodity ESP32-C3.
Current public numbers:
Host-side benchmark capability
LogiQA = 0.392523IFEval = 0.780037Published board proof
LogiQA 642 = 249 / 642 = 0.3878504672897196host_full_match = 642 / 6421,380,771 bytesImportant scope note:
This is not presented as unrestricted open-input native LLM generation on MCU.
The board-side path is closer to a flash-resident, table-driven runtime with:
So this is not simply a smaller graph-driven dense inference stack. It feels more like a task-specialized language runtime whose behavior has been compiled into a highly constrained executable form.
Repo:
https://github.com/Alpha-Guardian/Engram
The thing I’m curious about is whether systems like this should be viewed as:
Would be very interested in how people here think about that boundary.
Beta Was this translation helpful? Give feedback.
All reactions