This project demonstrates running a Large Language Model (LLM) directly on an ESP32-S3 microcontroller. It implements a highly optimized inference engine capable of running small transformer models with extensive optimizations for embedded systems.
The model is a 260K parameter TinyLlamas checkpoint trained on the TinyStories dataset. Despite its small size, it generates coherent, simple text suitable for embedded applications.
Based on llama2.c with extensive ESP32-specific optimizations.
- ESP32-S3 with 2MB PSRAM (ESP32-S3FH4R2)
- 8MB flash (7.9MB app partition)
- ~1.5MB available RAM
Achieves ~32.83 tokens/second through:
- SIMD Acceleration: ESP32-S3 vector instructions via ESP-DSP
- Memory Alignment: 16-byte aligned allocations for SIMD operations
- Dual-Core Processing: Parallel computation across both cores
- Optimized Clocks: CPU at 240MHz, PSRAM at 80MHz
- Assembly Optimizations: Custom float division routines
- Efficient Math: Lookup tables for activation functions
- Real-time token streaming via WebSocket
- Adjustable temperature and max tokens
- Mobile-responsive design
- Access at http://192.168.4.1
- LLM Engine: Complete transformer implementation with top-p sampling
- Memory Management: Custom aligned allocators for SIMD efficiency
- WiFi Manager: Station mode connectivity
- Embedded Model: 1MB model + 6KB tokenizer built into firmware
├── main/ # Application entry point
├── components/
│ ├── llm/ # LLM inference engine
│ │ ├── assets/ # Embedded model & tokenizer
│ │ └── src/ # Core implementation + ASM
│ └── web/ # Web server & WiFi
│ └── src/ # HTTP/WebSocket server
└── partitions.csv # 8MB partition table
Requires ESP-IDF v5.3.2 or later.
# Copy the template
cp components/web/include/wifi_config.h.template components/web/include/wifi_config.h
# Edit with your WiFi credentials
# Set WIFI_SSID and WIFI_PASS in wifi_config.h
# Set up ESP-IDF environment
. ~/esp/esp-idf/export.sh
# Configure for ESP32-S3
idf.py set-target esp32s3
# Build the project
idf.py build
# Flash and monitor (replace PORT with your device)
idf.py -p /dev/ttyUSB0 flash monitor
- After flashing, ESP32 connects to your WiFi network
- Find the device IP in serial monitor output
- Navigate to
http://<device-ip>
in your browser - Enter prompts and watch tokens stream in real-time
The device provides detailed logs including:
- Memory usage statistics
- Performance metrics per layer
- Token generation progress
Key options in idf.py menuconfig
:
- I2C GPIO pins (if using external peripherals)
- I2C clock speed
- Model: 260K parameters, 6 layers, 288 dimensions
- Vocabulary: 512 tokens optimized for simple stories
- Context: 512 token maximum sequence length
- Sampling: Top-p (nucleus) sampling with temperature control
- Memory: ~1.5MB runtime memory requirement
- Simple vocabulary suitable for basic stories
- 512 token context window
- No dynamic model loading (embedded in firmware)
- WiFi required for web interface
Contributions welcome! Areas for improvement:
- Further SIMD optimizations
- Model quantization support
- External storage for larger models
- Additional sampling methods
This project is intended to be open source. Please add a LICENSE file to clarify terms.