A local-first runtime for making small models useful on constrained hardware.
TinyMind is not another chatbot wrapper. It is a scaffold for the breakthrough direction:
Intelligence is not only in model weights. Intelligence is in the loop: routing, memory, tools, cache reuse, verification, and hardware-aware scheduling.
- Prompt router: classifies requests into tool-only, summarization, small coding, hard coding, reasoning, chat, etc.
- Provider abstraction: supports mock, Ollama, and llama.cpp server.
- SQLite memory: stores notes and traces, retrieves relevant memories before generation.
- Safe tools: simple file listing/reading and allowlisted shell command execution.
- Verifier loop: hard tasks can be checked by a verifier model.
- Benchmark harness: JSONL task runner with routing, latency, token estimate, and pass-rate reporting.
- FastAPI server: optional
/ask,/remember,/memory,/healthAPI. - Hardware profiles: RTX 4060, Mac M3 Pro, Jetson Orin Nano, Raspberry Pi 5.
Small raw models are often weak. Small models inside a strong runtime can become useful.
TinyMind tries to make that concrete:
tiny local model + memory + tools + routing + verifier > bigger model for every task
cd tinymind-runtime
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Works offline through mock fallback
python -m tinymind.cli --config configs/tinymind.example.json route "Fix this Python function: def add(a,b): return a-b"
python -m tinymind.cli --config configs/tinymind.example.json ask "Explain KV cache in one paragraph"Or use the console script after install:
tinymind --config configs/tinymind.example.json ask "Design a local AI router for a Raspberry Pi"Install and run a small model:
ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7bThen ask:
tinymind --config configs/tinymind.example.json ask "Fix this Python bug: def add(a,b): return a-b" --jsonIf Ollama is not running, TinyMind falls back to the mock provider so the scaffold still works.
Example llama.cpp server:
./llama-server \
-m models/qwen2.5-coder-7b-q4_k_m.gguf \
--host 127.0.0.1 --port 8080 \
-c 32768 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--metricsThen configure llamacpp-main.endpoint in configs/tinymind.example.json if needed.
tinymind --config configs/tinymind.example.json route "Summarize this long log..."tinymind --config configs/tinymind.example.json ask "What should this repo do?"tinymind --config configs/tinymind.example.json remember \
"Project goal" \
"TinyMind should make small local models useful with routing, tools, and memory." \
--tags tinymind local-ai --importance 3tinymind --config configs/tinymind.example.json memory "local models"tinymind --config configs/tinymind.example.json bench examples/tasks.jsonlpip install -e '.[server]'
TINYMIND_CONFIG=configs/tinymind.example.json uvicorn tinymind.server.app:app --reload --port 8765Ask:
curl -s http://127.0.0.1:8765/ask \
-H 'Content-Type: application/json' \
-d '{"prompt":"Explain TinyMind in 5 bullets"}' | jqUser prompt
↓
Feature extraction
↓
Router
├── tool-only path
├── small local model path
├── medium local coding model path
└── hard reasoning / verifier path
↓
Memory retrieval
↓
Provider: Ollama / llama.cpp / mock
↓
Verifier loop for hard tasks
↓
Trace stored back into SQLite memory
tinymind/
agent.py orchestration loop
router.py prompt classification + route decisions
memory.py SQLite memory and FTS retrieval
providers/ mock, Ollama, llama.cpp
tools/ safe file/shell tools
bench/harness.py JSONL benchmark runner
server/app.py optional FastAPI API
configs/
tinymind.example.json hardware, model, and routing config
examples/
tasks.jsonl sample benchmark tasks
scripts/
smoke_test.sh local smoke test
This MVP is deliberately small. The high-value roadmap is:
Add:
- auto-detect RAM/VRAM/CPU/GPU
- generate recommended llama.cpp/Ollama/MLX flags
- decide context length, quantization, and KV cache precision
- output per-device configs for RTX 4060, Mac M-series, Jetson, Raspberry Pi
Replace rules with a tiny local classifier trained from traces:
prompt → task_kind, difficulty, risk, best_model, expected_cost
For llama.cpp, aggressively use cache_prompt and stable system/project prefixes. Long-term goal:
SOUL.md + USER.md + PROJECT.md + TOOLS.md → cached once → reused across runs
Idle job that turns traces into:
- facts
- preferences
- project summaries
- repo maps
- failure notes
- reusable skills
Expand examples/tasks.jsonl into a real benchmark:
- coding patches
- log triage
- shell planning
- structured extraction
- memory recall
- tool-use accuracy
- latency/watt/cost reporting
Add provider adapters for:
- RWKV
- Mamba/SSM models
- BitNet/bitnet.cpp
- MLX
- WebGPU/WebNN later
Do not chase a bigger model first.
Build the runtime that makes small models useful.