Skip to content

daniel-silva-perez/tinymind-runtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyMind Runtime

A local-first runtime for making small models useful on constrained hardware.

TinyMind is not another chatbot wrapper. It is a scaffold for the breakthrough direction:

Intelligence is not only in model weights. Intelligence is in the loop: routing, memory, tools, cache reuse, verification, and hardware-aware scheduling.

What this MVP includes

  • Prompt router: classifies requests into tool-only, summarization, small coding, hard coding, reasoning, chat, etc.
  • Provider abstraction: supports mock, Ollama, and llama.cpp server.
  • SQLite memory: stores notes and traces, retrieves relevant memories before generation.
  • Safe tools: simple file listing/reading and allowlisted shell command execution.
  • Verifier loop: hard tasks can be checked by a verifier model.
  • Benchmark harness: JSONL task runner with routing, latency, token estimate, and pass-rate reporting.
  • FastAPI server: optional /ask, /remember, /memory, /health API.
  • Hardware profiles: RTX 4060, Mac M3 Pro, Jetson Orin Nano, Raspberry Pi 5.

Why this matters

Small raw models are often weak. Small models inside a strong runtime can become useful.

TinyMind tries to make that concrete:

tiny local model + memory + tools + routing + verifier > bigger model for every task

Quick start

cd tinymind-runtime
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Works offline through mock fallback
python -m tinymind.cli --config configs/tinymind.example.json route "Fix this Python function: def add(a,b): return a-b"
python -m tinymind.cli --config configs/tinymind.example.json ask "Explain KV cache in one paragraph"

Or use the console script after install:

tinymind --config configs/tinymind.example.json ask "Design a local AI router for a Raspberry Pi"

Use with Ollama

Install and run a small model:

ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7b

Then ask:

tinymind --config configs/tinymind.example.json ask "Fix this Python bug: def add(a,b): return a-b" --json

If Ollama is not running, TinyMind falls back to the mock provider so the scaffold still works.

Use with llama.cpp server

Example llama.cpp server:

./llama-server \
  -m models/qwen2.5-coder-7b-q4_k_m.gguf \
  --host 127.0.0.1 --port 8080 \
  -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --metrics

Then configure llamacpp-main.endpoint in configs/tinymind.example.json if needed.

CLI commands

Inspect routing

tinymind --config configs/tinymind.example.json route "Summarize this long log..."

Ask through router + memory + provider

tinymind --config configs/tinymind.example.json ask "What should this repo do?"

Store memory

tinymind --config configs/tinymind.example.json remember \
  "Project goal" \
  "TinyMind should make small local models useful with routing, tools, and memory." \
  --tags tinymind local-ai --importance 3

Search memory

tinymind --config configs/tinymind.example.json memory "local models"

Run benchmark

tinymind --config configs/tinymind.example.json bench examples/tasks.jsonl

Optional API server

pip install -e '.[server]'
TINYMIND_CONFIG=configs/tinymind.example.json uvicorn tinymind.server.app:app --reload --port 8765

Ask:

curl -s http://127.0.0.1:8765/ask \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"Explain TinyMind in 5 bullets"}' | jq

Architecture

User prompt
  ↓
Feature extraction
  ↓
Router
  ├── tool-only path
  ├── small local model path
  ├── medium local coding model path
  └── hard reasoning / verifier path
  ↓
Memory retrieval
  ↓
Provider: Ollama / llama.cpp / mock
  ↓
Verifier loop for hard tasks
  ↓
Trace stored back into SQLite memory

Files

tinymind/
  agent.py              orchestration loop
  router.py             prompt classification + route decisions
  memory.py             SQLite memory and FTS retrieval
  providers/            mock, Ollama, llama.cpp
  tools/                safe file/shell tools
  bench/harness.py      JSONL benchmark runner
  server/app.py         optional FastAPI API
configs/
  tinymind.example.json hardware, model, and routing config
examples/
  tasks.jsonl           sample benchmark tasks
scripts/
  smoke_test.sh         local smoke test

What to build next

This MVP is deliberately small. The high-value roadmap is:

1. Hardware optimizer

Add:

  • auto-detect RAM/VRAM/CPU/GPU
  • generate recommended llama.cpp/Ollama/MLX flags
  • decide context length, quantization, and KV cache precision
  • output per-device configs for RTX 4060, Mac M-series, Jetson, Raspberry Pi

2. Real routing model

Replace rules with a tiny local classifier trained from traces:

prompt → task_kind, difficulty, risk, best_model, expected_cost

3. Persistent prefix cache

For llama.cpp, aggressively use cache_prompt and stable system/project prefixes. Long-term goal:

SOUL.md + USER.md + PROJECT.md + TOOLS.md → cached once → reused across runs

4. Memory compiler

Idle job that turns traces into:

  • facts
  • preferences
  • project summaries
  • repo maps
  • failure notes
  • reusable skills

5. Local agent benchmark

Expand examples/tasks.jsonl into a real benchmark:

  • coding patches
  • log triage
  • shell planning
  • structured extraction
  • memory recall
  • tool-use accuracy
  • latency/watt/cost reporting

6. Non-Transformer backend experiments

Add provider adapters for:

  • RWKV
  • Mamba/SSM models
  • BitNet/bitnet.cpp
  • MLX
  • WebGPU/WebNN later

Design principle

Do not chase a bigger model first.

Build the runtime that makes small models useful.

About

A local-first runtime for making small models useful on constrained hardware.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors