TinyMind Runtime

A local-first runtime for making small models useful on constrained hardware.

TinyMind is not another chatbot wrapper. It is a scaffold for the breakthrough direction:

Intelligence is not only in model weights. Intelligence is in the loop: routing, memory, tools, cache reuse, verification, and hardware-aware scheduling.

What this MVP includes

Prompt router: classifies requests into tool-only, summarization, small coding, hard coding, reasoning, chat, etc.
Provider abstraction: supports mock, Ollama, and llama.cpp server.
SQLite memory: stores notes and traces, retrieves relevant memories before generation.
Safe tools: simple file listing/reading and allowlisted shell command execution.
Verifier loop: hard tasks can be checked by a verifier model.
Benchmark harness: JSONL task runner with routing, latency, token estimate, and pass-rate reporting.
FastAPI server: optional /ask, /remember, /memory, /health API.
Hardware profiles: RTX 4060, Mac M3 Pro, Jetson Orin Nano, Raspberry Pi 5.

Why this matters

Small raw models are often weak. Small models inside a strong runtime can become useful.

TinyMind tries to make that concrete:

tiny local model + memory + tools + routing + verifier > bigger model for every task

Quick start

cd tinymind-runtime
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Works offline through mock fallback
python -m tinymind.cli --config configs/tinymind.example.json route "Fix this Python function: def add(a,b): return a-b"
python -m tinymind.cli --config configs/tinymind.example.json ask "Explain KV cache in one paragraph"

Or use the console script after install:

tinymind --config configs/tinymind.example.json ask "Design a local AI router for a Raspberry Pi"

Use with Ollama

Install and run a small model:

ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7b

Then ask:

tinymind --config configs/tinymind.example.json ask "Fix this Python bug: def add(a,b): return a-b" --json

If Ollama is not running, TinyMind falls back to the mock provider so the scaffold still works.

Use with llama.cpp server

Example llama.cpp server:

./llama-server \
  -m models/qwen2.5-coder-7b-q4_k_m.gguf \
  --host 127.0.0.1 --port 8080 \
  -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --metrics

Then configure llamacpp-main.endpoint in configs/tinymind.example.json if needed.

CLI commands

Inspect routing

tinymind --config configs/tinymind.example.json route "Summarize this long log..."

Ask through router + memory + provider

tinymind --config configs/tinymind.example.json ask "What should this repo do?"

Store memory

tinymind --config configs/tinymind.example.json remember \
  "Project goal" \
  "TinyMind should make small local models useful with routing, tools, and memory." \
  --tags tinymind local-ai --importance 3

Search memory

tinymind --config configs/tinymind.example.json memory "local models"

Run benchmark

tinymind --config configs/tinymind.example.json bench examples/tasks.jsonl

Optional API server

pip install -e '.[server]'
TINYMIND_CONFIG=configs/tinymind.example.json uvicorn tinymind.server.app:app --reload --port 8765

Ask:

curl -s http://127.0.0.1:8765/ask \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"Explain TinyMind in 5 bullets"}' | jq

Architecture

User prompt
  ↓
Feature extraction
  ↓
Router
  ├── tool-only path
  ├── small local model path
  ├── medium local coding model path
  └── hard reasoning / verifier path
  ↓
Memory retrieval
  ↓
Provider: Ollama / llama.cpp / mock
  ↓
Verifier loop for hard tasks
  ↓
Trace stored back into SQLite memory

Files

tinymind/
  agent.py              orchestration loop
  router.py             prompt classification + route decisions
  memory.py             SQLite memory and FTS retrieval
  providers/            mock, Ollama, llama.cpp
  tools/                safe file/shell tools
  bench/harness.py      JSONL benchmark runner
  server/app.py         optional FastAPI API
configs/
  tinymind.example.json hardware, model, and routing config
examples/
  tasks.jsonl           sample benchmark tasks
scripts/
  smoke_test.sh         local smoke test

What to build next

This MVP is deliberately small. The high-value roadmap is:

1. Hardware optimizer

Add:

auto-detect RAM/VRAM/CPU/GPU
generate recommended llama.cpp/Ollama/MLX flags
decide context length, quantization, and KV cache precision
output per-device configs for RTX 4060, Mac M-series, Jetson, Raspberry Pi

2. Real routing model

Replace rules with a tiny local classifier trained from traces:

prompt → task_kind, difficulty, risk, best_model, expected_cost

3. Persistent prefix cache

For llama.cpp, aggressively use cache_prompt and stable system/project prefixes. Long-term goal:

SOUL.md + USER.md + PROJECT.md + TOOLS.md → cached once → reused across runs

4. Memory compiler

Idle job that turns traces into:

facts
preferences
project summaries
repo maps
failure notes
reusable skills

5. Local agent benchmark

Expand examples/tasks.jsonl into a real benchmark:

coding patches
log triage
shell planning
structured extraction
memory recall
tool-use accuracy
latency/watt/cost reporting

6. Non-Transformer backend experiments

Add provider adapters for:

RWKV
Mamba/SSM models
BitNet/bitnet.cpp
MLX
WebGPU/WebNN later

Design principle

Do not chase a bigger model first.

Build the runtime that makes small models useful.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
examples		examples
scripts		scripts
tests		tests
tinymind		tinymind
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyMind Runtime

What this MVP includes

Why this matters

Quick start

Use with Ollama

Use with llama.cpp server

CLI commands

Inspect routing

Ask through router + memory + provider

Store memory

Search memory

Run benchmark

Optional API server

Architecture

Files

What to build next

1. Hardware optimizer

2. Real routing model

3. Persistent prefix cache

4. Memory compiler

5. Local agent benchmark

6. Non-Transformer backend experiments

Design principle

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyMind Runtime

What this MVP includes

Why this matters

Quick start

Use with Ollama

Use with llama.cpp server

CLI commands

Inspect routing

Ask through router + memory + provider

Store memory

Search memory

Run benchmark

Optional API server

Architecture

Files

What to build next

1. Hardware optimizer

2. Real routing model

3. Persistent prefix cache

4. Memory compiler

5. Local agent benchmark

6. Non-Transformer backend experiments

Design principle

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages