Persistent memory layer for Claude Code (and any LLM agent). Most AI agents have amnesia. HippoAgent gives them a hippocampus.
HippoAgent is not a replacement for Claude Code, Cursor, or your favourite agent. It's the memory module they're missing. Plug it in via MCP and:
- Episodes you and the agent worked on persist across sessions.
- Procedures the agent figured out get compiled into deterministic macros — the next call doesn't even hit the LLM.
- Failure → success patterns are rescued and replayed during sleep cycles.
8+ neuroscientific mechanisms (DG pattern separation, TCM, synaptic tagging, lateral inhibition, engram crossover, schema priming, …) all opt-in via flags.
pip install hippoagent
claude plugin install hippoagentOr manually drop .mcp.json in your project root:
{
"mcpServers": {
"hippoagent": {
"command": "python",
"args": ["-m", "hippoagent.mcp_server"]
}
}
}Restart Claude Code and the hippoagent-memory skill auto-activates at the
start of every conversation. To disable temporarily: export HIPPO_DISABLED=1.
5 iterations of the same task family on Anthropic Claude Opus 4.7, 8 tasks per iter (digit-sum suite). After the 3rd iteration HippoAgent has compiled the procedure into a deterministic macro that bypasses the LLM entirely:
| Iter | HippoAgent tokens | HippoAgent latency | Raw LLM tokens | Raw LLM latency |
|---|---|---|---|---|
| 0 (cold) | 4225 | 4.47s | 59 | 0.67s |
| 1 | 1711 (-60%) | 0.92s (-79%) | 59 | 0.71s |
| 2 | 687 (-84%) | 0.52s (-88%) | 59 | 0.85s |
| 3 | 0 ✅ | 0.22s (-95%) | 59 | 1.08s |
| 4 | 0 ✅ | 0.24s (-95%) | 59 | 0.69s |
Break-even at iter 3. Macro fast-path hit rate: 70%. The agent stops asking the LLM for tasks it has solved before — your token bill drops to zero on recurrent work and your replies come back in 200 ms instead of 700 ms.
Bench data: data/bench_learning_curve_anthropic_n5.{results,by_iter}.json.
Beyond digit-sum: 5 TRAIN tasks (URL parsing, date format, capitalize, reverse, word count) → sleep consolidate → 5 HELD-OUT tasks with fresh inputs of the same families.
| Phase | Success | Rate |
|---|---|---|
| TRAIN | 5/5 | 100% |
| HELD-OUT | 5/5 | 100% |
The agent retrieves 3 relevant skills per held-out task and applies them
without per-task re-discovery. Bench data: data/bench_held_out_practical.{results.json,summary.md}.
12 tasks at increasing skill-chaining depth (apply ROT3 + REVERSE in 1-5 chained transformations to fresh inputs). 4 providers, 1 iter:
| Provider | Lv1 | Lv2 | Lv3 | Lv4 | Lv5 | overall |
|---|---|---|---|---|---|---|
| Anthropic raw | 100% | 50% | 0% | 0% | 0% | 42% |
| Anthropic HippoAgent | 100% | 100% | 100% | 100% | 100% | 100% |
| DeepSeek raw | 50% | 0% | 0% | 0% | 0% | 25% |
| DeepSeek HippoAgent | 100% | 100% | 100% | 100% | 100% | 100% |
| OpenRouter raw | 100% | 0% | 0% | 0% | 0% | 33% |
| OpenRouter HippoAgent | 100% | 100% | 100% | 100% | 100% | 100% |
| Groq raw | 50% | 0% | 0% | 0% | 0% | 25% |
| Groq HippoAgent | 100% | 50% | 100% | 50% | 100% | 83% |
The accuracy gap GROWS with composition depth. Single-shot LLMs collapse to 0% at Lv3; persistent-memory agents stay at 100% up to Lv5 on 3/4 providers. Empirical signature of compositional reasoning over consolidated skills.
Compiled-macro hit rate (skills bypassing the LLM entirely): OpenRouter 92%, Groq 75%, DeepSeek 58%, Anthropic 0% (Anthropic uses the full ReAct loop — the 100% accuracy comes from memory recall, not caching).
Bench data (replicable): data/bench_compositional_4providers.*.json.
python scripts/bench_with_without_hippo.py \
--suite compositional \
--providers anthropic,deepseek,openrouter,groq \
--conditions raw,hippo_warmHippoAgent (formerly Engram) — named after the neuroscientific term for a permanent memory trace (Lashley, Tulving). The system turns experience into inspectable, fitness-tracked, mergeable artifacts.
If you want the end-to-end map of the moving parts — components,
configuration knobs, task flow, MCP integration, multi-model bench
harness, test isolation — read docs/PLATFORM.md.
That doc is the contract; everything below is narrative.
For a 5-minute MCP integration walkthrough (Claude Code, Cursor,
opencode, Cline, Continue) see docs/MCP_QUICKSTART.md.
The v0.2 push consolidates the prototype into a vendible system. See
CHANGELOG.md and the four R&D reports
(SECURITY_AUDIT.md, ARCHITECTURE_AUDIT.md,
QA_AUDIT.md, RND_MEMORIE.md,
RND_PERFORMANCE.md, RND_UX.md)
for the full picture.
- 13 CVE closed — RCE in
/api/ide/runand the WS terminal (auth token + binary allowlist +shlex.split), SSRF blocklist inweb_fetch, sensitive-path deny-list, plaintext API-key leak in/api/settings/providersredacted, computer-use kill-switch hotkey deny-list, dashboard CORS lock + session token,editfmtconfig-file deny-list, MCP server schema validation + audit log + rate limit, prompt-injection wrapper around external content (<untrusted_content>), Docker sandboxed Python executor (opt-in viaHIPPO_PYTHON_EXEC_BACKEND=docker). - Active memory v2 — 7 enhancements to the original 6 mechanisms + 5 new ones (11 total). All five new mechanisms are zero LLM call (pure numpy / string ops on existing memory). See RND_EXPLORATION.md for the diary of how they were found:
- 7 — Working Memory Pruning (RND_MEMORIE.md): wake-loop char-budget compressor; critical for small-context models. Validated on Ollama
qwen2.5:7b: −54 % token usage with lower variance vs unpatched build. - 8 — Trace Alignment / Reverse Replay (RND_TRACE_ALIGNMENT.md): Needleman-Wunsch on observation embeddings finds the exact divergence step between a failed run and its success-twin. Two-mode: action-divergence (same situation, different decision) + input-divergence (same tool, wrong file/query). Inspired by sharp-wave reverse replay in CA1 place cells.
- 9 — Lateral Inhibition (Anti-Hebbian): when a winner skill consolidates on a task, its near-clone rivals are nudged away from that task vector. Földiák 1990 competitive specialisation. Empirically: −0.067 cosine differentiation at step 50 vs Hebbian-only baseline. Disabled by default; opt in via
lateral_inhibition_enabled. - 10 — Spontaneous Reactivation: a default-mode rehearsal stage during sleep. Skills not used in N days get their
last_used_atpushed forward by half the decay cutoff so they don't fall over the retirement cliff. Born & Wilhelm 2012 spaced-repetition substrate. Disabled by default; opt in viaspontaneous_reactivation_enabled. - 11 — Salience by Surprise: replay priority gains a fourth term that boosts episodes whose
num_stepsdeviates from the skill's average — Buzsáki 2015 prediction-error replay. Combined with multi-skill smallest-deviation logic so a typical-for-skill-A / anomalous-for-skill-B trace doesn't double-count. Disabled by default (sleep_replay_priority_surprise=0.0).
- 7 — Working Memory Pruning (RND_MEMORIE.md): wake-loop char-budget compressor; critical for small-context models. Validated on Ollama
- Performance — hot paths sped up 16× to 4 700×: LRU-cached embeddings, vectorised skill clustering (
corpus @ corpus.T), in-memory recall index with optional FAISS, mtime+size repomap cache. - Architecture —
dashboard.py2 338 LOC monolith → 159 LOC entry-point + 11-filedashboard_routes/package. LLM provider registry moved toproviders.yaml+ PydanticProviderSpec. Newpydantic-settings-basedSettingsv2 singleton. Lightweight Alembic-style migrations. - UX — Production-grade design system at
hippoagent/static/dashboard.css(WCAG 2.1 AA verified contrasts, 4 px scale, 1.25 type ratio, light theme)./skillspage redesigned with KPI grid + responsive card grid + filter pills + accessibility. CLI banner with contextual tips and grouped/help. - CI/CD — 3 OS × 4 Python = 11 jobs, dedicated security workflow (
pip-audit,safety,bandit, ruff S-rules) running weekly. Multi-stage Dockerfile (~500 MB), non-root user,HEALTHCHECK.pipextras ([headless],[mcp-only],[tui],[vision],[full],[dev]) — default install is now minimal sane. - Tests — 113 → 1072+ (+849 %). Coverage 46 % → 59 %. Ruff: 33 → 0 errors. Recent additions (FORGIA #27–#89):
tests/test_bench_harness.py,test_bench_compare.py,test_bench_summary_md.py,test_bench_cli.py,test_bench_recall_ablation.py,test_clean_bench_data.py,test_jsonutil.py,test_corruption_guards.py,test_data_dir_isolation.py,test_auto_fallback.py,test_config_env.py,test_makefile_help.py,test_wake_used_macro.py,test_sleep_report_n_llm_calls.py,test_real_provider_smoke.py,test_mcp_e2e_smoke.py. Original suites preserved:tests/security/(path traversal, SSRF, secrets redaction, prompt injection, executor isolation, editfmt sensitive, pentest validation),tests/test_settings.py,test_settings_v2.py,test_provider_registry.py,test_migrations.py,test_mcp_server.py,test_mcp_server_security.py,test_dashboard_api.py,test_cli.py,tests/perf/test_perf.py(10 benchmarks),tests/test_rnd_active_memory.py,test_trace_alignment.py,test_lateral_inhibition.py.
You give the agent a task in plain language. It thinks, uses tools (Python sandbox, file I/O, shell, web fetch, screenshots, webcam, vision LLM, computer use), retrieves any relevant past skills it has consolidated, and answers. Every conversation is an episode in memory. Every few episodes, you trigger a sleep cycle that distills new procedural skills (NREM), recombines them creatively (REM), merges duplicates, and promotes/retires by Bayesian fitness. Tomorrow the agent is genuinely better at what you asked it yesterday — and you can read every lesson it learned.
Multi-model bench harness (FORGIA #27, 2026-05-09): same task suite, 3 conditions (raw / hippo_cold / hippo_warm), 3-4 real providers, fail-isolated.
Headline result — memory_recall suite (the discriminative one):
| Provider | raw | hippo_cold | hippo_warm |
|---|---|---|---|
| anthropic | 0.50 | 1.00 | 1.00 (latency −56 %) |
| deepseek | 0.50 | 1.00 | 1.00 |
| openrouter | 0.50 | 1.00 | 1.00 |
The 50 % raw failure is the 3 query tasks ("What was the value I told you?") — with no shared context, the LLM has no place to retrieve from. HippoAgent's recall pipeline retrieves the seed episode and the query phase succeeds 100 % on every provider. +50 percentage-point accuracy uplift, three different LLMs.
Hardened result — hard_memory_recall (12 tasks: direct + paraphrased + synthesis):
| Provider | raw | hippo_cold | hippo_warm |
|---|---|---|---|
| anthropic | 0.50 | 1.00 | 1.00 (latency −51 %) |
| deepseek | 0.50 | 0.92 | 0.92 (lost the synthesis) |
| openrouter | 0.50 | 1.00 | 1.00 |
The headline holds: +42–50 pp uplift across paraphrased queries and multi-step synthesis. DeepSeek lost the multi-step task (retrieved both facts but failed the addition) — HippoAgent provides the memory, arithmetic composition is on the model.
Skill-compounding suite (8 digit-sum tasks): hippo_warm latency −41 % vs hippo_cold on anthropic (compiled-macro fast-path engaging). Default trivia suite: raw wins (~50 tokens vs ~3 000), hippo costs structural overhead but proves end-to-end transport works.
All raw data committed at data/bench_*.{results,summary}.json. Full
analysis in docs/PLATFORM.md.
Reproduce locally:
make bench-help # list available task suites
make bench-mock # offline smoke (no API key needed)
make bench-real # run on every provider with an env key set
make bench-memory # the discriminative recall suite
make bench-summary # render the latest summary as markdown
make bench-csv # CSV (Excel-friendly)
make bench-quick # mock + 2 tasks (CI smoke)
make bench-clean # dry-run of transient bench dirs
make stats # project size + test count
make bench-compare BEFORE=... AFTER=... # diff two bench summariesThe bench script supports many flags useful in CI:
--quiet --max-tasks N --task-id ID --providers auto|csv --suite NAME --n-iter N --consolidate-every K --save-md --memory-stats --show-failures --print-config --list-providers --clean-data --output-dir PATH. See python scripts/bench_with_without_hippo.py --help.
Tool-use verified live across 4 providers (same task: write a real file to Desktop), all via native tool-use API (no fragile JSON-in-text parsing):
| Provider | Model | Steps | Tokens | Outcome |
|---|---|---|---|---|
| Ollama (local, free) | qwen2.5:1.5b | 2 | 4,655 | ✓ wrote file to disk |
| Ollama (local, free) | qwen2.5:7b-instruct | 2 | 4,699 | ✓ wrote file to disk |
| Groq (free tier) | llama-3.3-70b-versatile | 4 | 11,794 | ✓ wrote file to disk |
| Anthropic | claude-haiku-4-5 | 2 | 7,200 | ✓ wrote file to disk |
Computer-use end-to-end verified live:
- ✓ shell_run:
whoami,systeminfo,wmic,ver— Claude riassume sistema - ✓ Task "apri Calc + screenshot + descrivi + chiudi" — 4 step, success
- ✓ vision_describe — Claude descrive logo OMNEX correttamente
- ✓ web_fetch + web_search (DuckDuckGo) — paper Nature/ScienceDirect
- ✓ desktop_screenshot 2560×1600 + describe via Anthropic vision
- ✓ Sleep cycle: 18 episodi → 6 NREM + 2 REM + 2 merge + 6 facts in 103s
Plan mode + auto-fallback chain verified.
HippoAgent is a working prototype of an idea: an LLM agent that becomes measurably more competent over time without ever updating its weights. It does so by mimicking the two-stage memory consolidation model from neuroscience (NREM slow-wave + REM paradoxical sleep). Every "lesson" the agent learns is a structured, versioned, fitness-tracked artifact you can read, edit, share, retire — not an opaque parameter shift.
Today's LLM agents have two ways to "learn":
- Fine-tuning — costly, centralized, opaque, irreversible.
- RAG / context — they "know" things but don't remember the session; no consolidation, no transfer, no growth curve.
The space in between — what humans actually do during sleep — is empty in production systems. Voyager (Wang et al.) introduced a skill library for Minecraft. MemGPT/Letta layered tiered memory. Reflexion added critique. But nobody has closed the loop with a consolidation cycle that:
- replays episodes (success and failure),
- extracts invariant patterns into procedural skills,
- recombines existing skills creatively,
- tests them under a fitness function,
- prunes the losers, promotes the winners,
- and measures itself with held-out tasks.
That loop is the bet of HippoAgent.
┌──────────────────────────────────────────────────────────────────────┐
│ HippoAgent │
│ │
│ ╔═══════════════ WAKE ════════════════╗ ╔═════ SLEEP ══════════╗ │
│ ║ Task → Memory retrieval (skills ║ ║ NREM: cluster ║ │
│ ║ + similar episodes) ║ ║ episodes → ║ │
│ ║ ReAct loop with tool use: ║ ║ distill skills ║ │
│ ║ • run_python (sandboxed subproc) ║ ║ + semantic facts ║ │
│ ║ • syntax_check, find_function ║ ║ ║ │
│ ║ • submit_solution ║ ║ REM: pick 2 skills, ║ │
│ ║ Reflexion-style self-critique on ║ ║ propose hybrid ║ │
│ ║ failure → 1-shot retry. ║ ║ ║ │
│ ║ Episode persisted to memory. ║ ║ Curator: merge ║ │
│ ╚══════════════════ ▼ ════════════════╝ ║ semantic dups ║ │
│ │ ║ ║ │
│ Episodes (SQLite + ║ Pruning: Bayesian ║ │
│ embeddings + causal graph) ║ fitness → promote / ║ │
│ ║ retire skills. ║ │
│ Skills (JSON files + ╚══════════ ▲ ══════════╝ │
│ version chain + lineage DAG ──────────────┘ │
│ + Beta-Binomial fitness) │
│ │
│ Semantic facts (decoupled from time) │
└──────────────────────────────────────────────────────────────────────┘
Observability layer: every action emits a structured event →
structlog → metrics registry → dashboard (FastAPI + vis-network).
| Module | What it does | Key novelty |
|---|---|---|
episode.py |
Episode + Trace — full ReAct trajectory, immutable |
timestamped, embedding-indexed |
memory.py |
Episodic memory: SQLite + dense recall + causal graph (networkx) | clustering for replay; A→B causal edges via shared skill |
semantic.py |
Semantic memory: facts decoupled from time | Tulving-style separation |
skill.py |
Skill library — versioned JSON files + index | Bayesian Beta-Binomial fitness, lineage DAG, status lifecycle |
tools.py |
Sandboxed Python executor + AST analyzer | subprocess isolation, timeout, output cap |
wake.py |
ReAct loop with skill+episode injection | tolerant parser, Reflexion critique, A/B toggle |
sleep.py |
Multi-stage consolidation engine | NREM + REM + Curator + Pruning |
prompts.py |
All LLM prompts, in one auditable file | "experience as artifact" thesis |
observability.py |
EventBus + structlog + metrics registry | every step emits a typed event |
dashboard.py |
FastAPI + HTML dashboard | skill lineage graph (vis-network) |
cli.py |
typer + rich CLI | hippo run/wake/sleep/benchmark/skills/episodes/dashboard |
benchmark/ |
18 HumanEval-style coding tasks + evaluator | wake/heldout split, Wilson CIs, two-prop z-test |
- Two-stage consolidation: Walker & Stickgold (2004), Diekelmann & Born (2010). NREM consolidates declarative memory; REM enables creative recombination.
- Episodic vs semantic vs procedural: Tulving (1972, 1985).
- Fast hippocampal replay → slow cortical consolidation: McClelland, McNaughton & O'Reilly (1995) — the model HippoAgent's wake/sleep split mirrors.
- Reflexion: Shinn et al. (2023) — verbal RL on agents (no gradients) — used here for the self-critique retry.
- Voyager: Wang et al. (2023) — skill library for embodied agents — closest prior art for the procedural memory.
- Bayesian fitness for small N: Beta-Binomial conjugate prior — robust posterior mean even after 1–2 trials.
- Declarative → procedural transition: Anderson (1982) ACT-R, Logan (1988) instance theory — the basis for procedural compilation.
- Hippocampal forward sweeps: Pfeiffer & Foster (2013), Diba & Buzsáki (2007) — predictive replay before action — the basis for forward replay.
- Hebbian plasticity: Hebb (1949) "cells that fire together wire together" — the basis for the trigger-embedding drift on success.
- Counterfactual reasoning in episodic memory: Gershman & Daw (2017) — alternative trajectories during offline replay drive learning beyond mere reinforcement.
Most LLM-agent memory systems are passive: they retrieve past prompts and dump them into context. HippoAgent's memory is active — the act of using a skill makes it stronger, faster, and more discriminating. Six mechanisms compound:
-
Procedural compilation — once a skill has succeeded N times with high fitness, the DREAMER (during sleep) distills its successful traces into a parameterised macro: a list of tool calls with
{{TASK}}and{{LAST_OBSERVATION}}placeholders. At wake time, when a strongly-matching task arrives, the macro is executed deterministically — zero LLM tokens, no model latency between steps. The skill is not just remembered, it is compiled the way deliberate actions become motor reflexes. (compilation.py) -
Forward replay — before the wake loop fires, the agent looks up the top skill's past successful trajectories and projects an expected action sequence. The block is injected as
## PREDICTED PATHin the user prompt. This anchors the LLM's reasoning (less drift on familiar tasks) and lets us detect divergence (a learning signal). Pure retrieval — no extra LLM call. (wake.py:_forward_replay_block) -
Hebbian skill embedding — every successful application drifts the skill's trigger embedding toward the task that just succeeded (
new = (1 - α)·current + α·task, α = 0.05, then re-normalised). Skills become more retrievable for the kind of task they keep solving — the library shapes itself to its workload over time without any retraining. (skill.py:_hebbian_update) -
Counterfactual REM — when a skill keeps failing (fitness < 0.5, trials ≥ 3), the dreamer doesn't just decrement Bayesian counts. It reads the failed trajectory and synthesises an alternative strategy — a candidate counterfactual skill with the failed skill as parent. The alternative competes for retrieval on future similar tasks; if it wins, it supersedes the broken approach without manual intervention. (sleep.py:_stage_counterfactual)
-
Schema formation — once a domain has accumulated enough skills (≥3 with cosine similarity ≥ 0.62 on triggers), the dreamer writes a SCHEMA: a meta-skill whose body is a one-line rubric for picking among the children. Lineage edges (
relation='specialises') connect each schema to its specifics, building a 2-level hierarchy that becomes navigable as the library grows. Tulving's episodic→semantic transition. (sleep.py:_stage_schema) -
Self-suggested practice — for skills sitting in the uncertain middle (fitness 0.45–0.65), the dreamer writes 2 concrete practice prompts that would plausibly trigger them. They appear in the dashboard's skill detail under "📚 Practice prompts" with one-click "▶ run in chat" buttons. Running a prompt feeds real evidence into the Bayesian fitness — so the skill is decisively promoted or retired instead of lingering in ambiguity. The agent literally suggests its own training set. (sleep.py:_stage_practice)
Together these turn memory from an archive into an organ that grows.
python scripts/demo_active_memory.pySeeds a library with six skills, runs a complete sleep cycle (NREM + REM + Curator + compilation + counterfactual + schema + practice + pruning) using a scripted mock LLM, then re-runs a matching task to demonstrate the macro fast-path. Typical output:
Sleep cycle — six mechanisms fire in one pass
duration : 1.37s (12 LLM calls, 3080 tokens)
🔧 macros : 1
🌀 cf : 1
🌳 schemas : 1
📚 practice : 6 prompts written
promoted : 1 retired : 1
Wake — re-running a similar task uses the macro fast-path
steps : 2
tokens used : 0
llm calls : 0 (macro fired, no model invoked)
Run python scripts/bench_macro.py --repeats 3 --latency 0.5 to reproduce:
COLD (ReAct) median wall=1.296s llm_calls=2 tokens=560
HOT (macro) median wall=0.202s llm_calls=0 tokens=0
→ speed-up : 6.41x
→ time saved : 1094 ms per task
→ token saved : 560 per task
The simulated LLM models a real-world 0.5 s/call latency (typical for hosted models). On the hot path the macro fires deterministically — not a single token leaves the box.
Visit /active-memory (or click "Active memory" in the nav) for a live KPI
panel — compiled-macro coverage, Hebbian-tuned skills, counterfactual lineage,
schema hierarchy. The /api/active-memory/stats JSON endpoint exposes the
same metrics for external tooling.
Every chat turn surfaces 👍 / 👎 buttons. Clicking them feeds the same Bayesian fitness machinery used by automatic outcomes — no separate "feedback model" or fine-tune. A 👎 on a successful turn flips the episode to failure and records a failure trial against each applied skill; mis-promoted skills decay back into the candidate pool. A 👍 simply boosts the trial count with a success.
So the library has three sources of evidence, all converging on the same fitness posterior:
- automatic validator outcomes during the wake loop;
- counterfactual REM evaluations during sleep;
- explicit user up- / down-votes from the chat surface.
cd ProgettiAI/HippoAgent
python -m venv .venv
source .venv/Scripts/activate # or .venv/bin/activate on POSIX
pip install -e .
hippo dashboard # → http://127.0.0.1:8765The first time you open the dashboard, you'll land on /welcome — pick a
provider, paste an API key (or run Ollama locally — no key needed), test the
connection, save. Everything is configurable from the ⚙ Settings page;
no env-var voodoo required.
docker compose up
# or: docker build -t hippoagent . && docker run -p 8765:8765 -v $PWD/data:/app/data hippoagentThe container picks up any *_API_KEY env vars you pass through. To use a
host-running Ollama, the compose file already maps host.docker.internal.
There are three ways to interact with the agent. They share the same backend (memory, skills, sleep cycles) — just pick what suits you.
hippo dashboard → http://127.0.0.1:8765
| page | what it's for |
|---|---|
| Chat ⭐ | the main UI — type a task, get an answer with applied skills |
| Settings | pick LLM provider, paste API key, test connection, switch models live |
| Episodes | every task ever executed; click any row for the full ReAct trajectory |
| Skills | consolidated lessons with Bayesian fitness, status, lineage |
| Lineage | interactive graph of how skills derive from one another |
| Events | live event stream (every memory write, retrieval, LLM call) |
| Metrics | counters/histograms |
hippo chat # interactive REPL: type tasks, /sleep, /skills, /quit
hippo run "your task here" # one-shot
hippo wake --n-tasks 5 # run the bench wake-set
hippo sleep # consolidation cycle on demand
hippo benchmark # held-out evaluation
hippo skills list / show <id>
hippo episodes list / show <id>
hippo providers list / scan / models <p> / activeHippoAgent ships an MCP (Model Context Protocol) server, so any MCP-aware
client can use it as a memory-augmented agent. Add this to your client's
mcp.json (or equivalent):
{
"mcpServers": {
"hippoagent": {
"command": "hippo",
"args": ["mcp"],
"env": {
"HIPPO_LLM_PROVIDER": "groq",
"GROQ_API_KEY": "your-key",
"HIPPO_MODEL": "llama-3.3-70b-versatile"
}
}
}
}Once registered, the host (Claude Code, Cursor, etc.) can call:
| MCP tool | what it does |
|---|---|
hippo_run_task |
full wake loop — agent uses its own tools + skills |
hippo_consolidate |
trigger a sleep cycle |
hippo_recall |
semantic search over past episodes (embeddings) |
hippo_recall_explain |
semantic recall + per-component score breakdown |
hippo_search |
keyword search over episode task_text (LIKE) |
hippo_episode_list |
paginated listing of episodes (limit/offset/outcome) |
hippo_episode_get |
one episode in full (trajectory + critique) |
hippo_episode_pin / hippo_episode_unpin |
protect / release an episode from decay-pruning |
hippo_forget |
delete one episode by id (privacy / GDPR) |
hippo_metrics_history |
token-usage timeseries bucketed by day |
hippo_skills_for |
preview which skills HippoAgent would inject for a task |
hippo_skill_promote / hippo_skill_retire / hippo_skill_edit |
manual curation |
hippo_skill_export / hippo_skill_import |
portable JSON bundles (share skills between installations) |
hippo_skill_test |
render the prompt-context for a (skill, task) pair without calling the LLM |
hippo_skill_top |
top-k skills by fitness / recency / activity |
hippo_skill_lineage |
walk the parent_skills DAG ancestry of a skill |
hippo_skill_compare |
diff two skills (body / fitness / trials) |
hippo_skill_similar |
top-k skills by Jaccard overlap on body tokens |
hippo_skill_describe |
deterministic 1-line natural-language summary of a skill (no LLM) |
hippo_skill_merge |
manually merge skill A into B (sum trials, retire A, lineage tracked) |
hippo_episodes_by_skill |
every episode whose skills_used includes a given skill |
hippo_provider_switch |
switch the active LLM provider at runtime (anthropic / openai / groq / …) |
hippo_remember |
store one fact directly in semantic memory — no episode, no sleep cycle |
hippo_facts_recall |
semantic search over facts (cosine on proposition embedding) |
hippo_facts_search |
keyword/substring search over facts (LIKE on proposition) |
hippo_facts_list |
paginated listing of all facts (newest-first) |
hippo_fact_forget |
delete one fact by id (privacy / GDPR) |
hippo_skills_search |
keyword/substring search over skills (LIKE on name+trigger+body) |
hippo_skill_bundles / hippo_compound_skills / hippo_skill_antagonists |
structural introspection |
hippo_status |
counts + active provider |
hippo_health |
deep preflight at startup — 3-tier reachability + counts + flag + tool_count |
hippo_stats |
aggregate metrics (episodes by outcome, skills, token usage) |
hippo_audit_tail |
last N records of the MCP audit log (forensics) |
And read MCP resources:
hippo://skills/listandhippo://skills/{id}— consolidated skillshippo://episodes/recentandhippo://episodes/{id}— past trajectories
The result: your Claude Code (or whatever) gets a second-brain agent it can delegate to — and the second brain remembers across sessions, not just within one conversation.
from hippoagent.agent import HippoAgent
agent = HippoAgent.build()
result = agent.run_task(
task_id="my-task",
task_text="Write a Python function that ...",
validator=lambda ans: (bool(ans.strip()), "non-empty"),
)
print(result.episode.final_answer)
agent.consolidate() # nightly sleep cyclepowershell -ExecutionPolicy Bypass -File scripts\install_desktop_shortcut.ps1This creates two shortcuts on your Desktop:
- HippoAgent Dashboard — double-click to launch the web UI (browser opens to http://127.0.0.1:8765).
- HippoAgent CLI — opens a shell with the venv activated, ready for
hippo …commands.
HippoAgent is provider-agnostic. Set ONE of these env vars (or run Ollama locally) and you're done.
The first matching one wins (or force with HIPPO_LLM_PROVIDER=<name>).
| family | provider (alias) | env var | base URL |
|---|---|---|---|
| native | anthropic |
ANTHROPIC_API_KEY |
Anthropic SDK |
| native | ollama |
OLLAMA_HOST (defaults to http://localhost:11434) |
local |
| US/EU | openai, openrouter, mistral, groq, xai (grok), perplexity, fireworks, together, cerebras, gemini (google), nvidia, huggingface (hf), deepinfra, hyperbolic, novita, lepton, anyscale, azure |
<NAME>_API_KEY |
each provider's /v1 |
| China | moonshot (kimi), deepseek, qwen (dashscope/alibaba), zhipu (glm), baichuan, yi (lingyi/01ai), doubao (ark), hunyuan (tencent), stepfun (step), minimax, spark (iflytek) |
<NAME>_API_KEY |
each provider's /v1 |
| local OpenAI-compat | lmstudio, vllm, localai, tabby |
<NAME>_API_KEY (any non-empty) |
localhost |
Per-stage model overrides (Claude defaults are tuned; for other providers you usually want to set these):
export HIPPO_MODEL=qwen2.5:7b # all stages
export HIPPO_MODEL_EXECUTOR=qwen2.5:7b # ReAct loop only
export HIPPO_MODEL_DREAMER=qwen2.5:14b # NREM/REM synthesis (smarter recommended)
export HIPPO_MODEL_CRITIC=qwen2.5:1.5b # cheap criticDiscover what's actually reachable from your setup:
hippo providers list # all known providers + env-var status
hippo providers scan # query /v1/models on every configured one (real discovery)
hippo providers models kimi # list one provider's models
hippo providers active # which provider is selected right nowExamples:
HIPPO_LLM_PROVIDER=kimi MOONSHOT_API_KEY=sk-... hippo wake
HIPPO_LLM_PROVIDER=deepseek DEEPSEEK_API_KEY=sk-... HIPPO_MODEL=deepseek-reasoner hippo wake
HIPPO_LLM_PROVIDER=ollama OLLAMA_MODEL=qwen2.5:7b hippo wake
HIPPO_LLM_PROVIDER=groq GROQ_API_KEY=gsk-... HIPPO_MODEL=llama-3.3-70b-versatile hippo wakeFor tests / dev:
pip install -e ".[dev]"# Tests (fully offline, mock LLM)
HIPPO_OFFLINE=1 pytest
# CLI
hippo --help
hippo run "Define a Python function that returns the n-th prime"
hippo wake --n-tasks 5 # run wake-set (records episodes)
hippo sleep # run consolidation cycle
hippo benchmark # run held-out tasks with consolidated skills
hippo skills list
hippo skills show <id>
hippo episodes list
hippo dashboard # → http://127.0.0.1:8765python run_demo.py --n-wake 10 --n-heldout 8What it does, in order:
- Reset.
- Baseline: held-out tasks, without skills, without past episodes — pure model.
- Wipe the episodes the baseline accidentally created (clean slate).
- Wake: run the wake-set; the agent records everything.
- Sleep: NREM clusters episodes → distills skills; REM proposes hybrids; Curator merges semantic duplicates; pruning promotes/retires by fitness.
- Hippo: re-run the held-out tasks with the consolidated skill library.
- Compare: pass-rate, avg-steps, avg-tokens, skill-reuse-rate. 95% Wilson interval on rates. Two-proportion z-test for the gap.
The script saves a JSON report under data/reports/.
A small run (10 wake / 8 held-out) on Claude Haiku 4.5 produced these consolidated skills, derived autonomously from the agent's own behavior:
Verify test cases before blaming algorithm(NREM, fitness=0.70)Validate test runner output before debug(REM hybrid, 0.70)Harden JSON code payloads end-to-end(REM hybrid, 0.70)Escape newlines in JSON code strings(the Curator merged 4 near-duplicates of this)
These are not generic "be helpful" stubs — they are concrete lessons the
agent extracted from its own failure modes (mostly: malformed JSON in
run_python calls when the code contained newlines).
The pass-rate gap on N=8 held-out is not statistically significant (p≈0.5 by two-prop z-test) — and that is the honest scientific finding at this scale. Ramping to N≥30 is the next experiment; the architecture is ready for it.
- ✅ 23 unit + integration tests passing
- ✅ Real LLM client + offline mock for deterministic CI
- ✅ Sandboxed Python execution (subprocess + timeout)
- ✅ Structured logging + event bus + metrics registry
- ✅ Bayesian fitness with conjugate prior (not naive ratios)
- ✅ Skill lineage DAG (networkx) — full provenance from episode → skill → REM hybrid → merge
- ✅ Web dashboard with skill lineage visualization
- ✅ Reproducible benchmark with deterministic seed split
- ✅ Statistical primitives (Wilson interval, two-prop z-test) for honest reporting
- ✅ A/B toggle (
--no-skills) so baseline-vs-hippo is a single CLI flag
The vision_describe tool dispatches to the right multimodal endpoint per
provider. Defaults shipped (override via HIPPO_VISION_MODEL):
| Provider | Default vision model | Status |
|---|---|---|
| Anthropic | claude-haiku-4-5-20251001 |
✓ verified live |
| OpenAI | gpt-4o-mini |
✓ |
| Google Gemini | gemini-1.5-flash |
✓ free tier |
| Groq | meta-llama/llama-4-scout-17b-16e-instruct |
✓ verified live, free tier |
| OpenRouter | anthropic/claude-haiku-4.5 |
✓ verified live |
| xAI Grok | grok-4 |
✓ (paid) |
| Mistral | pixtral-12b-latest |
✓ |
| Alibaba Qwen | qwen-vl-plus |
✓ |
| Zhipu GLM | glm-4v |
✓ |
| Moonshot Kimi | moonshot-v1-8k-vision-preview |
✓ |
| 01.AI Yi | yi-vision |
✓ |
| ByteDance Doubao | doubao-vision-pro-32k |
✓ |
| NVIDIA NIM | meta/llama-3.2-90b-vision-instruct |
✓ |
| Together / Fireworks | Llama 3.2 Vision 90B | ✓ |
| HuggingFace router | Llama 3.2 Vision 90B | ✓ |
| Ollama (local) | llava (override with OLLAMA_VISION_MODEL) |
✓ — needs ollama pull llava (or qwen2-vl, llama3.2-vision, bakllava, moondream) |
| DeepSeek | n/a | ✗ DeepSeek V4 API doesn't accept image_url blocks |
The dispatcher prefers HIPPO_VISION_MODEL env var > OLLAMA_VISION_MODEL
(for Ollama) > the default in the table above. So you can mix: text inference
on cheap-fast model, vision on a multimodal-capable one — same single call.
# Use Ollama for everything but route vision to Anthropic:
HIPPO_LLM_PROVIDER=ollama OLLAMA_MODEL=qwen2.5:7b \
HIPPO_VISION_MODEL=ignored # vision uses provider, set ANTHROPIC_API_KEY too
hippo dashboard
# Use DeepSeek for text, fallback to Groq for vision:
# (DeepSeek V4 has no vision; agent calls vision_describe → falls through to
# the configured fallback chain.)When the active provider hits rate-limit / quota / 5xx, Engram transparently
tries the next configured provider. Configure in /settings → "Fallback chain":
primary: Anthropic Claude (fast, paid)
↓ on 429/quota
fallback 1: Groq llama 70b (free tier, very fast)
↓ on quota
fallback 2: Ollama qwen 7b (local, always available)
Result: zero-downtime LLM access, scaling free → paid → premium automatically.
Recoverable error patterns: 429, rate, quota, billing, credit,
limit, 503, 504, timeout, connection, overload.
Press 📋 Plan first in the chat instead of Send. The agent produces a numbered plan (3-7 steps) WITHOUT executing anything. Review it, then click ✓ Approve & execute to run the plan, or ✗ Reject.
Useful for: cautious computer-use tasks, multi-step shell ops, anything where you want to know what will happen before it does.
Engram exposes a single master switch plus 6 granular toggles in
/settings:
| Capability | Default | Description |
|---|---|---|
| Sandbox master | ON | When OFF, all permissions are unrestricted |
| Filesystem | home |
strict (data/ only), home (user dir), full (anywhere) |
| Computer use | OFF | mouse/keyboard control via pyautogui |
| Webcam | OFF | capture frames + describe via vision LLM |
| Shell | OFF | arbitrary cmd.exe / /bin/sh commands |
| Web | ON | web_fetch + DuckDuckGo search |
| Vision | ON | describe images via multimodal LLM |
Two preset buttons:
- 🔓 Unleash — sandbox OFF + all permissions ON. Full PC access.
- 🔒 Lockdown — sandbox ON, filesystem = strict, only web + vision.
When the sandbox is OFF the agent has full access to your machine: it can read/write any file, run any shell command, control your mouse and keyboard, capture your webcam, fetch the web, and describe images. Use this only with models you trust.
Engram runs on Termux (Android 10+) with a few caveats:
pkg install python git ffmpeg-essentials
git clone https://github.com/aureliocpr-ctrl/hippoagent.git
cd hippoagent
python -m venv .venv && source .venv/bin/activate
# Skip pyautogui (no display server) and opencv (heavy):
pip install -e . --no-deps
pip install anthropic openai sentence-transformers numpy scipy scikit-learn \
networkx pydantic structlog typer rich fastapi uvicorn jinja2 \
python-dotenv httpx mcp textual pillow
hippo dashboard --host 0.0.0.0 --port 8765Open http://<phone-ip>:8765 from any device on the same network. CLI works
fully (hippo chat, hippo tui). Webcam/computer-use are no-op on Android,
everything else (incl. Ollama via Termux+proot or remote, Groq, Gemini,
Anthropic) works normally.
- The Python sandbox is
subprocess -I, not a real isolation layer. A task like "write a calculator and save it to the desktop" will actually write to your disk, because the LLM-generated code can callopen(),os.makedirs,requests.get, etc. That's by design for this prototype (it's how you get an agent that does things), but it means only run HippoAgent against models and tasks you trust. For untrusted use, run the whole agent in Docker with no host volume mounts, or wrap the executor in seccomp/Firejail. - API keys saved via the Settings page are written to
data/user_settings.jsonin plaintext. Don't commit that file (it's in.gitignore). Production deployments should swap the storage for an OS keychain or Vault.
- Sandbox is
subprocess -I, not seccomp/Firejail/Docker — fine for non-adversarial models, not for production. - Benchmark is small (18 tasks). Real validation would need ≥100 tasks across more domains.
- No active learning: the agent doesn't choose which tasks to attempt to maximize learning.
- No explicit forgetting curve over time — only fitness-based pruning.
- Single-agent. The lineage DAG is ready for skill-sharing across instances; the marketplace is not built.
HippoAgent/
├── hippoagent/ # core library
│ ├── __init__.py
│ ├── config.py # all hyper-params, single source
│ ├── observability.py # event bus, logger, metrics
│ ├── llm.py # Anthropic client + MockLLM
│ ├── embedding.py # sentence-transformers wrapper
│ ├── tools.py # PythonExecutor, CodeAnalyzer, tool registry
│ ├── episode.py # Episode + Trace
│ ├── memory.py # EpisodicMemory (vector + causal graph)
│ ├── semantic.py # SemanticMemory (facts)
│ ├── skill.py # Skill + SkillLibrary (lineage, fitness, lifecycle)
│ ├── prompts.py # all prompts, auditable
│ ├── wake.py # ReAct loop with skill injection + critique
│ ├── sleep.py # NREM + REM + Curator + Pruning
│ ├── agent.py # high-level orchestrator
│ ├── cli.py # typer + rich CLI
│ └── dashboard.py # FastAPI dashboard
├── benchmark/ # task suite + evaluator + statistics
├── tests/ # pytest suite
├── data/ # episodes/skills/semantic/runs/reports (gitignored)
├── run_demo.py # full baseline-vs-hippo experiment
├── pyproject.toml
└── README.md
MIT, but please cite the underlying neuroscience papers if you build on the consolidation idea.