⚡ HippoAgent

Persistent memory layer for Claude Code (and any LLM agent). Most AI agents have amnesia. HippoAgent gives them a hippocampus.

HippoAgent is not a replacement for Claude Code, Cursor, or your favourite agent. It's the memory module they're missing. Plug it in via MCP and:

Episodes you and the agent worked on persist across sessions.
Procedures the agent figured out get compiled into deterministic macros — the next call doesn't even hit the LLM.
Failure → success patterns are rescued and replayed during sleep cycles.

8+ neuroscientific mechanisms (DG pattern separation, TCM, synaptic tagging, lateral inhibition, engram crossover, schema priming, …) all opt-in via flags.

⚡ Install in 2 minutes (Claude Code)

pip install hippoagent
claude plugin install hippoagent

Or manually drop .mcp.json in your project root:

{
  "mcpServers": {
    "hippoagent": {
      "command": "python",
      "args": ["-m", "hippoagent.mcp_server"]
    }
  }
}

Restart Claude Code and the hippoagent-memory skill auto-activates at the start of every conversation. To disable temporarily: export HIPPO_DISABLED=1.

📈 Learning curve — what changes when memory is on (real LLMs)

5 iterations of the same task family on Anthropic Claude Opus 4.7, 8 tasks per iter (digit-sum suite). After the 3rd iteration HippoAgent has compiled the procedure into a deterministic macro that bypasses the LLM entirely:

Iter	HippoAgent tokens	HippoAgent latency	Raw LLM tokens	Raw LLM latency
0 (cold)	4225	4.47s	59	0.67s
1	1711 (-60%)	0.92s (-79%)	59	0.71s
2	687 (-84%)	0.52s (-88%)	59	0.85s
3	0 ✅	0.22s (-95%)	59	1.08s
4	0 ✅	0.24s (-95%)	59	0.69s

Break-even at iter 3. Macro fast-path hit rate: 70%. The agent stops asking the LLM for tasks it has solved before — your token bill drops to zero on recurrent work and your replies come back in 200 ms instead of 700 ms.

Bench data: data/bench_learning_curve_anthropic_n5.{results,by_iter}.json.

🎯 Held-out generalization — practical text tasks

Beyond digit-sum: 5 TRAIN tasks (URL parsing, date format, capitalize, reverse, word count) → sleep consolidate → 5 HELD-OUT tasks with fresh inputs of the same families.

Phase	Success	Rate
TRAIN	5/5	100%
HELD-OUT	5/5	100%

The agent retrieves 3 relevant skills per held-out task and applies them without per-task re-discovery. Bench data: data/bench_held_out_practical.{results.json,summary.md}.

📊 Headline — compositional generalization (real LLMs, 96 calls)

12 tasks at increasing skill-chaining depth (apply ROT3 + REVERSE in 1-5 chained transformations to fresh inputs). 4 providers, 1 iter:

Provider	Lv1	Lv2	Lv3	Lv4	Lv5	overall
Anthropic raw	100%	50%	0%	0%	0%	42%
Anthropic HippoAgent	100%	100%	100%	100%	100%	100%
DeepSeek raw	50%	0%	0%	0%	0%	25%
DeepSeek HippoAgent	100%	100%	100%	100%	100%	100%
OpenRouter raw	100%	0%	0%	0%	0%	33%
OpenRouter HippoAgent	100%	100%	100%	100%	100%	100%
Groq raw	50%	0%	0%	0%	0%	25%
Groq HippoAgent	100%	50%	100%	50%	100%	83%

The accuracy gap GROWS with composition depth. Single-shot LLMs collapse to 0% at Lv3; persistent-memory agents stay at 100% up to Lv5 on 3/4 providers. Empirical signature of compositional reasoning over consolidated skills.

Compiled-macro hit rate (skills bypassing the LLM entirely): OpenRouter 92%, Groq 75%, DeepSeek 58%, Anthropic 0% (Anthropic uses the full ReAct loop — the 100% accuracy comes from memory recall, not caching).

Bench data (replicable): data/bench_compositional_4providers.*.json.

python scripts/bench_with_without_hippo.py \
    --suite compositional \
    --providers anthropic,deepseek,openrouter,groq \
    --conditions raw,hippo_warm

HippoAgent (formerly Engram) — named after the neuroscientific term for a permanent memory trace (Lashley, Tulving). The system turns experience into inspectable, fitness-tracked, mergeable artifacts.

📚 Platform reference (start here)

If you want the end-to-end map of the moving parts — components, configuration knobs, task flow, MCP integration, multi-model bench harness, test isolation — read docs/PLATFORM.md. That doc is the contract; everything below is narrative.

For a 5-minute MCP integration walkthrough (Claude Code, Cursor, opencode, Cline, Continue) see docs/MCP_QUICKSTART.md.

🚀 What's new in v0.2.0 (production hardening)

The v0.2 push consolidates the prototype into a vendible system. See CHANGELOG.md and the four R&D reports (SECURITY_AUDIT.md, ARCHITECTURE_AUDIT.md, QA_AUDIT.md, RND_MEMORIE.md, RND_PERFORMANCE.md, RND_UX.md) for the full picture.

13 CVE closed — RCE in /api/ide/run and the WS terminal (auth token + binary allowlist + shlex.split), SSRF blocklist in web_fetch, sensitive-path deny-list, plaintext API-key leak in /api/settings/providers redacted, computer-use kill-switch hotkey deny-list, dashboard CORS lock + session token, editfmt config-file deny-list, MCP server schema validation + audit log + rate limit, prompt-injection wrapper around external content (<untrusted_content>), Docker sandboxed Python executor (opt-in via HIPPO_PYTHON_EXEC_BACKEND=docker).
Active memory v2 — 7 enhancements to the original 6 mechanisms + 5 new ones (11 total). All five new mechanisms are zero LLM call (pure numpy / string ops on existing memory). See RND_EXPLORATION.md for the diary of how they were found:
- 7 — Working Memory Pruning (RND_MEMORIE.md): wake-loop char-budget compressor; critical for small-context models. Validated on Ollama qwen2.5:7b: −54 % token usage with lower variance vs unpatched build.
- 8 — Trace Alignment / Reverse Replay (RND_TRACE_ALIGNMENT.md): Needleman-Wunsch on observation embeddings finds the exact divergence step between a failed run and its success-twin. Two-mode: action-divergence (same situation, different decision) + input-divergence (same tool, wrong file/query). Inspired by sharp-wave reverse replay in CA1 place cells.
- 9 — Lateral Inhibition (Anti-Hebbian): when a winner skill consolidates on a task, its near-clone rivals are nudged away from that task vector. Földiák 1990 competitive specialisation. Empirically: −0.067 cosine differentiation at step 50 vs Hebbian-only baseline. Disabled by default; opt in via lateral_inhibition_enabled.
- 10 — Spontaneous Reactivation: a default-mode rehearsal stage during sleep. Skills not used in N days get their last_used_at pushed forward by half the decay cutoff so they don't fall over the retirement cliff. Born & Wilhelm 2012 spaced-repetition substrate. Disabled by default; opt in via spontaneous_reactivation_enabled.
- 11 — Salience by Surprise: replay priority gains a fourth term that boosts episodes whose num_steps deviates from the skill's average — Buzsáki 2015 prediction-error replay. Combined with multi-skill smallest-deviation logic so a typical-for-skill-A / anomalous-for-skill-B trace doesn't double-count. Disabled by default (sleep_replay_priority_surprise=0.0).
Performance — hot paths sped up 16× to 4 700×: LRU-cached embeddings, vectorised skill clustering (corpus @ corpus.T), in-memory recall index with optional FAISS, mtime+size repomap cache.
Architecture — dashboard.py 2 338 LOC monolith → 159 LOC entry-point + 11-file dashboard_routes/ package. LLM provider registry moved to providers.yaml + Pydantic ProviderSpec. New pydantic-settings-based Settings v2 singleton. Lightweight Alembic-style migrations.
UX — Production-grade design system at hippoagent/static/dashboard.css (WCAG 2.1 AA verified contrasts, 4 px scale, 1.25 type ratio, light theme). /skills page redesigned with KPI grid + responsive card grid + filter pills + accessibility. CLI banner with contextual tips and grouped /help.
CI/CD — 3 OS × 4 Python = 11 jobs, dedicated security workflow (pip-audit, safety, bandit, ruff S-rules) running weekly. Multi-stage Dockerfile (~500 MB), non-root user, HEALTHCHECK. pip extras ([headless], [mcp-only], [tui], [vision], [full], [dev]) — default install is now minimal sane.
Tests — 113 → 1072+ (+849 %). Coverage 46 % → 59 %. Ruff: 33 → 0 errors. Recent additions (FORGIA #27–#89): tests/test_bench_harness.py, test_bench_compare.py, test_bench_summary_md.py, test_bench_cli.py, test_bench_recall_ablation.py, test_clean_bench_data.py, test_jsonutil.py, test_corruption_guards.py, test_data_dir_isolation.py, test_auto_fallback.py, test_config_env.py, test_makefile_help.py, test_wake_used_macro.py, test_sleep_report_n_llm_calls.py, test_real_provider_smoke.py, test_mcp_e2e_smoke.py. Original suites preserved: tests/security/ (path traversal, SSRF, secrets redaction, prompt injection, executor isolation, editfmt sensitive, pentest validation), tests/test_settings.py, test_settings_v2.py, test_provider_registry.py, test_migrations.py, test_mcp_server.py, test_mcp_server_security.py, test_dashboard_api.py, test_cli.py, tests/perf/test_perf.py (10 benchmarks), tests/test_rnd_active_memory.py, test_trace_alignment.py, test_lateral_inhibition.py.

What Engram does, in one breath

You give the agent a task in plain language. It thinks, uses tools (Python sandbox, file I/O, shell, web fetch, screenshots, webcam, vision LLM, computer use), retrieves any relevant past skills it has consolidated, and answers. Every conversation is an episode in memory. Every few episodes, you trigger a sleep cycle that distills new procedural skills (NREM), recombines them creatively (REM), merges duplicates, and promotes/retires by Bayesian fitness. Tomorrow the agent is genuinely better at what you asked it yesterday — and you can read every lesson it learned.

Verified working today

Multi-model bench harness (FORGIA #27, 2026-05-09): same task suite, 3 conditions (raw / hippo_cold / hippo_warm), 3-4 real providers, fail-isolated.

Headline result — memory_recall suite (the discriminative one):

Provider	raw	hippo_cold	hippo_warm
anthropic	0.50	1.00	1.00 (latency −56 %)
deepseek	0.50	1.00	1.00
openrouter	0.50	1.00	1.00

The 50 % raw failure is the 3 query tasks ("What was the value I told you?") — with no shared context, the LLM has no place to retrieve from. HippoAgent's recall pipeline retrieves the seed episode and the query phase succeeds 100 % on every provider. +50 percentage-point accuracy uplift, three different LLMs.

Hardened result — hard_memory_recall (12 tasks: direct + paraphrased + synthesis):

Provider	raw	hippo_cold	hippo_warm
anthropic	0.50	1.00	1.00 (latency −51 %)
deepseek	0.50	0.92	0.92 (lost the synthesis)
openrouter	0.50	1.00	1.00

The headline holds: +42–50 pp uplift across paraphrased queries and multi-step synthesis. DeepSeek lost the multi-step task (retrieved both facts but failed the addition) — HippoAgent provides the memory, arithmetic composition is on the model.

Skill-compounding suite (8 digit-sum tasks): hippo_warm latency −41 % vs hippo_cold on anthropic (compiled-macro fast-path engaging). Default trivia suite: raw wins (~50 tokens vs ~3 000), hippo costs structural overhead but proves end-to-end transport works.

All raw data committed at data/bench_*.{results,summary}.json. Full analysis in docs/PLATFORM.md.

Reproduce locally:

make bench-help          # list available task suites
make bench-mock          # offline smoke (no API key needed)
make bench-real          # run on every provider with an env key set
make bench-memory        # the discriminative recall suite
make bench-summary       # render the latest summary as markdown
make bench-csv           # CSV (Excel-friendly)
make bench-quick         # mock + 2 tasks (CI smoke)
make bench-clean         # dry-run of transient bench dirs
make stats               # project size + test count
make bench-compare BEFORE=... AFTER=...   # diff two bench summaries

The bench script supports many flags useful in CI: --quiet --max-tasks N --task-id ID --providers auto|csv --suite NAME --n-iter N --consolidate-every K --save-md --memory-stats --show-failures --print-config --list-providers --clean-data --output-dir PATH. See python scripts/bench_with_without_hippo.py --help.

Tool-use verified live across 4 providers (same task: write a real file to Desktop), all via native tool-use API (no fragile JSON-in-text parsing):

Provider	Model	Steps	Tokens	Outcome
Ollama (local, free)	qwen2.5:1.5b	2	4,655	✓ wrote file to disk
Ollama (local, free)	qwen2.5:7b-instruct	2	4,699	✓ wrote file to disk
Groq (free tier)	llama-3.3-70b-versatile	4	11,794	✓ wrote file to disk
Anthropic	claude-haiku-4-5	2	7,200	✓ wrote file to disk

Computer-use end-to-end verified live:

✓ shell_run: whoami, systeminfo, wmic, ver — Claude riassume sistema
✓ Task "apri Calc + screenshot + descrivi + chiudi" — 4 step, success
✓ vision_describe — Claude descrive logo OMNEX correttamente
✓ web_fetch + web_search (DuckDuckGo) — paper Nature/ScienceDirect
✓ desktop_screenshot 2560×1600 + describe via Anthropic vision
✓ Sleep cycle: 18 episodi → 6 NREM + 2 REM + 2 merge + 6 facts in 103s

Plan mode + auto-fallback chain verified.

HippoAgent is a working prototype of an idea: an LLM agent that becomes measurably more competent over time without ever updating its weights. It does so by mimicking the two-stage memory consolidation model from neuroscience (NREM slow-wave + REM paradoxical sleep). Every "lesson" the agent learns is a structured, versioned, fitness-tracked artifact you can read, edit, share, retire — not an opaque parameter shift.

Why this exists

Today's LLM agents have two ways to "learn":

Fine-tuning — costly, centralized, opaque, irreversible.
RAG / context — they "know" things but don't remember the session; no consolidation, no transfer, no growth curve.

The space in between — what humans actually do during sleep — is empty in production systems. Voyager (Wang et al.) introduced a skill library for Minecraft. MemGPT/Letta layered tiered memory. Reflexion added critique. But nobody has closed the loop with a consolidation cycle that:

replays episodes (success and failure),
extracts invariant patterns into procedural skills,
recombines existing skills creatively,
tests them under a fitness function,
prunes the losers, promotes the winners,
and measures itself with held-out tasks.

That loop is the bet of HippoAgent.

Architecture (one screen)

┌──────────────────────────────────────────────────────────────────────┐
│                            HippoAgent                                │
│                                                                      │
│  ╔═══════════════ WAKE ════════════════╗   ╔═════ SLEEP ══════════╗  │
│  ║  Task → Memory retrieval (skills    ║   ║  NREM: cluster        ║  │
│  ║          + similar episodes)        ║   ║   episodes →          ║  │
│  ║  ReAct loop with tool use:          ║   ║   distill skills      ║  │
│  ║   • run_python (sandboxed subproc)  ║   ║   + semantic facts    ║  │
│  ║   • syntax_check, find_function     ║   ║                       ║  │
│  ║   • submit_solution                 ║   ║  REM: pick 2 skills,  ║  │
│  ║  Reflexion-style self-critique on   ║   ║   propose hybrid      ║  │
│  ║   failure → 1-shot retry.           ║   ║                       ║  │
│  ║  Episode persisted to memory.       ║   ║  Curator: merge       ║  │
│  ╚══════════════════ ▼ ════════════════╝   ║   semantic dups       ║  │
│                      │                     ║                       ║  │
│              Episodes (SQLite +            ║  Pruning: Bayesian    ║  │
│              embeddings + causal graph)    ║   fitness → promote / ║  │
│                                            ║   retire skills.      ║  │
│              Skills (JSON files +          ╚══════════ ▲ ══════════╝  │
│              version chain + lineage DAG ──────────────┘              │
│              + Beta-Binomial fitness)                                │
│                                                                      │
│              Semantic facts (decoupled from time)                    │
└──────────────────────────────────────────────────────────────────────┘

Observability layer: every action emits a structured event →
   structlog → metrics registry → dashboard (FastAPI + vis-network).

What's actually inside

Module	What it does	Key novelty
`episode.py`	`Episode` + `Trace` — full ReAct trajectory, immutable	timestamped, embedding-indexed
`memory.py`	Episodic memory: SQLite + dense recall + causal graph (networkx)	clustering for replay; A→B causal edges via shared skill
`semantic.py`	Semantic memory: facts decoupled from time	Tulving-style separation
`skill.py`	Skill library — versioned JSON files + index	Bayesian Beta-Binomial fitness, lineage DAG, status lifecycle
`tools.py`	Sandboxed Python executor + AST analyzer	subprocess isolation, timeout, output cap
`wake.py`	ReAct loop with skill+episode injection	tolerant parser, Reflexion critique, A/B toggle
`sleep.py`	Multi-stage consolidation engine	NREM + REM + Curator + Pruning
`prompts.py`	All LLM prompts, in one auditable file	"experience as artifact" thesis
`observability.py`	EventBus + structlog + metrics registry	every step emits a typed event
`dashboard.py`	FastAPI + HTML dashboard	skill lineage graph (vis-network)
`cli.py`	typer + rich CLI	`hippo run/wake/sleep/benchmark/skills/episodes/dashboard`
`benchmark/`	18 HumanEval-style coding tasks + evaluator	wake/heldout split, Wilson CIs, two-prop z-test

Scientific anchors

Two-stage consolidation: Walker & Stickgold (2004), Diekelmann & Born (2010). NREM consolidates declarative memory; REM enables creative recombination.
Episodic vs semantic vs procedural: Tulving (1972, 1985).
Fast hippocampal replay → slow cortical consolidation: McClelland, McNaughton & O'Reilly (1995) — the model HippoAgent's wake/sleep split mirrors.
Reflexion: Shinn et al. (2023) — verbal RL on agents (no gradients) — used here for the self-critique retry.
Voyager: Wang et al. (2023) — skill library for embodied agents — closest prior art for the procedural memory.
Bayesian fitness for small N: Beta-Binomial conjugate prior — robust posterior mean even after 1–2 trials.
Declarative → procedural transition: Anderson (1982) ACT-R, Logan (1988) instance theory — the basis for procedural compilation.
Hippocampal forward sweeps: Pfeiffer & Foster (2013), Diba & Buzsáki (2007) — predictive replay before action — the basis for forward replay.
Hebbian plasticity: Hebb (1949) "cells that fire together wire together" — the basis for the trigger-embedding drift on success.
Counterfactual reasoning in episodic memory: Gershman & Daw (2017) — alternative trajectories during offline replay drive learning beyond mere reinforcement.

What makes the memory active (not just stored)

Most LLM-agent memory systems are passive: they retrieve past prompts and dump them into context. HippoAgent's memory is active — the act of using a skill makes it stronger, faster, and more discriminating. Six mechanisms compound:

Procedural compilation — once a skill has succeeded N times with high fitness, the DREAMER (during sleep) distills its successful traces into a parameterised macro: a list of tool calls with {{TASK}} and {{LAST_OBSERVATION}} placeholders. At wake time, when a strongly-matching task arrives, the macro is executed deterministically — zero LLM tokens, no model latency between steps. The skill is not just remembered, it is compiled the way deliberate actions become motor reflexes. (compilation.py)
Forward replay — before the wake loop fires, the agent looks up the top skill's past successful trajectories and projects an expected action sequence. The block is injected as ## PREDICTED PATH in the user prompt. This anchors the LLM's reasoning (less drift on familiar tasks) and lets us detect divergence (a learning signal). Pure retrieval — no extra LLM call. (wake.py:_forward_replay_block)
Hebbian skill embedding — every successful application drifts the skill's trigger embedding toward the task that just succeeded (new = (1 - α)·current + α·task, α = 0.05, then re-normalised). Skills become more retrievable for the kind of task they keep solving — the library shapes itself to its workload over time without any retraining. (skill.py:_hebbian_update)
Counterfactual REM — when a skill keeps failing (fitness < 0.5, trials ≥ 3), the dreamer doesn't just decrement Bayesian counts. It reads the failed trajectory and synthesises an alternative strategy — a candidate counterfactual skill with the failed skill as parent. The alternative competes for retrieval on future similar tasks; if it wins, it supersedes the broken approach without manual intervention. (sleep.py:_stage_counterfactual)
Schema formation — once a domain has accumulated enough skills (≥3 with cosine similarity ≥ 0.62 on triggers), the dreamer writes a SCHEMA: a meta-skill whose body is a one-line rubric for picking among the children. Lineage edges (relation='specialises') connect each schema to its specifics, building a 2-level hierarchy that becomes navigable as the library grows. Tulving's episodic→semantic transition. (sleep.py:_stage_schema)
Self-suggested practice — for skills sitting in the uncertain middle (fitness 0.45–0.65), the dreamer writes 2 concrete practice prompts that would plausibly trigger them. They appear in the dashboard's skill detail under "📚 Practice prompts" with one-click "▶ run in chat" buttons. Running a prompt feeds real evidence into the Bayesian fitness — so the skill is decisively promoted or retired instead of lingering in ambiguity. The agent literally suggests its own training set. (sleep.py:_stage_practice)

Together these turn memory from an archive into an organ that grows.

End-to-end demo (no API keys needed)

python scripts/demo_active_memory.py

Seeds a library with six skills, runs a complete sleep cycle (NREM + REM + Curator + compilation + counterfactual + schema + practice + pruning) using a scripted mock LLM, then re-runs a matching task to demonstrate the macro fast-path. Typical output:

Sleep cycle — six mechanisms fire in one pass
  duration    : 1.37s  (12 LLM calls, 3080 tokens)
  🔧 macros   : 1
  🌀 cf       : 1
  🌳 schemas  : 1
  📚 practice : 6 prompts written
  promoted    : 1   retired : 1

Wake — re-running a similar task uses the macro fast-path
  steps       : 2
  tokens used : 0
  llm calls   : 0  (macro fired, no model invoked)

Empirical evidence — macro speed-up

Run python scripts/bench_macro.py --repeats 3 --latency 0.5 to reproduce:

COLD (ReAct)    median  wall=1.296s  llm_calls=2  tokens=560
HOT  (macro)    median  wall=0.202s  llm_calls=0  tokens=0
→ speed-up      :   6.41x
→ time saved    : 1094 ms per task
→ token saved   :   560 per task

The simulated LLM models a real-world 0.5 s/call latency (typical for hosted models). On the hot path the macro fires deterministically — not a single token leaves the box.

Live dashboard

Visit /active-memory (or click "Active memory" in the nav) for a live KPI panel — compiled-macro coverage, Hebbian-tuned skills, counterfactual lineage, schema hierarchy. The /api/active-memory/stats JSON endpoint exposes the same metrics for external tooling.

Closing the loop — explicit user feedback

Every chat turn surfaces 👍 / 👎 buttons. Clicking them feeds the same Bayesian fitness machinery used by automatic outcomes — no separate "feedback model" or fine-tune. A 👎 on a successful turn flips the episode to failure and records a failure trial against each applied skill; mis-promoted skills decay back into the candidate pool. A 👍 simply boosts the trial count with a success.

So the library has three sources of evidence, all converging on the same fitness posterior:

automatic validator outcomes during the wake loop;
counterfactual REM evaluations during sleep;
explicit user up- / down-votes from the chat surface.

Install

Option A — Python (local)

cd ProgettiAI/HippoAgent
python -m venv .venv
source .venv/Scripts/activate           # or .venv/bin/activate on POSIX
pip install -e .
hippo dashboard                         # → http://127.0.0.1:8765

The first time you open the dashboard, you'll land on /welcome — pick a provider, paste an API key (or run Ollama locally — no key needed), test the connection, save. Everything is configurable from the ⚙ Settings page; no env-var voodoo required.

Option B — Docker

docker compose up
# or:  docker build -t hippoagent . && docker run -p 8765:8765 -v $PWD/data:/app/data hippoagent

The container picks up any *_API_KEY env vars you pass through. To use a host-running Ollama, the compose file already maps host.docker.internal.

Using HippoAgent

There are three ways to interact with the agent. They share the same backend (memory, skills, sleep cycles) — just pick what suits you.

1. Web dashboard

hippo dashboard → http://127.0.0.1:8765

page	what it's for
Chat ⭐	the main UI — type a task, get an answer with applied skills
Settings	pick LLM provider, paste API key, test connection, switch models live
Episodes	every task ever executed; click any row for the full ReAct trajectory
Skills	consolidated lessons with Bayesian fitness, status, lineage
Lineage	interactive graph of how skills derive from one another
Events	live event stream (every memory write, retrieval, LLM call)
Metrics	counters/histograms

2. CLI

hippo chat                       # interactive REPL: type tasks, /sleep, /skills, /quit
hippo run "your task here"       # one-shot
hippo wake --n-tasks 5           # run the bench wake-set
hippo sleep                      # consolidation cycle on demand
hippo benchmark                  # held-out evaluation
hippo skills list / show <id>
hippo episodes list / show <id>
hippo providers list / scan / models <p> / active

3. As an MCP server inside Claude Code / Cursor / opencode / Cline / Zed

HippoAgent ships an MCP (Model Context Protocol) server, so any MCP-aware client can use it as a memory-augmented agent. Add this to your client's mcp.json (or equivalent):

{
  "mcpServers": {
    "hippoagent": {
      "command": "hippo",
      "args": ["mcp"],
      "env": {
        "HIPPO_LLM_PROVIDER": "groq",
        "GROQ_API_KEY": "your-key",
        "HIPPO_MODEL": "llama-3.3-70b-versatile"
      }
    }
  }
}

Once registered, the host (Claude Code, Cursor, etc.) can call:

MCP tool	what it does
`hippo_run_task`	full wake loop — agent uses its own tools + skills
`hippo_consolidate`	trigger a sleep cycle
`hippo_recall`	semantic search over past episodes (embeddings)
`hippo_recall_explain`	semantic recall + per-component score breakdown
`hippo_search`	keyword search over episode `task_text` (LIKE)
`hippo_episode_list`	paginated listing of episodes (`limit`/`offset`/`outcome`)
`hippo_episode_get`	one episode in full (trajectory + critique)
`hippo_episode_pin` / `hippo_episode_unpin`	protect / release an episode from decay-pruning
`hippo_forget`	delete one episode by id (privacy / GDPR)
`hippo_metrics_history`	token-usage timeseries bucketed by day
`hippo_skills_for`	preview which skills HippoAgent would inject for a task
`hippo_skill_promote` / `hippo_skill_retire` / `hippo_skill_edit`	manual curation
`hippo_skill_export` / `hippo_skill_import`	portable JSON bundles (share skills between installations)
`hippo_skill_test`	render the prompt-context for a (skill, task) pair without calling the LLM
`hippo_skill_top`	top-k skills by `fitness` / `recency` / `activity`
`hippo_skill_lineage`	walk the `parent_skills` DAG ancestry of a skill
`hippo_skill_compare`	diff two skills (body / fitness / trials)
`hippo_skill_similar`	top-k skills by Jaccard overlap on body tokens
`hippo_skill_describe`	deterministic 1-line natural-language summary of a skill (no LLM)
`hippo_skill_merge`	manually merge skill A into B (sum trials, retire A, lineage tracked)
`hippo_episodes_by_skill`	every episode whose `skills_used` includes a given skill
`hippo_provider_switch`	switch the active LLM provider at runtime (anthropic / openai / groq / …)
`hippo_remember`	store one fact directly in semantic memory — no episode, no sleep cycle
`hippo_facts_recall`	semantic search over facts (cosine on proposition embedding)
`hippo_facts_search`	keyword/substring search over facts (LIKE on proposition)
`hippo_facts_list`	paginated listing of all facts (newest-first)
`hippo_fact_forget`	delete one fact by id (privacy / GDPR)
`hippo_skills_search`	keyword/substring search over skills (LIKE on name+trigger+body)
`hippo_skill_bundles` / `hippo_compound_skills` / `hippo_skill_antagonists`	structural introspection
`hippo_status`	counts + active provider
`hippo_health`	deep preflight at startup — 3-tier reachability + counts + flag + tool_count
`hippo_stats`	aggregate metrics (episodes by outcome, skills, token usage)
`hippo_audit_tail`	last N records of the MCP audit log (forensics)

And read MCP resources:

hippo://skills/list and hippo://skills/{id} — consolidated skills
hippo://episodes/recent and hippo://episodes/{id} — past trajectories

The result: your Claude Code (or whatever) gets a second-brain agent it can delegate to — and the second brain remembers across sessions, not just within one conversation.

4. Programmatically

from hippoagent.agent import HippoAgent
agent = HippoAgent.build()
result = agent.run_task(
    task_id="my-task",
    task_text="Write a Python function that ...",
    validator=lambda ans: (bool(ans.strip()), "non-empty"),
)
print(result.episode.final_answer)
agent.consolidate()              # nightly sleep cycle

Desktop launchers (Windows)

powershell -ExecutionPolicy Bypass -File scripts\install_desktop_shortcut.ps1

This creates two shortcuts on your Desktop:

HippoAgent Dashboard — double-click to launch the web UI (browser opens to http://127.0.0.1:8765).
HippoAgent CLI — opens a shell with the venv activated, ready for hippo … commands.

Providers — bring any LLM you like

HippoAgent is provider-agnostic. Set ONE of these env vars (or run Ollama locally) and you're done. The first matching one wins (or force with HIPPO_LLM_PROVIDER=<name>).

family	provider (alias)	env var	base URL
native	`anthropic`	`ANTHROPIC_API_KEY`	Anthropic SDK
native	`ollama`	`OLLAMA_HOST` (defaults to `http://localhost:11434`)	local
US/EU	`openai`, `openrouter`, `mistral`, `groq`, `xai` (`grok`), `perplexity`, `fireworks`, `together`, `cerebras`, `gemini` (`google`), `nvidia`, `huggingface` (`hf`), `deepinfra`, `hyperbolic`, `novita`, `lepton`, `anyscale`, `azure`	`<NAME>_API_KEY`	each provider's `/v1`
China	`moonshot` (`kimi`), `deepseek`, `qwen` (`dashscope`/`alibaba`), `zhipu` (`glm`), `baichuan`, `yi` (`lingyi`/`01ai`), `doubao` (`ark`), `hunyuan` (`tencent`), `stepfun` (`step`), `minimax`, `spark` (`iflytek`)	`<NAME>_API_KEY`	each provider's `/v1`
local OpenAI-compat	`lmstudio`, `vllm`, `localai`, `tabby`	`<NAME>_API_KEY` (any non-empty)	localhost

Per-stage model overrides (Claude defaults are tuned; for other providers you usually want to set these):

export HIPPO_MODEL=qwen2.5:7b               # all stages
export HIPPO_MODEL_EXECUTOR=qwen2.5:7b      # ReAct loop only
export HIPPO_MODEL_DREAMER=qwen2.5:14b      # NREM/REM synthesis (smarter recommended)
export HIPPO_MODEL_CRITIC=qwen2.5:1.5b      # cheap critic

Discover what's actually reachable from your setup:

hippo providers list      # all known providers + env-var status
hippo providers scan      # query /v1/models on every configured one (real discovery)
hippo providers models kimi    # list one provider's models
hippo providers active    # which provider is selected right now

Examples:

HIPPO_LLM_PROVIDER=kimi     MOONSHOT_API_KEY=sk-...                              hippo wake
HIPPO_LLM_PROVIDER=deepseek DEEPSEEK_API_KEY=sk-... HIPPO_MODEL=deepseek-reasoner hippo wake
HIPPO_LLM_PROVIDER=ollama   OLLAMA_MODEL=qwen2.5:7b                              hippo wake
HIPPO_LLM_PROVIDER=groq     GROQ_API_KEY=gsk-...   HIPPO_MODEL=llama-3.3-70b-versatile hippo wake

For tests / dev:

pip install -e ".[dev]"

Quick run

# Tests (fully offline, mock LLM)
HIPPO_OFFLINE=1 pytest

# CLI
hippo --help
hippo run "Define a Python function that returns the n-th prime"
hippo wake --n-tasks 5             # run wake-set (records episodes)
hippo sleep                        # run consolidation cycle
hippo benchmark                    # run held-out tasks with consolidated skills
hippo skills list
hippo skills show <id>
hippo episodes list
hippo dashboard                    # → http://127.0.0.1:8765

End-to-end demo (the actual experiment)

python run_demo.py --n-wake 10 --n-heldout 8

What it does, in order:

Reset.
Baseline: held-out tasks, without skills, without past episodes — pure model.
Wipe the episodes the baseline accidentally created (clean slate).
Wake: run the wake-set; the agent records everything.
Sleep: NREM clusters episodes → distills skills; REM proposes hybrids; Curator merges semantic duplicates; pruning promotes/retires by fitness.
Hippo: re-run the held-out tasks with the consolidated skill library.
Compare: pass-rate, avg-steps, avg-tokens, skill-reuse-rate. 95% Wilson interval on rates. Two-proportion z-test for the gap.

The script saves a JSON report under data/reports/.

First-run observations

A small run (10 wake / 8 held-out) on Claude Haiku 4.5 produced these consolidated skills, derived autonomously from the agent's own behavior:

Verify test cases before blaming algorithm (NREM, fitness=0.70)
Validate test runner output before debug (REM hybrid, 0.70)
Harden JSON code payloads end-to-end (REM hybrid, 0.70)
Escape newlines in JSON code strings (the Curator merged 4 near-duplicates of this)

These are not generic "be helpful" stubs — they are concrete lessons the agent extracted from its own failure modes (mostly: malformed JSON in run_python calls when the code contained newlines).

The pass-rate gap on N=8 held-out is not statistically significant (p≈0.5 by two-prop z-test) — and that is the honest scientific finding at this scale. Ramping to N≥30 is the next experiment; the architecture is ready for it.

What makes this a prototype, not a toy

✅ 23 unit + integration tests passing
✅ Real LLM client + offline mock for deterministic CI
✅ Sandboxed Python execution (subprocess + timeout)
✅ Structured logging + event bus + metrics registry
✅ Bayesian fitness with conjugate prior (not naive ratios)
✅ Skill lineage DAG (networkx) — full provenance from episode → skill → REM hybrid → merge
✅ Web dashboard with skill lineage visualization
✅ Reproducible benchmark with deterministic seed split
✅ Statistical primitives (Wilson interval, two-prop z-test) for honest reporting
✅ A/B toggle (--no-skills) so baseline-vs-hippo is a single CLI flag

👁 Vision — works with any multimodal provider

The vision_describe tool dispatches to the right multimodal endpoint per provider. Defaults shipped (override via HIPPO_VISION_MODEL):

Provider	Default vision model	Status
Anthropic	`claude-haiku-4-5-20251001`	✓ verified live
OpenAI	`gpt-4o-mini`	✓
Google Gemini	`gemini-1.5-flash`	✓ free tier
Groq	`meta-llama/llama-4-scout-17b-16e-instruct`	✓ verified live, free tier
OpenRouter	`anthropic/claude-haiku-4.5`	✓ verified live
xAI Grok	`grok-4`	✓ (paid)
Mistral	`pixtral-12b-latest`	✓
Alibaba Qwen	`qwen-vl-plus`	✓
Zhipu GLM	`glm-4v`	✓
Moonshot Kimi	`moonshot-v1-8k-vision-preview`	✓
01.AI Yi	`yi-vision`	✓
ByteDance Doubao	`doubao-vision-pro-32k`	✓
NVIDIA NIM	`meta/llama-3.2-90b-vision-instruct`	✓
Together / Fireworks	Llama 3.2 Vision 90B	✓
HuggingFace router	Llama 3.2 Vision 90B	✓
Ollama (local)	`llava` (override with OLLAMA_VISION_MODEL)	✓ — needs `ollama pull llava` (or `qwen2-vl`, `llama3.2-vision`, `bakllava`, `moondream`)
DeepSeek	n/a	✗ DeepSeek V4 API doesn't accept image_url blocks

The dispatcher prefers HIPPO_VISION_MODEL env var > OLLAMA_VISION_MODEL (for Ollama) > the default in the table above. So you can mix: text inference on cheap-fast model, vision on a multimodal-capable one — same single call.

# Use Ollama for everything but route vision to Anthropic:
HIPPO_LLM_PROVIDER=ollama OLLAMA_MODEL=qwen2.5:7b \
  HIPPO_VISION_MODEL=ignored  # vision uses provider, set ANTHROPIC_API_KEY too
hippo dashboard

# Use DeepSeek for text, fallback to Groq for vision:
# (DeepSeek V4 has no vision; agent calls vision_describe → falls through to
# the configured fallback chain.)

🔁 Auto-fallback provider chain

When the active provider hits rate-limit / quota / 5xx, Engram transparently tries the next configured provider. Configure in /settings → "Fallback chain":

primary: Anthropic Claude   (fast, paid)
   ↓ on 429/quota
fallback 1: Groq llama 70b  (free tier, very fast)
   ↓ on quota
fallback 2: Ollama qwen 7b  (local, always available)

Result: zero-downtime LLM access, scaling free → paid → premium automatically. Recoverable error patterns: 429, rate, quota, billing, credit, limit, 503, 504, timeout, connection, overload.

📋 Plan mode

Press 📋 Plan first in the chat instead of Send. The agent produces a numbered plan (3-7 steps) WITHOUT executing anything. Review it, then click ✓ Approve & execute to run the plan, or ✗ Reject.

Useful for: cautious computer-use tasks, multi-step shell ops, anything where you want to know what will happen before it does.

🔐 Permissions & Sandbox

Engram exposes a single master switch plus 6 granular toggles in /settings:

Capability	Default	Description
Sandbox master	ON	When OFF, all permissions are unrestricted
Filesystem	`home`	`strict` (data/ only), `home` (user dir), `full` (anywhere)
Computer use	OFF	mouse/keyboard control via pyautogui
Webcam	OFF	capture frames + describe via vision LLM
Shell	OFF	arbitrary `cmd.exe` / `/bin/sh` commands
Web	ON	web_fetch + DuckDuckGo search
Vision	ON	describe images via multimodal LLM

Two preset buttons:

🔓 Unleash — sandbox OFF + all permissions ON. Full PC access.
🔒 Lockdown — sandbox ON, filesystem = strict, only web + vision.

When the sandbox is OFF the agent has full access to your machine: it can read/write any file, run any shell command, control your mouse and keyboard, capture your webcam, fetch the web, and describe images. Use this only with models you trust.

📱 Android / Termux

Engram runs on Termux (Android 10+) with a few caveats:

pkg install python git ffmpeg-essentials
git clone https://github.com/aureliocpr-ctrl/hippoagent.git
cd hippoagent
python -m venv .venv && source .venv/bin/activate
# Skip pyautogui (no display server) and opencv (heavy):
pip install -e . --no-deps
pip install anthropic openai sentence-transformers numpy scipy scikit-learn \
            networkx pydantic structlog typer rich fastapi uvicorn jinja2 \
            python-dotenv httpx mcp textual pillow
hippo dashboard --host 0.0.0.0 --port 8765

Open http://<phone-ip>:8765 from any device on the same network. CLI works fully (hippo chat, hippo tui). Webcam/computer-use are no-op on Android, everything else (incl. Ollama via Termux+proot or remote, Groq, Gemini, Anthropic) works normally.

⚠ Security notes

The Python sandbox is subprocess -I, not a real isolation layer. A task like "write a calculator and save it to the desktop" will actually write to your disk, because the LLM-generated code can call open(), os.makedirs, requests.get, etc. That's by design for this prototype (it's how you get an agent that does things), but it means only run HippoAgent against models and tasks you trust. For untrusted use, run the whole agent in Docker with no host volume mounts, or wrap the executor in seccomp/Firejail.
API keys saved via the Settings page are written to data/user_settings.json in plaintext. Don't commit that file (it's in .gitignore). Production deployments should swap the storage for an OS keychain or Vault.

What's still toy

Sandbox is subprocess -I, not seccomp/Firejail/Docker — fine for non-adversarial models, not for production.
Benchmark is small (18 tasks). Real validation would need ≥100 tasks across more domains.
No active learning: the agent doesn't choose which tasks to attempt to maximize learning.
No explicit forgetting curve over time — only fitness-based pruning.
Single-agent. The lineage DAG is ready for skill-sharing across instances; the marketplace is not built.

Repository layout

HippoAgent/
├── hippoagent/                # core library
│   ├── __init__.py
│   ├── config.py              # all hyper-params, single source
│   ├── observability.py       # event bus, logger, metrics
│   ├── llm.py                 # Anthropic client + MockLLM
│   ├── embedding.py           # sentence-transformers wrapper
│   ├── tools.py               # PythonExecutor, CodeAnalyzer, tool registry
│   ├── episode.py             # Episode + Trace
│   ├── memory.py              # EpisodicMemory (vector + causal graph)
│   ├── semantic.py            # SemanticMemory (facts)
│   ├── skill.py               # Skill + SkillLibrary (lineage, fitness, lifecycle)
│   ├── prompts.py             # all prompts, auditable
│   ├── wake.py                # ReAct loop with skill injection + critique
│   ├── sleep.py               # NREM + REM + Curator + Pruning
│   ├── agent.py               # high-level orchestrator
│   ├── cli.py                 # typer + rich CLI
│   └── dashboard.py           # FastAPI dashboard
├── benchmark/                 # task suite + evaluator + statistics
├── tests/                     # pytest suite
├── data/                      # episodes/skills/semantic/runs/reports (gitignored)
├── run_demo.py                # full baseline-vs-hippo experiment
├── pyproject.toml
└── README.md

License

MIT, but please cite the underlying neuroscience papers if you build on the consolidation idea.

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
.claude-plugin		.claude-plugin
.claude/skills/hippoagent-memory		.claude/skills/hippoagent-memory
.github		.github
benchmark		benchmark
data		data
docs		docs
hippoagent		hippoagent
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
ARCHITECTURE_AUDIT.md		ARCHITECTURE_AUDIT.md
BENCH_VALIDATION.md		BENCH_VALIDATION.md
CHANGELOG.md		CHANGELOG.md
CODE_QUALITY_AUDIT.md		CODE_QUALITY_AUDIT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
FINAL_REVIEW.md		FINAL_REVIEW.md
FORGIA.md		FORGIA.md
LICENSE		LICENSE
Makefile		Makefile
PRODUCTION_ROADMAP.md		PRODUCTION_ROADMAP.md
QA_AUDIT.md		QA_AUDIT.md
README.md		README.md
RND_EXPLORATION.md		RND_EXPLORATION.md
RND_MEMORIE.md		RND_MEMORIE.md
RND_PERFORMANCE.md		RND_PERFORMANCE.md
RND_TRACE_ALIGNMENT.md		RND_TRACE_ALIGNMENT.md
RND_UX.md		RND_UX.md
SECURITY_AUDIT.md		SECURITY_AUDIT.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
run_demo.py		run_demo.py

Folders and files

Latest commit

History

Repository files navigation

⚡ HippoAgent

⚡ Install in 2 minutes (Claude Code)

📈 Learning curve — what changes when memory is on (real LLMs)

🎯 Held-out generalization — practical text tasks

📊 Headline — compositional generalization (real LLMs, 96 calls)

📚 Platform reference (start here)

🚀 What's new in v0.2.0 (production hardening)

What Engram does, in one breath

Verified working today

Why this exists

Architecture (one screen)

What's actually inside

Scientific anchors

What makes the memory active (not just stored)

End-to-end demo (no API keys needed)

Empirical evidence — macro speed-up

Live dashboard

Closing the loop — explicit user feedback

Install

Option A — Python (local)

Option B — Docker

Using HippoAgent

1. Web dashboard

2. CLI

3. As an MCP server inside Claude Code / Cursor / opencode / Cline / Zed

4. Programmatically

Desktop launchers (Windows)

Providers — bring any LLM you like

Quick run

End-to-end demo (the actual experiment)

First-run observations

What makes this a prototype, not a toy

👁 Vision — works with any multimodal provider

🔁 Auto-fallback provider chain

📋 Plan mode

🔐 Permissions & Sandbox

📱 Android / Termux

⚠ Security notes

What's still toy

Repository layout

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages