Skip to content

aureliocpr-ctrl/hippoagent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

241 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ HippoAgent

CI License: MIT Python ≥ 3.10 Tests FORGIA

Persistent memory layer for Claude Code (and any LLM agent). Most AI agents have amnesia. HippoAgent gives them a hippocampus.

HippoAgent is not a replacement for Claude Code, Cursor, or your favourite agent. It's the memory module they're missing. Plug it in via MCP and:

  • Episodes you and the agent worked on persist across sessions.
  • Procedures the agent figured out get compiled into deterministic macros — the next call doesn't even hit the LLM.
  • Failure → success patterns are rescued and replayed during sleep cycles.

8+ neuroscientific mechanisms (DG pattern separation, TCM, synaptic tagging, lateral inhibition, engram crossover, schema priming, …) all opt-in via flags.

⚡ Install in 2 minutes (Claude Code)

pip install hippoagent
claude plugin install hippoagent

Or manually drop .mcp.json in your project root:

{
  "mcpServers": {
    "hippoagent": {
      "command": "python",
      "args": ["-m", "hippoagent.mcp_server"]
    }
  }
}

Restart Claude Code and the hippoagent-memory skill auto-activates at the start of every conversation. To disable temporarily: export HIPPO_DISABLED=1.

📈 Learning curve — what changes when memory is on (real LLMs)

5 iterations of the same task family on Anthropic Claude Opus 4.7, 8 tasks per iter (digit-sum suite). After the 3rd iteration HippoAgent has compiled the procedure into a deterministic macro that bypasses the LLM entirely:

Iter HippoAgent tokens HippoAgent latency Raw LLM tokens Raw LLM latency
0 (cold) 4225 4.47s 59 0.67s
1 1711 (-60%) 0.92s (-79%) 59 0.71s
2 687 (-84%) 0.52s (-88%) 59 0.85s
3 0 0.22s (-95%) 59 1.08s
4 0 0.24s (-95%) 59 0.69s

Break-even at iter 3. Macro fast-path hit rate: 70%. The agent stops asking the LLM for tasks it has solved before — your token bill drops to zero on recurrent work and your replies come back in 200 ms instead of 700 ms.

Bench data: data/bench_learning_curve_anthropic_n5.{results,by_iter}.json.

🎯 Held-out generalization — practical text tasks

Beyond digit-sum: 5 TRAIN tasks (URL parsing, date format, capitalize, reverse, word count) → sleep consolidate → 5 HELD-OUT tasks with fresh inputs of the same families.

Phase Success Rate
TRAIN 5/5 100%
HELD-OUT 5/5 100%

The agent retrieves 3 relevant skills per held-out task and applies them without per-task re-discovery. Bench data: data/bench_held_out_practical.{results.json,summary.md}.

📊 Headline — compositional generalization (real LLMs, 96 calls)

12 tasks at increasing skill-chaining depth (apply ROT3 + REVERSE in 1-5 chained transformations to fresh inputs). 4 providers, 1 iter:

Provider Lv1 Lv2 Lv3 Lv4 Lv5 overall
Anthropic raw 100% 50% 0% 0% 0% 42%
Anthropic HippoAgent 100% 100% 100% 100% 100% 100%
DeepSeek raw 50% 0% 0% 0% 0% 25%
DeepSeek HippoAgent 100% 100% 100% 100% 100% 100%
OpenRouter raw 100% 0% 0% 0% 0% 33%
OpenRouter HippoAgent 100% 100% 100% 100% 100% 100%
Groq raw 50% 0% 0% 0% 0% 25%
Groq HippoAgent 100% 50% 100% 50% 100% 83%

The accuracy gap GROWS with composition depth. Single-shot LLMs collapse to 0% at Lv3; persistent-memory agents stay at 100% up to Lv5 on 3/4 providers. Empirical signature of compositional reasoning over consolidated skills.

Compiled-macro hit rate (skills bypassing the LLM entirely): OpenRouter 92%, Groq 75%, DeepSeek 58%, Anthropic 0% (Anthropic uses the full ReAct loop — the 100% accuracy comes from memory recall, not caching).

Bench data (replicable): data/bench_compositional_4providers.*.json.

python scripts/bench_with_without_hippo.py \
    --suite compositional \
    --providers anthropic,deepseek,openrouter,groq \
    --conditions raw,hippo_warm

HippoAgent (formerly Engram) — named after the neuroscientific term for a permanent memory trace (Lashley, Tulving). The system turns experience into inspectable, fitness-tracked, mergeable artifacts.

📚 Platform reference (start here)

If you want the end-to-end map of the moving parts — components, configuration knobs, task flow, MCP integration, multi-model bench harness, test isolation — read docs/PLATFORM.md. That doc is the contract; everything below is narrative.

For a 5-minute MCP integration walkthrough (Claude Code, Cursor, opencode, Cline, Continue) see docs/MCP_QUICKSTART.md.

🚀 What's new in v0.2.0 (production hardening)

The v0.2 push consolidates the prototype into a vendible system. See CHANGELOG.md and the four R&D reports (SECURITY_AUDIT.md, ARCHITECTURE_AUDIT.md, QA_AUDIT.md, RND_MEMORIE.md, RND_PERFORMANCE.md, RND_UX.md) for the full picture.

  • 13 CVE closed — RCE in /api/ide/run and the WS terminal (auth token + binary allowlist + shlex.split), SSRF blocklist in web_fetch, sensitive-path deny-list, plaintext API-key leak in /api/settings/providers redacted, computer-use kill-switch hotkey deny-list, dashboard CORS lock + session token, editfmt config-file deny-list, MCP server schema validation + audit log + rate limit, prompt-injection wrapper around external content (<untrusted_content>), Docker sandboxed Python executor (opt-in via HIPPO_PYTHON_EXEC_BACKEND=docker).
  • Active memory v2 — 7 enhancements to the original 6 mechanisms + 5 new ones (11 total). All five new mechanisms are zero LLM call (pure numpy / string ops on existing memory). See RND_EXPLORATION.md for the diary of how they were found:
    • 7 — Working Memory Pruning (RND_MEMORIE.md): wake-loop char-budget compressor; critical for small-context models. Validated on Ollama qwen2.5:7b: −54 % token usage with lower variance vs unpatched build.
    • 8 — Trace Alignment / Reverse Replay (RND_TRACE_ALIGNMENT.md): Needleman-Wunsch on observation embeddings finds the exact divergence step between a failed run and its success-twin. Two-mode: action-divergence (same situation, different decision) + input-divergence (same tool, wrong file/query). Inspired by sharp-wave reverse replay in CA1 place cells.
    • 9 — Lateral Inhibition (Anti-Hebbian): when a winner skill consolidates on a task, its near-clone rivals are nudged away from that task vector. Földiák 1990 competitive specialisation. Empirically: −0.067 cosine differentiation at step 50 vs Hebbian-only baseline. Disabled by default; opt in via lateral_inhibition_enabled.
    • 10 — Spontaneous Reactivation: a default-mode rehearsal stage during sleep. Skills not used in N days get their last_used_at pushed forward by half the decay cutoff so they don't fall over the retirement cliff. Born & Wilhelm 2012 spaced-repetition substrate. Disabled by default; opt in via spontaneous_reactivation_enabled.
    • 11 — Salience by Surprise: replay priority gains a fourth term that boosts episodes whose num_steps deviates from the skill's average — Buzsáki 2015 prediction-error replay. Combined with multi-skill smallest-deviation logic so a typical-for-skill-A / anomalous-for-skill-B trace doesn't double-count. Disabled by default (sleep_replay_priority_surprise=0.0).
  • Performance — hot paths sped up 16× to 4 700×: LRU-cached embeddings, vectorised skill clustering (corpus @ corpus.T), in-memory recall index with optional FAISS, mtime+size repomap cache.
  • Architecturedashboard.py 2 338 LOC monolith → 159 LOC entry-point + 11-file dashboard_routes/ package. LLM provider registry moved to providers.yaml + Pydantic ProviderSpec. New pydantic-settings-based Settings v2 singleton. Lightweight Alembic-style migrations.
  • UX — Production-grade design system at hippoagent/static/dashboard.css (WCAG 2.1 AA verified contrasts, 4 px scale, 1.25 type ratio, light theme). /skills page redesigned with KPI grid + responsive card grid + filter pills + accessibility. CLI banner with contextual tips and grouped /help.
  • CI/CD — 3 OS × 4 Python = 11 jobs, dedicated security workflow (pip-audit, safety, bandit, ruff S-rules) running weekly. Multi-stage Dockerfile (~500 MB), non-root user, HEALTHCHECK. pip extras ([headless], [mcp-only], [tui], [vision], [full], [dev]) — default install is now minimal sane.
  • Tests — 113 → 1072+ (+849 %). Coverage 46 % → 59 %. Ruff: 33 → 0 errors. Recent additions (FORGIA #27–#89): tests/test_bench_harness.py, test_bench_compare.py, test_bench_summary_md.py, test_bench_cli.py, test_bench_recall_ablation.py, test_clean_bench_data.py, test_jsonutil.py, test_corruption_guards.py, test_data_dir_isolation.py, test_auto_fallback.py, test_config_env.py, test_makefile_help.py, test_wake_used_macro.py, test_sleep_report_n_llm_calls.py, test_real_provider_smoke.py, test_mcp_e2e_smoke.py. Original suites preserved: tests/security/ (path traversal, SSRF, secrets redaction, prompt injection, executor isolation, editfmt sensitive, pentest validation), tests/test_settings.py, test_settings_v2.py, test_provider_registry.py, test_migrations.py, test_mcp_server.py, test_mcp_server_security.py, test_dashboard_api.py, test_cli.py, tests/perf/test_perf.py (10 benchmarks), tests/test_rnd_active_memory.py, test_trace_alignment.py, test_lateral_inhibition.py.

What Engram does, in one breath

You give the agent a task in plain language. It thinks, uses tools (Python sandbox, file I/O, shell, web fetch, screenshots, webcam, vision LLM, computer use), retrieves any relevant past skills it has consolidated, and answers. Every conversation is an episode in memory. Every few episodes, you trigger a sleep cycle that distills new procedural skills (NREM), recombines them creatively (REM), merges duplicates, and promotes/retires by Bayesian fitness. Tomorrow the agent is genuinely better at what you asked it yesterday — and you can read every lesson it learned.

Verified working today

Multi-model bench harness (FORGIA #27, 2026-05-09): same task suite, 3 conditions (raw / hippo_cold / hippo_warm), 3-4 real providers, fail-isolated.

Headline result — memory_recall suite (the discriminative one):

Provider raw hippo_cold hippo_warm
anthropic 0.50 1.00 1.00 (latency −56 %)
deepseek 0.50 1.00 1.00
openrouter 0.50 1.00 1.00

The 50 % raw failure is the 3 query tasks ("What was the value I told you?") — with no shared context, the LLM has no place to retrieve from. HippoAgent's recall pipeline retrieves the seed episode and the query phase succeeds 100 % on every provider. +50 percentage-point accuracy uplift, three different LLMs.

Hardened result — hard_memory_recall (12 tasks: direct + paraphrased + synthesis):

Provider raw hippo_cold hippo_warm
anthropic 0.50 1.00 1.00 (latency −51 %)
deepseek 0.50 0.92 0.92 (lost the synthesis)
openrouter 0.50 1.00 1.00

The headline holds: +42–50 pp uplift across paraphrased queries and multi-step synthesis. DeepSeek lost the multi-step task (retrieved both facts but failed the addition) — HippoAgent provides the memory, arithmetic composition is on the model.

Skill-compounding suite (8 digit-sum tasks): hippo_warm latency −41 % vs hippo_cold on anthropic (compiled-macro fast-path engaging). Default trivia suite: raw wins (~50 tokens vs ~3 000), hippo costs structural overhead but proves end-to-end transport works.

All raw data committed at data/bench_*.{results,summary}.json. Full analysis in docs/PLATFORM.md.

Reproduce locally:

make bench-help          # list available task suites
make bench-mock          # offline smoke (no API key needed)
make bench-real          # run on every provider with an env key set
make bench-memory        # the discriminative recall suite
make bench-summary       # render the latest summary as markdown
make bench-csv           # CSV (Excel-friendly)
make bench-quick         # mock + 2 tasks (CI smoke)
make bench-clean         # dry-run of transient bench dirs
make stats               # project size + test count
make bench-compare BEFORE=... AFTER=...   # diff two bench summaries

The bench script supports many flags useful in CI: --quiet --max-tasks N --task-id ID --providers auto|csv --suite NAME --n-iter N --consolidate-every K --save-md --memory-stats --show-failures --print-config --list-providers --clean-data --output-dir PATH. See python scripts/bench_with_without_hippo.py --help.

Tool-use verified live across 4 providers (same task: write a real file to Desktop), all via native tool-use API (no fragile JSON-in-text parsing):

Provider Model Steps Tokens Outcome
Ollama (local, free) qwen2.5:1.5b 2 4,655 ✓ wrote file to disk
Ollama (local, free) qwen2.5:7b-instruct 2 4,699 ✓ wrote file to disk
Groq (free tier) llama-3.3-70b-versatile 4 11,794 ✓ wrote file to disk
Anthropic claude-haiku-4-5 2 7,200 ✓ wrote file to disk

Computer-use end-to-end verified live:

  • ✓ shell_run: whoami, systeminfo, wmic, ver — Claude riassume sistema
  • ✓ Task "apri Calc + screenshot + descrivi + chiudi" — 4 step, success
  • ✓ vision_describe — Claude descrive logo OMNEX correttamente
  • ✓ web_fetch + web_search (DuckDuckGo) — paper Nature/ScienceDirect
  • ✓ desktop_screenshot 2560×1600 + describe via Anthropic vision
  • ✓ Sleep cycle: 18 episodi → 6 NREM + 2 REM + 2 merge + 6 facts in 103s

Plan mode + auto-fallback chain verified.

HippoAgent is a working prototype of an idea: an LLM agent that becomes measurably more competent over time without ever updating its weights. It does so by mimicking the two-stage memory consolidation model from neuroscience (NREM slow-wave + REM paradoxical sleep). Every "lesson" the agent learns is a structured, versioned, fitness-tracked artifact you can read, edit, share, retire — not an opaque parameter shift.

Why this exists

Today's LLM agents have two ways to "learn":

  1. Fine-tuning — costly, centralized, opaque, irreversible.
  2. RAG / context — they "know" things but don't remember the session; no consolidation, no transfer, no growth curve.

The space in between — what humans actually do during sleep — is empty in production systems. Voyager (Wang et al.) introduced a skill library for Minecraft. MemGPT/Letta layered tiered memory. Reflexion added critique. But nobody has closed the loop with a consolidation cycle that:

  • replays episodes (success and failure),
  • extracts invariant patterns into procedural skills,
  • recombines existing skills creatively,
  • tests them under a fitness function,
  • prunes the losers, promotes the winners,
  • and measures itself with held-out tasks.

That loop is the bet of HippoAgent.

Architecture (one screen)

┌──────────────────────────────────────────────────────────────────────┐
│                            HippoAgent                                │
│                                                                      │
│  ╔═══════════════ WAKE ════════════════╗   ╔═════ SLEEP ══════════╗  │
│  ║  Task → Memory retrieval (skills    ║   ║  NREM: cluster        ║  │
│  ║          + similar episodes)        ║   ║   episodes →          ║  │
│  ║  ReAct loop with tool use:          ║   ║   distill skills      ║  │
│  ║   • run_python (sandboxed subproc)  ║   ║   + semantic facts    ║  │
│  ║   • syntax_check, find_function     ║   ║                       ║  │
│  ║   • submit_solution                 ║   ║  REM: pick 2 skills,  ║  │
│  ║  Reflexion-style self-critique on   ║   ║   propose hybrid      ║  │
│  ║   failure → 1-shot retry.           ║   ║                       ║  │
│  ║  Episode persisted to memory.       ║   ║  Curator: merge       ║  │
│  ╚══════════════════ ▼ ════════════════╝   ║   semantic dups       ║  │
│                      │                     ║                       ║  │
│              Episodes (SQLite +            ║  Pruning: Bayesian    ║  │
│              embeddings + causal graph)    ║   fitness → promote / ║  │
│                                            ║   retire skills.      ║  │
│              Skills (JSON files +          ╚══════════ ▲ ══════════╝  │
│              version chain + lineage DAG ──────────────┘              │
│              + Beta-Binomial fitness)                                │
│                                                                      │
│              Semantic facts (decoupled from time)                    │
└──────────────────────────────────────────────────────────────────────┘

Observability layer: every action emits a structured event →
   structlog → metrics registry → dashboard (FastAPI + vis-network).

What's actually inside

Module What it does Key novelty
episode.py Episode + Trace — full ReAct trajectory, immutable timestamped, embedding-indexed
memory.py Episodic memory: SQLite + dense recall + causal graph (networkx) clustering for replay; A→B causal edges via shared skill
semantic.py Semantic memory: facts decoupled from time Tulving-style separation
skill.py Skill library — versioned JSON files + index Bayesian Beta-Binomial fitness, lineage DAG, status lifecycle
tools.py Sandboxed Python executor + AST analyzer subprocess isolation, timeout, output cap
wake.py ReAct loop with skill+episode injection tolerant parser, Reflexion critique, A/B toggle
sleep.py Multi-stage consolidation engine NREM + REM + Curator + Pruning
prompts.py All LLM prompts, in one auditable file "experience as artifact" thesis
observability.py EventBus + structlog + metrics registry every step emits a typed event
dashboard.py FastAPI + HTML dashboard skill lineage graph (vis-network)
cli.py typer + rich CLI hippo run/wake/sleep/benchmark/skills/episodes/dashboard
benchmark/ 18 HumanEval-style coding tasks + evaluator wake/heldout split, Wilson CIs, two-prop z-test

Scientific anchors

  • Two-stage consolidation: Walker & Stickgold (2004), Diekelmann & Born (2010). NREM consolidates declarative memory; REM enables creative recombination.
  • Episodic vs semantic vs procedural: Tulving (1972, 1985).
  • Fast hippocampal replay → slow cortical consolidation: McClelland, McNaughton & O'Reilly (1995) — the model HippoAgent's wake/sleep split mirrors.
  • Reflexion: Shinn et al. (2023) — verbal RL on agents (no gradients) — used here for the self-critique retry.
  • Voyager: Wang et al. (2023) — skill library for embodied agents — closest prior art for the procedural memory.
  • Bayesian fitness for small N: Beta-Binomial conjugate prior — robust posterior mean even after 1–2 trials.
  • Declarative → procedural transition: Anderson (1982) ACT-R, Logan (1988) instance theory — the basis for procedural compilation.
  • Hippocampal forward sweeps: Pfeiffer & Foster (2013), Diba & Buzsáki (2007) — predictive replay before action — the basis for forward replay.
  • Hebbian plasticity: Hebb (1949) "cells that fire together wire together" — the basis for the trigger-embedding drift on success.
  • Counterfactual reasoning in episodic memory: Gershman & Daw (2017) — alternative trajectories during offline replay drive learning beyond mere reinforcement.

What makes the memory active (not just stored)

Most LLM-agent memory systems are passive: they retrieve past prompts and dump them into context. HippoAgent's memory is active — the act of using a skill makes it stronger, faster, and more discriminating. Six mechanisms compound:

  1. Procedural compilation — once a skill has succeeded N times with high fitness, the DREAMER (during sleep) distills its successful traces into a parameterised macro: a list of tool calls with {{TASK}} and {{LAST_OBSERVATION}} placeholders. At wake time, when a strongly-matching task arrives, the macro is executed deterministically — zero LLM tokens, no model latency between steps. The skill is not just remembered, it is compiled the way deliberate actions become motor reflexes. (compilation.py)

  2. Forward replay — before the wake loop fires, the agent looks up the top skill's past successful trajectories and projects an expected action sequence. The block is injected as ## PREDICTED PATH in the user prompt. This anchors the LLM's reasoning (less drift on familiar tasks) and lets us detect divergence (a learning signal). Pure retrieval — no extra LLM call. (wake.py:_forward_replay_block)

  3. Hebbian skill embedding — every successful application drifts the skill's trigger embedding toward the task that just succeeded (new = (1 - α)·current + α·task, α = 0.05, then re-normalised). Skills become more retrievable for the kind of task they keep solving — the library shapes itself to its workload over time without any retraining. (skill.py:_hebbian_update)

  4. Counterfactual REM — when a skill keeps failing (fitness < 0.5, trials ≥ 3), the dreamer doesn't just decrement Bayesian counts. It reads the failed trajectory and synthesises an alternative strategy — a candidate counterfactual skill with the failed skill as parent. The alternative competes for retrieval on future similar tasks; if it wins, it supersedes the broken approach without manual intervention. (sleep.py:_stage_counterfactual)

  5. Schema formation — once a domain has accumulated enough skills (≥3 with cosine similarity ≥ 0.62 on triggers), the dreamer writes a SCHEMA: a meta-skill whose body is a one-line rubric for picking among the children. Lineage edges (relation='specialises') connect each schema to its specifics, building a 2-level hierarchy that becomes navigable as the library grows. Tulving's episodic→semantic transition. (sleep.py:_stage_schema)

  6. Self-suggested practice — for skills sitting in the uncertain middle (fitness 0.45–0.65), the dreamer writes 2 concrete practice prompts that would plausibly trigger them. They appear in the dashboard's skill detail under "📚 Practice prompts" with one-click "▶ run in chat" buttons. Running a prompt feeds real evidence into the Bayesian fitness — so the skill is decisively promoted or retired instead of lingering in ambiguity. The agent literally suggests its own training set. (sleep.py:_stage_practice)

Together these turn memory from an archive into an organ that grows.

End-to-end demo (no API keys needed)

python scripts/demo_active_memory.py

Seeds a library with six skills, runs a complete sleep cycle (NREM + REM + Curator + compilation + counterfactual + schema + practice + pruning) using a scripted mock LLM, then re-runs a matching task to demonstrate the macro fast-path. Typical output:

Sleep cycle — six mechanisms fire in one pass
  duration    : 1.37s  (12 LLM calls, 3080 tokens)
  🔧 macros   : 1
  🌀 cf       : 1
  🌳 schemas  : 1
  📚 practice : 6 prompts written
  promoted    : 1   retired : 1

Wake — re-running a similar task uses the macro fast-path
  steps       : 2
  tokens used : 0
  llm calls   : 0  (macro fired, no model invoked)

Empirical evidence — macro speed-up

Run python scripts/bench_macro.py --repeats 3 --latency 0.5 to reproduce:

COLD (ReAct)    median  wall=1.296s  llm_calls=2  tokens=560
HOT  (macro)    median  wall=0.202s  llm_calls=0  tokens=0
→ speed-up      :   6.41x
→ time saved    : 1094 ms per task
→ token saved   :   560 per task

The simulated LLM models a real-world 0.5 s/call latency (typical for hosted models). On the hot path the macro fires deterministically — not a single token leaves the box.

Live dashboard

Visit /active-memory (or click "Active memory" in the nav) for a live KPI panel — compiled-macro coverage, Hebbian-tuned skills, counterfactual lineage, schema hierarchy. The /api/active-memory/stats JSON endpoint exposes the same metrics for external tooling.

Closing the loop — explicit user feedback

Every chat turn surfaces 👍 / 👎 buttons. Clicking them feeds the same Bayesian fitness machinery used by automatic outcomes — no separate "feedback model" or fine-tune. A 👎 on a successful turn flips the episode to failure and records a failure trial against each applied skill; mis-promoted skills decay back into the candidate pool. A 👍 simply boosts the trial count with a success.

So the library has three sources of evidence, all converging on the same fitness posterior:

  1. automatic validator outcomes during the wake loop;
  2. counterfactual REM evaluations during sleep;
  3. explicit user up- / down-votes from the chat surface.

Install

Option A — Python (local)

cd ProgettiAI/HippoAgent
python -m venv .venv
source .venv/Scripts/activate           # or .venv/bin/activate on POSIX
pip install -e .
hippo dashboard                         # → http://127.0.0.1:8765

The first time you open the dashboard, you'll land on /welcome — pick a provider, paste an API key (or run Ollama locally — no key needed), test the connection, save. Everything is configurable from the ⚙ Settings page; no env-var voodoo required.

Option B — Docker

docker compose up
# or:  docker build -t hippoagent . && docker run -p 8765:8765 -v $PWD/data:/app/data hippoagent

The container picks up any *_API_KEY env vars you pass through. To use a host-running Ollama, the compose file already maps host.docker.internal.

Using HippoAgent

There are three ways to interact with the agent. They share the same backend (memory, skills, sleep cycles) — just pick what suits you.

1. Web dashboard

hippo dashboardhttp://127.0.0.1:8765

page what it's for
Chat the main UI — type a task, get an answer with applied skills
Settings pick LLM provider, paste API key, test connection, switch models live
Episodes every task ever executed; click any row for the full ReAct trajectory
Skills consolidated lessons with Bayesian fitness, status, lineage
Lineage interactive graph of how skills derive from one another
Events live event stream (every memory write, retrieval, LLM call)
Metrics counters/histograms

2. CLI

hippo chat                       # interactive REPL: type tasks, /sleep, /skills, /quit
hippo run "your task here"       # one-shot
hippo wake --n-tasks 5           # run the bench wake-set
hippo sleep                      # consolidation cycle on demand
hippo benchmark                  # held-out evaluation
hippo skills list / show <id>
hippo episodes list / show <id>
hippo providers list / scan / models <p> / active

3. As an MCP server inside Claude Code / Cursor / opencode / Cline / Zed

HippoAgent ships an MCP (Model Context Protocol) server, so any MCP-aware client can use it as a memory-augmented agent. Add this to your client's mcp.json (or equivalent):

{
  "mcpServers": {
    "hippoagent": {
      "command": "hippo",
      "args": ["mcp"],
      "env": {
        "HIPPO_LLM_PROVIDER": "groq",
        "GROQ_API_KEY": "your-key",
        "HIPPO_MODEL": "llama-3.3-70b-versatile"
      }
    }
  }
}

Once registered, the host (Claude Code, Cursor, etc.) can call:

MCP tool what it does
hippo_run_task full wake loop — agent uses its own tools + skills
hippo_consolidate trigger a sleep cycle
hippo_recall semantic search over past episodes (embeddings)
hippo_recall_explain semantic recall + per-component score breakdown
hippo_search keyword search over episode task_text (LIKE)
hippo_episode_list paginated listing of episodes (limit/offset/outcome)
hippo_episode_get one episode in full (trajectory + critique)
hippo_episode_pin / hippo_episode_unpin protect / release an episode from decay-pruning
hippo_forget delete one episode by id (privacy / GDPR)
hippo_metrics_history token-usage timeseries bucketed by day
hippo_skills_for preview which skills HippoAgent would inject for a task
hippo_skill_promote / hippo_skill_retire / hippo_skill_edit manual curation
hippo_skill_export / hippo_skill_import portable JSON bundles (share skills between installations)
hippo_skill_test render the prompt-context for a (skill, task) pair without calling the LLM
hippo_skill_top top-k skills by fitness / recency / activity
hippo_skill_lineage walk the parent_skills DAG ancestry of a skill
hippo_skill_compare diff two skills (body / fitness / trials)
hippo_skill_similar top-k skills by Jaccard overlap on body tokens
hippo_skill_describe deterministic 1-line natural-language summary of a skill (no LLM)
hippo_skill_merge manually merge skill A into B (sum trials, retire A, lineage tracked)
hippo_episodes_by_skill every episode whose skills_used includes a given skill
hippo_provider_switch switch the active LLM provider at runtime (anthropic / openai / groq / …)
hippo_remember store one fact directly in semantic memory — no episode, no sleep cycle
hippo_facts_recall semantic search over facts (cosine on proposition embedding)
hippo_facts_search keyword/substring search over facts (LIKE on proposition)
hippo_facts_list paginated listing of all facts (newest-first)
hippo_fact_forget delete one fact by id (privacy / GDPR)
hippo_skills_search keyword/substring search over skills (LIKE on name+trigger+body)
hippo_skill_bundles / hippo_compound_skills / hippo_skill_antagonists structural introspection
hippo_status counts + active provider
hippo_health deep preflight at startup — 3-tier reachability + counts + flag + tool_count
hippo_stats aggregate metrics (episodes by outcome, skills, token usage)
hippo_audit_tail last N records of the MCP audit log (forensics)

And read MCP resources:

  • hippo://skills/list and hippo://skills/{id} — consolidated skills
  • hippo://episodes/recent and hippo://episodes/{id} — past trajectories

The result: your Claude Code (or whatever) gets a second-brain agent it can delegate to — and the second brain remembers across sessions, not just within one conversation.

4. Programmatically

from hippoagent.agent import HippoAgent
agent = HippoAgent.build()
result = agent.run_task(
    task_id="my-task",
    task_text="Write a Python function that ...",
    validator=lambda ans: (bool(ans.strip()), "non-empty"),
)
print(result.episode.final_answer)
agent.consolidate()              # nightly sleep cycle

Desktop launchers (Windows)

powershell -ExecutionPolicy Bypass -File scripts\install_desktop_shortcut.ps1

This creates two shortcuts on your Desktop:

  • HippoAgent Dashboard — double-click to launch the web UI (browser opens to http://127.0.0.1:8765).
  • HippoAgent CLI — opens a shell with the venv activated, ready for hippo … commands.

Providers — bring any LLM you like

HippoAgent is provider-agnostic. Set ONE of these env vars (or run Ollama locally) and you're done. The first matching one wins (or force with HIPPO_LLM_PROVIDER=<name>).

family provider (alias) env var base URL
native anthropic ANTHROPIC_API_KEY Anthropic SDK
native ollama OLLAMA_HOST (defaults to http://localhost:11434) local
US/EU openai, openrouter, mistral, groq, xai (grok), perplexity, fireworks, together, cerebras, gemini (google), nvidia, huggingface (hf), deepinfra, hyperbolic, novita, lepton, anyscale, azure <NAME>_API_KEY each provider's /v1
China moonshot (kimi), deepseek, qwen (dashscope/alibaba), zhipu (glm), baichuan, yi (lingyi/01ai), doubao (ark), hunyuan (tencent), stepfun (step), minimax, spark (iflytek) <NAME>_API_KEY each provider's /v1
local OpenAI-compat lmstudio, vllm, localai, tabby <NAME>_API_KEY (any non-empty) localhost

Per-stage model overrides (Claude defaults are tuned; for other providers you usually want to set these):

export HIPPO_MODEL=qwen2.5:7b               # all stages
export HIPPO_MODEL_EXECUTOR=qwen2.5:7b      # ReAct loop only
export HIPPO_MODEL_DREAMER=qwen2.5:14b      # NREM/REM synthesis (smarter recommended)
export HIPPO_MODEL_CRITIC=qwen2.5:1.5b      # cheap critic

Discover what's actually reachable from your setup:

hippo providers list      # all known providers + env-var status
hippo providers scan      # query /v1/models on every configured one (real discovery)
hippo providers models kimi    # list one provider's models
hippo providers active    # which provider is selected right now

Examples:

HIPPO_LLM_PROVIDER=kimi     MOONSHOT_API_KEY=sk-...                              hippo wake
HIPPO_LLM_PROVIDER=deepseek DEEPSEEK_API_KEY=sk-... HIPPO_MODEL=deepseek-reasoner hippo wake
HIPPO_LLM_PROVIDER=ollama   OLLAMA_MODEL=qwen2.5:7b                              hippo wake
HIPPO_LLM_PROVIDER=groq     GROQ_API_KEY=gsk-...   HIPPO_MODEL=llama-3.3-70b-versatile hippo wake

For tests / dev:

pip install -e ".[dev]"

Quick run

# Tests (fully offline, mock LLM)
HIPPO_OFFLINE=1 pytest

# CLI
hippo --help
hippo run "Define a Python function that returns the n-th prime"
hippo wake --n-tasks 5             # run wake-set (records episodes)
hippo sleep                        # run consolidation cycle
hippo benchmark                    # run held-out tasks with consolidated skills
hippo skills list
hippo skills show <id>
hippo episodes list
hippo dashboard                    # → http://127.0.0.1:8765

End-to-end demo (the actual experiment)

python run_demo.py --n-wake 10 --n-heldout 8

What it does, in order:

  1. Reset.
  2. Baseline: held-out tasks, without skills, without past episodes — pure model.
  3. Wipe the episodes the baseline accidentally created (clean slate).
  4. Wake: run the wake-set; the agent records everything.
  5. Sleep: NREM clusters episodes → distills skills; REM proposes hybrids; Curator merges semantic duplicates; pruning promotes/retires by fitness.
  6. Hippo: re-run the held-out tasks with the consolidated skill library.
  7. Compare: pass-rate, avg-steps, avg-tokens, skill-reuse-rate. 95% Wilson interval on rates. Two-proportion z-test for the gap.

The script saves a JSON report under data/reports/.

First-run observations

A small run (10 wake / 8 held-out) on Claude Haiku 4.5 produced these consolidated skills, derived autonomously from the agent's own behavior:

  • Verify test cases before blaming algorithm (NREM, fitness=0.70)
  • Validate test runner output before debug (REM hybrid, 0.70)
  • Harden JSON code payloads end-to-end (REM hybrid, 0.70)
  • Escape newlines in JSON code strings (the Curator merged 4 near-duplicates of this)

These are not generic "be helpful" stubs — they are concrete lessons the agent extracted from its own failure modes (mostly: malformed JSON in run_python calls when the code contained newlines).

The pass-rate gap on N=8 held-out is not statistically significant (p≈0.5 by two-prop z-test) — and that is the honest scientific finding at this scale. Ramping to N≥30 is the next experiment; the architecture is ready for it.

What makes this a prototype, not a toy

  • ✅ 23 unit + integration tests passing
  • ✅ Real LLM client + offline mock for deterministic CI
  • ✅ Sandboxed Python execution (subprocess + timeout)
  • ✅ Structured logging + event bus + metrics registry
  • ✅ Bayesian fitness with conjugate prior (not naive ratios)
  • ✅ Skill lineage DAG (networkx) — full provenance from episode → skill → REM hybrid → merge
  • ✅ Web dashboard with skill lineage visualization
  • ✅ Reproducible benchmark with deterministic seed split
  • ✅ Statistical primitives (Wilson interval, two-prop z-test) for honest reporting
  • ✅ A/B toggle (--no-skills) so baseline-vs-hippo is a single CLI flag

👁 Vision — works with any multimodal provider

The vision_describe tool dispatches to the right multimodal endpoint per provider. Defaults shipped (override via HIPPO_VISION_MODEL):

Provider Default vision model Status
Anthropic claude-haiku-4-5-20251001 ✓ verified live
OpenAI gpt-4o-mini
Google Gemini gemini-1.5-flash ✓ free tier
Groq meta-llama/llama-4-scout-17b-16e-instruct ✓ verified live, free tier
OpenRouter anthropic/claude-haiku-4.5 ✓ verified live
xAI Grok grok-4 ✓ (paid)
Mistral pixtral-12b-latest
Alibaba Qwen qwen-vl-plus
Zhipu GLM glm-4v
Moonshot Kimi moonshot-v1-8k-vision-preview
01.AI Yi yi-vision
ByteDance Doubao doubao-vision-pro-32k
NVIDIA NIM meta/llama-3.2-90b-vision-instruct
Together / Fireworks Llama 3.2 Vision 90B
HuggingFace router Llama 3.2 Vision 90B
Ollama (local) llava (override with OLLAMA_VISION_MODEL) ✓ — needs ollama pull llava (or qwen2-vl, llama3.2-vision, bakllava, moondream)
DeepSeek n/a ✗ DeepSeek V4 API doesn't accept image_url blocks

The dispatcher prefers HIPPO_VISION_MODEL env var > OLLAMA_VISION_MODEL (for Ollama) > the default in the table above. So you can mix: text inference on cheap-fast model, vision on a multimodal-capable one — same single call.

# Use Ollama for everything but route vision to Anthropic:
HIPPO_LLM_PROVIDER=ollama OLLAMA_MODEL=qwen2.5:7b \
  HIPPO_VISION_MODEL=ignored  # vision uses provider, set ANTHROPIC_API_KEY too
hippo dashboard

# Use DeepSeek for text, fallback to Groq for vision:
# (DeepSeek V4 has no vision; agent calls vision_describe → falls through to
# the configured fallback chain.)

🔁 Auto-fallback provider chain

When the active provider hits rate-limit / quota / 5xx, Engram transparently tries the next configured provider. Configure in /settings → "Fallback chain":

primary: Anthropic Claude   (fast, paid)
   ↓ on 429/quota
fallback 1: Groq llama 70b  (free tier, very fast)
   ↓ on quota
fallback 2: Ollama qwen 7b  (local, always available)

Result: zero-downtime LLM access, scaling free → paid → premium automatically. Recoverable error patterns: 429, rate, quota, billing, credit, limit, 503, 504, timeout, connection, overload.

📋 Plan mode

Press 📋 Plan first in the chat instead of Send. The agent produces a numbered plan (3-7 steps) WITHOUT executing anything. Review it, then click ✓ Approve & execute to run the plan, or ✗ Reject.

Useful for: cautious computer-use tasks, multi-step shell ops, anything where you want to know what will happen before it does.

🔐 Permissions & Sandbox

Engram exposes a single master switch plus 6 granular toggles in /settings:

Capability Default Description
Sandbox master ON When OFF, all permissions are unrestricted
Filesystem home strict (data/ only), home (user dir), full (anywhere)
Computer use OFF mouse/keyboard control via pyautogui
Webcam OFF capture frames + describe via vision LLM
Shell OFF arbitrary cmd.exe / /bin/sh commands
Web ON web_fetch + DuckDuckGo search
Vision ON describe images via multimodal LLM

Two preset buttons:

  • 🔓 Unleash — sandbox OFF + all permissions ON. Full PC access.
  • 🔒 Lockdown — sandbox ON, filesystem = strict, only web + vision.

When the sandbox is OFF the agent has full access to your machine: it can read/write any file, run any shell command, control your mouse and keyboard, capture your webcam, fetch the web, and describe images. Use this only with models you trust.

📱 Android / Termux

Engram runs on Termux (Android 10+) with a few caveats:

pkg install python git ffmpeg-essentials
git clone https://github.com/aureliocpr-ctrl/hippoagent.git
cd hippoagent
python -m venv .venv && source .venv/bin/activate
# Skip pyautogui (no display server) and opencv (heavy):
pip install -e . --no-deps
pip install anthropic openai sentence-transformers numpy scipy scikit-learn \
            networkx pydantic structlog typer rich fastapi uvicorn jinja2 \
            python-dotenv httpx mcp textual pillow
hippo dashboard --host 0.0.0.0 --port 8765

Open http://<phone-ip>:8765 from any device on the same network. CLI works fully (hippo chat, hippo tui). Webcam/computer-use are no-op on Android, everything else (incl. Ollama via Termux+proot or remote, Groq, Gemini, Anthropic) works normally.

⚠ Security notes

  • The Python sandbox is subprocess -I, not a real isolation layer. A task like "write a calculator and save it to the desktop" will actually write to your disk, because the LLM-generated code can call open(), os.makedirs, requests.get, etc. That's by design for this prototype (it's how you get an agent that does things), but it means only run HippoAgent against models and tasks you trust. For untrusted use, run the whole agent in Docker with no host volume mounts, or wrap the executor in seccomp/Firejail.
  • API keys saved via the Settings page are written to data/user_settings.json in plaintext. Don't commit that file (it's in .gitignore). Production deployments should swap the storage for an OS keychain or Vault.

What's still toy

  • Sandbox is subprocess -I, not seccomp/Firejail/Docker — fine for non-adversarial models, not for production.
  • Benchmark is small (18 tasks). Real validation would need ≥100 tasks across more domains.
  • No active learning: the agent doesn't choose which tasks to attempt to maximize learning.
  • No explicit forgetting curve over time — only fitness-based pruning.
  • Single-agent. The lineage DAG is ready for skill-sharing across instances; the marketplace is not built.

Repository layout

HippoAgent/
├── hippoagent/                # core library
│   ├── __init__.py
│   ├── config.py              # all hyper-params, single source
│   ├── observability.py       # event bus, logger, metrics
│   ├── llm.py                 # Anthropic client + MockLLM
│   ├── embedding.py           # sentence-transformers wrapper
│   ├── tools.py               # PythonExecutor, CodeAnalyzer, tool registry
│   ├── episode.py             # Episode + Trace
│   ├── memory.py              # EpisodicMemory (vector + causal graph)
│   ├── semantic.py            # SemanticMemory (facts)
│   ├── skill.py               # Skill + SkillLibrary (lineage, fitness, lifecycle)
│   ├── prompts.py             # all prompts, auditable
│   ├── wake.py                # ReAct loop with skill injection + critique
│   ├── sleep.py               # NREM + REM + Curator + Pruning
│   ├── agent.py               # high-level orchestrator
│   ├── cli.py                 # typer + rich CLI
│   └── dashboard.py           # FastAPI dashboard
├── benchmark/                 # task suite + evaluator + statistics
├── tests/                     # pytest suite
├── data/                      # episodes/skills/semantic/runs/reports (gitignored)
├── run_demo.py                # full baseline-vs-hippo experiment
├── pyproject.toml
└── README.md

License

MIT, but please cite the underlying neuroscience papers if you build on the consolidation idea.

About

Sleep consolidation for LLM agents — hippocampal-cortical inspired learning loop. Memory and skills as inspectable, fitness-tracked artifacts. Multi-provider, MCP server, web/TUI/CLI.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors