A cognitive memory system for AI agents, grounded in the taxonomy from Measuring Progress Toward AGI: A Cognitive Framework.
Most agent frameworks treat memory as a single vector store you dump context into. This project implements memory the way the cognitive science paper describes it: separate stores for different memory types, a unified retriever that queries across them, a forgetting service that prunes stale knowledge, and an event bus that lets future modules (working memory, learning, metacognition) subscribe without touching core code.
Built on Gemini Embedding 2 for natively multimodal embeddings — text, images, audio, and video share a single vector space.
Status: Phase 1 in progress. Semantic memory write path is functional. Read path and event bus next.
graph TB
subgraph "Phase 1 — Memory System"
CLI["demo/cli.py"]
EMB["GeminiEmbedder"]
SS["SemanticStore"]
ES["EpisodicStore"]
PS["ProceduralStore"]
RET["UnifiedRetriever"]
FGT["ForgettingService"]
BUS["EventBus"]
DB[(ChromaDB)]
CLI --> SS
CLI --> ES
CLI --> PS
SS --> EMB
ES --> EMB
PS --> EMB
SS --> DB
ES --> DB
PS --> DB
RET --> SS
RET --> ES
RET --> PS
FGT --> SS
FGT --> ES
FGT --> PS
SS -.->|emit| BUS
ES -.->|emit| BUS
PS -.->|emit| BUS
end
subgraph "Phase 2 — Working Memory + Learning"
WM["WorkingMemory"]
LRN["LearningModule"]
WM -.->|subscribe| BUS
LRN -.->|subscribe| BUS
end
subgraph "Phase 3 — Metacognition"
META["MetacognitiveMonitor"]
META -.->|subscribe| BUS
end
style SS fill:#2d6a4f,color:#fff
style EMB fill:#2d6a4f,color:#fff
style CLI fill:#2d6a4f,color:#fff
style DB fill:#2d6a4f,color:#fff
style ES fill:#555,color:#aaa
style PS fill:#555,color:#aaa
style RET fill:#555,color:#aaa
style FGT fill:#555,color:#aaa
style BUS fill:#555,color:#aaa
style WM fill:#333,color:#666
style LRN fill:#333,color:#666
style META fill:#333,color:#666
Green = built. Grey = planned (Phase 1). Dark = future phases.
The paper identifies distinct memory sub-types that behave differently — different decay rates, retrieval patterns, and update semantics. This project implements them as separate ChromaDB collections behind a shared interface.
classDiagram
class MemoryRecord {
+str content
+str memory_type
+str modality
+str id
+datetime created_at
+float importance
+list~float~ embedding
+str media_ref
}
class SemanticMemory {
+str category
+float confidence
+str supersedes
+list~str~ related_ids
}
class EpisodicMemory {
+str context
+float emotional_valence
+str session_id
}
class ProceduralMemory {
+str trigger
+list~str~ steps
+int execution_count
}
MemoryRecord <|-- SemanticMemory
MemoryRecord <|-- EpisodicMemory
MemoryRecord <|-- ProceduralMemory
How a memory goes from raw content to a persisted vector:
sequenceDiagram
participant C as CLI
participant S as SemanticStore
participant E as GeminiEmbedder
participant G as Gemini API
participant D as ChromaDB
C->>S: store(SemanticMemory)
S->>E: embed_text(content)
E->>G: embed_content(text, 768 dims)
G-->>E: [0.023, -0.41, 0.87, ...]
E-->>S: embedding vector
S->>D: add(id, embedding, metadata)
D-->>S: persisted
S-->>C: record_id
All modalities go through the same GeminiEmbedder and land in the same vector space. A text query can retrieve an image memory. An audio clip can be compared to text descriptions.
graph LR
T["text"] -->|embed_text| EMB["GeminiEmbedder"]
I["image bytes"] -->|embed_image| EMB
A["audio bytes"] -->|embed_audio| EMB
EMB --> V["768-dim vector space"]
style V fill:#2d6a4f,color:#fff
The embedding model is Gemini Embedding 2 — natively multimodal, mapping text, images, audio, and video into a single embedding space. Matryoshka support allows truncation from 3072 down to 768 dimensions with minimal accuracy loss.
We ran an experiment to test whether audio embeddings encode emotional tone or just acoustic structure. Four songs across three languages (English, Hindi, Arabic), embedded as raw bytes with no metadata passed to the model.
graph LR
subgraph "Audio Pipeline"
SONG["song.mp3"] -->|ffmpeg| CHUNKS["60s chunks"]
CHUNKS -->|embed_audio| VECS["chunk vectors"]
VECS -->|average| AV["song vector"]
end
subgraph "Probe Vectors"
P1["grief / loss"]
P2["melancholy"]
P3["joy / happiness"]
P4["peaceful / calm"]
P5["tension / dread"]
P6["neutral"]
end
AV -->|cosine similarity| SCORES["ranked scores"]
P1 --> SCORES
P2 --> SCORES
P3 --> SCORES
Results summary:
| Song | Genre | Language | Top Match | Runner-up |
|---|---|---|---|---|
| Schindler's List Theme | Orchestral | Instrumental | melancholy (0.70) | grief (0.68) |
| Phir Se | Bollywood | Hindi | grief (0.69) | melancholy (0.69) |
| Rasputin | Disco-pop | English | joy (0.69) | tension (0.67) |
| Didi | Rai | Arabic/French | joy (0.66) | melancholy (0.63) |
The model correctly separated sad from happy songs, made nuanced distinctions within each cluster (melancholy vs. grief, joy vs. tension), and worked cross-lingually on raw audio bytes with no transcription.
Grief ranked last on both happy songs. Neutral ranked last on both sad songs. The largest winner-to-loser gap was 0.087 (Phir Se: grief vs. neutral).
Full methodology and analysis: experiments/audio_emotion_probe_results.md
Architectural implication: emotional_valence on episodic memories can be derived directly from the embedding — no separate sentiment analysis pipeline needed.
agentic-memory/
├── config.py # API keys, model config, ChromaDB path
├── models/
│ ├── base.py # MemoryRecord dataclass
│ └── semantic.py # SemanticMemory (factual knowledge)
├── utils/
│ └── embeddings.py # GeminiEmbedder — text, image, audio
├── stores/
│ ├── base.py # Abstract BaseStore interface
│ └── semantic_store.py # ChromaDB-backed semantic store
├── retrieval/ # (planned) Unified cross-store retriever
├── events/ # (planned) Event bus for store/retrieve signals
├── api/
│ └── app.py # FastAPI boundary over stores/retriever/events
├── demo/
│ └── cli.py # CLI for storing and querying memories
├── web/ # Next.js playground UI for memory workflows
├── experiments/
│ ├── audio_emotion_probe.py # Cross-modal emotion probe script
│ └── audio_emotion_probe_results.md # Full results and analysis
├── media/ # Audio/image files (not committed)
└── research-docs/
└── measuring-progress-toward-agi-a-cognitive-framework.pdf
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate a .env file:
GEMINI_API_KEY=your_key_here
System dependency for audio chunking:
# arch
sudo pacman -S ffmpeg
# ubuntu/debian
sudo apt install ffmpegRun the Python API locally:
.venv/bin/python -m uvicorn api.app:app --reloadKey routes:
POST /api/memories/semanticPOST /api/memories/episodic/textPOST /api/memories/episodic/filePOST /api/retrieval/queryGET /api/episodes/recentGET /api/episodes/session/{session_id}GET /api/episodes/time-rangeGET /api/eventsGET /api/overview
Interactive docs:
http://localhost:8000/docs
The repository now includes a Next.js playground in web/ that talks to the Python API.
cd web
npm install
NEXT_PUBLIC_MEMORY_API_BASE_URL=http://localhost:8000 npm run devFor a Vercel deployment such as memory.agentclash.dev, set:
NEXT_PUBLIC_MEMORY_API_BASE_URL=<your deployed Python API base URL>Store a fact:
python demo/cli.py store "Python was created by Guido van Rossum"Run the audio emotion probe:
python experiments/audio_emotion_probe.py "song.mp3"The repo now includes a deterministic offline episodic-memory evaluation harness for fixed synthetic fixtures across mixed-store retrieval, temporal recall, session reconstruction, recent-event lookup, and cross-modal media-backed episodes.
Run it with:
pytest tests/test_offline_episodic_eval.pyor:
python tests/test_offline_episodic_eval.pyBenchmark mapping and rationale: docs/offline_episodic_eval.md
This project is built on the cognitive taxonomy from the DeepMind paper Measuring Progress Toward AGI. The paper distinguishes three faculties that most agent frameworks conflate:
- Memory — passive storage and retrieval (semantic facts, episodic events, procedural skills)
- Working Memory — active manipulation of information for a current goal (sits under Executive Functions, not Memory)
- Learning — acquisition and consolidation of new knowledge into long-term memory
The architecture implements these as separate systems. Phase 1 builds the memory stores. Phase 2 adds working memory (an active scratchpad) and a learning module (consolidation from working memory to long-term stores). Phase 3 adds metacognitive monitoring — the system's ability to assess confidence in its own retrieved context.
MIT