Skip to content

Commit 79da2ce

Browse files
committed
docs(readme): add quantitative SOTA benchmark tables (matched gpt-4o reader)
Replaces qualitative "differentiator" framing with apples-to-apples quantitative comparison at gpt-4o reader on LongMemEval-S Phase B N=500 plus the LongMemEval-M variant nobody else publishes: - LongMemEval-S table: AgentOS 85.6% [82.4%, 88.6%] vs EmergenceMem 86.0% (tied within CI), Mastra OM gpt-4o 84.23%, Supermemory 81.6%, Zep 71.2/63.8% reproduction. All at the same gpt-4o reader. - LongMemEval-M table: AgentOS 70.2% [66.0%, 74.0%] tied with AgentBrain closed-source 71.7%, +4.5 pp above the LongMemEval paper's academic-baseline ceiling (65.7%, Wu et al. ICLR 2025 Table 3). First open-source memory library above 65% on M with full methodology disclosure. - 12-axis methodology disclosure matrix: bootstrap CIs, judge-FPR probes, per-case run JSONs, matched-reader cross-vendor table — none of which most vendors publish. Cross-provider configurations (e.g. Mastra's 94.9% gpt-5-mini reader + gemini-2.5-flash observer) are excluded from the comparison tables because their results cannot be reproduced from public methodology disclosures.
1 parent ed2839a commit 79da2ce

1 file changed

Lines changed: 51 additions & 0 deletions

File tree

README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,57 @@ The pipeline is novel because the **T0 / no-memory gate** removes retrieval enti
8787

8888
---
8989

90+
## Memory Benchmarks at Matched Reader
91+
92+
Honest, apples-to-apples comparison: same reader (`gpt-4o`), same dataset, same N=500 Phase B methodology, same `gpt-4o-2024-08-06` judge with `rubric 2026-04-18.1` (FPR 1% [0%, 3%] at n=100). Cross-provider configurations (e.g. Gemini observers) are not included because their results cannot be reproduced from public methodology disclosures.
93+
94+
### LongMemEval-S Phase B (115K tokens, 50 sessions per haystack)
95+
96+
| System (gpt-4o reader, Phase B N=500) | Accuracy | 95% CI | $/correct | p50 latency | Source |
97+
|---|---:|---|---:|---:|---|
98+
| EmergenceMem Internal | 86.0% | not published | not published | 5,650 ms | [emergence.ai](https://www.emergence.ai/blog/sota-on-longmemeval-with-rag) |
99+
| **🚀 AgentOS canonical-hybrid + reader-router** | **85.6%** | **[82.4%, 88.6%]** | **$0.0090** | **3,558 ms** | [85.6% post](https://docs.agentos.sh/blog/2026/04/28/reader-router-pareto-win) |
100+
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published | not published | [mastra.ai](https://mastra.ai/research/observational-memory) |
101+
| Supermemory gpt-4o | 81.6% | not published | not published | not published | [supermemory.ai](https://supermemory.ai/research/) |
102+
| EmergenceMem Simple Fast (apples-to-apples in our harness) | 80.6% | measured | $0.0586 | not published | v2 vendor reproduction |
103+
| Zep self-reported / independently reproduced | 71.2% / 63.8% | not published | not published | 632 ms p95 search | [self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv:2512.13564](https://arxiv.org/abs/2512.13564) |
104+
105+
**+1.4 pp accuracy over Mastra OM gpt-4o at the same reader.** Statistically tied with EmergenceMem Internal (86.0% point estimate sits inside our 95% CI [82.4%, 88.6%]). Median latency vs EmergenceMem is **1.6× faster** (3.558 s vs 5.650 s).
106+
107+
### LongMemEval-M Phase B (1.5M tokens, 500 sessions per haystack)
108+
109+
The harder variant. M's haystacks exceed every production context window (GPT-4o 128K, Claude Opus 200K, Gemini 3 Pro 1M). Most memory vendors stop at S because raw long-context fits there.
110+
111+
| System | Accuracy | 95% CI | License | Source |
112+
|---|---:|---|---|---|
113+
| AgentBrain | 71.7% (Test 0) | not published | closed-source SaaS | [github.com/AgentBrainHQ](https://github.com/AgentBrainHQ) |
114+
| **🚀 AgentOS (sem-embed + reader-router + top-K=5)** | **70.2%** | **[66.0%, 74.0%]** | **MIT** | [70.2% post](https://docs.agentos.sh/blog/2026/04/29/longmemeval-m-70-with-topk5) |
115+
| LongMemEval paper academic baseline | 65.7% | not published | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
116+
| Mem0 v3, Mastra OM, Hindsight, Zep, EmergenceMem, Supermemory, MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta | not published || various | reports S only |
117+
118+
**Statistically tied with AgentBrain's closed-source SaaS** (their 71.7% sits inside our CI [66.0%, 74.0%]). **+4.5 pp above the LongMemEval paper's published academic ceiling.** **First open-source memory library on the public record above 65% on M with full methodology disclosure** (bootstrap CIs, per-case run JSONs, reproducible CLI, MIT-licensed).
119+
120+
### Methodology disclosure (12 axes most vendors omit)
121+
122+
| Axis | AgentOS | Most vendors |
123+
|---|:-:|:-:|
124+
| Aggregate accuracy | yes | yes |
125+
| 95% bootstrap CI on headline | yes | no |
126+
| Per-category 95% CI | yes | no |
127+
| Reader model disclosed | yes | mostly |
128+
| Observer / ingest model disclosed | yes | mostly |
129+
| USD cost per correct | yes | no |
130+
| Latency avg / p50 / p95 | yes | rarely |
131+
| Per-category breakdown | yes | sometimes |
132+
| Open-source benchmark runner | yes | rarely |
133+
| Per-case run JSONs at fixed seed | yes | no |
134+
| Judge-adversarial FPR probe | yes (1% S, 2% M, 0% LOCOMO) | no |
135+
| Matched-reader cross-vendor table | yes | partial |
136+
137+
The full audit framework is at [Memory Benchmark Transparency Audit](https://docs.agentos.sh/blog/2026/04/24/memory-benchmark-transparency-audit). Every run referenced above ships with a per-case run JSON at `seed=42`.
138+
139+
---
140+
90141
## See It In Action
91142

92143
### 🌀 Paracosm — AI Agent Swarm Simulation

0 commit comments

Comments
 (0)