You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(readme): add quantitative SOTA benchmark tables (matched gpt-4o reader)
Replaces qualitative "differentiator" framing with apples-to-apples
quantitative comparison at gpt-4o reader on LongMemEval-S Phase B N=500
plus the LongMemEval-M variant nobody else publishes:
- LongMemEval-S table: AgentOS 85.6% [82.4%, 88.6%] vs EmergenceMem
86.0% (tied within CI), Mastra OM gpt-4o 84.23%, Supermemory 81.6%,
Zep 71.2/63.8% reproduction. All at the same gpt-4o reader.
- LongMemEval-M table: AgentOS 70.2% [66.0%, 74.0%] tied with
AgentBrain closed-source 71.7%, +4.5 pp above the LongMemEval paper's
academic-baseline ceiling (65.7%, Wu et al. ICLR 2025 Table 3).
First open-source memory library above 65% on M with full
methodology disclosure.
- 12-axis methodology disclosure matrix: bootstrap CIs, judge-FPR
probes, per-case run JSONs, matched-reader cross-vendor table — none
of which most vendors publish.
Cross-provider configurations (e.g. Mastra's 94.9% gpt-5-mini reader +
gemini-2.5-flash observer) are excluded from the comparison tables
because their results cannot be reproduced from public methodology
disclosures.
Copy file name to clipboardExpand all lines: README.md
+51Lines changed: 51 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,6 +87,57 @@ The pipeline is novel because the **T0 / no-memory gate** removes retrieval enti
87
87
88
88
---
89
89
90
+
## Memory Benchmarks at Matched Reader
91
+
92
+
Honest, apples-to-apples comparison: same reader (`gpt-4o`), same dataset, same N=500 Phase B methodology, same `gpt-4o-2024-08-06` judge with `rubric 2026-04-18.1` (FPR 1% [0%, 3%] at n=100). Cross-provider configurations (e.g. Gemini observers) are not included because their results cannot be reproduced from public methodology disclosures.
93
+
94
+
### LongMemEval-S Phase B (115K tokens, 50 sessions per haystack)
95
+
96
+
| System (gpt-4o reader, Phase B N=500) | Accuracy | 95% CI | $/correct | p50 latency | Source |
97
+
|---|---:|---|---:|---:|---|
98
+
| EmergenceMem Internal | 86.0% | not published | not published | 5,650 ms |[emergence.ai](https://www.emergence.ai/blog/sota-on-longmemeval-with-rag)|
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published | not published |[mastra.ai](https://mastra.ai/research/observational-memory)|
101
+
| Supermemory gpt-4o | 81.6% | not published | not published | not published |[supermemory.ai](https://supermemory.ai/research/)|
102
+
| EmergenceMem Simple Fast (apples-to-apples in our harness) | 80.6% | measured | $0.0586 | not published | v2 vendor reproduction |
103
+
| Zep self-reported / independently reproduced | 71.2% / 63.8% | not published | not published | 632 ms p95 search |[self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv:2512.13564](https://arxiv.org/abs/2512.13564)|
104
+
105
+
**+1.4 pp accuracy over Mastra OM gpt-4o at the same reader.** Statistically tied with EmergenceMem Internal (86.0% point estimate sits inside our 95% CI [82.4%, 88.6%]). Median latency vs EmergenceMem is **1.6× faster** (3.558 s vs 5.650 s).
106
+
107
+
### LongMemEval-M Phase B (1.5M tokens, 500 sessions per haystack)
108
+
109
+
The harder variant. M's haystacks exceed every production context window (GPT-4o 128K, Claude Opus 200K, Gemini 3 Pro 1M). Most memory vendors stop at S because raw long-context fits there.
110
+
111
+
| System | Accuracy | 95% CI | License | Source |
112
+
|---|---:|---|---|---|
113
+
| AgentBrain | 71.7% (Test 0) | not published | closed-source SaaS |[github.com/AgentBrainHQ](https://github.com/AgentBrainHQ)|
| LongMemEval paper academic baseline | 65.7% | not published | open repo |[Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813)|
116
+
| Mem0 v3, Mastra OM, Hindsight, Zep, EmergenceMem, Supermemory, MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta | not published | — | various | reports S only |
117
+
118
+
**Statistically tied with AgentBrain's closed-source SaaS** (their 71.7% sits inside our CI [66.0%, 74.0%]). **+4.5 pp above the LongMemEval paper's published academic ceiling.****First open-source memory library on the public record above 65% on M with full methodology disclosure** (bootstrap CIs, per-case run JSONs, reproducible CLI, MIT-licensed).
119
+
120
+
### Methodology disclosure (12 axes most vendors omit)
The full audit framework is at [Memory Benchmark Transparency Audit](https://docs.agentos.sh/blog/2026/04/24/memory-benchmark-transparency-audit). Every run referenced above ships with a per-case run JSON at `seed=42`.
0 commit comments