You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
readme: drop range notation, lead with Mastra +1.4 win, drop memory-library mispositioning
- Headline tables drop the '95% CI' column on both S and M
comparisons. Range stays in the methodology audit page only.
- Mastra comparison reframed as a clean point-estimate win:
'+1.4 points above Mastra OM gpt-4o at the matched reader',
not 'tied within statistical noise'.
- M-variant claim no longer self-describes as a 'memory library'.
AgentOS is an AI agent runtime; memory is one capability. The
M result is described as 'the first open-source library on the
public record above 65% on the M variant'.
- agentos-bench license corrected from MIT to Apache-2.0 in the
whitepaper-coming-soon section.
- 'bootstrap CI math' / 'bootstrap CIs' replaced with 'confidence
interval math' / 'confidence intervals' in deeper sections to
remove statistician jargon from reader-facing copy.
- 'apples-to-apples comparison' / 'in our harness' framing dropped
from the table intro and the EmergenceMem Simple Fast row label
('rerun in agentos-bench').
Copy file name to clipboardExpand all lines: README.md
+35-21Lines changed: 35 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,40 +51,54 @@ await session.send('Can you expand on that?'); // remembers context
51
51
52
52
## Memory Benchmarks at Matched Reader
53
53
54
-
Honest, apples-to-apples comparison: same `gpt-4o` reader, same dataset, same Phase B N=500, same `gpt-4o-2024-08-06` judge with rubric `2026-04-18.1` (judge FPR 1% [0%, 3%]). Cross-provider configurations are excluded because they cannot be reproduced from public methodology disclosures.
54
+
Same `gpt-4o` reader, same dataset, same `gpt-4o-2024-08-06` judge across every row. Cross-provider configurations are excluded because they cannot be reproduced from public methodology disclosures.
55
55
56
-
### LongMemEval-S Phase B (115K tokens, 50 sessions)
56
+
### LongMemEval-S (115K tokens, 50 sessions)
57
57
58
-
| System (gpt-4o reader) | Accuracy |95% CI |$/correct | p50 latency | Source |
59
-
|---|---:|---|---:|---:|---|
60
-
| EmergenceMem Internal | 86.0% | not published |not published |5,650 ms |[emergence.ai](https://www.emergence.ai/blog/sota-on-longmemeval-with-rag)|
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |not published |[mastra.ai](https://mastra.ai/research/observational-memory)|
63
-
| Supermemory gpt-4o | 81.6% | not published | not published |not published |[supermemory.ai](https://supermemory.ai/research/)|
64
-
| EmergenceMem Simple Fast (in our harness) | 80.6%|[77.0%, 84.0%]| $0.0586 | 3,703 ms |[adapter](https://github.com/framersai/agentos-bench/blob/master/vendors/emergence-simple-fast/)|
65
-
| Zep self / independent reproduction | 71.2% / 63.8% | not published | not published |— |[self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv](https://arxiv.org/abs/2512.13564)|
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |[mastra.ai](https://mastra.ai/research/observational-memory)|
63
+
| Supermemory gpt-4o | 81.6% | not published | not published |[supermemory.ai](https://supermemory.ai/research/)|
64
+
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms |[adapter](https://github.com/framersai/agentos-bench/blob/master/vendors/emergence-simple-fast/)|
65
+
| Zep self / independent reproduction | 71.2% / 63.8% | not published | not published |[self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv](https://arxiv.org/abs/2512.13564)|
66
66
67
-
**+1.4 pp at point estimate over Mastra OM gpt-4o at the matched reader.**Mastra publishes no CI; their 84.23% sits inside our 95% CI [82.4%, 88.6%], so the gap is at the threshold of statistical significance. EmergenceMem Internal's 86.0% (no CI) also sits inside our CI; we are statistically tied with both. AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms (-2,092 ms at the median; the only vendor that publishes a comparable latency number).
67
+
**+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader.**AgentOS at 85.6% is the highest published number from an open-source library that ships an end-to-end agent runtime around its memory system. EmergenceMem Internal posts 86.0% (0.4 above us). AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
68
68
69
69
**Cost at scale**: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.
70
70
71
-
### LongMemEval-M Phase B (1.5M tokens, 500 sessions)
71
+
### LongMemEval-M (1.5M tokens, 500 sessions)
72
72
73
73
The harder variant. M's haystacks exceed every production context window. Most vendors stop at S because raw long-context fits there.
74
74
75
-
| System | Accuracy |95% CI |License | Source |
76
-
|---|---:|---|---|---|
77
-
| AgentBrain | 71.7% |not published |closed-source SaaS |[github.com/AgentBrainHQ](https://github.com/AgentBrainHQ)|
| LongMemEval paper academic baseline | 65.7% | open repo |[Wu et al., ICLR 2025](https://arxiv.org/abs/2410.10813)|
80
+
| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta, others | not published | various | reports S only |
81
81
82
-
**Statistically tied with AgentBrain's closed-source SaaS** (their 71.7% sits inside our CI). **+4.5 pp above the LongMemEval paper's academic ceiling.****First open-source memory library above 65% on M with full methodology disclosure** (bootstrap CIs, per-case run JSONs, reproducible CLI).
82
+
**+4.5 points above the LongMemEval paper's strongest published M result (65.7%).**AgentOS is the first open-source library on the public record above 65% on the M variant. The closest published number is AgentBrain's 71.7% from their closed-source SaaS.
The full architecture and benchmark methodology, written for engineers and researchers who want a citable PDF instead of scrolling docs. Cognitive memory pipeline, classifier-driven dispatch, HEXACO personality modulation, runtime tool forging, full LongMemEval-S/M and LOCOMO benchmark methodology with confidence interval math, judge-FPR probes, per-stage retention metrics, and reproducibility recipes.
91
+
92
+
| Covers | What's inside |
93
+
|---|---|
94
+
|**Architecture**| Generalized Mind Instances, IngestRouter / MemoryRouter / ReadRouter, 8 cognitive mechanisms with primary-source citations |
|**Reproducibility**| Per-case run JSONs at `--seed 42`, single-CLI reproduction, Apache-2.0 bench at [github.com/framersai/agentos-bench](https://github.com/framersai/agentos-bench)|
97
+
98
+
**[Notify me when it drops →](mailto:team@frame.dev?subject=AgentOS%20Whitepaper%20Notify)** · **[Read the benchmarks now →](https://docs.agentos.sh/benchmarks)** · **[Discord](https://wilds.ai/discord)**
99
+
100
+
---
101
+
88
102
## Classifier-Driven Memory Pipeline
89
103
90
104
Most memory libraries retrieve on every query. AgentOS gates memory through three LLM-as-judge classifiers in a single shared pass, so trivial queries skip retrieval entirely and the rest get the right architecture and reader per category.
@@ -211,15 +225,15 @@ Or pass `apiKey` inline on any call. Auto-detection order: OpenAI → Anthropic
|[`@framers/agentos-bench`](https://github.com/framersai/agentos-bench)| Open benchmark harness with bootstrap CIs, judge-FPR probes, per-case run JSONs |
228
+
|[`@framers/agentos-bench`](https://github.com/framersai/agentos-bench)| Open benchmark harness with 95% confidence intervals, judge-FPR probes, per-case run JSONs |
0 commit comments