You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: streamline Memory Benchmarks section, keep every metric
Same numbers, ~40% less prose. Cuts:
- Section title shortened ("Memory Benchmarks at Matched Reader" ->
"Memory Benchmarks (matched reader)").
- "Source" column removed from both tables; the "Full leaderboard ->"
callout below the M table consolidates source links into one
affordance, and the LongMemEval paper now lives there as a single
link instead of three repeats in a table column.
- Per-row source links were doing nothing the leaderboard link
doesn't already do; removing them tightens the tables to 4
columns on S and 3 columns on M without losing data.
- Two paragraphs of S-section commentary collapsed into one. The
cross-provider exclusion list (Mastra gpt-5-mini 94.87%,
agentmemory 96.2%, MemMachine 93.0%, Hindsight 91.4%) compressed
to a single line.
- "Cost at scale" calculation paragraph dropped (the $/correct
number is in the table; the back-of-envelope $9 / $45 numbers
were duplicating what the headline already conveys).
- M section's "Competitive with..." paragraph compressed to a
single sentence with all three paper anchors (65.7 / 71.4 / 72.0)
inline.
Every number in the original section is still present:
- LongMemEval-S: 85.6% / $0.0090 / 3,558 ms ; comparators 86.0,
84.23, 81.6, 80.6 / $0.0586 / 3,703 ms, 71.2 / 63.8.
- Cross-provider: 94.87, 96.2, 93.0, 91.4.
- LongMemEval-M: 70.2% ; comparators 72.0, 71.7, 71.4, 65.7.
Copy file name to clipboardExpand all lines: README.md
+23-25Lines changed: 23 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,43 +94,41 @@ Durable memory, a tool surface that grows within a session, and optional persona
94
94
95
95
---
96
96
97
-
## Memory Benchmarks at Matched Reader
97
+
## Memory Benchmarks (matched reader)
98
98
99
-
Same `gpt-4o` reader, same dataset, same `gpt-4o-2024-08-06` judgeacross every row. Cross-provider configurations are excluded because they cannot be reproduced from public methodology disclosures.
99
+
`gpt-4o` reader, `gpt-4o-2024-08-06` judge, full N=500 across every row. Cross-provider numbers are excluded from the tables because their public methodology disclosures don't admit reproduction.
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |[mastra.ai](https://mastra.ai/research/observational-memory)|
108
-
| Supermemory gpt-4o | 81.6% | not published | not published |[supermemory.ai](https://supermemory.ai/research/)|
109
-
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms |[adapter](https://github.com/framersai/agentos-bench/blob/master/vendors/emergence-simple-fast/)|
110
-
| Zep self / independent reproduction | 71.2% / 63.8% | not published | not published|[self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv](https://arxiv.org/abs/2512.13564)|
103
+
| System | Accuracy | $/correct | p50 latency |
104
+
|---|---:|---:|---:|
105
+
| EmergenceMem Internal | 86.0% | not published | 5,650 ms |
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |
108
+
| Supermemory gpt-4o | 81.6% | not published | not published |
109
+
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms |
110
+
| Zep (self / independent reproduction)| 71.2% / 63.8% | not published | not published |
111
111
112
-
**+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader.** Among open-source memory libraries that publish at gpt-4o with publicly reproducible runs (per-case run JSONs at fixed seed, single-CLI reproduction), AgentOS at 85.6% is the highest published number. EmergenceMem Internal posts 86.0% (0.4 above us) but does not publish per-case results or a reproducible CLI. AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
112
+
+1.4 points above Mastra OM at matched reader. EmergenceMem Internal posts 86.0% (0.4 above) but doesn't publish per-case results or a reproducible CLI; among open-source libraries with single-CLI reproductionat `gpt-4o`, 85.6% is the highest publicly reproducible number located. p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
113
113
114
-
Notes on cross-provider numbers excluded from this table: Mastra also publishes 94.87% with a gpt-5-mini reader plus gemini-2.5-flash observer (cross-provider); agentmemory publishes 96.2% with a Claude Opus 4.6 reader; MemMachine publishes 93.0% with a GPT-5-mini reader; Hindsight publishes 91.4% with an unspecified stronger backbone. None of these are at the matched gpt-4o reader, and most do not publish full methodology details (judge model, dataset version, per-case results, single-CLI reproduction).
115
-
116
-
**Cost at scale**: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.
114
+
Cross-provider numbers omitted from the table (different reader and/or undisclosed judge): Mastra OM 94.87% (gpt-5-mini + gemini-2.5-flash observer), agentmemory 96.2% (Claude Opus 4.6), MemMachine 93.0% (GPT-5-mini), Hindsight 91.4% (unspecified backbone).
117
115
118
116
### LongMemEval-M (1.5M tokens, 500 sessions)
119
117
120
-
The harder variant. M's haystacks exceed every production context window. Most vendors stop at S because raw long-context fits there.
118
+
M's haystacks exceed every production context window; most vendors only publish on S.
121
119
122
-
| System | Accuracy | License | Source |
123
-
|---|---:|---|---|
124
-
| LongMemEval paper, strongest GPT-4o (round, Top-10) | 72.0% | open repo|[Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813)|
| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta| not published |—|
130
128
131
-
**Competitive with the strongest published M results in the LongMemEval paper.**At matched Top-5 retrieval, AgentOS at 70.2% is +4.5 points above the round-level configuration (65.7%) and 1.2 points below the session-level configuration (71.4%); the paper's strongest GPT-4o result overall is 72.0% at round-level Top-10. Among open-source memory libraries with publicly reproducible runs (per-case run JSONs at fixed seed, single-CLI reproduction), AgentOS is the only one on the public record above 65% on M.
129
+
At matched Top-5 retrieval, +4.5 above the round-level paper baseline (65.7%) and 1.2 below the session-level (71.4%); the paper's overall strongest GPT-4o result is 72.0% at Top-10. Of open-source libraries with publicly reproducible runs, AgentOS is the only one above 65% on M.
0 commit comments