Skip to content

Commit 05abd08

Browse files
committed
docs: streamline Memory Benchmarks section, keep every metric
Same numbers, ~40% less prose. Cuts: - Section title shortened ("Memory Benchmarks at Matched Reader" -> "Memory Benchmarks (matched reader)"). - "Source" column removed from both tables; the "Full leaderboard ->" callout below the M table consolidates source links into one affordance, and the LongMemEval paper now lives there as a single link instead of three repeats in a table column. - Per-row source links were doing nothing the leaderboard link doesn't already do; removing them tightens the tables to 4 columns on S and 3 columns on M without losing data. - Two paragraphs of S-section commentary collapsed into one. The cross-provider exclusion list (Mastra gpt-5-mini 94.87%, agentmemory 96.2%, MemMachine 93.0%, Hindsight 91.4%) compressed to a single line. - "Cost at scale" calculation paragraph dropped (the $/correct number is in the table; the back-of-envelope $9 / $45 numbers were duplicating what the headline already conveys). - M section's "Competitive with..." paragraph compressed to a single sentence with all three paper anchors (65.7 / 71.4 / 72.0) inline. Every number in the original section is still present: - LongMemEval-S: 85.6% / $0.0090 / 3,558 ms ; comparators 86.0, 84.23, 81.6, 80.6 / $0.0586 / 3,703 ms, 71.2 / 63.8. - Cross-provider: 94.87, 96.2, 93.0, 91.4. - LongMemEval-M: 70.2% ; comparators 72.0, 71.7, 71.4, 65.7.
1 parent d537242 commit 05abd08

1 file changed

Lines changed: 23 additions & 25 deletions

File tree

README.md

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -94,43 +94,41 @@ Durable memory, a tool surface that grows within a session, and optional persona
9494

9595
---
9696

97-
## Memory Benchmarks at Matched Reader
97+
## Memory Benchmarks (matched reader)
9898

99-
Same `gpt-4o` reader, same dataset, same `gpt-4o-2024-08-06` judge across every row. Cross-provider configurations are excluded because they cannot be reproduced from public methodology disclosures.
99+
`gpt-4o` reader, `gpt-4o-2024-08-06` judge, full N=500 across every row. Cross-provider numbers are excluded from the tables because their public methodology disclosures don't admit reproduction.
100100

101101
### LongMemEval-S (115K tokens, 50 sessions)
102102

103-
| System (gpt-4o reader) | Accuracy | $/correct | p50 latency | Source |
104-
|---|---:|---:|---:|---|
105-
| EmergenceMem Internal | 86.0% | not published | 5,650 ms | [emergence.ai](https://www.emergence.ai/blog/sota-on-longmemeval-with-rag) |
106-
| **🚀 AgentOS canonical-hybrid + reader-router** | **85.6%** | **$0.0090** | **3,558 ms** | [post](https://docs.agentos.sh/blog/2026/04/28/reader-router-pareto-win) |
107-
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published | [mastra.ai](https://mastra.ai/research/observational-memory) |
108-
| Supermemory gpt-4o | 81.6% | not published | not published | [supermemory.ai](https://supermemory.ai/research/) |
109-
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms | [adapter](https://github.com/framersai/agentos-bench/blob/master/vendors/emergence-simple-fast/) |
110-
| Zep self / independent reproduction | 71.2% / 63.8% | not published | not published | [self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv](https://arxiv.org/abs/2512.13564) |
103+
| System | Accuracy | $/correct | p50 latency |
104+
|---|---:|---:|---:|
105+
| EmergenceMem Internal | 86.0% | not published | 5,650 ms |
106+
| **AgentOS** (canonical-hybrid + reader-router) | **85.6%** | **$0.0090** | **3,558 ms** |
107+
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |
108+
| Supermemory gpt-4o | 81.6% | not published | not published |
109+
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms |
110+
| Zep (self / independent reproduction) | 71.2% / 63.8% | not published | not published |
111111

112-
**+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader.** Among open-source memory libraries that publish at gpt-4o with publicly reproducible runs (per-case run JSONs at fixed seed, single-CLI reproduction), AgentOS at 85.6% is the highest published number. EmergenceMem Internal posts 86.0% (0.4 above us) but does not publish per-case results or a reproducible CLI. AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
112+
+1.4 points above Mastra OM at matched reader. EmergenceMem Internal posts 86.0% (0.4 above) but doesn't publish per-case results or a reproducible CLI; among open-source libraries with single-CLI reproduction at `gpt-4o`, 85.6% is the highest publicly reproducible number located. p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
113113

114-
Notes on cross-provider numbers excluded from this table: Mastra also publishes 94.87% with a gpt-5-mini reader plus gemini-2.5-flash observer (cross-provider); agentmemory publishes 96.2% with a Claude Opus 4.6 reader; MemMachine publishes 93.0% with a GPT-5-mini reader; Hindsight publishes 91.4% with an unspecified stronger backbone. None of these are at the matched gpt-4o reader, and most do not publish full methodology details (judge model, dataset version, per-case results, single-CLI reproduction).
115-
116-
**Cost at scale**: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.
114+
Cross-provider numbers omitted from the table (different reader and/or undisclosed judge): Mastra OM 94.87% (gpt-5-mini + gemini-2.5-flash observer), agentmemory 96.2% (Claude Opus 4.6), MemMachine 93.0% (GPT-5-mini), Hindsight 91.4% (unspecified backbone).
117115

118116
### LongMemEval-M (1.5M tokens, 500 sessions)
119117

120-
The harder variant. M's haystacks exceed every production context window. Most vendors stop at S because raw long-context fits there.
118+
M's haystacks exceed every production context window; most vendors only publish on S.
121119

122-
| System | Accuracy | License | Source |
123-
|---|---:|---|---|
124-
| LongMemEval paper, strongest GPT-4o (round, Top-10) | 72.0% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
125-
| AgentBrain | 71.7% | closed-source SaaS | [github.com/AgentBrainHQ](https://github.com/AgentBrainHQ) |
126-
| LongMemEval paper, strongest GPT-4o at Top-5 (session) | 71.4% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
127-
| **🚀 AgentOS** (sem-embed + reader-router + top-K=5) | **70.2%** | **Apache-2.0** | [post](https://docs.agentos.sh/blog/2026/04/29/longmemeval-m-70-with-topk5) |
128-
| LongMemEval paper, GPT-4o at Top-5 (round) | 65.7% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
129-
| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta, others | not published | various | reports S only |
120+
| System | Accuracy | License |
121+
|---|---:|---|
122+
| LongMemEval paper, GPT-4o round Top-10 (paper's best) | 72.0% | open repo |
123+
| AgentBrain | 71.7% | closed-source SaaS |
124+
| LongMemEval paper, GPT-4o session Top-5 | 71.4% | open repo |
125+
| **AgentOS** (sem-embed + reader-router + Top-5) | **70.2%** | **Apache-2.0** |
126+
| LongMemEval paper, GPT-4o round Top-5 | 65.7% | open repo |
127+
| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta | not published | |
130128

131-
**Competitive with the strongest published M results in the LongMemEval paper.** At matched Top-5 retrieval, AgentOS at 70.2% is +4.5 points above the round-level configuration (65.7%) and 1.2 points below the session-level configuration (71.4%); the paper's strongest GPT-4o result overall is 72.0% at round-level Top-10. Among open-source memory libraries with publicly reproducible runs (per-case run JSONs at fixed seed, single-CLI reproduction), AgentOS is the only one on the public record above 65% on M.
129+
At matched Top-5 retrieval, +4.5 above the round-level paper baseline (65.7%) and 1.2 below the session-level (71.4%); the paper's overall strongest GPT-4o result is 72.0% at Top-10. Of open-source libraries with publicly reproducible runs, AgentOS is the only one above 65% on M.
132130

133-
> **[Full benchmarks page ](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)** · **[Reproducible run JSONs →](https://github.com/framersai/agentos-bench/tree/master/results/runs)** · **[Methodology audit →](https://agentos.sh/en/blog/agentos-memory-sota-longmemeval/)**
131+
> **[Full leaderboard ](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)** · **[Run JSONs →](https://github.com/framersai/agentos-bench/tree/master/results/runs)** · **[Methodology audit →](https://agentos.sh/en/blog/agentos-memory-sota-longmemeval/)** · **[LongMemEval paper](https://arxiv.org/abs/2410.10813)** (Wu et al., ICLR 2025, Table 3)
134132
135133
---
136134

0 commit comments

Comments
 (0)