Aurra Memory Benchmarks

Benchmarks comparing memory infrastructure systems for AI agents on the LoCoMo long-term conversational memory dataset.

Background

We built Aurra as a memory layer for AI agents and wanted to compare it honestly against existing systems. This repo contains the benchmark code, results, and methodology.

See the writeup: Mem0 thinks our 2023 conversation happened in 2026 (blog link — update before publishing)

April 29, 2026 baseline

Metric	Aurra	Mem0
Total memories captured	2,685	780 (capped at 100/conv by free-tier API)
Memories with fabricated dates	0 (0.00%)	179 (22.95%)
Judge-rated useful	42.4%	28.2%
Judge-rated hallucinated	55.3%	64.5%
Judge-rated junk	2.6%	5.9%
Judge-rated misattributed	1.7%	7.2%

LoCoMo conversations are dated 2023; Mem0 stamps most memories with 2026 (today's date). See results/date_audit.txt for examples and results/DAY1_SUMMARY.md for full details.

Caveat on judge scoring: the LLM judge scores against LoCoMo's event_summary ground truth, which is brief and incomplete. Memories about real but unsummarized content get flagged as hallucinated, so absolute numbers are inflated for both systems. Relative comparison is the meaningful signal.

What this measures

Two metrics, both with LLM-as-judge scoring:

Memory quality (junk rate) — Of the memories a system stores from a conversation, what percentage are useful vs hallucinated vs duplicates vs junk?
Answer accuracy — Given questions about the conversation, can the system retrieve the right memories to answer correctly?

Systems tested

Mem0 (cloud, free tier) — mem0ai Python SDK
Aurra — direct API calls

Reproducing

git clone https://github.com/aurra-memory/benchmarks
cd benchmarks
pip install -r requirements.txt

cp .env.example .env
# Edit .env to add MEM0_API_KEY, ANTHROPIC_API_KEY, AURRA_API_KEY

# 1. Run ingestion for both systems
python3 run_ingestion.py both

# 2. Score memories for junk rate
python3 scoring/junk_classifier.py both

# 3. Run Q&A accuracy
python3 run_qa_accuracy.py both

# 4. Generate charts
python3 scoring/generate_charts.py

Methodology notes

10 conversations, 5,882 turns, 272 sessions (full LoCoMo10 dataset)
Each conversation runs as isolated tenant_id to prevent cross-contamination
Q&A: stratified random sample of 500 questions across categories 1-4 (excluding adversarial cat 5)
Judge model: Claude Opus
Mem0 ingests asynchronously; we wait 120s after each conversation. 30s throttle between sessions to avoid free-tier rate limiting.

Caveats

Mem0 free tier vs paid tier may differ
Mem0's extraction LLM is closed-source; Aurra uses Claude Opus
LoCoMo conversations are synthetic
LLM-as-judge introduces grader variance

License

MIT

Citation

@inproceedings{maharana2024locomo,
  title={Evaluating Very Long-Term Conversational Memory of LLM Agents},
  author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
  booktitle={ACL},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aurra Memory Benchmarks

Background

April 29, 2026 baseline

What this measures

Systems tested

Reproducing

Methodology notes

Caveats

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
adapters		adapters
charts		charts
data		data
results		results
scoring		scoring
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_ingestion.py		run_ingestion.py
run_qa_accuracy.py		run_qa_accuracy.py

Folders and files

Latest commit

History

Repository files navigation

Aurra Memory Benchmarks

Background

April 29, 2026 baseline

What this measures

Systems tested

Reproducing

Methodology notes

Caveats

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages