Skip to content

boqiny/memory-probe

Repository files navigation

Memory Probe

📄 Paper: Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Diagnostic framework that tests whether LLM memory agents actually use their retrieved memories. Evaluates three memory strategies on the LOCOMO dataset using LLM-as-judge probes for retrieval relevance, memory utilization, and failure analysis.

Strategies

  • Default RAG — stores raw conversation chunks (3 turns each), no LLM at write time
  • Extracted Facts — LLM extracts structured facts per session with conflict resolution (A-MEM / Mem0 style)
  • Summarized Episodes — LLM summarizes each session into one entry (MemGPT style)

Setup

pip install -r requirements.txt
export OPENAI_API_KEY=your_key_here

Usage

# Pilot run (5 questions, 1 strategy)
python run_experiment.py --pilot --strategy basic_rag

# Full experiment (all strategies, all conversations)
python run_experiment.py

# Top-k ablation
python run_experiment.py --top-k 3 5 10

# Single strategy with custom workers
python run_experiment.py --strategy extracted_facts --workers 10

# Analyze results
python analyze_results.py results/results_TIMESTAMP.json

Citation

If you use this work in your research, please cite:

@misc{yuan2026diagnosingretrievalvsutilization,
      title={Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory}, 
      author={Boqin Yuan and Yue Su and Kun Yao},
      year={2026},
      eprint={2603.02473},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.02473}, 
}

About

Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory https://arxiv.org/abs/2603.02473

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages