LLMs have no memory. The context window is a register file, not RAM — we need a real memory layer.
LMM is a research prototype that explores a radical idea: what if a small language model could serve as the memory substrate for a larger working model? Instead of stuffing everything into the context window or maintaining flat markdown files, LMM compresses interaction history into model weights and generates contextual memory on demand.
Read the full essay: LMM: When Large Language Models Learn to Remember
Current LLM systems have a fundamental architectural gap. In computer architecture terms:
- Context window = CPU registers (fast, finite, temporary)
- CLAUDE.md / .cursorrules = crib notes taped to the monitor
- Re-scanning codebases = reading the entire hard drive into registers every time
What's missing is the memory hierarchy — L1/L2 cache, RAM, virtual memory, and the OS that manages data flow between layers.
LMM explores one piece of this: a small model (1.7B parameters) that ingests your interaction history and generates relevant context on demand. Your memory lives in the model's weights, runs locally, and belongs to you — not to any model provider.
This is the first implementation: fine-tuning Qwen3-1.7B on Claude Code session data to generate contextual memory. It includes:
- Data pipeline: Parse Claude Code sessions → extract cognitions → build training pairs
- Two-stage training: Continual Pre-Training (CPT) on raw documents + Supervised Fine-Tuning (SFT) on context→memory pairs
- Evaluation framework: Keyword coverage metrics, diversity analysis, per-category breakdown
Four rounds of experiments revealed both the promise and limits of pure generation:
| Run | Approach | Val Loss | Key Finding |
|---|---|---|---|
| 1 | Baseline SFT | 0.4917 | Model learns the format, but outputs are generic |
| 2 | Salience-weighted | 0.4476 | Weighting helps, but coverage still low |
| 3 | + Codebase knowledge | 0.6978 | "Parrot problem" — 49% of pairs shared 4 memories |
| 4 | CPT + deduped SFT | 1.6497 | Diversity fixed (Jaccard 0.048), but specificity limited |
Key insights:
- Pure generation without retrieval hits fundamental limits at 1.7B scale. The model learns patterns ("this project uses Python", "the user prefers X") but can't memorize specifics (exact function names, API details).
- Data quality dominates. The "parrot problem" in Run 3 (model repeating the same 4 memories) was a data issue, not a model issue. Deduplication + chunk-based pairs fixed it.
- Memory should be retrieval + generation, not generation alone. The next architecture should: store raw data → retrieve relevant chunks → generate/synthesize contextual memory.
- Python 3.11+
- PyTorch 2.0+ with CUDA
- ~8GB+ GPU VRAM (tested on RTX 4090)
# Clone the repo
git clone https://github.com/Waerden001/LMM.git
cd LMM
# Install dependencies
pip install -e ".[dev]"
# Download the base model
python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
AutoTokenizer.from_pretrained('Qwen/Qwen3-1.7B'); \
AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-1.7B', dtype='auto')"LMM trains on Claude Code session data. You'll need your own sessions from ~/.claude/projects/.
# Step 1: Parse sessions into structured format
python scripts/01_parse_sessions.py
# Step 2: Extract cognitions (key knowledge) from sessions
python scripts/02_extract_cognitions.py
# Step 2a (optional): Ingest codebase knowledge
python scripts/02a_ingest_codebases.py
# Step 3: Build training pairs
python scripts/03_build_training_data.py --use-chunk-pairs --max-per-memory 3# Two-stage training: CPT then SFT
WANDB_MODE=offline python scripts/04_train.py --stage both --use-chunk-pairs --max-per-memory 3
# Or run stages separately:
WANDB_MODE=offline python scripts/04_train.py --stage cpt
WANDB_MODE=offline python scripts/04_train.py --stage sft --from-checkpoint checkpoints/run_cpt/bestpython scripts/05_evaluate.pypython scripts/06_demo.pyLMM/
├── src/lmm/ # Core library
│ ├── config.py # Training & model configuration
│ ├── data/ # Dataset classes (SFT, CPT)
│ ├── training/ # Trainer, data collation
│ ├── eval/ # Evaluation metrics
│ └── utils/ # Parsing, cognition extraction
├── scripts/ # Pipeline scripts (00-06)
├── configs/ # Project configuration (YAML)
├── data/ # Generated data (gitignored)
├── results/ # Evaluation results
└── docs/ # Blog posts & research plan
Your memory belongs to you, not to any model provider. Core principles:
- Local training: The model runs on your hardware
- No data upload: Raw sessions and trained models never leave your machine
- Your weights, your memory: The fine-tuned model is yours — it encodes your interaction patterns
This repo does not include any pre-trained checkpoints or session data. You train your own LMM on your own data.
The experiments point clearly toward a retrieval + generation architecture:
- Store raw interaction data in a structured knowledge base
- Retrieve relevant chunks given the current context (semantic search, not brute-force)
- Generate synthesized memory from retrieved chunks (the LMM's role shifts from "remember everything" to "synthesize what's retrieved")
This combines the best of both worlds: retrieval provides specificity, generation provides contextual adaptation. The 1.7B model doesn't need to memorize every function name — it just needs to be good at synthesizing relevant retrieved information into useful context.
- Blog Post (English) — Full essay on the LMM vision
- Blog Post (中文) — Chinese version
- Research Plan — Detailed research plan and architecture
- Implementation Notes — Development notes
LLM 的 context window 是寄存器,不是内存。我们需要一个真正的记忆层。
LMM(Large Memory Model)探索一个激进的想法:用一个小型语言模型作为大型工作模型的记忆基底。与其把所有东西塞进上下文窗口,或者维护扁平的 markdown 文件,不如把交互历史压缩到模型权重中,按需生成上下文记忆。
完整文章:LMM:当大语言模型学会记忆
四轮实验揭示了纯生成路线的潜力和局限:
- Run 1-2:模型能学会记忆的格式和通用模式,但具体细节(函数名、API 参数)记不住
- Run 3:发现"鹦鹉问题"——49% 的训练对共享仅 4 条记忆,模型只是在复读
- Run 4:CPT + 去重 SFT 修复了多样性(Jaccard 相似度 0.048),但在 1.7B 规模下纯生成的覆盖率仍然有限
关键结论:记忆应该是"检索 + 生成",而不是纯生成。下一步架构:存储原始数据 → 检索相关片段 → 生成/综合上下文记忆。
你的记忆属于你,不属于任何模型提供商:
- 模型在你的硬件上本地训练
- 原始会话数据和训练好的模型永远不离开你的机器
- 本仓库不包含任何预训练权重或会话数据——你用自己的数据训练自己的 LMM
请参考上方英文部分的 Quick Start 指南。主要流程:
- 安装依赖:
pip install -e ".[dev]" - 下载基础模型 Qwen3-1.7B
- 准备数据:运行
scripts/01-03处理你的 Claude Code 会话 - 训练:
python scripts/04_train.py --stage both - 评估:
python scripts/05_evaluate.py
详细研究计划和实现笔记见 docs/ 目录。