Memory augmented inference library for Recursive Language Models (RLMs), built on top of rlm. Read the RLM blog post for background on the paradigm.
RLM lets LLMs manage their own context via a Python REPL. The problem: every run starts cold. The model makes the same mistakes, ignores the same patterns, and never learns which approaches work for which problem types.
Mem-RLM wraps RLM with a multi-timescale memory layer that records trajectories, learns which strategies work, and injects the best strategy into future runs. It gets better over time.
10 medium-difficulty prompts (multi-step computation, algorithm implementation, combinatorics, linear algebra, graph theory). Raw RLM has no memory — every run starts cold. MemRLM accumulates strategies across rounds.
| Score | Avg Iters | Errors | Tokens | |
|---|---|---|---|---|
| Raw RLM (baseline) | 0.450 | 11.0 | 6 | 48,832 |
| MemRLM Round 1 | 0.410 | 11.4 | 6 | 52,576 |
| MemRLM Round 2 | 0.470 | 11.9 | 3 | 47,263 |
| MemRLM Round 3 | 0.565 | 13.8 | 4 | 76,661 |
+26% score improvement over baseline by Round 3. Nano struggles enough that strategy memory makes a real difference — errors drop, more prompts reach correct answers.
| Score | Avg Iters | Errors | Tokens | |
|---|---|---|---|---|
| Raw RLM (baseline) | 0.855 | 6.0 | 1 | 21,061 |
| MemRLM Round 1 | 0.720 | 5.2 | 2 | 19,401 |
| MemRLM Round 2 | 0.925 | 4.9 | 1 | 17,238 |
| MemRLM Round 3 | 0.860 | 5.4 | 2 | 19,252 |
+8% score improvement at peak (Round 2). Mini is strong enough to handle most prompts on its own, so the signal is smaller since the baseline is already high.
Takeaway: MemRLM helps most when the base model struggles. Weaker models benefit more from accumulated strategy guidance.
| Timescale | What happens | When |
|---|---|---|
| Fast | Standard RLM execution (code generation, REPL, iteration) | Every iteration |
| Medium | Score the run, record trajectory, update strategy ratings | Every completion |
| Slow | Extract new strategies from successful trajectory clusters | Every N completions |
git clone https://github.com/dtunai/Mem-RLM.git
cd Mem-RLM
uv sync --all-groupsRequires the RLM reference library (included as local dependency in rlm-reference/).
from memrlm import MemRLM
mem = MemRLM(
backend="openai",
backend_kwargs={"model_name": "gpt-4.1-mini"},
environment="local",
environment_tag="math",
)
# Seed a strategy
mem.register_strategy(
environment_tag="math",
name="direct_compute",
env_tip=(
"When solving math problems:\n"
"1. Parse the problem into variables\n"
"2. Compute the answer directly in Python\n"
"3. Assign the result and use FINAL_VAR()"
),
)
result = mem.completion("What is the sum of the first 100 prime numbers?")
print(result.response)
# The sum of the first 100 prime numbers is 24133.Every call records a trajectory. When auto_evaluate=True, it also scores the run, updates strategy ratings, and extracts new strategies from patterns in successful runs.
Strategy selection — Epsilon-greedy (default 10% exploration). Picks the highest-scoring active strategy 90% of the time, random strategy 10% of the time. Returns None on cold start (no strategies yet).
Strategy injection — The selected strategy's env_tip is appended after RLM_SYSTEM_PROMPT. Curly braces are escaped to avoid conflicts with RLM's .format() call.
Trajectory recording — Every completion (successful or partial) is recorded with full iteration data, token counts, timing, and error flags. Partial runs from timeouts/budget errors are captured via RLMLogger.
Scoring — LLM-based evaluation (0-1 range). The scorer sees the task prompt, final response, actual code execution output (stdout/stderr from the Python REPL), and execution stats. No regex or heuristics — the LLM judges correctness and assigns a score. Correct answers score 0.7-1.0, incorrect 0.0-0.5.
Strategy extraction — After enough trajectories accumulate (default 5), the evaluator analyzes trajectory patterns and uses an LLM to generate new strategies. Deduplicates on (environment_tag, feature_hash).
Strategy lifecycle — Weighted running average scoring (alpha=0.3). Strategies that drop below 0.25 avg score after 5+ uses get deactivated, but remain eligible during exploration.
MemRLM(
# RLM pass-through
backend="openai",
backend_kwargs={"model_name": "gpt-4.1-mini"},
environment="local",
environment_kwargs=None,
max_depth=1,
max_iterations=30,
max_budget=None,
max_timeout=None,
verbose=False,
# MemRLM-specific
environment_tag="default", # groups strategies by problem type
db_url="sqlite:///memrlm.db", # SQLAlchemy database URL
auto_evaluate=False, # set True for benchmark/eval mode
strategy_override=None, # force a specific env_tip string
epsilon=0.1, # exploration rate
**rlm_kwargs, # any other RLM params (custom_tools, compaction, etc.)
)MemRLM uses SQLAlchemy ORM, so any supported database works:
# SQLite (default)
MemRLM(db_url="sqlite:///memrlm.db")
# PostgreSQL
MemRLM(db_url="postgresql://user:pass@localhost/memrlm")
# MySQL
MemRLM(db_url="mysql+pymysql://user:pass@localhost/memrlm")Bare file paths still work for backward compatibility: db_path="memrlm.db".
Methods:
completion(prompt, root_prompt=None)— Run with strategy injection + recordingregister_strategy(environment_tag, name, env_tip, description="")— Seed a strategy manually
Store access: mem.store exposes list_trajectories(), get_strategies_for_tag(), and other query methods.
memrlm/
core.py MemRLM wraps RLM, orchestrates the loop
models.py ORM models SQLAlchemy models for all entities
store.py StrategyStore database CRUD (trajectories, strategies, run records)
recorder.py TrajectoryRecorder builds Trajectory from RLMChatCompletion
evaluator.py Evaluator LLM-based scoring + strategy extraction
selector.py StrategySelector epsilon-greedy selection
types.py Trajectory, Strategy, RunRecord dataclasses
# Basic completion with strategy memory
uv run python examples/quickstart.py
# Custom tools pass-through
uv run python examples/custom_tools.py
# Depth > 1 with rlm_query() subcalls
uv run python examples/depth_metadata.py
# Parallel child RLMs with rlm_query_batched()
uv run python examples/rlm_query_batched.py
# Context compaction
uv run python examples/compaction.py
# Trajectory capture (automatic in MemRLM)
uv run python examples/logger.pyuv run pytest # 60 tests, no API keys needed
uv run ruff check # lintMIT