MemRLM

Memory augmented inference library for Recursive Language Models (RLMs), built on top of rlm. Read the RLM blog post for background on the paradigm.

RLM lets LLMs manage their own context via a Python REPL. The problem: every run starts cold. The model makes the same mistakes, ignores the same patterns, and never learns which approaches work for which problem types.

Mem-RLM wraps RLM with a multi-timescale memory layer that records trajectories, learns which strategies work, and injects the best strategy into future runs. It gets better over time.

Benchmark

10 medium-difficulty prompts (multi-step computation, algorithm implementation, combinatorics, linear algebra, graph theory). Raw RLM has no memory — every run starts cold. MemRLM accumulates strategies across rounds.

gpt-4.1-nano

	Score	Avg Iters	Errors	Tokens
Raw RLM (baseline)	0.450	11.0	6	48,832
MemRLM Round 1	0.410	11.4	6	52,576
MemRLM Round 2	0.470	11.9	3	47,263
MemRLM Round 3	0.565	13.8	4	76,661

+26% score improvement over baseline by Round 3. Nano struggles enough that strategy memory makes a real difference — errors drop, more prompts reach correct answers.

gpt-4.1-mini

	Score	Avg Iters	Errors	Tokens
Raw RLM (baseline)	0.855	6.0	1	21,061
MemRLM Round 1	0.720	5.2	2	19,401
MemRLM Round 2	0.925	4.9	1	17,238
MemRLM Round 3	0.860	5.4	2	19,252

+8% score improvement at peak (Round 2). Mini is strong enough to handle most prompts on its own, so the signal is smaller since the baseline is already high.

Takeaway: MemRLM helps most when the base model struggles. Weaker models benefit more from accumulated strategy guidance.

Three timescales

Timescale	What happens	When
Fast	Standard RLM execution (code generation, REPL, iteration)	Every iteration
Medium	Score the run, record trajectory, update strategy ratings	Every completion
Slow	Extract new strategies from successful trajectory clusters	Every N completions

Install

git clone https://github.com/dtunai/Mem-RLM.git
cd Mem-RLM
uv sync --all-groups

Requires the RLM reference library (included as local dependency in rlm-reference/).

Quickstart

from memrlm import MemRLM

mem = MemRLM(
    backend="openai",
    backend_kwargs={"model_name": "gpt-4.1-mini"},
    environment="local",
    environment_tag="math",
)

# Seed a strategy
mem.register_strategy(
    environment_tag="math",
    name="direct_compute",
    env_tip=(
        "When solving math problems:\n"
        "1. Parse the problem into variables\n"
        "2. Compute the answer directly in Python\n"
        "3. Assign the result and use FINAL_VAR()"
    ),
)

result = mem.completion("What is the sum of the first 100 prime numbers?")
print(result.response)
# The sum of the first 100 prime numbers is 24133.

Every call records a trajectory. When auto_evaluate=True, it also scores the run, updates strategy ratings, and extracts new strategies from patterns in successful runs.

How it works

Strategy selection — Epsilon-greedy (default 10% exploration). Picks the highest-scoring active strategy 90% of the time, random strategy 10% of the time. Returns None on cold start (no strategies yet).

Strategy injection — The selected strategy's env_tip is appended after RLM_SYSTEM_PROMPT. Curly braces are escaped to avoid conflicts with RLM's .format() call.

Trajectory recording — Every completion (successful or partial) is recorded with full iteration data, token counts, timing, and error flags. Partial runs from timeouts/budget errors are captured via RLMLogger.

Scoring — LLM-based evaluation (0-1 range). The scorer sees the task prompt, final response, actual code execution output (stdout/stderr from the Python REPL), and execution stats. No regex or heuristics — the LLM judges correctness and assigns a score. Correct answers score 0.7-1.0, incorrect 0.0-0.5.

Strategy extraction — After enough trajectories accumulate (default 5), the evaluator analyzes trajectory patterns and uses an LLM to generate new strategies. Deduplicates on (environment_tag, feature_hash).

Strategy lifecycle — Weighted running average scoring (alpha=0.3). Strategies that drop below 0.25 avg score after 5+ uses get deactivated, but remain eligible during exploration.

API

MemRLM(
    # RLM pass-through
    backend="openai",
    backend_kwargs={"model_name": "gpt-4.1-mini"},
    environment="local",
    environment_kwargs=None,
    max_depth=1,
    max_iterations=30,
    max_budget=None,
    max_timeout=None,
    verbose=False,
    # MemRLM-specific
    environment_tag="default",   # groups strategies by problem type
    db_url="sqlite:///memrlm.db",  # SQLAlchemy database URL
    auto_evaluate=False,         # set True for benchmark/eval mode
    strategy_override=None,      # force a specific env_tip string
    epsilon=0.1,                 # exploration rate
    **rlm_kwargs,                # any other RLM params (custom_tools, compaction, etc.)
)

Storage backends

MemRLM uses SQLAlchemy ORM, so any supported database works:

# SQLite (default)
MemRLM(db_url="sqlite:///memrlm.db")

# PostgreSQL
MemRLM(db_url="postgresql://user:pass@localhost/memrlm")

# MySQL
MemRLM(db_url="mysql+pymysql://user:pass@localhost/memrlm")

Bare file paths still work for backward compatibility: db_path="memrlm.db".

Methods:

completion(prompt, root_prompt=None) — Run with strategy injection + recording
register_strategy(environment_tag, name, env_tip, description="") — Seed a strategy manually

Store access: mem.store exposes list_trajectories(), get_strategies_for_tag(), and other query methods.

Architecture

memrlm/
  core.py        MemRLM              wraps RLM, orchestrates the loop
  models.py      ORM models          SQLAlchemy models for all entities
  store.py       StrategyStore        database CRUD (trajectories, strategies, run records)
  recorder.py    TrajectoryRecorder   builds Trajectory from RLMChatCompletion
  evaluator.py   Evaluator            LLM-based scoring + strategy extraction
  selector.py    StrategySelector     epsilon-greedy selection
  types.py       Trajectory, Strategy, RunRecord dataclasses

Examples

# Basic completion with strategy memory
uv run python examples/quickstart.py

# Custom tools pass-through
uv run python examples/custom_tools.py

# Depth > 1 with rlm_query() subcalls
uv run python examples/depth_metadata.py

# Parallel child RLMs with rlm_query_batched()
uv run python examples/rlm_query_batched.py

# Context compaction
uv run python examples/compaction.py

# Trajectory capture (automatic in MemRLM)
uv run python examples/logger.py

Tests

uv run pytest        # 60 tests, no API keys needed
uv run ruff check    # lint

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
memrlm		memrlm
tests		tests
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemRLM

Benchmark

gpt-4.1-nano

gpt-4.1-mini

Three timescales

Install

Quickstart

How it works

API

Storage backends

Architecture

Examples

Tests

License

About

Uh oh!

Releases

Packages

Languages

dtunai/Mem-RLM

Folders and files

Latest commit

History

Repository files navigation

MemRLM

Benchmark

gpt-4.1-nano

gpt-4.1-mini

Three timescales

Install

Quickstart

How it works

API

Storage backends

Architecture

Examples

Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages