Skip to content
/ Mem-RLM Public

Memory augmented inference library for Recursive Language Models (RLMs), built on top of rlm.

Notifications You must be signed in to change notification settings

dtunai/Mem-RLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MemRLM

Memory augmented inference library for Recursive Language Models (RLMs), built on top of rlm. Read the RLM blog post for background on the paradigm.

RLM lets LLMs manage their own context via a Python REPL. The problem: every run starts cold. The model makes the same mistakes, ignores the same patterns, and never learns which approaches work for which problem types.

Mem-RLM wraps RLM with a multi-timescale memory layer that records trajectories, learns which strategies work, and injects the best strategy into future runs. It gets better over time.

Benchmark

10 medium-difficulty prompts (multi-step computation, algorithm implementation, combinatorics, linear algebra, graph theory). Raw RLM has no memory — every run starts cold. MemRLM accumulates strategies across rounds.

gpt-4.1-nano

Score Avg Iters Errors Tokens
Raw RLM (baseline) 0.450 11.0 6 48,832
MemRLM Round 1 0.410 11.4 6 52,576
MemRLM Round 2 0.470 11.9 3 47,263
MemRLM Round 3 0.565 13.8 4 76,661

+26% score improvement over baseline by Round 3. Nano struggles enough that strategy memory makes a real difference — errors drop, more prompts reach correct answers.

gpt-4.1-mini

Score Avg Iters Errors Tokens
Raw RLM (baseline) 0.855 6.0 1 21,061
MemRLM Round 1 0.720 5.2 2 19,401
MemRLM Round 2 0.925 4.9 1 17,238
MemRLM Round 3 0.860 5.4 2 19,252

+8% score improvement at peak (Round 2). Mini is strong enough to handle most prompts on its own, so the signal is smaller since the baseline is already high.

Takeaway: MemRLM helps most when the base model struggles. Weaker models benefit more from accumulated strategy guidance.

Three timescales

Timescale What happens When
Fast Standard RLM execution (code generation, REPL, iteration) Every iteration
Medium Score the run, record trajectory, update strategy ratings Every completion
Slow Extract new strategies from successful trajectory clusters Every N completions

Install

git clone https://github.com/dtunai/Mem-RLM.git
cd Mem-RLM
uv sync --all-groups

Requires the RLM reference library (included as local dependency in rlm-reference/).

Quickstart

from memrlm import MemRLM

mem = MemRLM(
    backend="openai",
    backend_kwargs={"model_name": "gpt-4.1-mini"},
    environment="local",
    environment_tag="math",
)

# Seed a strategy
mem.register_strategy(
    environment_tag="math",
    name="direct_compute",
    env_tip=(
        "When solving math problems:\n"
        "1. Parse the problem into variables\n"
        "2. Compute the answer directly in Python\n"
        "3. Assign the result and use FINAL_VAR()"
    ),
)

result = mem.completion("What is the sum of the first 100 prime numbers?")
print(result.response)
# The sum of the first 100 prime numbers is 24133.

Every call records a trajectory. When auto_evaluate=True, it also scores the run, updates strategy ratings, and extracts new strategies from patterns in successful runs.

How it works

Strategy selection — Epsilon-greedy (default 10% exploration). Picks the highest-scoring active strategy 90% of the time, random strategy 10% of the time. Returns None on cold start (no strategies yet).

Strategy injection — The selected strategy's env_tip is appended after RLM_SYSTEM_PROMPT. Curly braces are escaped to avoid conflicts with RLM's .format() call.

Trajectory recording — Every completion (successful or partial) is recorded with full iteration data, token counts, timing, and error flags. Partial runs from timeouts/budget errors are captured via RLMLogger.

Scoring — LLM-based evaluation (0-1 range). The scorer sees the task prompt, final response, actual code execution output (stdout/stderr from the Python REPL), and execution stats. No regex or heuristics — the LLM judges correctness and assigns a score. Correct answers score 0.7-1.0, incorrect 0.0-0.5.

Strategy extraction — After enough trajectories accumulate (default 5), the evaluator analyzes trajectory patterns and uses an LLM to generate new strategies. Deduplicates on (environment_tag, feature_hash).

Strategy lifecycle — Weighted running average scoring (alpha=0.3). Strategies that drop below 0.25 avg score after 5+ uses get deactivated, but remain eligible during exploration.

API

MemRLM(
    # RLM pass-through
    backend="openai",
    backend_kwargs={"model_name": "gpt-4.1-mini"},
    environment="local",
    environment_kwargs=None,
    max_depth=1,
    max_iterations=30,
    max_budget=None,
    max_timeout=None,
    verbose=False,
    # MemRLM-specific
    environment_tag="default",   # groups strategies by problem type
    db_url="sqlite:///memrlm.db",  # SQLAlchemy database URL
    auto_evaluate=False,         # set True for benchmark/eval mode
    strategy_override=None,      # force a specific env_tip string
    epsilon=0.1,                 # exploration rate
    **rlm_kwargs,                # any other RLM params (custom_tools, compaction, etc.)
)

Storage backends

MemRLM uses SQLAlchemy ORM, so any supported database works:

# SQLite (default)
MemRLM(db_url="sqlite:///memrlm.db")

# PostgreSQL
MemRLM(db_url="postgresql://user:pass@localhost/memrlm")

# MySQL
MemRLM(db_url="mysql+pymysql://user:pass@localhost/memrlm")

Bare file paths still work for backward compatibility: db_path="memrlm.db".

Methods:

  • completion(prompt, root_prompt=None) — Run with strategy injection + recording
  • register_strategy(environment_tag, name, env_tip, description="") — Seed a strategy manually

Store access: mem.store exposes list_trajectories(), get_strategies_for_tag(), and other query methods.

Architecture

memrlm/
  core.py        MemRLM              wraps RLM, orchestrates the loop
  models.py      ORM models          SQLAlchemy models for all entities
  store.py       StrategyStore        database CRUD (trajectories, strategies, run records)
  recorder.py    TrajectoryRecorder   builds Trajectory from RLMChatCompletion
  evaluator.py   Evaluator            LLM-based scoring + strategy extraction
  selector.py    StrategySelector     epsilon-greedy selection
  types.py       Trajectory, Strategy, RunRecord dataclasses

Examples

# Basic completion with strategy memory
uv run python examples/quickstart.py

# Custom tools pass-through
uv run python examples/custom_tools.py

# Depth > 1 with rlm_query() subcalls
uv run python examples/depth_metadata.py

# Parallel child RLMs with rlm_query_batched()
uv run python examples/rlm_query_batched.py

# Context compaction
uv run python examples/compaction.py

# Trajectory capture (automatic in MemRLM)
uv run python examples/logger.py

Tests

uv run pytest        # 60 tests, no API keys needed
uv run ruff check    # lint

License

MIT

About

Memory augmented inference library for Recursive Language Models (RLMs), built on top of rlm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages