# Recursive Language Models: The End of Context Rot

How to escape the "Context Window" trap by teaching models to read like humans.

It feels like the "Context Wars" are over. In 2023, we fought for 32k context windows. By 2024, 128k was standard. Today, Gemini, Anthropic, and many other providers give you 1 Million tokens. You can literally dump the entire Harry Potter series, the Linux Kernel, and your company's entire Slack history into a single prompt.

So, problem solved, right?

Wrong.

While we solved the Storage problem (fitting the text in RAM), we haven't solved the Reasoning problem. This is a phenomenon known as Context Rot (or the "Lost in the Middle" phenomenon).

As you fill that 1M window, the model gets "dumber". Its attention mechanism—the very heart of the Transformer—becomes diluted. When you ask a question about line 14,032, a model looking at 1,000,000 tokens struggles to separate the signal from the noise. It's like trying to find a specific needle in a haystack, but the haystack is the size of Texas.

We don't need bigger windows. We need a fundamental shift in how models interact with data. We need them to stop "ingesting" data and start "reading" it.

This is where Recursive Language Models (RLMs) come in.

In this deep dive, based on the recent paper from MIT OASYS, we aren't just going to explain the theory. We are going to build an RLM from scratch.

## Setup

Before running this notebook, please make sure you have your API key set up.

You can export it in your terminal before running Jupyter:
```bash
export GOOGLE_API_KEY=your_api_key_here
```

Or set it directly in the cell below:

In [None]:
!pip install google-genai

import os
os.environ["GOOGLE_API_KEY"] = "your_api_key_here"

## From "Seeing" to "Seeking"

To understand the breakthrough, we have to look at the math.

In a standard Language Model (like GPT), the probability of the next word ($y$) depends on everything that came before it ($x$).

This $x$ represents your massive 1-million-token prompt. The core mechanism of the Transformer, Self-Attention, has to compare every token to every other token to find relationships. This leads to a complexity of roughly $O(N^2)$.

Think of this like a librarian who, every time you ask a question, has to re-read every single book in the library simultaneously before answering.

Going back to the librarian example, when the library grows ($N \to \infty$), the librarian becomes overwhelmed. The "signal" of the answer gets drowned out by the "noise" of millions of irrelevant words. This is Context Rot.

## Active Information Retrieval

Recursive Language Models flip this logic. Instead of forcing the model to *see* the text, we *hide* the text. We treat the context $x$ as an Environment State—like a database or a file system—rather than input tokens.

The model doesn't generate the answer immediately. It generates a Program ($\pi$).

Here is the narrative arc of this formula:
1. $S_0$ is the user's prompt (e.g., "Find the secret code").
2. The model cannot *see* the context $x$. It only knows "There is a variable named `context` in memory".
3. Because it can't answer yet, $P_\theta$ (the model) decides to write a program $\pi$ to go look for information.
4. This program $\pi$ executes in the environment to produce an Observation ($o$).
5. Finally, the model looks at its own program and the result it got back ($o$), and decides what to do next.

Why is this revolutionary? Because the model's "working memory" (what it pays attention to) is now just the Program $\pi$ and the Observation $o$.

- **Old Way**: The model must attend to 1,000,000 tokens simultaneously ($O(N^2)$ cognitive load).
- **RLM Way**: The model attends only to the specific chunk it requested (e.g., 500 tokens). While the system might scan the whole library, the model never holds the entire load at once.

We have decoupled the Size of the Data from the Cognitive Load of the model.

## The "Recursive" in RLM

The "Recursive" in RLM isn't just a buzzword. It literally means the model can call itself.

Imagine asking the model to "Summarize this 10,000 page document." A standard RLM might try to read the first 100 pages using code. But that's still too much work.

In a Recursive system, the generated program $\pi$ can invoke the RLM function again:
- $M_d$: The Model at recursion depth $d$.
- $M_{d-1}$: A sub-agent called by the parent.

### Example Flow
1. **Manager Agent (Depth 0)**: "This document is huge. I'll split it into 3 parts and hire 3 sub-agents."
   - Code: `for part in parts: rlm.completion("Summarize this", part)`
2. **Worker Agent A (Depth 1)**: "I received Part 1. It's still big, but I can read it."
   - Code: `read(part); summarize()`
3. **Manager Agent**: Receives 3 summaries, combines them, and answers.

This turns a linear processing task into a Tree Search. The model dynamically creates its own "org chart" of sub-agents to handle the complexity.

## A Simple Implementation

Let's turn this math into Python. We need to build a system where the model can write code that references itself.

**Note**: For clarity and simplicity, we execute Python code in a local environment. In production, this "Environment" would be a secure Docker container to prevent the model from deleting your files!

### Step 1: The Environment
This is our "Computer". It stores variables and runs code.

In [4]:
import io
import contextlib

class SimpleREPL:
    def __init__(self, variables=None):
        self.variables = variables if variables else {}

    def execute(self, code):
        f = io.StringIO()
        with contextlib.redirect_stdout(f):
            try:
                # The Magic: We run the code with access to 'self.variables'
                exec(code, self.variables)
            except Exception as e:
                print(f"Error: {e}")
        return f.getvalue()

### Step 2: LLM Client

We need a standard client to talk to the model. We'll use the new Gemini v2 SDK (`google-genai`) for this.

You can install it with: `pip install google-genai`

In [5]:
from google import genai
import os

class GeminiClient:
    def __init__(self):
        # 1. Initialize the new Gemini v2 Client
        # The client will automatically use 'GOOGLE_API_KEY' from environment variables if not provided,
        # or we can explicitly pass it.
        api_key = os.getenv("GOOGLE_API_KEY") or os.getenv("GEMINI_API_KEY")
        self.client = genai.Client(api_key=api_key)
        self.model_name = "gemini-2.0-flash"

    def completion(self, prompt: str) -> str:
        # Standard completion wrapper
        try:
            response = self.client.models.generate_content(
                model=self.model_name,
                contents=prompt
            )
            return response.text
        except Exception as e:
            return f"Error: {e}"

    def chat(self, messages: list[dict]) -> str:
        # 2. Convert RLM messages to Gemini format
        # For simplicity, we concatenate them into a single string here.
        full_prompt = "\n".join([f"{m['role'].upper()}: {m['content']}" for m in messages])
        
        try:
            response = self.client.models.generate_content(
                model=self.model_name,
                contents=full_prompt
            )
            # Helper to strip markdown code blocks from the response
            return self._clean_code(response.text)
        except Exception as e:
            return f"print('Error: {e}')"

    def _clean_code(self, text):
        if "```" in text:
            try:
                return text.split("```python")[1].split("```")[0].strip()
            except IndexError:
                # Fallback if no python tag or different format
                return text.split("```")[1].strip()
        return text

### Step 3: The Recursive Agent

This is the RLM. Notice how `completion` creates a REPL, adds context, and—crucially—adds `self` (as `rlm`) to the variables.

In [6]:
class SimpleRLM:
    def __init__(self, client):
        self.client = client

    def completion(self, prompt, context=None):
        # 1. Define the Proxy Function
        # This allows the model to "call itself" (or at least call the LLM)
        def llm_query(query):
            return self.client.completion(query)

        # 2. Prepare the Environment
        variables = {
            "context": context,
            "llm_query": llm_query
        }
        repl = SimpleREPL(variables)

        # 3. The System Prompt
        system_msg = (
            "You are a Recursive Language Model. "
            "You have a variable 'context' (str). "
            "You have a function 'llm_query(prompt)' to ask the LLM questions. "
            "Write Python code to process the context. "
            "If the context is large, split it and call llm_query on chunks. "
            "Set 'final_answer' variable if you have the result, or print it. "
            "IMPORTANT: If you define a function, YOU MUST CALL IT at the end."
        )

        # 4. Get the Program
        messages = [
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt}
        ]
        
        # (In a real app, you'd loop this: Thought -> Code -> Obs -> Thought)
        print(f"--- RLM Thinking on: {prompt} ---")
        code = self.client.chat(messages) 
        print(f"--- Generated Code ---\n{code}")

        # 5. Execute and Return
        result = repl.execute(code)
        
        # If the code set a 'final_answer' variable, return that.
        if "final_answer" in repl.variables:
            return repl.variables["final_answer"]
        return result.strip()

## Step 4: Recursive Trace

Let's walk through a real Recursive Trace to see the "Active Attention" claim in action.

Imagine you have a dense, 100,000-character document—perhaps a legal contract or a technical manual—and a simple prompt: "Summarize this."

A standard LLM may attempt to eat the elephant in one bite. It forces all 100,000 characters into its active memory simultaneously.
- **The Action**: It attends to every token at once, treating the noise and the signal with equal weight.
- **The Failure**: The "Lost in the Middle" phenomenon kicks in. The model hallucinates or forgets crucial details buried in paragraph 400 because its attention mechanism is stretched too thin.
- **The Cost**: You pay for 100k tokens of compute, much of which is wasted on "cognitive overload."

The Recursive Language Model acts less like an overwhelmed reader and more like a Project Manager.

### Phase 1: The Assessment (Level 0)
The model receives the input but doesn't read it yet. It checks the variable `len(context)` and sees 100,000. It immediately realizes: "This is above my optimal cognitive load."
It makes an executive decision: "I will split this into 5 manageable chunks and hire sub-agents to process them."

### Phase 2: The Plan (Code Generation)
The model writes and executes a Python script to handle the logistics. It dynamically creates a Map-Reduce architecture on the fly.

In [7]:
# Example Usage Mockup
# Ensure you have your GOOGLE_API_KEY set in your environment variables before running this.

client = GeminiClient()
rlm = SimpleRLM(client)
# Create a dummy large context
large_context = "This is a sentence. " * 10000 
answer = rlm.completion("Summarize this text", context=large_context)
print(answer)

--- RLM Thinking on: Summarize this text ---
--- Generated Code ---
context = """
The Amazon rainforest is a vast and biodiverse ecosystem located in the Amazon basin of South America. Covering an area of approximately 8 million square kilometers, it spans across nine countries: Brazil, Peru, Colombia, Venezuela, Ecuador, Bolivia, Guyana, Suriname, and French Guiana. The majority of the rainforest is within Brazil.

The Amazon rainforest is often referred to as the "lungs of the Earth" due to its crucial role in producing oxygen and absorbing carbon dioxide. It plays a vital role in regulating the global climate and maintaining the stability of the planet's ecosystems. However, this label is somewhat misleading as the rainforest itself consumes nearly all of the oxygen it produces through respiration. Its real importance lies in its role in carbon sequestration and influencing regional and global weather patterns.

Biodiversity is exceptionally high in the Amazon rainforest, with an es

### Phase 3: The Delegation (Level 1)
Five "Sub-Agents" wake up. Each one receives exactly 20,000 characters—a perfectly manageable workload.
- Agent 1 reads the intro and summarizes it perfectly.
- Agent 3 reads the middle section (usually the "lost" part) with full clarity because, to Agent 3, this is the only context that exists.

They all return their partial summaries to the manager.

### Phase 4: The Synthesis
Back at Level 0, the Manager receives 5 concise summaries. The massive noise of the original document is gone. It calls itself one last time with the compressed context to generate a flowing, cohesive final answer.

By switching to this recursive method, we drastically changed the math of the operation:
- Map Step Load: 20k tokens (manageable).
- Reduce Step Load: ~2k tokens (pure signal).
- Max Cognitive Load: 20k tokens.

We processed a 100k document, but the model never had to "hold" more than 20% of it at any given time.

---
The full original source code implementation is available at github.com/alexzhang13/rlm.