# Compaction Cookbook: Incremental Session Memory Strategy

This notebook demonstrates an efficient compaction strategy that uses **incremental background summarization** instead of summarizing everything at compaction time.

## The Problem

Traditional compaction summarizes the entire conversation when context gets full. This is:
- **Slow**: Requires a blocking API call at the moment the user is waiting
- **Disruptive**: The user experiences latency at the worst possible time

## The Solution: Session Memory

Instead, we maintain a **running summary** that updates incrementally in the background:
1. Periodically summarize new messages into a "session memory"
2. Track which messages have been summarized
3. At compaction time, just use the pre-computed summary + unsummarized messages

**Key benefit**: The compaction itself is instant - no API call needed when context is full.

**Trade-off**: This adds overhead from periodic summarization calls, so it doesn't reduce total API cost. The value is in eliminating user-facing latency.

## Setup

In [None]:
import anthropic
from anthropic.types import MessageParam

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-5-20250929"

## Session Memory Manager

This class manages the incremental summarization strategy:

In [None]:
from dataclasses import dataclass, field


@dataclass
class SessionMemory:
    """Manages incremental session summarization for fast compaction."""
    min_tokens_to_init: int = 1000  # Tokens before first summarization
    min_tokens_between_updates: int = 500  # Tokens between updates
    summary: str = ""
    last_summarized_count: int = 0  # Message count at last summarization
    tokens_at_last_update: int = 0
    current_tokens: int = 0

    def update_tokens(self, tokens: int):
        """Update current token count (call after each API response)."""
        self.current_tokens = tokens

    def should_summarize(self) -> bool:
        """Check if we should run background summarization."""
        if self.current_tokens < self.min_tokens_to_init:
            return False

        tokens_since = self.current_tokens - self.tokens_at_last_update
        return tokens_since >= self.min_tokens_between_updates
 
    def compact_conversation(self, messages: list[MessageParam], summarize_fn):
        """Incrementally summarize new messages in the background."""
        new_messages = messages[self.last_summarized_count :]
        if not new_messages:
            return

        self.summary = summarize_fn(new_messages, self.summary)
        self.last_summarized_count = len(messages)
        self.tokens_at_last_update = self.current_tokens

        print(f"  [Background] Summarized {len(new_messages)} messages at {self.current_tokens} tokens")

        return self.summary

## Summarization Function

The summarization function calls Claude to extract key information:

In [None]:
def summarize_messages(messages: list[MessageParam], existing_memory: str) -> str:
    """Use Claude to incrementally summarize conversation messages."""
    conversation_text = "\n".join(f"{msg['role'].upper()}: {msg['content']}" for msg in messages)

    if existing_memory:
        prompt = f"""Update this session memory with new conversation turns.

<existing_summary>
{existing_memory}
</existing_summary>

<new_messages>
{conversation_text}
</new_messages>

Return only the updated summary."""
    else:
        prompt = f"""Summarize this conversation concisely.

<messages>
{conversation_text}
</messages>

Capture: topics discussed, key decisions, important context. Return only the summary."""

    response = client.messages.create(
        model=MODEL,
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}],
    )

    return response.content[0].text

## Real Conversation Loop with Session Memory

Now let's run a real conversation with Claude while demonstrating background summarization:

In [None]:
# Create session memory with low thresholds for demo
session_memory = SessionMemory(
    min_tokens_to_init=200, 
    min_tokens_between_updates=500
)

SYSTEM_PROMPT = """You are a helpful coding assistant. Keep responses concise but informative."""

user_questions = [
    "What are Python decorators and why are they useful?",
    "Show me a simple decorator example that logs function calls.",
    "How do I create a decorator that accepts arguments?",
    "Now explain Python's async/await syntax briefly.",
    "What's the difference between asyncio.gather and asyncio.wait?",
]

print("=" * 60)
print("CONVERSATION WITH BACKGROUND SUMMARIZATION")
print("=" * 60)

messages: list[MessageParam] = []
for i, question in enumerate(user_questions, 1):
    print(f"\n[Turn {i}] USER: {question}")
    print("-" * 40)

    messages.append({"role": "user", "content": question})

    response = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages,
    )

    assistant_msg: MessageParam = {"role": "assistant", "content": response.content[0].text}
    messages.append(assistant_msg)
    session_memory.update_tokens(response.usage.input_tokens + response.usage.output_tokens) # Update token count with each response

    print(f"ASSISTANT RESPONSE: {response.content[0].text}")
   
    if not session_memory.should_summarize():
        print(f"\nConversation at {response.usage.input_tokens + response.usage.output_tokens} tokens, no summarization needed yet.")
        continue

    if session_memory.should_summarize():
        print(f"\nConversation at {response.usage.input_tokens + response.usage.output_tokens} tokens, running background summarization...")
        # Create the summary of the conversation and update session memory
        session_memory.compact_conversation(messages, summarize_messages)
        
        # Reset the messages to only keep the summary for future turns
        summary = session_memory.compact_conversation(messages, summarize_messages)
        
        print("\n" + "=" * 60)
        print("SESSION MEMORY SUMMARY")
        print("=" * 60)
        print(session_memory.summary)

        messages = [{"role": "user", "content": f"You have been chatting with the user already. Here is the summary of the conversation so far:\n{summary}"}]

In [None]:
print("=" * 60)
print("COMPACTION DEMONSTRATION")
print("=" * 60)

print(f"\nBefore compaction:")
print(f"  Total messages: {len(messages)}")



print(f"\nAfter compaction:")
print(f"  Messages kept (unsummarized): {len(kept_messages)}")
for msg in kept_messages:
    content = msg["content"]
    preview = content[:50] + "..." if len(content) > 50 else content
    print(f"    - {msg['role']}: {preview}")

print(f"\nPre-computed summary:")
print("-" * 40)
print(summary)
print("-" * 40)
print("\n(Compaction was instant - no API call needed!)")

## Key Benefits

1. **Instant compaction**: No API call needed at compaction time - summary already exists
2. **Non-blocking**: Background summarization doesn't interrupt the user
3. **No lost context**: Messages after the last summarization are preserved verbatim
4. **Configurable thresholds**: Control when summarization happens based on token count

## Production Considerations

In a real implementation (like Claude Code's `sessionMemory.ts`):

- **Background execution**: Summarization runs in a forked process to not block the main conversation
- **Tool call awareness**: Don't summarize mid-tool-use to avoid orphaned tool results  
- **File persistence**: Session memory is saved to disk (`.claude/session-memory.md`)
- **Threshold tuning**: Default is 10K tokens to init, 5K between updates

## Continuing After Compaction

After compaction, we can continue the conversation using the summary as context:

In [None]:
# Build the compacted message history
compacted_messages: list[MessageParam] = [
    {"role": "user", "content": f"[Previous conversation summary]\n{summary}\n\n[Continuing conversation]"},
    {"role": "assistant", "content": "I understand. I have context from our previous discussion. How can I help?"},
]
compacted_messages.extend(kept_messages)

# Continue the conversation
print("=" * 60)
print("CONTINUING CONVERSATION AFTER COMPACTION")
print("=" * 60)

follow_up = "Based on what we discussed, how would I combine a decorator with an async function?"
print(f"\nUSER: {follow_up}")
print("-" * 40)

response, tokens = chat(follow_up, compacted_messages)
print(f"ASSISTANT: {response['content']}")

print(f"\n[Context: {len(compacted_messages)} messages, {tokens} tokens instead of {session_memory.current_tokens}]")

## Summary

This notebook demonstrated the **incremental session memory** pattern for efficient context compaction:

| Approach | At Compaction Time | Cost Distribution |
|----------|-------------------|-------------------|
| **Traditional** | Summarize all messages (slow) | All cost at once |
| **Session Memory** | Use pre-computed summary (instant) | Cost spread over time |

The key insight is that **summarization work can be done incrementally in the background**, making the actual compaction operation nearly instant. This pattern is particularly valuable for long-running conversations where context management is critical.

## Evaluation: Response Time with Full vs Compacted Context

Let's measure how much faster follow-up responses are when using the compacted context vs the full conversation history.

In [None]:
import time


def timed_chat(user_message: str, messages: list[MessageParam]) -> tuple[str, float, int]:
    """Send a message and return response, elapsed time, and input tokens."""
    start_time = time.perf_counter()

    response = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages + [{"role": "user", "content": user_message}],
    )

    elapsed = time.perf_counter() - start_time
    return response.content[0].text, elapsed, response.usage.input_tokens


# Build fresh message lists for fair comparison (before any follow-ups)

# Full context: original conversation messages
full_context_messages = messages.copy()

# Compacted context: summary + unsummarized messages
compacted_context_messages: list[MessageParam] = [
    {"role": "user", "content": f"[Previous conversation summary]\n{summary}\n\n[Continuing conversation]"},
    {"role": "assistant", "content": "I understand. I have context from our previous discussion. How can I help?"},
]
compacted_context_messages.extend(kept_messages)

# The follow-up question to test
follow_up_question = "Can you give me a quick example combining decorators with async?"

print("=" * 60)
print("RESPONSE TIME COMPARISON: FULL vs COMPACTED CONTEXT")
print("=" * 60)
print(f"\nQuestion: {follow_up_question}")

# Test 1: Full context
print("\n" + "-" * 60)
print("[1] FULL CONTEXT (original conversation)")
print("-" * 60)
print(f"Messages: {len(full_context_messages)} | ", end="")
full_response, full_time, full_tokens = timed_chat(follow_up_question, full_context_messages)
print(f"Input tokens: {full_tokens} | Response time: {full_time:.2f}s")
print(f"\nAnswer:\n{full_response}")

# Test 2: Compacted context
print("\n" + "-" * 60)
print("[2] COMPACTED CONTEXT (session memory)")
print("-" * 60)
print(f"Messages: {len(compacted_context_messages)} | ", end="")
compact_response, compact_time, compact_tokens = timed_chat(
    follow_up_question, compacted_context_messages
)
print(f"Input tokens: {compact_tokens} | Response time: {compact_time:.2f}s")
print(f"\nAnswer:\n{compact_response}")

# Results
print("\n" + "=" * 60)
print("COMPARISON")
print("=" * 60)

token_reduction = full_tokens - compact_tokens
token_reduction_pct = (token_reduction / full_tokens) * 100
time_saved = full_time - compact_time
time_saved_pct = (time_saved / full_time) * 100 if full_time > 0 else 0

print(f"\nToken reduction: {token_reduction:,} tokens ({token_reduction_pct:.1f}% smaller)")
print(f"Time saved: {time_saved:.2f}s ({time_saved_pct:.1f}% faster)")
print(f"\nFull context:     {full_tokens:,} tokens → {full_time:.2f}s")
print(f"Compacted context: {compact_tokens:,} tokens → {compact_time:.2f}s")

### Evaluation Results

The comparison shows the response time benefit after compaction:

| Metric | Full Context | Compacted Context |
|--------|--------------|-------------------|
| **Input Tokens** | All messages | Summary + recent |
| **Response Time** | Baseline | Faster |

**Why compaction speeds up responses:**
- Fewer input tokens = faster time-to-first-token
- Smaller context = lower cost *per subsequent turn*

**Important trade-off**: Session memory does NOT reduce total API cost. It adds overhead from periodic background summarization calls. The benefits are:

1. **No compaction latency** - user never waits for a summarization call
2. **Faster subsequent responses** - smaller context after compaction
3. **Lower cost per turn** - after compaction, each API call is cheaper

The value is in **user experience** (no blocking) and **per-turn efficiency** after compaction, not total cost reduction.

In [None]:
# Cost evaluation

# Pricing (per million tokens) - Sonnet 4.5
INPUT_COST_PER_M = 3.00
OUTPUT_COST_PER_M = 15.00


def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    """Estimate cost in dollars."""
    return (input_tokens * INPUT_COST_PER_M + output_tokens * OUTPUT_COST_PER_M) / 1_000_000


print("=" * 60)
print("COST COMPARISON: SESSION MEMORY vs TRADITIONAL")
print("=" * 60)

# Session Memory: background summarization overhead + compacted follow-up
print("\n[Session Memory]")
bg_summarize_calls = 2
bg_input = bg_summarize_calls * 800
bg_output = bg_summarize_calls * 200
print(f"  Background summaries: ~{bg_input:,} input, ~{bg_output:,} output")
print(f"  Follow-up:            {compact_tokens:,} input")

sm_input = bg_input + compact_tokens
sm_output = bg_output + len(compact_response) // 4
sm_cost = estimate_cost(sm_input, sm_output)
print(f"  Total: ${sm_cost:.4f}")

# Traditional: compaction + full-context follow-up
print("\n[Traditional]")
print(f"  Compaction:  {full_tokens:,} input")
print(f"  Follow-up:   {full_tokens:,} input")

trad_input = full_tokens * 2
trad_output = 300 + len(full_response) // 4
trad_cost = estimate_cost(trad_input, trad_output)
print(f"  Total: ${trad_cost:.4f}")

# Result
print("\n" + "-" * 60)
cost_diff = sm_cost - trad_cost
print(f"Difference: ${cost_diff:+.4f}")
if cost_diff > 0:
    print("→ Session memory costs more, but eliminates compaction latency")
else:
    print("→ Session memory costs less due to smaller follow-up context")