# Instant Compaction with Session Memory

Traditional compaction is slow: when you hit the context limit, you wait for a summary.

With **Instant compaction** the session memory is proactively generated once a soft token threshold is reached. Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.

Result: Instant compaction, no waiting.


```
TRADITIONAL COMPACTION (slow)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Turn 1 ‚Üí Turn 2 ‚Üí Turn 3 ‚Üí ... ‚Üí Turn N ‚Üí CONTEXT FULL!
                                              ‚îÇ
                                              ‚ñº
                                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                    ‚îÇ Generate summary‚îÇ
                                    ‚îÇ ( USER WAITS !) ‚îÇ
                                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                              ‚îÇ
                                              ‚ñº
                                         Continue


SESSION MEMORY COMPACTION (instant)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Turn 1 ‚Üí Turn 2 ‚Üí ... ‚Üí Turn K ‚Üí Turn K+1 ‚Üí ... ‚Üí Turn N ‚Üí ..  ‚Üí CONTEXT FULL!
                            ‚îÇ                         ‚îÇ            ‚îÇ
                (soft threshold met:              (update          ‚îÇ
                   10k tokens init)                trigger)        ‚îÇ
                            ‚îÇ                                      ‚îÇ
                            ‚îÇ                         ‚îÇ            ‚îÇ
                            ‚ñº                         ‚ñº            ‚îÇ
                       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê        ‚îÇ
                       ‚îÇ Update ‚îÇ                ‚îÇ Update ‚îÇ        ‚îÇ
                       ‚îÇ memory ‚îÇ (background)   ‚îÇ memory ‚îÇ        ‚îÇ
                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò        ‚îÇ
                            ‚îÇ                         ‚îÇ            ‚îÇ
                            ‚ñº                         ‚ñº            ‚ñº
                     üìù session-memory.md ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ INSTANT SWAP!
                       (continuously updated)
```

**Update triggers:** The first summary is generated after the initial 10k tokens. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~5k tokens or 3+ tool calls).

## Fundamentals: writing a compaction prompt

Make sure you have a well structured session memory prompt. 

Some best practices include:
- Use chain-of-thought before summarizing ‚Äî analyze first, then output                                                                                         
- Enumerate exactly what to preserve: file paths, code snippets, errors, user corrections                                                                      
- Weight recency heavily ‚Äî the end of the conversation is the active context                                                                                   
- Require verbatim quotes for next steps to prevent task drift                                                                                                 
- Use structured sections with token budgets per section                                                                                                       
- Include a "Current State" section that always reflects the moment of compaction

Some pitfalls include:
- Vague prompts like "summarize this conversation" produce lossy output                                                                                        
- Treating all messages equally loses the active working context                                                                                               
- Paraphrasing next steps introduces subtle drift that compounds                                                                                               
- Omitting error history causes the model to retry failed approaches                                                                                           
- Dropping user corrections makes the model revert to old behaviors                                                                                            
- No token limits lets one section consume the entire summary                                                                                                  
- Summarizing for human readability instead of model continuity
- Having the agent try to compress the results of tool calls here - this can be retrieved later if the agent needs it


In [None]:
SESSION_CREATION_PROMPT = """
<analysis-instructions>
Before generating your summary, analyze the transcript in <think>...</think> tags:
1. What did the user originally request? (Exact phrasing)
2. What actions succeeded? What failed and why?
3. Did the user correct or redirect the assistant at any point?
4. What was actively being worked on at the end?
5. What tasks remain incomplete or pending?
6. What specific details (IDs, paths, values, names) must survive compression?
</analysis-instructions>

<summary-format>
## User Intent
The user's original request and any refinements. Use direct quotes for key requirements.
If the user's goal evolved during the conversation, capture that progression.

## Completed Work
Actions successfully performed. Be specific:
- What was created, modified, or deleted
- Exact identifiers (file paths, record IDs, URLs, names)
- Specific values, configurations, or settings applied

## Errors & Corrections
- Problems encountered and how they were resolved
- Approaches that failed (so they aren't retried)
- User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
Capture corrections verbatim‚Äîthese represent learned preferences.

## Active Work
What was in progress when the session ended. Include:
- The specific task being performed
- Direct quotes showing exactly where work left off
- Any partial results or intermediate state

## Pending Tasks
Remaining items the user requested that haven't been started.
Distinguish between "explicitly requested" and "implied/assumed."

## Key References
Important details needed to continue:
- Identifiers: IDs, paths, URLs, names, keys
- Values: numbers, dates, configurations, credentials (redacted)
- Context: relevant background information, constraints, preferences
- Citations: sources referenced during the conversation
</summary-format>

<preserve-rules>
Always preserve when present:
- Exact identifiers (IDs, paths, URLs, keys, names)
- Error messages verbatim
- User corrections and negative feedback
- Specific values, formulas, or configurations
- Technical constraints or requirements discovered
- The precise state of any in-progress work
</preserve-rules>

<compression-rules>
- Weight recent messages more heavily‚Äîthe end of the transcript is the active context
- Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
- Omit system context that will be re-injected separately
- Keep each section under 500 words; condense older content to make room for recent
- If you must cut details, preserve: user corrections > errors > active work > completed work
</compression-rules>
"""

## Traditional compacting example
In traditional compaction, you generate one summary once the token threshold is reached.

In [22]:
# setup, we are using haiku for demo purposes
import anthropic
from dataclasses import dataclass, field

client = anthropic.Anthropic()
MODEL = "claude-haiku-4-5-20251001"

In [35]:
import time

class TraditionalCompactingChatSession:
    """Traditional chat session with compaction after the fact."""
    def __init__(self, context_limit: int = 700):
        self.context_limit = context_limit
        self.messages = []
        self.current_tokens = 0
        self.tokens_before_compaction = None  # Track for showing reduction
        self.summary = None
    
    def compact(self):
        prev_msg_count = len(self.messages)
        self.tokens_before_compaction = self.current_tokens
       
        compaction_prompt = SESSION_CREATION_PROMPT + "\n\nTranscript:\n"
        for msg in self.messages:
            role = "User" if msg["role"] == "user" else "Assistant"
            compaction_prompt += f"{role}: {msg['content']}\n"
        
        start_time = time.perf_counter()
        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="You are a helpful assistant that summarizes conversations.",
            messages=[{"role": "user", "content": compaction_prompt}]
        )
        elapsed = time.perf_counter() - start_time
        
        # Generate new summary message
        self.summary = response.content[0].text
        self.messages = [{
            "role": "user",
            "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off."""
        }]
        print(f"\n{'=' * 60}")
        curr_msg_count = len(self.messages)
        print(f"üîÑ Compaction messages: {prev_msg_count} ‚Üí {curr_msg_count}")
        print(f"‚è±Ô∏è  Compaction time: {elapsed:.2f}s (user waiting...)")
    
    def chat(self, user_message: str):
        if self.current_tokens >= self.context_limit:
            print("\nüßπ Context limit exceeded, compacting session memory...")
            self.compact()
        
        self.messages.append({"role": "user", "content": user_message})
        
        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="You are a helpful coding assistant. Be concise but thorough.",
            messages=self.messages
        )
        
        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})
        
        self.current_tokens = response.usage.input_tokens
        
        # Show token reduction if we just compacted
        if self.tokens_before_compaction is not None:
            reduction = self.tokens_before_compaction - self.current_tokens
            pct = (reduction / self.tokens_before_compaction) * 100
            print(f"‚úÖ Tokens reduced: {self.tokens_before_compaction:,} ‚Üí {self.current_tokens:,} ({reduction:,} tokens saved, {pct:.0f}% reduction)")
            print(f"{'=' * 60}")
            self.tokens_before_compaction = None
      
        return assistant_message, response.usage

  result = add_action(grammar, unpack).parseWithTabs().transformString(text)


### Example use of traditional compaction

In [41]:
session = TraditionalCompactingChatSession()

messages = [
    "Explain Python decorators with a simple example.",
    "Now show me a decorator that logs function arguments.",
    "How do I make a decorator that accepts parameters?",
]

print("Starting conversation with traditional compacting chat session...\n")

turn_count = 0
 
for i, message in enumerate(messages, 1):
    response, usage = session.chat(message)
    turn_count += 1
    print(
        f"\n{'=' * 60}\n"
        f"Turn {turn_count:2d}: Input={usage.input_tokens:7,} tokens | "
        f"Output={usage.output_tokens:5,} tokens | "
        f"Messages={len(session.messages):2d}"
    )
    print(f"\nUser: {message}\nAssistant: {response}\n{'-'*40}\n")

Starting conversation with traditional compacting chat session...


Turn  1: Input=     48 tokens | Output=  418 tokens | Messages= 2

User: Explain Python decorators with a simple example.
Assistant: # Python Decorators Explained

A **decorator** is a function that modifies or enhances another function or class without changing its source code. It wraps a function and executes code before and/or after the wrapped function runs.

## Simple Example

```python
def my_decorator(func):
    def wrapper(*args, **kwargs):
        print("Something before the function")
        result = func(*args, **kwargs)
        print("Something after the function")
        return result
    return wrapper

@my_decorator
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("Alice")
```

**Output:**
```
Something before the function
Hello, Alice!
Something after the function
```

## How It Works

1. `my_decorator` takes a function as input
2. `wrapper` is a new function that:
   - Runs code **before*

In [42]:
response, _ = session.chat("What did we just talk about?")
print("\nFinal assistant response:")
print(response)


üßπ Context limit exceeded, compacting session memory...

üîÑ Compaction messages: 6 ‚Üí 1
‚è±Ô∏è  Compaction time: 5.97s (user waiting...)
‚úÖ Tokens reduced: 963 ‚Üí 721 (242 tokens saved, 25% reduction)

Final assistant response:
We just finished a **three-part tutorial on Python decorators**, progressing from basics to advanced:

1. **Basic decorators** ‚Äì Simple wrapper functions using `def decorator(func)` pattern with a nested `wrapper` function
2. **Logging decorators** ‚Äì Capturing function arguments and return values, introducing `@wraps(func)` from `functools` to preserve metadata
3. **Parameterized decorators** ‚Äì The "three-level nesting" pattern where decorators themselves accept arguments:
   - `@repeat(times=3)` ‚Äì calls a function multiple times
   - `@rate_limit(max_calls, time_window)` ‚Äì throttles function calls

The key insight was that **parameterized decorators have three nested functions**: outer (parameters) ‚Üí middle (decorator) ‚Üí inner (wrapper).



As a result the user experineces a wait time when compaction occurs. It is only a few seconds in this example, but for long context compaction, this can be must longer.

## Instant Compaction


The key insight: **build the session memory in the background** so it's ready when you need it.

```
Turn 1 ‚Üí Turn 2 ‚Üí ... ‚Üí Turn K  ‚Üí Turn K+1 ‚Üí ... ‚Üí CONTEXT FULL!
                           ‚îÇ           ‚îÇ                 ‚îÇ
                     (threshold)  (update)          INSTANT!
                           ‚Üì           ‚Üì                 ‚Üì
                    [Background]  [Background]    [Just swap in
                     memory init   memory update   pre-built memory]
```

The `InstantCompactingChatSession` class uses **threading** for background execution:
1. **`threading.Thread`** - runs memory updates in background without blocking
2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
3. **Daemon threads** - background work doesn't prevent program exit
4. **Instant compaction** - when context is full, just swap in the pre-built memory

In [24]:
import threading
import time


class InstantCompactingChatSession:
    """
    Maintains session memory via incremental background updates.
    
    Key insight: By updating memory in the background after each turn,
    the summary is already ready when compaction is needed - instant swap!
    """

    def __init__(
        self,
        context_limit: int = 2000,
        min_tokens_to_init: int = 500,
        min_tokens_between_updates: int = 300,
    ):
        # Thresholds
        self.context_limit = context_limit
        self.min_tokens_to_init = min_tokens_to_init
        self.min_tokens_between_updates = min_tokens_between_updates

        # Conversation state
        self.messages = []
        self.current_tokens = 0

        # Session memory state
        self.session_memory = None
        self.last_summarized_index = 0
        self.tokens_at_last_update = 0

        # Background update tracking
        self._update_thread: threading.Thread | None = None
        self.last_update_time = None
        self._lock = threading.Lock()

    def chat(self, user_message: str):
        """Process a chat turn with background session memory updates."""
        if self.current_tokens >= self.context_limit:
            self.compact()

        self.messages.append({"role": "user", "content": user_message})

        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="You are a helpful coding assistant. Be concise but thorough.",
            messages=self.messages,
        )

        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})

        self.current_tokens = response.usage.input_tokens

        # KEY DIFFERENCE: Trigger background memory update if needed
        if self._should_init_memory() or self._should_update_memory():
            self._trigger_background_update()
            status = "initializing" if self.session_memory is None else "updating"
            print(f"   [Background] Session memory {status}...")

        return assistant_message, response.usage
    
    # Helper methods to determine when to init/update/compact
    def _should_init_memory(self) -> bool:
        return (
            self.session_memory is None
            and self.current_tokens >= self.min_tokens_to_init
        )

    # Helper method to determine if memory should be updated
    def _should_update_memory(self) -> bool:
        if self.session_memory is None:
            return False
        tokens_since = self.current_tokens - self.tokens_at_last_update
        return tokens_since >= self.min_tokens_between_updates

    def _build_transcript(self, messages: list[dict]) -> str:
        lines = []
        for msg in messages:
            role = "User" if msg["role"] == "user" else "Assistant"
            lines.append(f"{role}: {msg['content']}")
        return "\n\n".join(lines)

    def _create_session_memory(self, messages: list[dict]) -> str:
        """Generate initial session memory from messages."""
        transcript = self._build_transcript(messages)

        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="""You are a session memory agent. Compress the conversation into a structured summary 
that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
ability to continue working, not human readability.""",
            messages=[
                {
                    "role": "user",
                    "content": f"""Conversation transcript:
{transcript}

Create session memory using these instructions:
{SESSION_CREATION_PROMPT}

First analyze in <think>...</think> tags, then output the structured summary.""",
                }
            ],
        )
        return response.content[0].text

    def _update_session_memory(self, new_messages: list[dict]) -> str:
        """Update existing session memory with new messages."""
        transcript = self._build_transcript(new_messages)

        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="""You are a session memory agent. Update the existing session memory with new information 
from the recent conversation. Preserve important existing details while integrating new content.""",
            messages=[
                {
                    "role": "user",
                    "content": f"""Current session memory:
{self.session_memory}

New messages to integrate:
{transcript}

Update the session memory following these guidelines:
{SESSION_CREATION_PROMPT}

Output only the updated session memory (no analysis tags needed for updates).""",
                }
            ],
        )
        return response.content[0].text

    def _background_memory_update(
        self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
    ):
        """Run session memory update in a background thread."""
        try:
            if self.session_memory is None:
                new_memory = self._create_session_memory(messages_snapshot)
            else:
                new_messages = messages_snapshot[self.last_summarized_index :]
                if not new_messages:
                    return
                new_memory = self._update_session_memory(new_messages)

            # Update state (thread-safe)
            with self._lock:
                self.session_memory = new_memory
                self.last_summarized_index = snapshot_index
                self.tokens_at_last_update = current_tokens
                self.last_update_time = time.time()

        except Exception as e:
            print(f"   [Background] Error updating memory: {e}")

    def _trigger_background_update(self):
        """Trigger a background session memory update."""
        if self._update_thread is not None and self._update_thread.is_alive():
            return

        messages_snapshot = self.messages.copy()
        snapshot_index = len(messages_snapshot)
        current_tokens = self.current_tokens

        self._update_thread = threading.Thread(
            target=self._background_memory_update,
            args=(messages_snapshot, snapshot_index, current_tokens),
            daemon=True,
        )
        self._update_thread.start()

    def wait_for_memory(self, timeout: float = 30.0):
        """Wait for any pending background update to complete."""
        if self._update_thread is not None and self._update_thread.is_alive():
            self._update_thread.join(timeout=timeout)

    def compact(self):
        """INSTANT compaction using pre-built session memory."""
        prev_msg_count = len(self.messages)

        if self.session_memory is None:
            if self._update_thread is not None and self._update_thread.is_alive():
                print("   ‚è≥ Waiting for background memory update...")
                self._update_thread.join(timeout=30.0)

            if self.session_memory is None:
                print("   ‚ö†Ô∏è  No pre-built memory, creating synchronously...")
                start = time.perf_counter()
                self.session_memory = self._create_session_memory(self.messages)
                elapsed = time.perf_counter() - start
                print(f"   ‚è±Ô∏è  Took {elapsed:.2f}s (but should be instant normally!)")
                self.last_summarized_index = len(self.messages)

        unsummarized = self.messages[self.last_summarized_index :]

        summary_message = {
            "role": "user",
            "content": f"""This session is being continued from a previous conversation.

Here is the session memory:
{self.session_memory}

Continue from where we left off.""",
        }

        self.messages = [summary_message] + unsummarized
        self.last_summarized_index = 1

        print(f"\n{'=' * 60}")
        print(f"‚ö° INSTANT COMPACTION! Messages: {prev_msg_count} ‚Üí {len(self.messages)}")
        print(f"   Kept {len(unsummarized)} unsummarized messages")
        print(f"   Session memory was pre-built (no wait time!)")
        print(f"{'=' * 60}")

    

  grammar.streamline()
  result = add_action(grammar, unpack).parseWithTabs().transformString(text)


### Example use of Instant Compaction

In [44]:
# Low thresholds for demo - in production you'd use higher values
session = InstantCompactingChatSession(
    context_limit=700,
    min_tokens_to_init=200,
    min_tokens_between_updates=150,
)

messages = [
    "Explain Python decorators with a simple example.",
    "Now show me a decorator that logs function arguments.",
    "How do I make a decorator that accepts parameters?",
]

print("=" * 60)
print("INSTANT COMPACTING SESSION")
print("=" * 60)
print("Session memory builds in background, so compaction is instant!\n")

turn_count= 0
for i, message in enumerate(messages, 1):
    turn_count += 1
    response, usage = session.chat(message)
    
    memory_status = "ready" if session.session_memory else "not yet"
    print(
        f"\n{'=' * 60}\n"
        f"Turn {turn_count:2d}: Input={usage.input_tokens:7,} tokens | "
        f"Output={usage.output_tokens:5,} tokens | "
        f"Messages={len(session.messages):2d}"
    )

INSTANT COMPACTING SESSION
Session memory builds in background, so compaction is instant!


Turn  1: Input=     48 tokens | Output=  393 tokens | Messages= 2
   [Background] Session memory initializing...

Turn  2: Input=    454 tokens | Output=  476 tokens | Messages= 4
   [Background] Session memory initializing...

Turn  3: Input=    943 tokens | Output=  564 tokens | Messages= 6


In [45]:
response, _ = session.chat("What did we just talk about?")
print("\nFinal assistant response:")
print(response)


‚ö° INSTANT COMPACTION! Messages: 6 ‚Üí 3
   Kept 2 unsummarized messages
   Session memory was pre-built (no wait time!)
   [Background] Session memory updating...

Final assistant response:
We just covered **parameterized decorators** ‚Äî how to create decorators that accept their own parameters.

The key concept: you need **three levels of nesting** instead of two:

1. **Outer function** ‚Äî accepts decorator parameters (e.g., `repeat(times=3)`)
2. **Middle function** ‚Äî the decorator itself (takes the function to decorate)
3. **Inner function** ‚Äî the wrapper (executes the actual behavior)

I showed two examples:
- **`@repeat(times=3)`** ‚Äî runs a function multiple times
- **`@rate_limit(max_calls=3, time_window=10)`** ‚Äî prevents function calls exceeding a rate limit

This was a follow-up to our earlier conversation about Python decorators and logging decorators.


In [26]:
# Side-by-side comparison: Traditional vs Instant compaction

print("=" * 70)
print("COMPARISON: Traditional vs Instant Compaction")
print("=" * 70)

messages = [
    "Explain Python decorators with a simple example.",
    "Now show me a decorator that logs function arguments.",
    "How do I make a decorator that accepts parameters?",
]

# Traditional approach
print("\nüìä TRADITIONAL COMPACTION:")
print("-" * 40)
traditional = TraditionalCompactingChatSession(context_limit=500)

for i, msg in enumerate(messages, 1):
    response, usage = traditional.chat(msg)
    print(f"  Turn {i}: {usage.input_tokens:,} tokens")

# Force a compaction to measure time
start = time.perf_counter()
traditional.compact()
traditional_compaction_time = time.perf_counter() - start

# Instant approach  
print("\n‚ö° INSTANT COMPACTION:")
print("-" * 40)
instant = InstantCompactingChatSession(
    context_limit=500,
    min_tokens_to_init=100,
    min_tokens_between_updates=100,
)

for i, msg in enumerate(messages, 1):
    response, usage = instant.chat(msg)
    print(f"  Turn {i}: {usage.input_tokens:,} tokens | Memory: {'ready' if instant.session_memory else 'building...'}")

# Wait for background to finish
instant.wait_for_memory()

# Measure instant compaction time
start = time.perf_counter()
instant.compact()
instant_compaction_time = time.perf_counter() - start

print("\n" + "=" * 70)
print("RESULTS:")
print(f"  Traditional compaction time: {traditional_compaction_time:.2f}s (user waiting)")
print(f"  Instant compaction time:     {instant_compaction_time:.4f}s (just a swap!)")
print(f"  Speedup: {traditional_compaction_time/max(instant_compaction_time, 0.001):.0f}x faster")
print("=" * 70)

COMPARISON: Traditional vs Instant Compaction

üìä TRADITIONAL COMPACTION:
----------------------------------------
  Turn 1: 48 tokens
  Turn 2: 444 tokens
  Turn 3: 915 tokens

üîÑ Compaction triggered! Messages: 6 ‚Üí 1
‚è±Ô∏è  Compaction time: 5.69s (user waiting...)

‚ö° INSTANT COMPACTION:
----------------------------------------
  Turn 1: 48 tokens | Memory: building...
   [Background] Session memory initializing...
  Turn 2: 452 tokens | Memory: building...
   [Background] Session memory initializing...
  Turn 3: 1,024 tokens | Memory: building...

‚ö° INSTANT COMPACTION! Messages: 6 ‚Üí 3
   Kept 2 unsummarized messages
   Session memory was pre-built (no wait time!)

RESULTS:
  Traditional compaction time: 5.69s (user waiting)
  Instant compaction time:     0.0002s (just a swap!)
  Speedup: 5692x faster


## Advanced: Adding Prompt Caching


The background updates can be made **~10x cheaper** by using prompt caching. The trick:
1. Pass the **full conversation** to the background summarizer
2. Add `cache_control` markers so subsequent requests hit the cache
3. Only the new "summarize this" instruction is billed at full price

```
Main chat:         [System + Turn 1 + Turn 2 + ... + Turn N]
                                    ‚Üì
                              (cached automatically)
                              
Background update: [System + Turn 1 + Turn 2 + ... + Turn N] + [Summarize instruction]
                              ‚Üë                                        ‚Üë
                         CACHE HIT! (10x cheaper)              Only this is billed
```

### How the Caching Works

The key is in `_add_cache_control()` and `_create_session_memory_cached()`:

```python
# 1. Mark the last conversation message with cache_control
{
    "role": "user",
    "content": [{
        "type": "text",
        "text": msg["content"],
        "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
    }]
}

# 2. Also mark the system prompt
system=[{
    "type": "text",
    "text": "You are a session memory agent...",
    "cache_control": {"type": "ephemeral"}
}]
```

**Why this works:**
- The first background update creates a cache entry for `[System + Messages]`
- Subsequent updates with the same message prefix get **cache hits**
- Only the new summarization instruction is billed at full price
- Cache entries have a 5-minute TTL, so rapid updates benefit most

**Cost math:**
- Without caching: 5,000 tokens √ó $3.00/1M = $0.015 per update
- With caching: 500 new tokens √ó $3.00/1M + 4,500 cached √ó $0.30/1M = $0.00285
- **Savings: ~80%** on background summarization costs

In [None]:
class CachedInstantCompactingSession:
    """
    Session memory with prompt caching for cheaper background updates.
    
    Key optimization: By passing the full conversation with cache_control markers,
    background summarization requests get cache hits on 90%+ of input tokens.
    """

    def __init__(
        self,
        context_limit: int = 2000,
        min_tokens_to_init: int = 500,
        min_tokens_between_updates: int = 300,
        system_prompt: str = "You are a helpful coding assistant. Be concise but thorough.",
    ):
        self.context_limit = context_limit
        self.min_tokens_to_init = min_tokens_to_init
        self.min_tokens_between_updates = min_tokens_between_updates
        self.system_prompt = system_prompt

        self.messages = []
        self.current_tokens = 0

        self.session_memory = None
        self.last_summarized_index = 0
        self.tokens_at_last_update = 0

        self._update_thread = None
        self._lock = threading.Lock()

        # Track cache stats
        self.total_cache_read = 0
        self.total_cache_created = 0
        self.total_input_tokens = 0

    def _should_init_memory(self) -> bool:
        return self.session_memory is None and self.current_tokens >= self.min_tokens_to_init

    def _should_update_memory(self) -> bool:
        if self.session_memory is None:
            return False
        return (self.current_tokens - self.tokens_at_last_update) >= self.min_tokens_between_updates

    def _should_compact(self) -> bool:
        return self.current_tokens >= self.context_limit

    def _add_cache_control(self, messages: list[dict]) -> list[dict]:
        """
        Add cache_control markers to messages for prompt caching.
        
        Strategy: Mark the last message with cache_control so the entire
        conversation prefix gets cached for subsequent requests.
        """
        if not messages:
            return messages

        cached_messages = []
        for i, msg in enumerate(messages):
            if i == len(messages) - 1:
                # Last message: add cache_control marker
                cached_messages.append({
                    "role": msg["role"],
                    "content": [
                        {
                            "type": "text",
                            "text": msg["content"],
                            "cache_control": {"type": "ephemeral"},
                        }
                    ],
                })
            else:
                cached_messages.append(msg)

        return cached_messages

    def _create_session_memory_cached(self, messages: list[dict]) -> tuple[str, dict]:
        """
        Generate session memory using the FULL conversation with caching.
        
        This passes the entire conversation + summarize instruction, so subsequent
        calls with the same conversation prefix will hit the cache.
        """
        # Build conversation with cache marker on last message
        cached_messages = self._add_cache_control(messages)

        # Add the summarization instruction as a new user message
        cached_messages.append({
            "role": "user",
            "content": f"""Based on our conversation above, create a session memory summary.

{SESSION_CREATION_PROMPT}

Output the structured summary directly.""",
        })

        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system=[
                {
                    "type": "text",
                    "text": """You are a session memory agent. When asked, compress the conversation 
into a structured summary that preserves all information needed to continue work seamlessly.""",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
            messages=cached_messages,
        )

        # Extract cache stats
        cache_stats = {
            "cache_read": getattr(response.usage, "cache_read_input_tokens", 0),
            "cache_created": getattr(response.usage, "cache_creation_input_tokens", 0),
            "input_tokens": response.usage.input_tokens,
        }

        return response.content[0].text, cache_stats

    def _background_memory_update(
        self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
    ):
        """Run cached session memory update in background thread."""
        try:
            new_memory, cache_stats = self._create_session_memory_cached(messages_snapshot)

            with self._lock:
                self.session_memory = new_memory
                self.last_summarized_index = snapshot_index
                self.tokens_at_last_update = current_tokens
                self.total_cache_read += cache_stats["cache_read"]
                self.total_cache_created += cache_stats["cache_created"]
                self.total_input_tokens += cache_stats["input_tokens"]

            # Show cache performance
            if cache_stats["cache_read"] > 0:
                pct = (cache_stats["cache_read"] / cache_stats["input_tokens"]) * 100
                print(f"   [Cache] {cache_stats['cache_read']:,} read ({pct:.0f}% hit rate)")
            else:
                print(f"   [Cache] {cache_stats['cache_created']:,} created (first request)")

        except Exception as e:
            print(f"   [Background] Error: {e}")

    def _trigger_background_update(self):
        if self._update_thread is not None and self._update_thread.is_alive():
            return

        self._update_thread = threading.Thread(
            target=self._background_memory_update,
            args=(self.messages.copy(), len(self.messages), self.current_tokens),
            daemon=True,
        )
        self._update_thread.start()

    def wait_for_memory(self, timeout: float = 30.0):
        if self._update_thread is not None and self._update_thread.is_alive():
            self._update_thread.join(timeout=timeout)

    def compact(self):
        prev_msg_count = len(self.messages)

        if self.session_memory is None:
            if self._update_thread is not None and self._update_thread.is_alive():
                print("   ‚è≥ Waiting for background update...")
                self._update_thread.join(timeout=30.0)

            if self.session_memory is None:
                print("   ‚ö†Ô∏è  Creating memory synchronously...")
                self.session_memory, _ = self._create_session_memory_cached(self.messages)
                self.last_summarized_index = len(self.messages)

        unsummarized = self.messages[self.last_summarized_index :]

        self.messages = [
            {
                "role": "user",
                "content": f"Session memory:\n{self.session_memory}\n\nContinue from where we left off.",
            }
        ] + unsummarized
        self.last_summarized_index = 1

        print(f"\n{'=' * 60}")
        print(f"‚ö° INSTANT COMPACTION! Messages: {prev_msg_count} ‚Üí {len(self.messages)}")
        print(f"{'=' * 60}")

    def chat(self, user_message: str):
        if self._should_compact():
            self.compact()

        self.messages.append({"role": "user", "content": user_message})

        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system=self.system_prompt,
            messages=self.messages,
        )

        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})
        self.current_tokens = response.usage.input_tokens

        if self._should_init_memory() or self._should_update_memory():
            self._trigger_background_update()
            print(f"   [Background] Updating session memory...")

        return assistant_message, response.usage

    def get_cache_savings(self) -> dict:
        """Calculate cost savings from caching."""
        if self.total_input_tokens == 0:
            return {"savings_pct": 0, "effective_rate": 0}

        # Cache reads are 10x cheaper than regular input
        regular_cost = self.total_input_tokens
        actual_cost = (self.total_input_tokens - self.total_cache_read) + (self.total_cache_read * 0.1)
        savings_pct = ((regular_cost - actual_cost) / regular_cost) * 100 if regular_cost > 0 else 0

        return {
            "total_input": self.total_input_tokens,
            "cache_read": self.total_cache_read,
            "cache_created": self.total_cache_created,
            "savings_pct": savings_pct,
        }