# Checkpoint with Resume

## Overview

This notebook demonstrates **checkpoint persistence and resumption** in agent workflows - showing how to save workflow state and restart execution from specific points without human interaction required.

### Text Processing Pipeline:

```
Input: "hello world"
    ↓
UpperCaseExecutor → "HELLO WORLD"
    ↓
ReverseTextExecutor → "DLROW OLLEH"
    ↓
SubmitToLowerAgent → Build AgentExecutorRequest
    ↓
LowerAgent (Azure OpenAI) → "dlrow olleh"
    ↓
FinalizeFromAgent → Yield output
```

### Key Concepts:

1. **FileCheckpointStorage**: Persist state to JSON files
2. **Executor State**: Per-executor local state (`ctx.get_state()`, `ctx.set_state()`)
3. **Shared State**: Cross-executor visibility (`ctx.set_shared_state()`)
4. **Superstep Boundaries**: Automatic checkpoint creation
5. **Interactive Resume**: Choose which checkpoint to resume from
6. **Idempotent Execution**: Resume skips completed steps

### What You Learn:

- Configure `FileCheckpointStorage` and `with_checkpointing()`
- Persist executor state for observability
- Share state across executors via `shared_state`
- List and inspect checkpoints programmatically
- Interactively select checkpoint to resume
- Understand checkpoint metadata structure

## Prerequisites

- Azure OpenAI configured with environment variables
- Azure CLI authentication: `az login`
- Agent Framework installed: `pip install agent-framework`
- Filesystem access for writing checkpoint JSON files

## Setup and Imports

In [None]:
from dotenv import load_dotenv
import asyncio
import os
from pathlib import Path
from typing import TYPE_CHECKING, Any

from agent_framework import (
    AgentExecutor,
    AgentExecutorRequest,
    AgentExecutorResponse,
    ChatMessage,
    Executor,
    FileCheckpointStorage,
    RequestInfoExecutor,
    Role,
    WorkflowBuilder,
    WorkflowContext,
    handler,
)
from agent_framework.azure import AzureOpenAIChatClient
from azure.identity import AzureCliCredential

load_dotenv('../../.env')

if TYPE_CHECKING:
    from agent_framework import Workflow
    from agent_framework._workflows._checkpoint import WorkflowCheckpoint

SyntaxError: invalid syntax (422371332.py, line 9)

## Configure Checkpoint Directory

In [None]:
# Define temporary directory for checkpoint files
CHECKPOINT_DIR = Path("./tmp/checkpoints")
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Checkpoint directory: {CHECKPOINT_DIR.absolute()}")

## Create Pipeline Executors

### 1. UpperCaseExecutor

**Demonstrates:**
- Executor-local state persistence
- Shared state for cross-executor visibility
- Simple text transformation

In [None]:
class UpperCaseExecutor(Executor):
    """Uppercases the input text and persists both local and shared state."""

    @handler
    async def to_upper_case(self, text: str, ctx: WorkflowContext[str]) -> None:
        result = text.upper()
        print(f"UpperCaseExecutor: '{text}' -> '{result}'")

        # Persist executor-local state (captured in checkpoints)
        prev = await ctx.get_state() or {}
        count = int(prev.get("count", 0)) + 1
        await ctx.set_state({
            "count": count,
            "last_input": text,
            "last_output": result,
        })

        # Write to shared_state for downstream executors
        await ctx.set_shared_state("original_input", text)
        await ctx.set_shared_state("upper_output", result)

        await ctx.send_message(result)

### 2. ReverseTextExecutor

Simple text reversal with state tracking.

In [None]:
class ReverseTextExecutor(Executor):
    """Reverses the input text and persists local state."""

    @handler
    async def reverse_text(self, text: str, ctx: WorkflowContext[str]) -> None:
        result = text[::-1]
        print(f"ReverseTextExecutor: '{text}' -> '{result}'")

        prev = await ctx.get_state() or {}
        count = int(prev.get("count", 0)) + 1
        await ctx.set_state({
            "count": count,
            "last_input": text,
            "last_output": result,
        })

        await ctx.send_message(result)

### 3. SubmitToLowerAgent

**Demonstrates:**
- Reading shared state from other executors
- Building AgentExecutorRequest for LLM agent
- Targeting specific executor by ID

In [None]:
class SubmitToLowerAgent(Executor):
    """Builds AgentExecutorRequest to send to lowercasing agent."""

    def __init__(self, id: str, agent_id: str):
        super().__init__(id=id)
        self._agent_id = agent_id

    @handler
    async def submit(self, text: str, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
        # Read shared_state written by UpperCaseExecutor
        orig = await ctx.get_shared_state("original_input")
        upper = await ctx.get_shared_state("upper_output")
        print(f"LowerAgent (shared_state): original_input='{orig}', upper_output='{upper}'")

        # Build deterministic prompt for agent
        prompt = f"Convert the following text to lowercase. Return ONLY the transformed text.\n\nText: {text}"

        await ctx.send_message(
            AgentExecutorRequest(messages=[ChatMessage(Role.USER, text=prompt)], should_respond=True),
            target_id=self._agent_id,
        )

### 4. FinalizeFromAgent

Consumes AgentExecutorResponse and yields final output.

In [None]:
class FinalizeFromAgent(Executor):
    """Consumes the AgentExecutorResponse and yields the final result."""

    @handler
    async def finalize(self, response: AgentExecutorResponse, ctx: WorkflowContext[Any, str]) -> None:
        result = response.agent_run_response.text or ""

        prev = await ctx.get_state() or {}
        count = int(prev.get("count", 0)) + 1
        await ctx.set_state({
            "count": count,
            "last_output": result,
            "final": True,
        })

        await ctx.yield_output(result)

## Build Workflow with Checkpointing

### Graph Structure:

```
UpperCaseExecutor (start)
    ↓
ReverseTextExecutor
    ↓
SubmitToLowerAgent
    ↓
LowerAgent (AgentExecutor)
    ↓
FinalizeFromAgent
```

### Checkpointing Configuration:

- **`.with_checkpointing(checkpoint_storage=...)`**: Enable persistence
- **Superstep boundaries**: Checkpoints created automatically after each executor
- **State captured**: Executor states, shared state, messages, iteration count

In [None]:
def create_workflow(checkpoint_storage: FileCheckpointStorage):
    """Build workflow with checkpointing enabled."""
    
    # Instantiate pipeline executors
    upper_case_executor = UpperCaseExecutor(id="upper_case_executor")
    reverse_text_executor = ReverseTextExecutor(id="reverse_text_executor")

    # Configure agent for lowercasing
    endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    deployment_name = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
    chat_client = AzureOpenAIChatClient(
        deployment_name=deployment_name,
        endpoint=endpoint,
        credential=AzureCliCredential()
    )
    lower_agent = AgentExecutor(
        chat_client.create_agent(
            instructions="You transform text to lowercase. Reply with ONLY the transformed text."
        ),
        id="lower_agent",
    )

    submit_lower = SubmitToLowerAgent(id="submit_lower", agent_id=lower_agent.id)
    finalize = FinalizeFromAgent(id="finalize")

    # Build workflow with checkpointing
    return (
        WorkflowBuilder(max_iterations=5)
        .add_edge(upper_case_executor, reverse_text_executor)
        .add_edge(reverse_text_executor, submit_lower)
        .add_edge(submit_lower, lower_agent)
        .add_edge(lower_agent, finalize)
        .set_start_executor(upper_case_executor)
        .with_checkpointing(checkpoint_storage=checkpoint_storage)  # Enable persistence
        .build()
    )

print("✓ Workflow factory created")

## Checkpoint Summary Helper

Displays checkpoint metadata including:
- Checkpoint ID
- Iteration count
- Message count
- Executor states
- Shared state values
- Status

In [None]:
def render_checkpoint_summary(checkpoints: list) -> None:
    """Display human-friendly checkpoint metadata using framework summaries."""
    if not checkpoints:
        return

    print("\nCheckpoint summary:")
    for cp in sorted(checkpoints, key=lambda c: c.timestamp):
        summary = RequestInfoExecutor.checkpoint_summary(cp)
        msg_count = sum(len(v) for v in cp.messages.values())
        state_keys = sorted(cp.executor_states.keys())
        orig = cp.shared_state.get("original_input")
        upper = cp.shared_state.get("upper_output")

        line = (
            f"- {summary.checkpoint_id} | iter={summary.iteration_count} | "
            f"messages={msg_count} | states={state_keys}"
        )
        if summary.status:
            line += f" | status={summary.status}"
        line += f" | shared_state: original_input='{orig}', upper_output='{upper}'"
        print(line)

## Main Execution Flow

### Workflow:

1. **Clean checkpoint directory** for deterministic demo
2. **Run workflow** with initial message
3. **Observe events** as they stream
4. **Inspect checkpoints** created during run
5. **Interactive selection** of checkpoint to resume
6. **Resume execution** from selected point

In [None]:
async def main():
    """Run workflow with checkpointing and demonstrate resume."""
    
    # Clear existing checkpoints for clean run
    for file in CHECKPOINT_DIR.glob("*.json"):
        file.unlink()

    # Create checkpoint storage
    checkpoint_storage = FileCheckpointStorage(storage_path=CHECKPOINT_DIR)

    workflow = create_workflow(checkpoint_storage=checkpoint_storage)

    # === PHASE 1: Initial Run ===
    print("\n" + "="*70)
    print("PHASE 1: Initial Workflow Run")
    print("="*70)
    print("\nRunning workflow with initial message...\n")
    
    async for event in workflow.run_stream(message="hello world"):
        print(f"Event: {event}")

    # === PHASE 2: Inspect Checkpoints ===
    print("\n" + "="*70)
    print("PHASE 2: Checkpoint Inspection")
    print("="*70)
    
    all_checkpoints = await checkpoint_storage.list_checkpoints()
    if not all_checkpoints:
        print("No checkpoints found!")
        return

    workflow_id = all_checkpoints[0].workflow_id
    render_checkpoint_summary(all_checkpoints)

    # === PHASE 3: Interactive Resume Selection ===
    print("\n" + "="*70)
    print("PHASE 3: Resume from Checkpoint")
    print("="*70)
    
    sorted_cps = sorted(
        [cp for cp in all_checkpoints if cp.workflow_id == workflow_id],
        key=lambda c: c.timestamp
    )

    print("\nAvailable checkpoints to resume from:")
    for idx, cp in enumerate(sorted_cps):
        summary = RequestInfoExecutor.checkpoint_summary(cp)
        line = f"  [{idx}] id={summary.checkpoint_id} iter={summary.iteration_count}"
        if summary.status:
            line += f" status={summary.status}"
        msg_count = sum(len(v) for v in cp.messages.values())
        line += f" messages={msg_count}"
        print(line)

    user_input = input(
        "\nEnter checkpoint index (or paste checkpoint id) to resume from, or press Enter to skip: "
    ).strip()

    if not user_input:
        print("No checkpoint selected. Exiting without resuming.")
        return

    chosen_cp_id: str | None = None

    # Try as index first
    if user_input.isdigit():
        idx = int(user_input)
        if 0 <= idx < len(sorted_cps):
            chosen_cp_id = sorted_cps[idx].checkpoint_id
    
    # Fall back to direct id match (allow prefix)
    if chosen_cp_id is None:
        for cp in sorted_cps:
            if cp.checkpoint_id.startswith(user_input):
                chosen_cp_id = cp.checkpoint_id
                break

    if chosen_cp_id is None:
        print("Input did not match any checkpoint. Exiting without resuming.")
        return

    # Resume from checkpoint
    new_workflow = create_workflow(checkpoint_storage=checkpoint_storage)

    print(f"\nResuming from checkpoint: {chosen_cp_id}\n")
    async for event in new_workflow.run_stream_from_checkpoint(
        chosen_cp_id,
        checkpoint_storage=checkpoint_storage
    ):
        print(f"Resumed Event: {event}")
    
    print("\n" + "="*70)
    print("Resume completed successfully!")
    print("="*70)

## Run the Demo

In [None]:
await main()

## Expected Output Pattern

```
======================================================================
PHASE 1: Initial Workflow Run
======================================================================

Running workflow with initial message...

UpperCaseExecutor: 'hello world' -> 'HELLO WORLD'
Event: ExecutorInvokeEvent(executor_id=upper_case_executor)
Event: ExecutorCompletedEvent(executor_id=upper_case_executor)
ReverseTextExecutor: 'HELLO WORLD' -> 'DLROW OLLEH'
Event: ExecutorInvokeEvent(executor_id=reverse_text_executor)
Event: ExecutorCompletedEvent(executor_id=reverse_text_executor)
LowerAgent (shared_state): original_input='hello world', upper_output='HELLO WORLD'
Event: ExecutorInvokeEvent(executor_id=submit_lower)
Event: ExecutorInvokeEvent(executor_id=lower_agent)
Event: ExecutorInvokeEvent(executor_id=finalize)

======================================================================
PHASE 2: Checkpoint Inspection
======================================================================

Checkpoint summary:
- abc123... | iter=0 | messages=1 | states=['upper_case_executor'] | shared_state: original_input='hello world', upper_output='HELLO WORLD'
- def456... | iter=1 | messages=1 | states=['reverse_text_executor', 'upper_case_executor'] | shared_state: original_input='hello world', upper_output='HELLO WORLD'
- ghi789... | iter=2 | messages=0 | states=['finalize', 'lower_agent', 'reverse_text_executor', 'submit_lower', 'upper_case_executor'] | shared_state: original_input='hello world', upper_output='HELLO WORLD'

======================================================================
PHASE 3: Resume from Checkpoint
======================================================================

Available checkpoints to resume from:
  [0] id=abc123... iter=0 messages=1
  [1] id=def456... iter=1 messages=1
  [2] id=ghi789... iter=2 messages=0

Enter checkpoint index (or paste checkpoint id) to resume from, or press Enter to skip: 1

Resuming from checkpoint: def456...

LowerAgent (shared_state): original_input='hello world', upper_output='HELLO WORLD'
Resumed Event: ExecutorInvokeEvent(executor_id=submit_lower)
Resumed Event: ExecutorInvokeEvent(executor_id=lower_agent)
Resumed Event: ExecutorInvokeEvent(executor_id=finalize)

======================================================================
Resume completed successfully!
======================================================================
```

## Key Takeaways

### 1. State Persistence Patterns

#### Executor State (Local)
```python
# Write
await ctx.set_state({"count": 1, "last_input": text})

# Read
state = await ctx.get_state() or {}
count = state.get("count", 0)
```

**Use for:**
- Iteration counts
- Last processed inputs/outputs
- Executor-specific metadata
- Debugging information

#### Shared State (Cross-Executor)
```python
# Write (any executor)
await ctx.set_shared_state("original_input", text)

# Read (any executor)
orig = await ctx.get_shared_state("original_input")
```

**Use for:**
- Original inputs
- Workflow-wide configuration
- Cross-executor data sharing
- Audit trails

### 2. Checkpoint Configuration

```python
storage = FileCheckpointStorage(storage_path="./checkpoints")

workflow = (
    WorkflowBuilder()
    .add_edge(...)
    .with_checkpointing(checkpoint_storage=storage)
    .build()
)
```

**Automatic Capture:**
- Executor states (via `ctx.set_state()`)
- Shared state (via `ctx.set_shared_state()`)
- Messages in flight
- Iteration counts
- Workflow metadata

### 3. Checkpoint Inspection

```python
# List all checkpoints
checkpoints = await storage.list_checkpoints()

# Filter by workflow ID
workflow_cps = [cp for cp in checkpoints if cp.workflow_id == wf_id]

# Get summary
summary = RequestInfoExecutor.checkpoint_summary(cp)
```

**Available Fields:**
- `checkpoint_id`: Unique identifier
- `workflow_id`: Parent workflow
- `iteration_count`: Superstep number
- `timestamp`: Creation time
- `executor_states`: Dict of executor states
- `shared_state`: Workflow-wide state
- `messages`: Pending messages per executor

### 4. Resume Patterns

#### Basic Resume
```python
async for event in workflow.run_stream_from_checkpoint(
    checkpoint_id,
    checkpoint_storage=storage
):
    process(event)
```

#### Resume with Responses (HITL)
```python
async for event in workflow.run_stream_from_checkpoint(
    checkpoint_id,
    checkpoint_storage=storage,
    responses={request_id: human_response}
):
    process(event)
```

### 5. Superstep Boundaries

Checkpoints are created automatically:
- After each executor completes
- Before RequestInfoExecutor pauses
- At workflow completion

**Resume behavior:**
- Skips completed executors
- Restores in-flight messages
- Continues from exact state

### 6. Production Patterns

See notebook sections for:
- Database-backed storage
- Checkpoint expiration/cleanup
- Error recovery strategies
- Idempotent executor design
- Multi-tenant checkpointing