# Magentic Workflow with Checkpointing

## Overview

This notebook demonstrates **checkpointing** in Magentic workflows - the ability to save workflow state and resume execution from specific points. Checkpointing is critical for:

1. **Long-running workflows**: Save progress to avoid re-execution on failures
2. **Human-in-the-loop**: Pause for human review/approval, then resume
3. **Cost optimization**: Avoid re-running expensive LLM operations
4. **Debugging**: Reproduce and analyze specific execution states

### Key Concepts:

1. **FileCheckpointStorage**: Persists workflow state to disk
2. **Checkpoint Capture**: Automatic state snapshots at key points
3. **Resume Execution**: Continue from saved state with new inputs
4. **Request/Response Pattern**: Handle human interactions via checkpoints
5. **Multi-Stage Checkpointing**: Multiple save/resume cycles

### Checkpointing Lifecycle:

```
Workflow Start
    ↓
Execute → Checkpoint 1 (Plan Review Request)
    ↓
Pause & Save State
    ↓
Human Review
    ↓
Resume with Response
    ↓
Continue Execution → Checkpoint 2 (Intermediate State)
    ↓
Final Result
```

### Storage Structure:

- **Checkpoint Directory**: Specified path for state files
- **State Files**: JSON-serialized workflow states
- **Executor Context**: Agent conversation histories
- **Orchestrator State**: Plan, round count, agent states

## Prerequisites

- OpenAI API key configured: `OPENAI_API_KEY` environment variable
- Agent Framework installed: `pip install agent-framework`
- Write permissions for checkpoint directory
- Special models for agents:
  - ResearcherAgent: `gpt-4o-search-preview`
  - CoderAgent: OpenAI Assistants with code interpreter

## Setup and Imports

In [None]:
import asyncio
import logging
from pathlib import Path

import os
from dotenv import load_dotenv
from agent_framework import (
    ChatAgent,
    FileCheckpointStorage,
    HostedCodeInterpreterTool,
    MagenticAgentDeltaEvent,
    MagenticAgentMessageEvent,
    MagenticBuilder,
    MagenticCallbackEvent,
    MagenticCallbackMode,
    MagenticFinalResultEvent,
    MagenticOrchestratorMessageEvent,
    MagenticPlanReviewRequestEvent,
    MagenticWorkflow,
    RequestInfoEvent,
    WorkflowOutputEvent,
)
from agent_framework.openai import OpenAIChatClient, OpenAIResponsesClient

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

# Load environment variables from .env file
load_dotenv('../../.env')


## Configure Checkpoint Storage

### FileCheckpointStorage

- **checkpoint_path**: Directory for state files
- Creates directory if it doesn't exist
- Each checkpoint gets unique identifier
- Supports multiple concurrent workflows

### Best Practices:

- Use absolute paths
- Ensure write permissions
- Consider cleanup policies (old checkpoints)
- Monitor storage usage for long-running workflows

In [None]:
# Create checkpoint directory
checkpoint_dir = Path("./checkpoints")
checkpoint_dir.mkdir(exist_ok=True)

print(f"Checkpoint directory: {checkpoint_dir.absolute()}")

# Initialize checkpoint storage
checkpoint_storage = FileCheckpointStorage(checkpoint_path=str(checkpoint_dir))

## Create Specialized Agents

Same agents as basic Magentic workflow:
- **ResearcherAgent**: Information gathering
- **CoderAgent**: Code execution and analysis

In [None]:
async def run_checkpoint_workflow() -> None:
    researcher_agent = ChatAgent(
        name="ResearcherAgent",
        description="Specialist in research and information gathering",
        instructions=(
            "You are a Researcher. You find information without additional computation or quantitative analysis."
        ),
        chat_client=OpenAIChatClient(model_id="gpt-4o-search-preview"),
    )

    coder_agent = ChatAgent(
        name="CoderAgent",
        description="A helpful assistant that writes and executes code to process and analyze data.",
        instructions="You solve questions using code. Please provide detailed analysis and computation process.",
        chat_client=OpenAIResponsesClient(),
        tools=HostedCodeInterpreterTool(),
    )

## Define Event Callback with Checkpoint Detection

Enhanced callback that:
1. Tracks normal workflow events
2. Detects checkpoint requests (RequestInfoEvent, MagenticPlanReviewRequestEvent)
3. Signals when workflow needs human input
4. Handles resume operations

In [None]:
    last_stream_agent_id: str | None = None
    stream_line_open: bool = False
    checkpoint_needed: bool = False

    async def on_event(event: MagenticCallbackEvent) -> None:
        """
        Process workflow events including checkpoint triggers.
        """
        nonlocal last_stream_agent_id, stream_line_open, checkpoint_needed

        if isinstance(event, MagenticOrchestratorMessageEvent):
            print(f"\n[ORCH:{event.kind}]\n\n{getattr(event.message, 'text', '')}\n{'-' * 26}")

        elif isinstance(event, MagenticAgentDeltaEvent):
            if last_stream_agent_id != event.agent_id or not stream_line_open:
                if stream_line_open:
                    print()
                print(f"\n[STREAM:{event.agent_id}]: ", end="", flush=True)
                last_stream_agent_id = event.agent_id
                stream_line_open = True
            print(event.text, end="", flush=True)

        elif isinstance(event, MagenticAgentMessageEvent):
            if stream_line_open:
                print(" (final)")
                stream_line_open = False
                print()
            msg = event.message
            if msg is not None:
                response_text = (msg.text or "").replace("\n", " ")
                print(f"\n[AGENT:{event.agent_id}] {msg.role.value}\n\n{response_text}\n{'-' * 26}")

        elif isinstance(event, MagenticFinalResultEvent):
            print("\n" + "=" * 50)
            print("FINAL RESULT:")
            print("=" * 50)
            if event.message is not None:
                print(event.message.text)
            print("=" * 50)

        # Checkpoint detection
        elif isinstance(event, (RequestInfoEvent, MagenticPlanReviewRequestEvent)):
            print("\n" + "!" * 50)
            print("CHECKPOINT NEEDED: Workflow requires human input")
            print(f"Event Type: {type(event).__name__}")
            print("!" * 50)
            checkpoint_needed = True

## Build Workflow with Checkpoint Storage

### Key Addition:
- **`.with_checkpoint_storage(...)`**: Enable state persistence

### Configuration:
- All standard Magentic settings
- Checkpoint storage integration
- Automatic state snapshots at request points

In [None]:
    print("\nBuilding Magentic Workflow with Checkpointing...")

    workflow: MagenticWorkflow = (
        MagenticBuilder()
        .participants(researcher=researcher_agent, coder=coder_agent)
        .on_event(on_event, mode=MagenticCallbackMode.STREAMING)
        .with_checkpoint_storage(checkpoint_storage)  # Enable checkpointing
        .with_standard_manager(
            chat_client=OpenAIChatClient(),
            max_round_count=10,
            max_stall_count=3,
            max_reset_count=2,
        )
        .build()
    )

## Define Task and Initial Execution

This task will trigger a plan review request, creating a checkpoint.

In [None]:
    task = (
        "I am preparing a report on the energy efficiency of different machine learning model architectures. "
        "Compare the estimated training and inference energy consumption of ResNet-50, BERT-base, and GPT-2 "
        "on standard datasets. Then, estimate the CO2 emissions associated with each, assuming training "
        "on an Azure Standard_NC6s_v3 VM for 24 hours. Provide tables for clarity."
    )

    print(f"\nTask: {task}")
    print("\nStarting initial workflow execution...")

## Execute Until First Checkpoint

### Execution Flow:

1. Workflow generates plan
2. Requests human review (creates checkpoint)
3. Execution pauses
4. State saved to disk

### Checkpoint Contains:
- Current orchestrator state
- Agent conversation histories
- Round counters
- Pending requests

In [None]:
    try:
        checkpoint_id: str | None = None
        request_event: RequestInfoEvent | None = None

        print("\n[PHASE 1: Execute until checkpoint]\n")

        async for event in workflow.run_stream(task):
            print(event)

            # Capture checkpoint information
            if isinstance(event, RequestInfoEvent):
                request_event = event
                checkpoint_id = event.checkpoint_id
                print(f"\n✓ Checkpoint created with ID: {checkpoint_id}")
                print(f"Request type: {type(event.request).__name__}")
                break  # Pause execution

        if checkpoint_id is None:
            print("\n⚠ No checkpoint was created - workflow may have completed without requiring input.")
            return

## Simulate Human Review

In production:
- Human reviews the plan
- Provides approval or modifications
- System constructs response object

Here we simulate approval.

In [None]:
        print("\n" + "=" * 50)
        print("SIMULATING HUMAN REVIEW")
        print("=" * 50)
        print("Human response: Plan approved. Proceed with execution.")
        print("=" * 50 + "\n")

        # Construct approval response
        from agent_framework import MagenticPlanReviewReply

        human_response = MagenticPlanReviewReply(
            approved=True,
            message="Plan looks good. Please proceed with the energy efficiency analysis.",
        )

## Resume Workflow from Checkpoint

### Resume Process:

1. **Load checkpoint state** from storage
2. **Apply human response** to pending request
3. **Continue execution** from saved point
4. **Complete workflow** or hit next checkpoint

### Key Methods:

- **`workflow.run_stream_from_checkpoint(...)`**: Resume with streaming
- **`checkpoint_id`**: Identifies saved state
- **`responses`**: Dict mapping request IDs to response objects

In [None]:
        print("\n[PHASE 2: Resume from checkpoint]\n")

        # Resume execution with human response
        output: str | None = None
        async for event in workflow.run_stream_from_checkpoint(
            checkpoint_id=checkpoint_id,
            responses={request_event.request.request_id: human_response},  # type: ignore
        ):
            print(event)
            if isinstance(event, WorkflowOutputEvent):
                output = str(event.data)

        if output is not None:
            print(f"\nWorkflow completed with result:\n\n{output}")
        else:
            print("\nWorkflow completed without output.")

    except Exception as e:
        print(f"Workflow execution failed: {e}")
        import traceback
        traceback.print_exc()

## Run the Complete Checkpointed Workflow

In [None]:
await run_checkpoint_workflow()

## Expected Output Pattern

```
Checkpoint directory: /path/to/checkpoints

Building Magentic Workflow with Checkpointing...

Task: I am preparing a report on the energy efficiency...

[PHASE 1: Execute until checkpoint]

[ORCH:planning]
Generated execution plan:
1. Research energy consumption data
2. Calculate CO2 emissions
3. Create comparison tables
--------------------------

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
CHECKPOINT NEEDED: Workflow requires human input
Event Type: MagenticPlanReviewRequestEvent
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

✓ Checkpoint created with ID: checkpoint_abc123
Request type: MagenticPlanReviewRequest

==================================================
SIMULATING HUMAN REVIEW
==================================================
Human response: Plan approved. Proceed with execution.
==================================================

[PHASE 2: Resume from checkpoint]

[ORCH:executing]
Proceeding with approved plan...
--------------------------

[STREAM:ResearcherAgent]: Finding energy data...

[AGENT:ResearcherAgent] assistant
ResNet-50: 50 kWh, BERT-base: 1500 kWh, GPT-2: 280 kWh
--------------------------

[STREAM:CoderAgent]: Calculating emissions...

==================================================
FINAL RESULT:
==================================================
Energy Efficiency Report with CO2 Analysis...
==================================================
```

## Inspect Checkpoint Files

In [None]:
# List checkpoint files created
print("\nCheckpoint files created:")
for checkpoint_file in checkpoint_dir.glob("*"):
    print(f"  - {checkpoint_file.name} ({checkpoint_file.stat().st_size} bytes)")

## Key Takeaways

### 1. Checkpoint Configuration

```python
# Create storage
storage = FileCheckpointStorage(checkpoint_path="./checkpoints")

# Add to workflow
workflow = (
    MagenticBuilder()
    .participants(...)
    .with_checkpoint_storage(storage)  # Enable checkpointing
    .with_standard_manager(...)
    .build()
)
```

### 2. Checkpoint Lifecycle

#### Phase 1: Initial Execution
```python
async for event in workflow.run_stream(task):
    if isinstance(event, RequestInfoEvent):
        checkpoint_id = event.checkpoint_id
        request = event.request
        break  # Pause for human input
```

#### Phase 2: Resume
```python
async for event in workflow.run_stream_from_checkpoint(
    checkpoint_id=checkpoint_id,
    responses={request.request_id: human_response}
):
    # Continue execution
```

### 3. Request/Response Pattern

#### Request Types:
- **MagenticPlanReviewRequest**: Plan approval/modification
- **Custom Request Types**: Extend for domain-specific needs

#### Response Construction:
```python
from agent_framework import MagenticPlanReviewReply

response = MagenticPlanReviewReply(
    approved=True,  # or False to reject
    message="Optional feedback or modifications"
)
```

### 4. Storage Backends

#### FileCheckpointStorage (This Example)
- Simple, local file persistence
- Good for development and single-machine deployments
- Limited scalability

#### Custom Storage (Production)
- Implement `CheckpointStorage` interface
- Use databases (PostgreSQL, MongoDB)
- Cloud storage (Azure Blob, S3)
- Enable distributed workflows

### 5. What Gets Checkpointed?

- **Orchestrator State**: Current plan, round counts
- **Agent Contexts**: Conversation histories
- **Executor States**: Custom executor data
- **Pending Requests**: Open human-in-the-loop requests
- **Workflow Metadata**: Timestamps, IDs

### 6. Multi-Stage Checkpointing

Workflows can checkpoint multiple times:

```python
# First checkpoint: Plan review
checkpoint_1 = event.checkpoint_id

# Resume
async for event in workflow.run_stream_from_checkpoint(...):
    # Second checkpoint: Data approval
    if isinstance(event, RequestInfoEvent):
        checkpoint_2 = event.checkpoint_id
        break

# Resume again
async for event in workflow.run_stream_from_checkpoint(...):
    # Continue to completion
```

### 7. Error Handling with Checkpoints

```python
try:
    async for event in workflow.run_stream(task):
        # Save checkpoint IDs
        if isinstance(event, RequestInfoEvent):
            save_checkpoint_id(event.checkpoint_id)
except Exception as e:
    # Resume from last checkpoint
    last_checkpoint = load_last_checkpoint_id()
    async for event in workflow.run_stream_from_checkpoint(
        checkpoint_id=last_checkpoint
    ):
        # Retry execution
```

### 8. Production Best Practices

#### Storage Management
- Implement checkpoint expiration (TTL)
- Clean up old checkpoints
- Monitor storage usage
- Use compression for large states

#### Resumption Logic
- Validate checkpoint exists before resume
- Handle checkpoint corruption gracefully
- Provide fallback to restart
- Log checkpoint operations

#### Security
- Encrypt sensitive checkpoint data
- Validate checkpoint ownership
- Audit checkpoint access
- Secure storage credentials

### 9. Use Cases

#### Long-Running Workflows
- Multi-hour data processing
- Complex research tasks
- Batch operations

#### Human Approval Workflows
- Plan review and modification
- Data validation
- Quality assurance gates
- Compliance checks

#### Cost Optimization
- Avoid re-running expensive LLM calls
- Resume after rate limit errors
- Incremental processing

#### Debugging and Development
- Reproduce specific states
- Test resume logic
- Analyze workflow behavior

### 10. Checkpoint ID Management

```python
# Store checkpoint IDs persistently
import json

def save_checkpoint_metadata(checkpoint_id: str, task: str) -> None:
    metadata = {
        "checkpoint_id": checkpoint_id,
        "task": task,
        "timestamp": datetime.now().isoformat()
    }
    with open("checkpoint_metadata.json", "w") as f:
        json.dump(metadata, f)

def load_checkpoint_id() -> str:
    with open("checkpoint_metadata.json") as f:
        return json.load(f)["checkpoint_id"]
```

### 11. Advanced Patterns

- **With Plan Review**: See magentic_human_plan_update.ipynb
- **Custom Storage**: Implement database-backed persistence
- **Distributed Workflows**: Share checkpoints across workers
- **Versioned Checkpoints**: Support workflow schema evolution

### 12. Comparison with Other Patterns

| Feature | Checkpointing | No Checkpointing |
|---------|--------------|------------------|
| **Resumption** | Can resume from any checkpoint | Must restart from beginning |
| **Cost** | Avoids redundant LLM calls | Re-executes everything |
| **Complexity** | Requires storage management | Simpler architecture |
| **Human-in-loop** | Natural pause/resume | Requires custom state management |
| **Debugging** | Can inspect/replay states | Limited historical visibility |