# Checkpoint with Human-in-the-Loop

## Overview

This notebook demonstrates the **checkpoint + human-in-the-loop pattern** - combining workflow persistence with interactive human review. This is essential for production workflows that:

1. **Require human approval** at critical decision points
2. **Run across sessions** - pause, exit program, resume later
3. **Pre-supply responses** - provide answers at restore time
4. **Maintain state** across interruptions

### Workflow: Release Notes Approval Pipeline

```
User Brief
    ↓
BriefPreparer (normalize & prepare)
    ↓
WriterAgent (draft release notes)
    ↓
ReviewGateway (request human approval)
    ↓
┌─────────────────────────────────┐
│  RequestInfoExecutor            │
│  (pause for human decision)     │
│  • "approve" → Finalize         │
│  • feedback → Revision loop     │
└─────────────────────────────────┘
    ↓
If approved: FinalizeExecutor
If feedback: WriterAgent (revise)
```

### Key Features:

1. **FileCheckpointStorage**: Persist workflow state to disk
2. **RequestInfoExecutor**: Pause workflow for human approval
3. **Checkpoint Summaries**: Inspect pending requests
4. **Resume with Responses**: Pre-supply answers at restore
5. **Revision Loops**: Iterate based on human feedback
6. **State Persistence**: Executor state + shared state

### Pause/Resume Flow:

1. **Initial Run**: Execute until human approval needed → checkpoint created
2. **Program Exit**: Checkpoint saved with "awaiting human response" status
3. **Later Resume**: Restart program, select checkpoint
4. **Pre-supply Response**: Provide decision before resume
5. **Auto-apply**: Workflow resumes without re-emitting RequestInfoEvent

## Prerequisites

- Azure OpenAI configured with environment variables
- Azure CLI authentication: `az login`
- Agent Framework installed: `pip install agent-framework`
- Write permissions for checkpoint directory

## Setup and Imports

In [3]:
import asyncio
from collections.abc import AsyncIterable
from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING, Any
import os
from dotenv import load_dotenv

from agent_framework import (
    AgentExecutor,
    AgentExecutorRequest,
    AgentExecutorResponse,
    ChatMessage,
    Executor,
    FileCheckpointStorage,
    RequestInfoEvent,
    RequestInfoExecutor,
    RequestInfoMessage,
    RequestResponse,
    Role,
    WorkflowBuilder,
    WorkflowContext,
    WorkflowOutputEvent,
    WorkflowRunState,
    WorkflowStatusEvent,
    handler,
)
from agent_framework.azure import AzureOpenAIChatClient
from azure.identity import AzureCliCredential

if TYPE_CHECKING:
    from agent_framework import Workflow
    from agent_framework._workflows._checkpoint import WorkflowCheckpoint

# Load environment variables from .env file
notebook_path = Path().absolute()
load_dotenv('../../.env')

True

In [4]:
# Verify environment variables are loaded
print("🔍 Checking Azure OpenAI environment variables...")
endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')
deployment_name = os.getenv('AZURE_OPENAI_CHAT_DEPLOYMENT_NAME')

if not endpoint or not deployment_name:
    raise ValueError(
        "❌ Azure OpenAI environment variables not found.\n"
        "Please set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_CHAT_DEPLOYMENT_NAME in your .env file"
    )

print(f"✅ AZURE_OPENAI_ENDPOINT: {endpoint}")
print(f"✅ AZURE_OPENAI_CHAT_DEPLOYMENT_NAME: {deployment_name}")

🔍 Checking Azure OpenAI environment variables...
✅ AZURE_OPENAI_ENDPOINT: https://kd-foundry-project-resource.openai.azure.com/
✅ AZURE_OPENAI_CHAT_DEPLOYMENT_NAME: gpt-4o


## Configure Checkpoint Directory

We create a dedicated directory for this sample to:
- Isolate from other samples
- Enable clean-up after demonstration
- Make checkpoint files easy to inspect

In [5]:
# Create temporary checkpoint directory
CHECKPOINT_DIR = Path("./tmp/checkpoints_hitl")
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Checkpoint directory: {CHECKPOINT_DIR.absolute()}")

Checkpoint directory: c:\Users\kapildhanger\OneDrive - Microsoft\Microsoft_Kapil\Azure_learning\agent-framework\agent-framework\python\samples\getting_started\workflows\checkpoint\notebooks\tmp\checkpoints_hitl


## Define Request/Response Models

### HumanApprovalRequest

Subclasses `RequestInfoMessage` for typed human approval requests.

**Design Principles:**
- **Simple primitive types**: Ensures reliable checkpoint serialization
- **Context fields**: `draft` and `iteration` provide review context
- **Reconstruction-friendly**: Framework can rebuild from JSON

**Fields:**
- `prompt`: Instructions for human reviewer
- `draft`: Current release notes draft
- `iteration`: Revision count (shows iteration history)

In [6]:
@dataclass
class HumanApprovalRequest(RequestInfoMessage):
    """Message sent to the human reviewer via RequestInfoExecutor.
    
    These fields are intentionally simple because they are serialized into
    checkpoints. Keeping them primitive types guarantees the
    pending_requests_from_checkpoint helper can reconstruct them on resume.
    """
    prompt: str = ""
    draft: str = ""
    iteration: int = 0

## Create Workflow Executors

### 1. BriefPreparer

**Responsibilities:**
- Normalize user brief (collapse whitespace, add period)
- Store in shared state for cross-executor visibility
- Create deterministic prompt for writer
- Kick off agent execution

**Why minimal?** Keeping the first executor simple makes checkpoint state easier to reason about.

In [7]:
class BriefPreparer(Executor):
    """Normalizes the user brief and sends a single AgentExecutorRequest."""

    def __init__(self, id: str, agent_id: str) -> None:
        super().__init__(id=id)
        self._agent_id = agent_id

    @handler
    async def prepare(self, brief: str, ctx: WorkflowContext[AgentExecutorRequest, str]) -> None:
        # Collapse errant whitespace so the prompt is stable between runs
        normalized = " ".join(brief.split()).strip()
        if not normalized.endswith("."):
            normalized += "."
        
        # Persist the cleaned brief in shared state so downstream executors
        # and future checkpoints can recover the original intent
        await ctx.set_shared_state("brief", normalized)
        
        prompt = (
            "You are drafting product release notes. Summarize the brief below in two sentences. "
            "Keep it positive and end with a call to action.\n\n"
            f"BRIEF: {normalized}"
        )
        
        # Hand the prompt to the writer agent
        await ctx.send_message(
            AgentExecutorRequest(messages=[ChatMessage(Role.USER, text=prompt)], should_respond=True),
            target_id=self._agent_id,
        )

### 2. ReviewGateway

**Responsibilities:**
- Route agent drafts to human reviewers
- Persist iteration count and draft in executor state
- Process human decisions (approve vs. revise)
- Loop back to writer with feedback if needed

**Key Methods:**

#### `on_agent_response()`
- Captures agent's draft
- Increments iteration counter
- Stores state for checkpoint persistence
- Sends `HumanApprovalRequest` to `RequestInfoExecutor`

#### `on_human_feedback()`
- Receives `RequestResponse[HumanApprovalRequest, str]`
- If "approve": Route to finalizer
- If feedback: Send revision request back to writer
- Uses `feedback.original_request` for context

In [8]:
class ReviewGateway(Executor):
    """Routes agent drafts to humans and optionally back for revisions."""

    def __init__(self, id: str, reviewer_id: str, writer_id: str, finalize_id: str) -> None:
        super().__init__(id=id)
        self._reviewer_id = reviewer_id
        self._writer_id = writer_id
        self._finalize_id = finalize_id

    @handler
    async def on_agent_response(
        self,
        response: AgentExecutorResponse,
        ctx: WorkflowContext[HumanApprovalRequest, str],
    ) -> None:
        # Capture the agent output for reviewer and checkpoint persistence
        draft = response.agent_run_response.text or ""
        iteration = int((await ctx.get_state() or {}).get("iteration", 0)) + 1
        await ctx.set_state({"iteration": iteration, "last_draft": draft})
        
        # Emit human approval request - this pauses workflow until answer is supplied
        await ctx.send_message(
            HumanApprovalRequest(
                prompt="Review the draft. Reply 'approve' or provide edit instructions.",
                draft=draft,
                iteration=iteration,
            ),
            target_id=self._reviewer_id,
        )

    @handler
    async def on_human_feedback(
        self,
        feedback: RequestResponse[HumanApprovalRequest, str],
        ctx: WorkflowContext[AgentExecutorRequest | str, str],
    ) -> None:
        # RequestResponse gives us both human data and original request context
        reply = (feedback.data or "").strip()
        state = await ctx.get_state() or {}
        draft = state.get("last_draft") or (feedback.original_request.draft if feedback.original_request else "")

        if reply.lower() == "approve":
            # Human approval - send to finalizer
            await ctx.send_message(draft, target_id=self._finalize_id)
            return

        # Feedback provided - loop back to writer for revision
        guidance = reply or "Tighten the copy and emphasize customer benefit."
        iteration = int(state.get("iteration", 1)) + 1
        await ctx.set_state({"iteration": iteration, "last_draft": draft})
        
        prompt = (
            "Revise the launch note. Respond with the new copy only.\n\n"
            f"Previous draft:\n{draft}\n\n"
            f"Human guidance: {guidance}"
        )
        await ctx.send_message(
            AgentExecutorRequest(messages=[ChatMessage(Role.USER, text=prompt)], should_respond=True),
            target_id=self._writer_id,
        )

### 3. FinalizeExecutor

**Responsibilities:**
- Store approved text in executor state
- Yield final output (completes workflow)
- Provides observability for diagnostics

In [9]:
class FinalizeExecutor(Executor):
    """Publishes the approved text."""

    @handler
    async def publish(self, text: str, ctx: WorkflowContext[Any, str]) -> None:
        # Store output for diagnostics/UI
        await ctx.set_state({"published_text": text})
        # Yield final output - completes workflow
        await ctx.yield_output(text)

## Build Workflow with Checkpointing

### Graph Structure:

```
BriefPreparer → WriterAgent
WriterAgent → ReviewGateway
ReviewGateway → RequestInfoExecutor (approval)
RequestInfoExecutor → ReviewGateway (human decision)
ReviewGateway → WriterAgent (revisions)
ReviewGateway → FinalizeExecutor (approval)
```

### Checkpointing Configuration:

- **`.with_checkpointing(checkpoint_storage=...)`**: Enables persistence
- **Automatic snapshots**: Created at superstep boundaries
- **Identical workflow**: With/without checkpointing uses same graph

In [10]:
def create_workflow(*, checkpoint_storage: FileCheckpointStorage | None = None):
    """Assemble the workflow graph used by both initial run and resume."""

    # Create Azure OpenAI agent
    endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    deployment_name = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
    chat_client = AzureOpenAIChatClient(
        deployment_name=deployment_name,
        endpoint=endpoint,
        credential=AzureCliCredential()
    )
    writer = AgentExecutor(
        chat_client.create_agent(
            instructions="Write concise, warm release notes that sound human and helpful.",
        ),
        id="writer",
    )

    # RequestInfoExecutor - lynchpin for human-in-the-loop
    review = RequestInfoExecutor(id="request_info")
    finalise = FinalizeExecutor(id="finalise")
    gateway = ReviewGateway(
        id="review_gateway",
        reviewer_id=review.id,
        writer_id=writer.id,
        finalize_id=finalise.id,
    )
    prepare = BriefPreparer(id="prepare_brief", agent_id=writer.id)

    # Build workflow DAG
    builder = (
        WorkflowBuilder(max_iterations=6)
        .set_start_executor(prepare)
        .add_edge(prepare, writer)
        .add_edge(writer, gateway)
        .add_edge(gateway, review)
        .add_edge(review, gateway)  # Human resumes loop
        .add_edge(gateway, writer)  # Revisions
        .add_edge(gateway, finalise)
    )

    # Opt-in to persistence when caller provides storage
    if checkpoint_storage:
        builder = builder.with_checkpointing(checkpoint_storage=checkpoint_storage)

    return builder.build()

print("✓ Workflow factory created")

✓ Workflow factory created


## Helper Functions

### Checkpoint Summary Display

Uses framework's `RequestInfoExecutor.checkpoint_summary()` to display:
- Checkpoint ID
- Iteration count
- Target executors
- Executor states
- Status ("awaiting human response", "completed")
- Draft preview
- Pending request IDs

In [None]:
def render_checkpoint_summary(checkpoints: list) -> None:
    """Pretty-print saved checkpoints with framework summaries."""
    print("\nCheckpoint summary:")
    for summary in [
        RequestInfoExecutor.checkpoint_summary(cp) 
        for cp in sorted(checkpoints, key=lambda c: c.timestamp)
    ]:
        line = (
            f"- {summary.checkpoint_id} | iter={summary.iteration_count} "
            f"| targets={summary.targets} | states={summary.executor_states}"
        )
        if summary.status:
            line += f" | status={summary.status}"
        if summary.draft_preview:
            line += f" | draft_preview={summary.draft_preview}"
        if summary.pending_requests:
            line += f" | pending_request_id={summary.pending_requests[0].request_id}"
        print(line)

### Event Processing

Collects events and extracts:
- **WorkflowOutputEvent**: Final result
- **RequestInfoEvent**: Pending human approvals
- **WorkflowStatusEvent**: State transitions

In [None]:
def print_events(events: list[Any]) -> tuple[str | None, list[tuple[str, HumanApprovalRequest]]]:
    """Echo workflow events to console and collect outstanding requests."""
    completed_output: str | None = None
    requests: list[tuple[str, HumanApprovalRequest]] = []

    for event in events:
        print(f"Event: {event}")
        if isinstance(event, WorkflowOutputEvent):
            completed_output = event.data
        if isinstance(event, RequestInfoEvent) and isinstance(event.data, HumanApprovalRequest):
            requests.append((event.request_id, event.data))
        elif isinstance(event, WorkflowStatusEvent) and event.state in {
            WorkflowRunState.IN_PROGRESS_PENDING_REQUESTS,
            WorkflowRunState.IDLE_WITH_PENDING_REQUESTS,
        }:
            print(f"Workflow state: {event.state.name}")

    return completed_output, requests

### Interactive Response Collection

Prompts user for approval decisions via CLI.

In [None]:
def prompt_for_responses(requests: list[tuple[str, HumanApprovalRequest]]) -> dict[str, str] | None:
    """Interactive CLI prompt for any live RequestInfo requests."""
    if not requests:
        return None
    
    answers: dict[str, str] = {}
    for request_id, request in requests:
        print("\n=== Human approval needed ===")
        print(f"request_id: {request_id}")
        if request.iteration:
            print(f"Iteration: {request.iteration}")
        print(request.prompt)
        print("Draft: \n---\n" + request.draft + "\n---")
        answer = input("Type 'approve' or enter revision guidance (or 'exit' to quit): ").strip()
        if answer.lower() == "exit":
            raise SystemExit("Stopped by user.")
        answers[request_id] = answer
    return answers

### Pre-supply Responses for Resume

**Key Feature**: Provide human answers BEFORE resuming checkpoint.

**Benefits:**
- Workflow resumes immediately with answers
- No re-emission of RequestInfoEvent
- Ideal for offline approval workflows
- Supports batch processing of approvals

In [None]:
def maybe_pre_supply_responses(cp) -> dict[str, str] | None:
    """Offer to collect responses before resuming a checkpoint."""
    pending = RequestInfoExecutor.pending_requests_from_checkpoint(cp)
    if not pending:
        return None

    print(
        "This checkpoint still has pending human input. Provide the responses now so the resume step "
        "applies them immediately and does not re-emit the original RequestInfo event."
    )
    choice = input("Pre-supply responses for this checkpoint? [y/N]: ").strip().lower()
    if choice not in {"y", "yes"}:
        return None

    answers: dict[str, str] = {}
    for item in pending:
        iteration = item.iteration or 0
        print(f"\nPending draft (iteration {iteration} | request_id={item.request_id}):")
        draft_text = (item.draft or "").strip()
        if draft_text:
            print("Draft:\n---\n" + draft_text + "\n---")
        else:
            print("Draft: [not captured in checkpoint payload - refer to your notes/log]")
        prompt_text = (item.prompt or "Review the draft").strip()
        print(prompt_text)
        answer = input("Response ('approve' or guidance, 'exit' to abort): ").strip()
        if answer.lower() == "exit":
            raise SystemExit("Resume aborted by user.")
        answers[item.request_id] = answer
    return answers

### Stream Consumer Helper

In [None]:
async def consume(stream: AsyncIterable[Any]) -> list[Any]:
    """Materialize an async event stream into a list."""
    return [event async for event in stream]

## Run Interactive Session

Executes workflow with human-in-the-loop until:
- Completion (WorkflowOutputEvent)
- Pause for human input (RequestInfoEvent)

In [None]:
async def run_interactive_session(workflow, initial_message: str) -> str | None:
    """Run the workflow until it either finishes or pauses for human input."""
    pending_responses: dict[str, str] | None = None
    completed_output: str | None = None
    first = True

    while completed_output is None:
        if first:
            events = await consume(workflow.run_stream(initial_message))
            first = False
        elif pending_responses:
            events = await consume(workflow.send_responses_streaming(pending_responses))
        else:
            break

        completed_output, requests = print_events(events)
        if completed_output is None:
            pending_responses = prompt_for_responses(requests)

    return completed_output

## Resume from Checkpoint

### Resume Process:

1. **Select checkpoint** by ID
2. **Optional: Pre-supply responses** for pending requests
3. **Call `run_stream_from_checkpoint()`** with checkpoint_id and responses
4. **Continue execution** until next pause or completion

In [None]:
async def resume_from_checkpoint(
    workflow,
    checkpoint_id: str,
    storage: FileCheckpointStorage,
    pre_supplied: dict[str, str] | None,
) -> None:
    """Resume a stored checkpoint and continue until completion or another pause."""
    print(f"\nResuming from checkpoint: {checkpoint_id}")
    events = await consume(
        workflow.run_stream_from_checkpoint(
            checkpoint_id,
            checkpoint_storage=storage,
            responses=pre_supplied,
        )
    )
    completed_output, requests = print_events(events)
    
    if pre_supplied and not requests and completed_output is None:
        print("Pre-supplied responses applied automatically; workflow is now waiting for the next step.")

    pending = prompt_for_responses(requests)
    while completed_output is None and pending:
        events = await consume(workflow.send_responses_streaming(pending))
        completed_output, requests = print_events(events)
        if completed_output is None:
            pending = prompt_for_responses(requests)
        else:
            break

    if completed_output:
        print(f"Workflow completed with: {completed_output}")

## Main Execution Flow

### Workflow:

1. **Clean checkpoint directory**
2. **Create workflow with checkpointing**
3. **Run initial session** (may pause for approval)
4. **List checkpoints** created
5. **Optionally resume** from selected checkpoint
6. **Pre-supply responses** if checkpoint has pending requests

In [None]:
async def main() -> None:
    """Entry point used by both initial run and subsequent resumes."""
    
    # Clean existing checkpoints for deterministic demo
    for file in CHECKPOINT_DIR.glob("*.json"):
        file.unlink()

    storage = FileCheckpointStorage(storage_path=CHECKPOINT_DIR)
    workflow = create_workflow(checkpoint_storage=storage)

    brief = (
        "Introduce our limited edition smart coffee grinder. Mention the $249 price, highlight the "
        "sensor that auto-adjusts the grind, and invite customers to pre-order on the website."
    )

    print("Running workflow (human approval required)...")
    completed = await run_interactive_session(workflow, initial_message=brief)
    if completed:
        print(f"Initial run completed with final copy: {completed}")
    else:
        print("Initial run paused for human input.")

    checkpoints = await storage.list_checkpoints()
    if not checkpoints:
        print("No checkpoints recorded.")
        return

    render_checkpoint_summary(checkpoints)

    sorted_cps = sorted(checkpoints, key=lambda c: c.timestamp)
    print("\nAvailable checkpoints:")
    for idx, cp in enumerate(sorted_cps):
        print(f"  [{idx}] id={cp.checkpoint_id} iter={cp.iteration_count}")

    selection = input("\nResume from which checkpoint? (press Enter to skip): ").strip()
    if not selection:
        print("No resume selected. Exiting.")
        return

    try:
        idx = int(selection)
    except ValueError:
        print("Invalid input; exiting.")
        return

    if not 0 <= idx < len(sorted_cps):
        print("Index out of range; exiting.")
        return

    chosen = sorted_cps[idx]
    summary = RequestInfoExecutor.checkpoint_summary(chosen)
    if summary.status == "completed":
        print("Selected checkpoint already reflects a completed workflow; nothing to resume.")
        return

    # Pre-supply responses if desired
    pre_responses = maybe_pre_supply_responses(chosen)

    resumed_workflow = create_workflow()
    await resume_from_checkpoint(resumed_workflow, chosen.checkpoint_id, storage, pre_responses)

## Run the Demo

In [None]:
await main()

## Key Takeaways

### 1. Checkpoint + HITL Pattern

**Combination Benefits:**
- Workflows survive program restarts
- Human decisions preserved across sessions
- Offline approval workflows possible
- Audit trail via checkpoint files

### 2. Pre-supplying Responses

**When to use:**
- Offline approval collected via email/UI
- Batch processing of multiple approvals
- Automated testing with mock responses

**How it works:**
```python
# Pre-supply prevents re-emission of RequestInfoEvent
responses = {"request_abc123": "approve"}
await workflow.run_stream_from_checkpoint(
    checkpoint_id=checkpoint_id,
    responses=responses  # Applied immediately on resume
)
```

### 3. Checkpoint Summary Helpers

**RequestInfoExecutor.checkpoint_summary()**:
- Status: "awaiting human response", "completed"
- Pending requests with IDs
- Draft previews
- Iteration counts

**RequestInfoExecutor.pending_requests_from_checkpoint()**:
- Extract pending HumanApprovalRequest objects
- Access draft, iteration, prompt fields
- Enable pre-supply UI

### 4. State Persistence

**Executor State:**
```python
await ctx.set_state({"iteration": 2, "last_draft": "..."})
state = await ctx.get_state()
```

**Shared State:**
```python
await ctx.set_shared_state("brief", normalized_brief)
brief = await ctx.get_shared_state("brief")
```

Both are persisted in checkpoints!

### 5. Production Patterns

See comprehensive patterns in notebook for:
- Web API integration
- Database-backed storage
- Email/Slack notifications
- Multi-reviewer workflows
- Timeout handling