# Chapter 15: Building AI Agents for Network Operations

This notebook demonstrates AI agent patterns applied to **network engineering tasks**.
We'll build working examples from simple tool-calling loops to multi-step network
troubleshooting workflows.

### What is an AI Agent?

An AI agent is a program where an LLM (like Claude) acts as the "brain" in a loop:
1. **Perceive**: Read input (alerts, device output, user questions)
2. **Reason**: Decide what to do next
3. **Act**: Call tools (query devices, run commands, check logs)
4. **Repeat**: Keep going until the task is done

**Networking analogy**: Think of an agent like an event-driven automation system
(StackStorm, Ansible AWX) -- but instead of hardcoded playbooks, an LLM decides
which "playbook" to run based on context.

## Cell 1: Setup and Imports

Install packages and configure the API client. You'll need an Anthropic API key
from [console.anthropic.com](https://console.anthropic.com/).

In [None]:
# Install required packages
!pip install anthropic pydantic networkx -q

import os
import json
import time
import asyncio
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Dict, List, Optional, Any
from enum import Enum
from collections import Counter
from getpass import getpass

import anthropic

# Set up API key
if 'ANTHROPIC_API_KEY' not in os.environ:
    os.environ['ANTHROPIC_API_KEY'] = getpass('Enter your Anthropic API key: ')

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-20250514"

print(f"Using model: {MODEL}")
print("Setup complete.")

## Cell 2: Simple Network Agent with Tool Calling

The most basic agent pattern: the LLM receives a question, decides which
tool(s) to call, processes the results, and responds.

This is **the core pattern** behind all AI agents. Everything else in this
notebook builds on this loop:

```
User asks question
  → LLM decides which tool to call
    → Tool returns result
      → LLM reads result and decides: call another tool, or respond?
        → Repeat until done
```

**Networking parallel**: This is like a recursive DNS resolver -- it keeps
querying until it has the answer, then responds to the original client.

In [None]:
class NetworkAgent:
    """Basic network operations agent that uses a tool-calling loop.

    This agent has access to simulated network tools:
    - get_interface_status: like 'show interface' on a router
    - get_bgp_neighbors: like 'show bgp summary'
    - ping_device: like running a ping test

    In production, these tools would SSH into real devices via Netmiko/NAPALM.
    Here we simulate the responses for demonstration.
    """

    def __init__(self):
        self.conversation = []
        self.tools = self._define_tools()

    def _define_tools(self):
        """Define the tools available to the agent.

        Each tool has a name, description (Claude reads this to decide
        when to use it), and an input schema (what parameters it needs).
        """
        return [
            {
                "name": "get_interface_status",
                "description": "Get the status of a network interface on a device. "
                               "Returns interface state, IP, speed, errors, and traffic counters.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "device": {"type": "string", "description": "Device hostname (e.g., router-core-01)"},
                        "interface": {"type": "string", "description": "Interface name (e.g., GigabitEthernet0/1)"}
                    },
                    "required": ["device", "interface"]
                }
            },
            {
                "name": "get_bgp_neighbors",
                "description": "Get BGP neighbor summary for a device. "
                               "Returns neighbor IPs, AS numbers, state, and prefixes received.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "device": {"type": "string", "description": "Device hostname"}
                    },
                    "required": ["device"]
                }
            },
            {
                "name": "ping_device",
                "description": "Ping a target IP from a source device. Returns success rate and latency.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "source_device": {"type": "string", "description": "Device to ping from"},
                        "target_ip": {"type": "string", "description": "IP address to ping"}
                    },
                    "required": ["source_device", "target_ip"]
                }
            }
        ]

    def _execute_tool(self, tool_name, tool_input):
        """Execute a tool and return the result.

        In production, these would call Netmiko/NAPALM to query real devices.
        Here we return simulated but realistic output.
        """
        if tool_name == "get_interface_status":
            device = tool_input["device"]
            interface = tool_input["interface"]
            # Simulated interface data
            interfaces = {
                ("router-core-01", "GigabitEthernet0/1"): {
                    "status": "up/up", "ip": "10.0.1.1/24",
                    "speed": "1Gbps", "input_errors": 0, "output_errors": 0,
                    "input_rate": "450 Mbps", "output_rate": "380 Mbps"
                },
                ("router-core-01", "GigabitEthernet0/2"): {
                    "status": "down/down", "ip": "10.0.2.1/24",
                    "speed": "1Gbps", "input_errors": 1523, "output_errors": 47,
                    "input_rate": "0 bps", "output_rate": "0 bps"
                },
            }
            key = (device, interface)
            if key in interfaces:
                return json.dumps(interfaces[key])
            return json.dumps({"status": "up/up", "ip": "N/A", "speed": "1Gbps",
                             "input_errors": 0, "output_errors": 0,
                             "input_rate": "100 Mbps", "output_rate": "50 Mbps"})

        elif tool_name == "get_bgp_neighbors":
            return json.dumps({
                "device": tool_input["device"],
                "neighbors": [
                    {"neighbor_ip": "203.0.113.2", "remote_as": 65002,
                     "state": "Established", "prefixes_received": 542000,
                     "uptime": "45d 12:30:00", "description": "ISP-A Lumen"},
                    {"neighbor_ip": "198.51.100.1", "remote_as": 65003,
                     "state": "Idle", "prefixes_received": 0,
                     "uptime": "0:00:00", "description": "ISP-B Cogent"},
                ]
            })

        elif tool_name == "ping_device":
            target = tool_input["target_ip"]
            if target == "198.51.100.1":
                return json.dumps({"target": target, "packets_sent": 5,
                                 "packets_received": 0, "loss": "100%",
                                 "min_rtt": None, "avg_rtt": None, "max_rtt": None})
            return json.dumps({"target": target, "packets_sent": 5,
                             "packets_received": 5, "loss": "0%",
                             "min_rtt": "1.2ms", "avg_rtt": "2.1ms", "max_rtt": "3.5ms"})

        return json.dumps({"error": f"Unknown tool: {tool_name}"})

    def run(self, user_input, max_iterations=10):
        """Run the agent loop.

        The agent keeps calling tools until it has enough information
        to answer the user's question, or hits the iteration limit.
        """
        print(f"User: {user_input}\n")

        self.conversation = [{"role": "user", "content": user_input}]

        for iteration in range(max_iterations):
            # Ask Claude what to do next
            response = client.messages.create(
                model=MODEL,
                max_tokens=1024,
                system="You are a network operations assistant. Use the available tools "
                       "to gather information from network devices and answer questions. "
                       "Always check multiple data sources when troubleshooting.",
                tools=self.tools,
                messages=self.conversation
            )

            # Add Claude's response to conversation history
            self.conversation.append({"role": "assistant", "content": response.content})

            # If Claude is done (no more tool calls), return the answer
            if response.stop_reason == "end_turn":
                for block in response.content:
                    if hasattr(block, 'text'):
                        print(f"Agent: {block.text}")
                        return block.text

            # If Claude wants to call tools, execute them
            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        print(f"  [Tool Call] {block.name}({json.dumps(block.input)})")
                        result = self._execute_tool(block.name, block.input)
                        print(f"  [Result] {result[:120]}...")
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })

                # Feed tool results back to Claude
                self.conversation.append({"role": "user", "content": tool_results})

        return "Max iterations reached"


# ----- Test the network agent -----
agent = NetworkAgent()
print("=" * 60)
print("DEMO: Network troubleshooting agent")
print("=" * 60)
response = agent.run(
    "The ISP-B link on router-core-01 seems to be down. "
    "Can you check the BGP neighbor status, the interface GigabitEthernet0/2, "
    "and try to ping the ISP-B peer at 198.51.100.1?"
)

## Cell 3: Error Handling for Network Agents

Network operations are inherently unreliable -- SSH sessions time out,
devices are unreachable, APIs rate-limit. Robust error handling is critical.

**Networking parallel**: This is like implementing BFD (Bidirectional Forwarding
Detection) for your agent -- classify failures quickly and react appropriately.

In [None]:
class ErrorType(Enum):
    """Classify errors to determine the right recovery strategy.

    Like how a routing protocol classifies link failures:
    - Transient (flap) -> wait and retry
    - Permanent (cable cut) -> reroute immediately
    - Resource (table full) -> wait for resources to free up
    """
    TRANSIENT = "transient"      # SSH timeout, API 503 -> retry
    PERMANENT = "permanent"      # Wrong credentials, device decommissioned -> don't retry
    RESOURCE = "resource"        # API rate limit, memory full -> wait longer, then retry
    DEVICE = "device"            # Device unreachable, not in inventory -> skip
    AUTH = "auth"                # Authentication failure -> alert operator

def classify_error(exception):
    """Classify an error to decide how to handle it."""
    error_msg = str(exception).lower()

    if isinstance(exception, TimeoutError) or "timeout" in error_msg:
        return ErrorType.TRANSIENT
    elif isinstance(exception, PermissionError) or "auth" in error_msg:
        return ErrorType.AUTH
    elif "rate limit" in error_msg or isinstance(exception, MemoryError):
        return ErrorType.RESOURCE
    elif "unreachable" in error_msg or "connection refused" in error_msg:
        return ErrorType.DEVICE
    elif isinstance(exception, (KeyError, ValueError)):
        return ErrorType.PERMANENT
    else:
        return ErrorType.TRANSIENT  # Default: assume transient and retry

@dataclass
class ErrorContext:
    """Full context about an error -- useful for post-incident review."""
    error_type: ErrorType
    message: str
    device: str
    step: str
    attempt: int
    timestamp: datetime

class ErrorLogger:
    """Track all errors during an agent run for debugging."""

    def __init__(self):
        self.errors = []

    def log(self, context: ErrorContext):
        self.errors.append(context)
        print(f"  [{context.error_type.value.upper()}] {context.device}: {context.message}")

    def summary(self):
        by_type = Counter(e.error_type.value for e in self.errors)
        by_device = Counter(e.device for e in self.errors)
        return {"total": len(self.errors), "by_type": dict(by_type), "by_device": dict(by_device)}

# ----- Demo: Classify network errors -----
logger = ErrorLogger()

test_errors = [
    (TimeoutError("SSH connection timed out"), "router-core-01", "get_interface"),
    (PermissionError("TACACS authentication failed"), "switch-dist-02", "login"),
    (ConnectionError("Device unreachable"), "switch-access-05", "ping"),
    (TimeoutError("SSH connection timed out"), "router-core-01", "get_bgp"),
    (RuntimeError("API rate limit exceeded"), "api-server", "query"),
]

for exc, device, step in test_errors:
    ctx = ErrorContext(
        error_type=classify_error(exc),
        message=str(exc),
        device=device,
        step=step,
        attempt=1,
        timestamp=datetime.now()
    )
    logger.log(ctx)

print("\nError Summary:")
print(json.dumps(logger.summary(), indent=2))

## Cell 4: Retry with Exponential Backoff

When SSH connections time out or APIs rate-limit, you need automatic retries.
Exponential backoff prevents thundering-herd problems (all retries hitting at once).

**Networking parallel**: This is the same concept as Ethernet's binary exponential
backoff for collision handling in CSMA/CD -- wait longer after each failure.

In [None]:
class RetryPolicy:
    """Configure retry behavior for network operations."""

    def __init__(self, max_attempts=3, backoff_type="exponential",
                 initial_delay=1, max_delay=60):
        self.max_attempts = max_attempts
        self.backoff_type = backoff_type
        self.initial_delay = initial_delay
        self.max_delay = max_delay

    def get_delay(self, attempt):
        """Calculate wait time before next retry."""
        if self.backoff_type == "exponential":
            delay = self.initial_delay * (2 ** attempt)
        elif self.backoff_type == "linear":
            delay = self.initial_delay * (attempt + 1)
        else:
            delay = self.initial_delay
        return min(delay, self.max_delay)


class RetryExecutor:
    """Execute functions with automatic retry on failure."""

    def __init__(self, policy=None):
        self.policy = policy or RetryPolicy()

    def execute(self, fn, args=None, kwargs=None):
        args = args or ()
        kwargs = kwargs or {}
        last_error = None

        for attempt in range(self.policy.max_attempts):
            try:
                print(f"  Attempt {attempt + 1}/{self.policy.max_attempts}...")
                return fn(*args, **kwargs)
            except Exception as e:
                last_error = e
                error_type = classify_error(e)
                print(f"    Failed: {e}")

                # Don't retry permanent or auth errors
                if error_type in (ErrorType.PERMANENT, ErrorType.AUTH):
                    print(f"    Error type '{error_type.value}' is not retryable. Giving up.")
                    raise

                if attempt < self.policy.max_attempts - 1:
                    delay = self.policy.get_delay(attempt)
                    print(f"    Waiting {delay:.1f}s before retry...")
                    time.sleep(delay)

        raise Exception(f"Failed after {self.policy.max_attempts} attempts") from last_error


# ----- Demo: Simulate a flaky SSH connection -----
ssh_attempt_count = 0

def flaky_ssh_connection():
    """Simulates an SSH connection that fails twice, then succeeds."""
    global ssh_attempt_count
    ssh_attempt_count += 1
    if ssh_attempt_count < 3:
        raise TimeoutError("SSH connection to router-core-01 timed out")
    return {"device": "router-core-01", "output": "show version output here..."}

print("Simulating flaky SSH with exponential backoff:\n")
executor = RetryExecutor(RetryPolicy(max_attempts=4, initial_delay=0.5))
try:
    result = executor.execute(flaky_ssh_connection)
    print(f"\n  Success: Connected to {result['device']}")
except Exception as e:
    print(f"\n  Failed permanently: {e}")

## Cell 5: State Management

Implementing agent state tracking.

In [None]:
@dataclass
class AgentState:
    """Explicit state object for agent execution."""
    task_id: str
    user_input: str
    status: str  # "planning", "executing", "complete", "error"
    current_step: int
    plan: List[str]
    results: Dict[int, str]
    errors: List[str]
    created_at: datetime
    updated_at: datetime
    
    def to_dict(self):
        """Convert to dictionary."""
        return {
            "task_id": self.task_id,
            "user_input": self.user_input,
            "status": self.status,
            "current_step": self.current_step,
            "plan": self.plan,
            "results": self.results,
            "errors": self.errors,
            "created_at": self.created_at.isoformat(),
            "updated_at": self.updated_at.isoformat()
        }
    
    def to_json(self):
        """Convert to JSON string."""
        return json.dumps(self.to_dict(), indent=2)

# Demo: Create and manage state
import uuid

state = AgentState(
    task_id=str(uuid.uuid4()),
    user_input="Write a poem about Python",
    status="planning",
    current_step=0,
    plan=[
        "Understand requirements",
        "Generate poem structure",
        "Write poem",
        "Review and refine"
    ],
    results={},
    errors=[],
    created_at=datetime.now(),
    updated_at=datetime.now()
)

print("Initial State:")
print(state.to_json())

# Update state as steps execute
state.status = "executing"
state.current_step = 1
state.results[0] = "Requirements understood"
state.updated_at = datetime.now()

print("\nUpdated State:")
print(state.to_json())

## Cell 6: DAG-Based Workflow for Network Operations

Complex network tasks have dependencies. For example, you can't verify BGP
until OSPF is converged, and you can't test end-to-end until both are up.

A DAG (Directed Acyclic Graph) enforces execution order automatically.

**Networking parallel**: This is like a maintenance change window checklist
where steps have dependencies -- except the computer enforces the order.

In [None]:
import networkx as nx

class DAGWorkflow:
    """Execute network operations in dependency order using a DAG."""

    def __init__(self):
        self.graph = nx.DiGraph()
        self.results = {}

    def add_step(self, step_id, action, dependencies=None):
        """Add a step. Dependencies must complete before this step runs."""
        self.graph.add_node(step_id, action=action)
        if dependencies:
            for dep in dependencies:
                self.graph.add_edge(dep, step_id)

    def validate(self):
        """Check that the workflow has no circular dependencies."""
        if not nx.is_directed_acyclic_graph(self.graph):
            raise ValueError("Workflow contains circular dependencies!")
        return True

    def execute(self):
        """Execute all steps in topological (dependency) order."""
        self.validate()
        order = list(nx.topological_sort(self.graph))
        print(f"Execution order: {' -> '.join(order)}\n")

        for step_id in order:
            action = self.graph.nodes[step_id]["action"]
            deps = list(self.graph.predecessors(step_id))
            inputs = {dep: self.results[dep] for dep in deps}

            print(f"  [{step_id}] Running...")
            try:
                result = action(**inputs) if inputs else action()
                self.results[step_id] = result
                print(f"  [{step_id}] Done: {result}")
            except Exception as e:
                print(f"  [{step_id}] FAILED: {e}")
                raise
        return self.results


# ----- Demo: Network change window workflow -----
# Simulated steps for a BGP peer addition change window

def backup_config():
    return "Config backed up: router-core-01_backup_20260207.cfg"

def verify_ospf():
    return "OSPF: 12 neighbors, all Full/DR or Full/BDR"

def add_bgp_peer(backup_config=None):
    return "BGP peer 198.51.100.5 (AS 65004) added"

def verify_bgp(add_bgp_peer=None, verify_ospf=None):
    return "BGP peer 198.51.100.5 state: Established, 12000 prefixes received"

def verify_end_to_end(verify_bgp=None):
    return "Traceroute to 8.8.8.8 via new path: 3 hops, 2.1ms avg RTT"

workflow = DAGWorkflow()
workflow.add_step("backup_config", backup_config)
workflow.add_step("verify_ospf", verify_ospf)
workflow.add_step("add_bgp_peer", add_bgp_peer, dependencies=["backup_config"])
workflow.add_step("verify_bgp", verify_bgp, dependencies=["add_bgp_peer", "verify_ospf"])
workflow.add_step("verify_end_to_end", verify_end_to_end, dependencies=["verify_bgp"])

print("=" * 60)
print("DEMO: Network change window workflow (DAG)")
print("=" * 60)
results = workflow.execute()
print(f"\nAll steps completed successfully.")

## Cell 7: Tool Executor with Validation

Building a robust tool execution framework.

In [None]:
class ToolExecutor:
    """Execute tools with validation and error handling."""
    
    def __init__(self):
        self.tools = {}
        self.execution_history = []
    
    def register_tool(self, definition, handler):
        """Register a tool with definition and handler."""
        self.tools[definition["name"]] = {
            "definition": definition,
            "handler": handler
        }
    
    def execute(self, tool_name, parameters):
        """Execute a tool with parameter validation."""
        if tool_name not in self.tools:
            return {"error": f"Unknown tool: {tool_name}"}
        
        tool_info = self.tools[tool_name]
        
        # Validate parameters
        validation_error = self._validate_parameters(
            parameters, 
            tool_info["definition"]
        )
        if validation_error:
            return {"error": f"Invalid parameters: {validation_error}"}
        
        # Execute with error handling
        try:
            result = tool_info["handler"](**parameters)
            execution_record = {
                "tool": tool_name,
                "parameters": parameters,
                "result": result,
                "status": "success",
                "timestamp": datetime.now().isoformat()
            }
        except Exception as e:
            result = {"error": str(e)}
            execution_record = {
                "tool": tool_name,
                "parameters": parameters,
                "result": result,
                "status": "error",
                "timestamp": datetime.now().isoformat(),
                "exception_type": type(e).__name__
            }
        
        self.execution_history.append(execution_record)
        return result
    
    def _validate_parameters(self, params, tool_definition):
        """Validate parameters against tool definition."""
        required_params = tool_definition.get("input_schema", {}).get("required", [])
        
        for req_param in required_params:
            if req_param not in params:
                return f"Missing required parameter: {req_param}"
        
        return None
    
    def get_history(self):
        """Get execution history."""
        return self.execution_history

# Demo: Register and execute tools
executor = ToolExecutor()

# Register tools
executor.register_tool(
    {
        "name": "multiply",
        "description": "Multiply two numbers",
        "input_schema": {
            "type": "object",
            "properties": {
                "a": {"type": "number"},
                "b": {"type": "number"}
            },
            "required": ["a", "b"]
        }
    },
    lambda a, b: a * b
)

executor.register_tool(
    {
        "name": "concatenate",
        "description": "Concatenate strings",
        "input_schema": {
            "type": "object",
            "properties": {
                "text1": {"type": "string"},
                "text2": {"type": "string"}
            },
            "required": ["text1", "text2"]
        }
    },
    lambda text1, text2: f"{text1} {text2}"
)

# Execute tools
print("Executing multiply(5, 3):")
result1 = executor.execute("multiply", {"a": 5, "b": 3})
print(f"Result: {result1}\n")

print("Executing concatenate('Hello', 'World'):")
result2 = executor.execute("concatenate", {"text1": "Hello", "text2": "World"})
print(f"Result: {result2}\n")

print("Executing multiply with missing parameter:")
result3 = executor.execute("multiply", {"a": 5})
print(f"Result: {result3}\n")

print("Execution History:")
for record in executor.get_history():
    print(f"  {record['tool']}: {record['status']}")

## Cell 8: Decision Making Framework

Implementing multi-criteria decision analysis.

In [None]:
class MultiCriteriaDecisionAnalysis:
    """MCDA framework for evaluating options against multiple criteria."""
    
    def __init__(self, criteria, weights):
        """
        Initialize MCDA.
        
        Args:
            criteria: List of evaluation functions
            weights: List of weights for each criterion (must sum to 1)
        """
        if len(criteria) != len(weights):
            raise ValueError("Number of criteria must match number of weights")
        
        if abs(sum(weights) - 1.0) > 0.01:
            raise ValueError("Weights must sum to 1")
        
        self.criteria = criteria
        self.weights = weights
    
    def evaluate_option(self, option):
        """Evaluate a single option against all criteria."""
        scores = []
        for criterion in self.criteria:
            score = criterion(option)
            scores.append(score)
        return scores
    
    def decide(self, options):
        """Choose the best option."""
        scores = {}
        detailed_scores = {}
        
        for option in options:
            criterion_scores = self.evaluate_option(option)
            weighted_score = sum(s * w for s, w in zip(criterion_scores, self.weights))
            scores[option] = weighted_score
            detailed_scores[option] = criterion_scores
        
        best_option = max(options, key=lambda o: scores[o])
        
        return {
            "best_option": best_option,
            "scores": scores,
            "detailed_scores": detailed_scores
        }

# Demo: Choose best LLM model
@dataclass
class LLMModel:
    name: str
    accuracy: float  # 0-1
    latency_ms: float
    cost_per_1k: float
    reliability: float  # 0-1

models = [
    LLMModel("GPT-4", 0.95, 2000, 0.03, 0.98),
    LLMModel("Claude 3.5 Sonnet", 0.92, 800, 0.003, 0.99),
    LLMModel("Gemini Pro", 0.90, 1000, 0.0005, 0.95),
]

# Define criteria and weights
criteria = [
    lambda m: m.accuracy,              # Accuracy (weight: 0.4)
    lambda m: 1 / (m.latency_ms / 1000),  # Speed (weight: 0.3) 
    lambda m: 1 / m.cost_per_1k,       # Cost efficiency (weight: 0.2)
    lambda m: m.reliability             # Reliability (weight: 0.1)
]
weights = [0.4, 0.3, 0.2, 0.1]

mdca = MultiCriteriaDecisionAnalysis(criteria, weights)
result = mdca.decide(models)

print("Model Selection Results:")
print(f"Best Choice: {result['best_option'].name}")
print(f"\nScores:")
for model, score in sorted(result['scores'].items(), key=lambda x: x[1], reverse=True):
    print(f"  {model.name}: {score:.3f}")

print(f"\nDetailed Scores (Accuracy, Speed, Cost, Reliability):")
for model, scores in result['detailed_scores'].items():
    print(f"  {model.name}: {[f'{s:.2f}' for s in scores]}")

## Cell 9: Circuit Breaker Pattern

Prevents your agent from hammering a broken service. After too many failures,
the circuit "opens" and rejects requests immediately (fast-fail). After a
cooldown period, it tries again.

**Networking parallel**: This is exactly like interface dampening in BGP.
When a BGP peer flaps too many times, the router suppresses it (penalty/suppress
threshold). After a decay period, it un-suppresses. Same concept, different domain.

In [None]:
class CircuitState(Enum):
    """Circuit breaker states."""
    CLOSED = "closed"          # Normal operation
    OPEN = "open"              # Failing, reject requests
    HALF_OPEN = "half_open"    # Testing recovery

class CircuitBreaker:
    """Prevent cascading failures using circuit breaker pattern."""
    
    def __init__(self, failure_threshold=5, recovery_timeout=10):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.call_log = []
    
    def call(self, fn, *args, **kwargs):
        """Execute function with circuit breaker protection."""
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                print(f"  [Circuit] Transitioning to HALF_OPEN")
                self.state = CircuitState.HALF_OPEN
            else:
                self.call_log.append(("blocked", "circuit open"))
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = fn(*args, **kwargs)
            self.on_success()
            self.call_log.append(("success", None))
            return result
        except Exception as e:
            self.on_failure()
            self.call_log.append(("failure", str(e)))
            raise
    
    def on_success(self):
        """Handle successful call."""
        self.failure_count = 0
        if self.state == CircuitState.HALF_OPEN:
            print(f"  [Circuit] Recovered - transitioning to CLOSED")
            self.state = CircuitState.CLOSED
    
    def on_failure(self):
        """Handle failed call."""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            print(f"  [Circuit] Failure threshold reached - opening circuit")
            self.state = CircuitState.OPEN
    
    def get_state(self):
        return self.state.value

# Demo: Simulate service with intermittent failures
failing_call_count = 0

def unreliable_service():
    global failing_call_count
    failing_call_count += 1
    
    # Fails first 5 times, then succeeds
    if failing_call_count <= 5:
        raise Exception("Service unavailable")
    return "Success!"

breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=2)

print("Testing circuit breaker:")
for i in range(12):
    print(f"\nCall {i+1}:")
    try:
        result = breaker.call(unreliable_service)
        print(f"  Result: {result}")
    except Exception as e:
        print(f"  Error: {e}")
    print(f"  Circuit state: {breaker.get_state()}")
    
    if i == 6:
        print("\n  [Waiting for recovery timeout...]")
        time.sleep(2.5)

In [None]:
class PlanningAgent:
    """Agent that creates a plan (MoP) before executing network changes."""

    def __init__(self):
        pass

    def create_plan(self, task):
        """Ask Claude to create a step-by-step plan for a network task."""
        system_prompt = """You are a senior network engineer creating a Method of Procedure (MoP)
for a network change. Given a task, create a detailed step-by-step plan.

Format your response as JSON:
{
  "task": "description of the change",
  "risk_level": "low/medium/high",
  "estimated_impact": "description of potential impact",
  "rollback_plan": "how to undo the change",
  "steps": [
    {"id": 1, "action": "specific action", "verification": "how to verify this step", "dependencies": []},
    {"id": 2, "action": "next action", "verification": "verification command", "dependencies": [1]}
  ]
}"""

        response = client.messages.create(
            model=MODEL,
            max_tokens=1500,
            system=system_prompt,
            messages=[{"role": "user", "content": f"Create a MoP for: {task}"}]
        )

        response_text = response.content[0].text
        try:
            # Extract JSON from response
            import re
            json_match = re.search(r'\{[\s\S]*\}', response_text)
            if json_match:
                return json.loads(json_match.group())
            return json.loads(response_text)
        except (json.JSONDecodeError, AttributeError):
            return {"raw_plan": response_text}

    def execute_plan(self, plan):
        """Simulate executing the plan (in production, this would make real changes)."""
        results = {}
        if "steps" in plan:
            for step in plan["steps"]:
                step_id = step.get("id")
                action = step.get("action")
                verification = step.get("verification", "N/A")
                print(f"  Step {step_id}: {action}")
                print(f"    Verify: {verification}")
                results[step_id] = f"Completed: {action}"
        return results


# ----- Demo: Plan a network change -----
agent = PlanningAgent()
print("=" * 60)
print("DEMO: Planning agent for network changes")
print("=" * 60)
print("\nCreating MoP for: 'Add a new eBGP peer to ISP-C (AS 65004) on router-core-01'\n")

plan = agent.create_plan(
    "Add a new eBGP peer to ISP-C (AS 65004, peer IP 192.0.2.1) on router-core-01. "
    "The new peer should receive a default route and our /24 prefix 198.51.100.0/24. "
    "Apply route-maps for inbound and outbound filtering."
)

print("Generated Plan:")
print(json.dumps(plan, indent=2))

## Cell 11: Monitoring and Metrics

Track agent performance and behavior.

In [None]:
class AgentMetrics:
    """Track metrics for agent execution."""
    
    def __init__(self):
        self.task_counts = Counter()
        self.task_durations = []
        self.tool_calls = Counter()
        self.errors = Counter()
        self.concurrent_tasks = 0
    
    def record_task(self, status):
        """Record task completion."""
        self.task_counts[status] += 1
    
    def record_duration(self, duration_seconds):
        """Record task duration."""
        self.task_durations.append(duration_seconds)
    
    def record_tool_call(self, tool_name, status):
        """Record tool call."""
        self.tool_calls[f"{tool_name}:{status}"] += 1
    
    def record_error(self, error_type):
        """Record error."""
        self.errors[error_type] += 1
    
    def get_summary(self):
        """Get metrics summary."""
        avg_duration = sum(self.task_durations) / len(self.task_durations) if self.task_durations else 0
        
        return {
            "task_counts": dict(self.task_counts),
            "total_tasks": sum(self.task_counts.values()),
            "success_rate": self.task_counts["success"] / sum(self.task_counts.values()) if sum(self.task_counts.values()) > 0 else 0,
            "avg_duration_seconds": avg_duration,
            "min_duration_seconds": min(self.task_durations) if self.task_durations else 0,
            "max_duration_seconds": max(self.task_durations) if self.task_durations else 0,
            "tool_calls": dict(self.tool_calls),
            "errors": dict(self.errors),
            "total_errors": sum(self.errors.values())
        }

# Demo: Simulate agent execution and metrics
metrics = AgentMetrics()

# Simulate some task execution
for i in range(10):
    if i < 8:
        metrics.record_task("success")
        metrics.record_duration(2.3 + (i % 3) * 0.5)
        metrics.record_tool_call("search", "success")
        metrics.record_tool_call("extract", "success")
    else:
        metrics.record_task("error")
        metrics.record_duration(1.5)
        metrics.record_error("timeout")

print("Agent Metrics Summary:")
summary = metrics.get_summary()
for key, value in summary.items():
    if isinstance(value, float):
        print(f"{key}: {value:.3f}")
    else:
        print(f"{key}: {value}")

## Cell 12: Async/Parallel Workflow Execution

Execute independent steps in parallel.

In [None]:
class AsyncWorkflow:
    """Execute workflows with async tasks."""
    
    def __init__(self):
        self.steps = {}
        self.results = {}
    
    def add_step(self, step_id, async_action, dependencies=None):
        """Add an async step."""
        self.steps[step_id] = {
            "action": async_action,
            "dependencies": dependencies or []
        }
    
    async def execute(self):
        """Execute workflow asynchronously."""
        completed = set()
        
        while len(completed) < len(self.steps):
            # Find executable steps
            executable = [
                (step_id, step) for step_id, step in self.steps.items()
                if step_id not in completed 
                and all(dep in completed for dep in step["dependencies"])
            ]
            
            if not executable:
                break
            
            # Execute all executable steps in parallel
            tasks = []
            task_map = {}
            
            for step_id, step in executable:
                inputs = {dep: self.results[dep] for dep in step["dependencies"]}
                task = asyncio.create_task(step["action"](**inputs) if inputs else step["action"]())
                tasks.append(task)
                task_map[id(task)] = step_id
            
            # Wait for all tasks
            for task in tasks:
                result = await task
                step_id = task_map[id(task)]
                self.results[step_id] = result
                completed.add(step_id)
        
        return self.results

# Demo: Async workflow
async def task_a():
    await asyncio.sleep(0.5)
    return "Result A"

async def task_b():
    await asyncio.sleep(0.3)
    return "Result B"

async def task_c(a=None, b=None):
    await asyncio.sleep(0.2)
    return f"Combined: {a} and {b}"

async def run_async_demo():
    workflow = AsyncWorkflow()
    
    # A and B execute in parallel, C depends on both
    workflow.add_step("a", task_a)
    workflow.add_step("b", task_b)
    workflow.add_step("c", task_c, dependencies=["a", "b"])
    
    start = time.time()
    results = await workflow.execute()
    duration = time.time() - start
    
    print(f"Async Workflow Results (completed in {duration:.2f}s):")
    for step_id, result in results.items():
        print(f"  {step_id}: {result}")

# Run async demo
await run_async_demo()

## Cell 13: Human-in-the-Loop for Network Changes

For network changes, you almost always want human approval before
the agent makes modifications. This pattern pauses the agent and waits
for an operator to approve or reject the proposed change.

**Networking parallel**: This is like a change approval board (CAB)
workflow, but automated. The agent proposes the change, a human
reviews and approves, then the agent executes.

In [None]:
class NetworkChangeApproval:
    """Require human approval before executing network changes.

    Read-only operations (show commands) run automatically.
    Write operations (config changes) require approval.
    """

    def __init__(self):
        self.pending = {}
        self.history = []

    def request_approval(self, change_description, change_details):
        """Request human approval for a network change."""
        change_id = f"CHG-{len(self.history) + 1:04d}"
        self.pending[change_id] = {
            "id": change_id,
            "description": change_description,
            "details": change_details,
            "status": "pending",
            "requested_at": datetime.now()
        }

        print(f"\n{'='*50}")
        print(f"  APPROVAL REQUIRED: {change_id}")
        print(f"{'='*50}")
        print(f"  Change: {change_description}")
        print(f"  Details: {json.dumps(change_details, indent=4)}")
        print(f"{'='*50}")
        return change_id

    def approve(self, change_id, approver="operator"):
        if change_id in self.pending:
            self.pending[change_id]["status"] = "approved"
            self.pending[change_id]["approved_by"] = approver
            self.history.append(self.pending.pop(change_id))
            print(f"  APPROVED by {approver}: {change_id}")
            return True
        return False

    def reject(self, change_id, reason=""):
        if change_id in self.pending:
            self.pending[change_id]["status"] = "rejected"
            self.pending[change_id]["rejection_reason"] = reason
            self.history.append(self.pending.pop(change_id))
            print(f"  REJECTED: {change_id} -- {reason}")
            return True
        return False

    def execute_with_approval(self, action, description, details, auto_approve=False):
        """Execute a change, requiring approval first."""
        change_id = self.request_approval(description, details)

        if auto_approve:
            print("\n  (Auto-approving for demo purposes)")
            self.approve(change_id, approver="auto-demo")
            result = action()
            print(f"  Executed: {result}")
            return {"status": "approved", "result": result, "change_id": change_id}
        else:
            # In production, this would wait for a webhook/API/Slack callback
            self.reject(change_id, reason="No operator available (demo mode)")
            return {"status": "rejected", "change_id": change_id}


# ----- Demo: Change approval workflow -----
approver = NetworkChangeApproval()

# Change 1: Adding an ACL (auto-approved for demo)
result1 = approver.execute_with_approval(
    action=lambda: "ACL 'BLOCK-SCANNER' applied to Gi0/3 inbound",
    description="Add security ACL to block known scanner IPs",
    details={
        "device": "router-edge-01",
        "acl_name": "BLOCK-SCANNER",
        "interface": "GigabitEthernet0/3",
        "direction": "inbound",
        "entries": ["deny ip 185.220.0.0/16 any", "deny ip 45.148.10.0/24 any"]
    },
    auto_approve=True
)

# Change 2: Shutting down an interface (rejected -- too risky without human review)
result2 = approver.execute_with_approval(
    action=lambda: "Interface GigabitEthernet0/1 shut down",
    description="Shut down uplink interface for maintenance",
    details={
        "device": "router-core-01",
        "interface": "GigabitEthernet0/1",
        "action": "shutdown",
        "impact": "Loss of connectivity to datacenter spine"
    },
    auto_approve=False
)

print(f"\nChange History: {len(approver.history)} changes processed")

## Cell 14: Putting It All Together -- Network Operations Agent

Combining the patterns above into a comprehensive network operations agent
with state tracking, error handling, retry logic, and metrics.

In [None]:
class NetworkOpsAgent:
    """Complete network operations agent with error handling, retry, and metrics."""

    def __init__(self):
        self.metrics = AgentMetrics()
        self.logger = ErrorLogger()
        self.retry = RetryExecutor(RetryPolicy(max_attempts=3, initial_delay=0.5))

    def run_task(self, task_description, device="unknown"):
        """Run a network operations task with full monitoring."""
        task_id = f"TASK-{len(self.logger.errors) + 1:03d}"
        start_time = time.time()

        try:
            print(f"[{task_id}] Starting: {task_description} on {device}")

            # Simulate the task (in production, this would call real tools)
            result = f"Completed: {task_description}"

            duration = time.time() - start_time
            self.metrics.record_task("success")
            self.metrics.record_duration(duration)
            self.metrics.record_tool_call("ssh", "success")

            print(f"[{task_id}] Success ({duration:.2f}s)")
            return {"status": "success", "result": result}

        except Exception as e:
            duration = time.time() - start_time
            self.metrics.record_task("error")
            self.metrics.record_error(type(e).__name__)

            ctx = ErrorContext(
                error_type=classify_error(e),
                message=str(e), device=device,
                step=task_id, attempt=1,
                timestamp=datetime.now()
            )
            self.logger.log(ctx)

            print(f"[{task_id}] Error ({duration:.2f}s): {e}")
            return {"status": "error", "error": str(e)}

    def get_report(self):
        return {
            "metrics": self.metrics.get_summary(),
            "errors": self.logger.summary()
        }


# ----- Demo: Run network operations -----
ops_agent = NetworkOpsAgent()

tasks = [
    ("Check BGP neighbor status", "router-core-01"),
    ("Verify OSPF adjacencies", "router-core-01"),
    ("Audit ACL configurations", "switch-dist-01"),
    ("Generate interface utilization report", "switch-dist-02"),
    ("Check NTP sync status", "router-edge-01"),
]

for task, device in tasks:
    ops_agent.run_task(task, device)
    time.sleep(0.2)

print("\n" + "=" * 50)
print("Network Operations Agent Report:")
print("=" * 50)
report = ops_agent.get_report()
print(json.dumps(report, indent=2))

## Cell 15: Try It Yourself -- Network Agent Playground

Use this cell to experiment with the network agent. Try different
troubleshooting scenarios and see how the agent uses tools to investigate.

In [None]:
# -----------------------------------------------------------------------
# TRY IT: Ask the network agent a question
# -----------------------------------------------------------------------
# Modify the query below and run the cell. The agent has access to:
# - get_interface_status(device, interface)
# - get_bgp_neighbors(device)
# - ping_device(source_device, target_ip)
#
# Try queries like:
# - "What's the status of all BGP peers on router-core-01?"
# - "Is interface GigabitEthernet0/1 on router-core-01 healthy?"
# - "Can router-core-01 reach 203.0.113.2?"
# -----------------------------------------------------------------------

agent = NetworkAgent()

your_query = "Check if there are any interface errors on router-core-01 GigabitEthernet0/1 and GigabitEthernet0/2"

print("=" * 60)
print("YOUR QUERY")
print("=" * 60)
response = agent.run(your_query)

## Summary

This notebook demonstrated AI agent patterns applied to network operations:

1. **Network Agent with Tool Calling** -- The core pattern: LLM decides which network tools to call (show commands, pings) and reasons about the results
2. **Error Classification** -- Categorize failures (SSH timeout vs. auth failure vs. device unreachable) like routing protocols classify link failures
3. **Retry with Exponential Backoff** -- Same concept as CSMA/CD backoff: wait longer after each failure
4. **State Management** -- Track what the agent has done, what it knows, and what's left
5. **DAG Workflows** -- Execute network changes in dependency order (like a MoP checklist)
6. **Tool Validation** -- Validate parameters before calling network tools
7. **Decision Making** -- Multi-criteria analysis for choosing between options
8. **Circuit Breaker** -- Like BGP dampening: suppress a service after too many failures
9. **Planning Agent** -- Generate a Method of Procedure before making changes
10. **Monitoring & Metrics** -- Track agent performance (success rates, durations)
11. **Async Workflows** -- Run independent checks in parallel (ping + traceroute + show commands)
12. **Human-in-the-Loop** -- Require operator approval for config changes (like a CAB workflow)

### Key Takeaway for Network Engineers

AI agents are **event-driven automation with an LLM as the decision engine**. Instead of writing
an if/else playbook for every possible scenario, you give the agent tools (SSH, APIs, databases)
and let the LLM figure out which tools to use and in what order.

The patterns in this notebook (retry, circuit breaker, DAG workflows, human approval) are the
same reliability patterns you already use in networking -- just applied to AI-driven automation.