## Summary

This notebook demonstrates a production-ready failure agent implementation using LangGraph. The agent:

- **Receives** machine failure alerts with error codes and details
- **Analyzes** the failure using multiple retrieval tools
- **Synthesizes** information from technical documentation, maintenance history, and expert knowledge
- **Generates** comprehensive incident reports with step-by-step repair instructions
- **Maintains** conversation history for audit and learning purposes

The LangGraph framework provides:
- Clear workflow definition with nodes and edges
- Automatic tool binding and execution
- Message history management
- Extensibility for adding new tools and decision logic
- Support for async operations

To use this in production:
1. Replace mock databases with real MongoDB connections and vector embeddings
2. Configure OpenAI API credentials
3. Add persistent storage for incident reports
4. Implement checkpointing for long-running operations
5. Add error handling and retry logic

In [None]:
# Graph structure information
graph_info = f"""
AGENT GRAPH STRUCTURE:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Nodes:
  - agent: Processes input and calls LLM with tool bindings
  - tools: Executes tool calls and returns results

Edges:
  - START ‚Üí agent: Entry point
  - agent ‚Üí tools: When agent calls tools
  - tools ‚Üí agent: Loop back for next reasoning step
  - agent ‚Üí END: When no more tool calls needed

State Schema:
  - messages: Annotated list of BaseMessage objects
              (Maintains conversation history)

Routing Logic:
  - If last message has tool_calls ‚Üí route to "tools" node
  - Otherwise ‚Üí route to END (terminate)

Execution Flow:
  1. User provides alert details via HumanMessage
  2. Agent receives message and decides what tools to use
  3. Agent calls appropriate retrieval and analysis tools
  4. Tool results returned as ToolMessages
  5. Agent synthesizes results and generates incident report
  6. Agent generates final summary
  7. Graph terminates with complete incident documentation
"""

print(graph_info)

# Show how to use the agent
usage_example = """
USAGE EXAMPLE:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Input an alert
initial_state = {
    "messages": [
        HumanMessage(content="Alert: Machine MACH-001 reported error E001 at 10:30 AM")
    ]
}

# Run the agent asynchronously
import asyncio
result = await failure_agent.ainvoke(initial_state)

# Access the conversation history
for message in result["messages"]:
    print(f"{message.__class__.__name__}: {message.content}")

# Incidents are stored in INCIDENT_REPORTS for later retrieval
print(f"Total incidents created: {len(INCIDENT_REPORTS)}")
"""

print(usage_example)

In [None]:
# Visualize the agent workflow
import textwrap

# Agent flow diagram
flow_diagram = """
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     FAILURE AGENT WORKFLOW                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

                              START
                                ‚îÇ
                                ‚ñº
                        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                        ‚îÇ    AGENT     ‚îÇ
                        ‚îÇ  Node: Call  ‚îÇ
                        ‚îÇ Language LLM ‚îÇ
                        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                ‚îÇ
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ                       ‚îÇ
              Tool Calls?              No Calls
                    ‚îÇ                       ‚îÇ
                    ‚ñº                       ‚ñº
            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
            ‚îÇ    TOOLS     ‚îÇ         ‚îÇ   AGENT SENDS   ‚îÇ
            ‚îÇ Node: Process‚îÇ         ‚îÇ   FINAL MESSAGE ‚îÇ
            ‚îÇ Tool Results ‚îÇ         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                 ‚îÇ
                    ‚îÇ                        ‚ñº
                    ‚îÇ                       END
                    ‚îÇ
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚ñ≤
                            ‚îÇ
        Continue loop while agent has tool calls


TOOLS AVAILABLE TO THE AGENT:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

1. retrieve_manual(query, n=3)
   - Searches technical documentation
   - Returns relevant manuals and procedures
   - Helps understand error codes and prevention

2. retrieve_work_orders(query, n=3)
   - Finds related maintenance history
   - Shows previous occurrences and resolutions
   - Provides proven repair strategies

3. retrieve_interviews(query, n=3)
   - Accesses maintenance technician expertise
   - Provides practical troubleshooting tips
   - Includes lessons learned from field experience

4. generate_incident_report(error_code, error_name, root_cause, 
                           repair_instructions, machine_id)
   - Creates formal incident documentation
   - Stores structured repair procedures
   - Enables knowledge base building

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
"""

print(flow_diagram)

## 7. Visualize the Agent Flow

Visualize the graph structure to understand the agent's workflow.

In [None]:
# Example: Simulated incident report that would be generated
example_incident_report = {
    "incident_id": "INC-1001",
    "timestamp": "2025-01-26T10:30:00",
    "error_code": "E001",
    "error_name": "Motor Overheating",
    "machine_id": "MACH-2024-001",
    "root_cause": "The motor coolant pump is clogged with debris, preventing proper heat dissipation",
    "repair_instructions": [
        {
            "step": 1,
            "description": "Turn off the machine and allow it to cool for 30 minutes"
        },
        {
            "step": 2,
            "description": "Remove the coolant pump cover using a 15mm wrench"
        },
        {
            "step": 3,
            "description": "Inspect the pump inlet for debris and clean if necessary"
        },
        {
            "step": 4,
            "description": "Check coolant levels and top up with ISO VG 32 coolant if needed"
        },
        {
            "step": 5,
            "description": "Replace the pump cover and run the machine at idle for 5 minutes"
        },
        {
            "step": 6,
            "description": "Monitor temperature for 30 minutes and confirm normal operation"
        }
    ]
}

print("\n" + "=" * 80)
print("EXAMPLE INCIDENT REPORT OUTPUT")
print("=" * 80)
print(json.dumps(example_incident_report, indent=2))

print("\n‚úì Generated incident reports are stored for future reference")

In [None]:
# Test Scenario 2: Belt Misalignment
print("\n" + "=" * 80)
print("TEST SCENARIO 2: BELT MISALIGNMENT (E002)")
print("=" * 80)

test_input_2 = """
Alert Details:
- Error Code: E002
- Error Name: Belt Misalignment
- Machine ID: MACH-2024-005
- Timestamp: 2025-01-26 11:15:00
- Severity: Medium

Please analyze this failure and generate an incident report.
"""

initial_state_2 = {
    "messages": [HumanMessage(content=test_input_2)]
}

print("\nüì® Input Alert:")
print(test_input_2)
print("\nü§ñ Agent Processing...")
print("\nThe agent would:")
print("  1. Retrieve technical documentation about belt alignment")
print("  2. Find previous work orders with similar issues")
print("  3. Access maintenance technician expertise on belt alignment")
print("  4. Create comprehensive incident report with step-by-step repair guide")

In [None]:
# Test Scenario 1: Motor Overheating
print("=" * 80)
print("TEST SCENARIO 1: MOTOR OVERHEATING (E001)")
print("=" * 80)

test_input_1 = """
Alert Details:
- Error Code: E001
- Error Name: Motor Overheating
- Machine ID: MACH-2024-001
- Timestamp: 2025-01-26 10:30:00
- Severity: High

Please analyze this failure, retrieve relevant information, and generate an incident report 
with repair instructions.
"""

initial_state_1 = {
    "messages": [HumanMessage(content=test_input_1)]
}

print("\nüì® Input Alert:")
print(test_input_1)
print("\nü§ñ Agent Processing...")

# Run the agent (Note: this requires async context, so we'll show the structure)
print("\nNote: In a production environment, use: await failure_agent.ainvoke(initial_state_1)")
print("The agent would then:")
print("  1. Call retrieve_manual('E001 motor overheating')")
print("  2. Call retrieve_work_orders('motor overheating')")
print("  3. Call retrieve_interviews('motor overheating diagnosis')")
print("  4. Generate incident report with root cause and repair steps")

## 6. Test the Failure Agent

Execute the failure agent with sample failure scenarios and observe how it diagnoses issues and suggests solutions.

In [None]:
# Build the StateGraph
workflow = StateGraph(FailureAgentState)

# Add nodes
workflow.add_node("agent", agent_node)
workflow.add_node("tools", process_tool_calls)

# Add edges
workflow.add_edge(START, "agent")
workflow.add_conditional_edges("agent", should_continue)
workflow.add_edge("tools", "agent")

# Compile the graph
failure_agent = workflow.compile()

print("‚úì Failure Agent graph compiled successfully")

## 5. Compile the Graph

Create the StateGraph and compile it into an executable agent.

In [None]:
# Define the Routing Logic
def should_continue(state: FailureAgentState) -> str:
    """
    Decide whether to continue with tool execution or end the conversation.
    """
    messages = state["messages"]
    last_message = messages[-1]
    
    # If the last message has tool calls, route to the tools node
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    
    # Otherwise, end the agent
    return END

print("‚úì Routing logic defined")

In [None]:
# Define the Tool Execution Node
async def process_tool_calls(state: FailureAgentState) -> FailureAgentState:
    """
    Process tool calls from the agent and return the results.
    """
    messages = state["messages"]
    last_message = messages[-1]
    
    tool_results = []
    
    # Check if the last message has tool calls
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        for tool_call in last_message.tool_calls:
            tool_name = tool_call["name"]
            tool_input = tool_call["args"]
            
            print(f"\nüîß Executing tool: {tool_name}")
            print(f"   Input: {tool_input}")
            
            # Find and execute the tool
            for tool in tools:
                if tool.name == tool_name:
                    result = await tool.ainvoke(tool_input)
                    print(f"   Result: {result[:100]}...")
                    
                    tool_message = ToolMessage(
                        content=result,
                        tool_call_id=tool_call["id"]
                    )
                    tool_results.append(tool_message)
                    break
    
    return {
        "messages": tool_results
    }

print("‚úì Tool execution node defined")

In [None]:
# Define the Agent Node
async def agent_node(state: FailureAgentState) -> FailureAgentState:
    """
    The agent node processes messages and calls the LLM to decide next steps.
    """
    # Create the prompt template
    prompt = ChatPromptTemplate.from_messages([
        (
            "system",
            """You are the Failure Agent. Your role is to:
1. Receive alert details about machine failures
2. Retrieve additional context from manuals, work orders, and maintenance expertise
3. Analyze the root cause of the failure
4. Generate a comprehensive incident report with repair instructions

Use your tools strategically to gather all necessary information before generating the incident report.
After the incident report is generated, acknowledge the completion with a brief summary."""
        ),
        MessagesPlaceholder(variable_name="messages"),
    ])
    
    # Format the messages
    formatted_prompt = await prompt.ainvoke({"messages": state["messages"]})
    
    # Get the response from the model
    response = await llm_with_tools.ainvoke(formatted_prompt)
    
    return {
        "messages": [response]
    }

print("‚úì Agent node defined")

In [None]:
# Initialize the LLM
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    api_key=os.getenv("OPENAI_API_KEY")
)

# Bind tools to the model
llm_with_tools = llm.bind_tools(tools)

print("‚úì Language model configured with tools")

## 4. Build the Agent Graph

Construct the LangGraph workflow by defining nodes for agent logic, tool execution, and decision-making processes.

In [None]:
# Define Tool Functions

@tool
def retrieve_manual(query: str, n: int = 3) -> str:
    """
    Retrieve relevant technical manuals for the alert via semantic search.
    
    Args:
        query: The search query for technical documentation
        n: Number of results to return (default 3)
    
    Returns:
        JSON string containing relevant manual excerpts
    """
    # Mock implementation - in production would use vector search with embeddings
    results = []
    for error_code, manual in MOCK_MANUALS.items():
        if query.lower() in manual["title"].lower() or query.lower() in manual["content"].lower():
            results.append({
                "error_code": error_code,
                "title": manual["title"],
                "content": manual["content"],
                "relevance_score": 0.95
            })
    
    return json.dumps(results[:n])

@tool
def retrieve_work_orders(query: str, n: int = 3) -> str:
    """
    Retrieve related work orders for the alert via semantic search.
    
    Args:
        query: The search query for work orders
        n: Number of results to return (default 3)
    
    Returns:
        JSON string containing related work order information
    """
    # Mock implementation - in production would use vector search
    results = []
    for wo_id, wo in MOCK_WORKORDERS.items():
        if query.lower() in wo["title"].lower() or query.lower() in wo["observations"].lower():
            results.append({
                "work_order_id": wo_id,
                "title": wo["title"],
                "observations": wo["observations"],
                "date": wo["date"],
                "relevance_score": 0.88
            })
    
    return json.dumps(results[:n])

@tool
def retrieve_interviews(query: str, n: int = 3) -> str:
    """
    Retrieve interviews and expertise related to the alert via semantic search.
    
    Args:
        query: The search query for maintenance expertise
        n: Number of results to return (default 3)
    
    Returns:
        JSON string containing relevant interview excerpts
    """
    # Mock implementation - in production would use vector search
    results = []
    for int_id, interview in MOCK_INTERVIEWS.items():
        if query.lower() in interview["text"].lower() or query.lower() in interview["technician"].lower():
            results.append({
                "interview_id": int_id,
                "technician": interview["technician"],
                "expertise": interview["text"],
                "relevance_score": 0.92
            })
    
    return json.dumps(results[:n])

@tool
def generate_incident_report(
    error_code: str,
    error_name: str,
    root_cause: str,
    repair_instructions: List[Dict[str, Any]],
    machine_id: str
) -> str:
    """
    Generate and store an incident report for the failure alert.
    
    Args:
        error_code: The error code for the incident
        error_name: Human-readable name of the error
        root_cause: Root cause analysis inferred from context
        repair_instructions: List of repair steps (3-6 steps)
        machine_id: ID of the affected machine
    
    Returns:
        JSON string with incident report confirmation
    """
    report = {
        "incident_id": f"INC-{len(INCIDENT_REPORTS) + 1001}",
        "timestamp": datetime.now().isoformat(),
        "error_code": error_code,
        "error_name": error_name,
        "root_cause": root_cause,
        "repair_instructions": repair_instructions,
        "machine_id": machine_id,
        "status": "created"
    }
    
    INCIDENT_REPORTS.append(report)
    
    return json.dumps({
        "success": True,
        "incident_id": report["incident_id"],
        "message": f"Incident report created successfully"
    })

# Get all tools
tools = [retrieve_manual, retrieve_work_orders, retrieve_interviews, generate_incident_report]

print("‚úì Tool functions defined and registered")

In [None]:
# Mock database for demonstration
# In production, these would connect to MongoDB with vector search

MOCK_MANUALS = {
    "E001": {
        "title": "Error E001: Motor Overheating",
        "content": "The motor may be overheating due to excessive load or insufficient cooling. Check coolant levels and ensure ventilation is not blocked."
    },
    "E002": {
        "title": "Error E002: Belt Misalignment",
        "content": "Belt misalignment can cause uneven wear and reduced efficiency. Inspect belt tension and alignment guides."
    }
}

MOCK_WORKORDERS = {
    "WO-1001": {
        "title": "Replace Motor Bearings",
        "observations": "Previous motor failure resolved by replacing worn bearings",
        "date": "2025-12-15"
    },
    "WO-1002": {
        "title": "Coolant System Maintenance",
        "observations": "Coolant flush and filter replacement prevented overheating",
        "date": "2025-11-20"
    }
}

MOCK_INTERVIEWS = {
    "INT-001": {
        "technician": "John Smith",
        "text": "When we see E001 errors, the first thing to check is the coolant pump. 9 out of 10 times it's just debris in the pump."
    },
    "INT-002": {
        "technician": "Maria Garcia",
        "text": "E002 belt issues often happen when the drive sprockets are misaligned. Check both ends of the shaft alignment."
    }
}

INCIDENT_REPORTS = []

print("‚úì Mock database initialized")

## 3. Create Tool Functions

The failure agent uses four main tools to diagnose failures and generate incident reports:
- **retrieve_manual**: Search technical manuals for relevant information
- **retrieve_work_orders**: Find related maintenance work orders
- **retrieve_interviews**: Access maintenance staff expertise and historical insights
- **generate_incident_report**: Create and store incident reports

In [None]:
# Define State Schema
from typing import TypedDict

class FailureAgentState(TypedDict):
    """State schema for the Failure Agent"""
    messages: Annotated[List[BaseMessage], add_messages]
    
print("‚úì State schema defined")

## 2. Define the State Schema

The state schema maintains the conversation history and messages throughout the agent's execution.

In [None]:
# Installing a libraries' directly in the notebook
!pip install dotenv pymongo voyageai openai  langchain asyncio langchain-openai

In [None]:
# Import Required Libraries
import os
import json
from datetime import datetime
from typing import Any, Dict, Optional, List
from dotenv import load_dotenv

# LangChain and LangGraph imports
from langchain_core.tools import tool
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, END
from langgraph.types import StateSnapshot
from langgraph.graph.message import add_messages
from typing import Annotated

# MongoDB integration
from mongodb_integration import (
    retrieve_manuals,
    retrieve_work_orders,
    retrieve_interviews,
    insert_incident_report,
    close_mongo_client,
)

# Load environment variables
load_dotenv()

print("‚úì All libraries imported successfully")
print("‚úì MongoDB integration loaded")

# Failure Agent using LangGraph

This notebook implements a Python version of the failure agent that processes machine failure alerts and generates incident reports. The agent uses LangGraph as the agentic framework to coordinate multi-step reasoning and tool execution.

## Summary

This notebook demonstrates a production-ready failure agent implementation using LangGraph. The agent:

- **Receives** machine failure alerts with error codes and details
- **Analyzes** the failure using multiple retrieval tools
- **Synthesizes** information from technical documentation, maintenance history, and expert knowledge
- **Generates** comprehensive incident reports with step-by-step repair instructions
- **Maintains** conversation history for audit and learning purposes

The LangGraph framework provides:
- Clear workflow definition with nodes and edges
- Automatic tool binding and execution
- Message history management
- Extensibility for adding new tools and decision logic
- Support for async operations

To use this in production:
1. Replace mock databases with real MongoDB connections and vector embeddings
2. Configure OpenAI API credentials
3. Add persistent storage for incident reports
4. Implement checkpointing for long-running operations
5. Add error handling and retry logic

In [None]:
# Graph structure information
graph_info = f"""
AGENT GRAPH STRUCTURE:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Nodes:
  - agent: Processes input and calls LLM with tool bindings
  - tools: Executes tool calls and returns results

Edges:
  - START ‚Üí agent: Entry point
  - agent ‚Üí tools: When agent calls tools
  - tools ‚Üí agent: Loop back for next reasoning step
  - agent ‚Üí END: When no more tool calls needed

State Schema:
  - messages: Annotated list of BaseMessage objects
              (Maintains conversation history)

Routing Logic:
  - If last message has tool_calls ‚Üí route to "tools" node
  - Otherwise ‚Üí route to END (terminate)

Execution Flow:
  1. User provides alert details via HumanMessage
  2. Agent receives message and decides what tools to use
  3. Agent calls appropriate retrieval and analysis tools
  4. Tool results returned as ToolMessages
  5. Agent synthesizes results and generates incident report
  6. Agent generates final summary
  7. Graph terminates with complete incident documentation
"""

print(graph_info)

# Show how to use the agent
usage_example = """
USAGE EXAMPLE:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Input an alert
initial_state = {
    "messages": [
        HumanMessage(content="Alert: Machine MACH-001 reported error E001 at 10:30 AM")
    ]
}

# Run the agent asynchronously
import asyncio
result = await failure_agent.ainvoke(initial_state)

# Access the conversation history
for message in result["messages"]:
    print(f"{message.__class__.__name__}: {message.content}")

# Incidents are stored in INCIDENT_REPORTS for later retrieval
print(f"Total incidents created: {len(INCIDENT_REPORTS)}")
"""

print(usage_example)

In [None]:
# Visualize the agent workflow
import textwrap

# Agent flow diagram
flow_diagram = """
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     FAILURE AGENT WORKFLOW                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

                              START
                                ‚îÇ
                                ‚ñº
                        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                        ‚îÇ    AGENT     ‚îÇ
                        ‚îÇ  Node: Call  ‚îÇ
                        ‚îÇ Language LLM ‚îÇ
                        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                ‚îÇ
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ                       ‚îÇ
              Tool Calls?              No Calls
                    ‚îÇ                       ‚îÇ
                    ‚ñº                       ‚ñº
            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
            ‚îÇ    TOOLS     ‚îÇ         ‚îÇ   AGENT SENDS   ‚îÇ
            ‚îÇ Node: Process‚îÇ         ‚îÇ   FINAL MESSAGE ‚îÇ
            ‚îÇ Tool Results ‚îÇ         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                 ‚îÇ
                    ‚îÇ                        ‚ñº
                    ‚îÇ                       END
                    ‚îÇ
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚ñ≤
                            ‚îÇ
        Continue loop while agent has tool calls


TOOLS AVAILABLE TO THE AGENT:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

1. retrieve_manual(query, n=3)
   - Searches technical documentation
   - Returns relevant manuals and procedures
   - Helps understand error codes and prevention

2. retrieve_work_orders(query, n=3)
   - Finds related maintenance history
   - Shows previous occurrences and resolutions
   - Provides proven repair strategies

3. retrieve_interviews(query, n=3)
   - Accesses maintenance technician expertise
   - Provides practical troubleshooting tips
   - Includes lessons learned from field experience

4. generate_incident_report(error_code, error_name, root_cause, 
                           repair_instructions, machine_id)
   - Creates formal incident documentation
   - Stores structured repair procedures
   - Enables knowledge base building

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
"""

print(flow_diagram)

## 8. Visualize the Agent Flow

Visualize the graph structure to understand the agent's workflow.

In [None]:
# Example: Simulated incident report that would be generated
example_incident_report = {
    "incident_id": "INC-1001",
    "timestamp": "2025-01-26T10:30:00",
    "error_code": "E001",
    "error_name": "Motor Overheating",
    "machine_id": "MACH-2024-001",
    "root_cause": "The motor coolant pump is clogged with debris, preventing proper heat dissipation",
    "repair_instructions": [
        {
            "step": 1,
            "description": "Turn off the machine and allow it to cool for 30 minutes"
        },
        {
            "step": 2,
            "description": "Remove the coolant pump cover using a 15mm wrench"
        },
        {
            "step": 3,
            "description": "Inspect the pump inlet for debris and clean if necessary"
        },
        {
            "step": 4,
            "description": "Check coolant levels and top up with ISO VG 32 coolant if needed"
        },
        {
            "step": 5,
            "description": "Replace the pump cover and run the machine at idle for 5 minutes"
        },
        {
            "step": 6,
            "description": "Monitor temperature for 30 minutes and confirm normal operation"
        }
    ]
}

print("\n" + "=" * 80)
print("EXAMPLE INCIDENT REPORT OUTPUT")
print("=" * 80)
print(json.dumps(example_incident_report, indent=2))

print("\n‚úì Generated incident reports are stored for future reference")

In [None]:
# Test Scenario 2: Belt Misalignment
print("\n" + "=" * 80)
print("TEST SCENARIO 2: BELT MISALIGNMENT (E002)")
print("=" * 80)

test_input_2 = """
Alert Details:
- Error Code: E002
- Error Name: Belt Misalignment
- Machine ID: MACH-2024-005
- Timestamp: 2025-01-26 11:15:00
- Severity: Medium

Please analyze this failure and generate an incident report.
"""

initial_state_2 = {
    "messages": [HumanMessage(content=test_input_2)]
}

print("\nüì® Input Alert:")
print(test_input_2)
print("\nü§ñ Agent Processing...")
print("\nThe agent would:")
print("  1. Retrieve technical documentation about belt alignment")
print("  2. Find previous work orders with similar issues")
print("  3. Access maintenance technician expertise on belt alignment")
print("  4. Create comprehensive incident report with step-by-step repair guide")

In [None]:
# Test Scenario 1: Motor Overheating
print("=" * 80)
print("TEST SCENARIO 1: MOTOR OVERHEATING (E001)")
print("=" * 80)

test_input_1 = """
Alert Details:
- Error Code: E001
- Error Name: Motor Overheating
- Machine ID: MACH-2024-001
- Timestamp: 2025-01-26 10:30:00
- Severity: High

Please analyze this failure, retrieve relevant information, and generate an incident report 
with repair instructions.
"""

initial_state_1 = {
    "messages": [HumanMessage(content=test_input_1)]
}

print("\nüì® Input Alert:")
print(test_input_1)
print("\nü§ñ Agent Processing...")

# Run the agent (Note: this requires async context, so we'll show the structure)
print("\nNote: In a production environment, use: await failure_agent.ainvoke(initial_state_1)")
print("The agent would then:")
print("  1. Call retrieve_manual('E001 motor overheating')")
print("  2. Call retrieve_work_orders('motor overheating')")
print("  3. Call retrieve_interviews('motor overheating diagnosis')")
print("  4. Generate incident report with root cause and repair steps")

## 7. Test the Failure Agent

Execute the failure agent with sample failure scenarios and observe how it diagnoses issues and suggests solutions.

In [None]:
# Build the StateGraph
workflow = StateGraph(FailureAgentState)

# Add nodes
workflow.add_node("agent", agent_node)
workflow.add_node("tools", process_tool_calls)

# Add edges
workflow.add_edge(START, "agent")
workflow.add_conditional_edges("agent", should_continue)
workflow.add_edge("tools", "agent")

# Compile the graph
failure_agent = workflow.compile()

print("‚úì Failure Agent graph compiled successfully")

## 6. Compile the Graph

Create the StateGraph and compile it into an executable agent.

In [None]:
# Define the Routing Logic
def should_continue(state: FailureAgentState) -> str:
    """
    Decide whether to continue with tool execution or end the conversation.
    """
    messages = state["messages"]
    last_message = messages[-1]
    
    # If the last message has tool calls, route to the tools node
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    
    # Otherwise, end the agent
    return END

print("‚úì Routing logic defined")

In [None]:
# Define the Tool Execution Node
async def process_tool_calls(state: FailureAgentState) -> FailureAgentState:
    """
    Process tool calls from the agent and return the results.
    """
    messages = state["messages"]
    last_message = messages[-1]
    
    tool_results = []
    
    # Check if the last message has tool calls
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        for tool_call in last_message.tool_calls:
            tool_name = tool_call["name"]
            tool_input = tool_call["args"]
            
            print(f"\nüîß Executing tool: {tool_name}")
            print(f"   Input: {tool_input}")
            
            # Find and execute the tool
            for tool in tools:
                if tool.name == tool_name:
                    result = await tool.ainvoke(tool_input)
                    print(f"   Result: {result[:100]}...")
                    
                    tool_message = ToolMessage(
                        content=result,
                        tool_call_id=tool_call["id"]
                    )
                    tool_results.append(tool_message)
                    break
    
    return {
        "messages": tool_results
    }

print("‚úì Tool execution node defined")

In [None]:
# Define the Agent Node
async def agent_node(state: FailureAgentState) -> FailureAgentState:
    """
    The agent node processes messages and calls the LLM to decide next steps.
    """
    # Create the prompt template
    prompt = ChatPromptTemplate.from_messages([
        (
            "system",
            """You are the Failure Agent. Your role is to:
1. Receive alert details about machine failures
2. Retrieve additional context from manuals, work orders, and maintenance expertise
3. Analyze the root cause of the failure
4. Generate a comprehensive incident report with repair instructions

Use your tools strategically to gather all necessary information before generating the incident report.
After the incident report is generated, acknowledge the completion with a brief summary."""
        ),
        MessagesPlaceholder(variable_name="messages"),
    ])
    
    # Format the messages
    formatted_prompt = await prompt.ainvoke({"messages": state["messages"]})
    
    # Get the response from the model
    response = await llm_with_tools.ainvoke(formatted_prompt)
    
    return {
        "messages": [response]
    }

print("‚úì Agent node defined")

In [None]:
# Initialize the LLM
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    api_key=os.getenv("OPENAI_API_KEY")
)

# Bind tools to the model
llm_with_tools = llm.bind_tools(tools)

print("‚úì Language model configured with tools")

## 5. Build the Agent Graph

Construct the LangGraph workflow by defining nodes for agent logic, tool execution, and decision-making processes.

In [None]:
# Define Tool Functions using MongoDB Vector Search

@tool
def retrieve_manual(query: str, n: int = 3) -> str:
    """
    Retrieve relevant technical manuals for the alert via MongoDB vector search with Voyage AI embeddings.
    
    Args:
        query: The search query for technical documentation
        n: Number of results to return (default 3)
    
    Returns:
        JSON string containing relevant manual excerpts
    """
    try:
        return retrieve_manuals(query, n)
    except Exception as e:
        print(f"Error retrieving manuals: {e}")
        return json.dumps({"error": str(e)})


@tool
def retrieve_work_orders(query: str, n: int = 3) -> str:
    """
    Retrieve related work orders for the alert via MongoDB vector search with Voyage AI embeddings.
    
    Args:
        query: The search query for work orders
        n: Number of results to return (default 3)
    
    Returns:
        JSON string containing related work order information
    """
    try:
        return retrieve_work_orders(query, n)
    except Exception as e:
        print(f"Error retrieving work orders: {e}")
        return json.dumps({"error": str(e)})


@tool
def retrieve_interviews(query: str, n: int = 3) -> str:
    """
    Retrieve interviews and expertise related to the alert via MongoDB vector search with Voyage AI embeddings.
    
    Args:
        query: The search query for maintenance expertise
        n: Number of results to return (default 3)
    
    Returns:
        JSON string containing relevant interview excerpts
    """
    try:
        return retrieve_interviews(query, n)
    except Exception as e:
        print(f"Error retrieving interviews: {e}")
        return json.dumps({"error": str(e)})


@tool
def generate_incident_report(
    error_code: str,
    error_name: str,
    root_cause: str,
    repair_instructions: List[Dict[str, Any]],
    machine_id: str
) -> str:
    """
    Generate and store an incident report in MongoDB for the failure alert.
    
    Args:
        error_code: The error code for the incident
        error_name: Human-readable name of the error
        root_cause: Root cause analysis inferred from context
        repair_instructions: List of repair steps (3-6 steps)
        machine_id: ID of the affected machine
    
    Returns:
        JSON string with incident report confirmation
    """
    try:
        return insert_incident_report(
            error_code=error_code,
            error_name=error_name,
            root_cause=root_cause,
            repair_instructions=repair_instructions,
            machine_id=machine_id
        )
    except Exception as e:
        print(f"Error creating incident report: {e}")
        return json.dumps({"error": str(e)})


# Get all tools
tools = [retrieve_manual, retrieve_work_orders, retrieve_interviews, generate_incident_report]

print("‚úì Tool functions defined and registered (using MongoDB + Voyage AI)")


In [None]:
# MongoDB Connection Setup
# Prerequisites: 
#   1. Run: python scripts/ingest_data_mongodb.py
#   2. Set MONGODB_URI and VOYAGE_API_KEY environment variables

print("=" * 80)
print("MONGODB CONFIGURATION")
print("=" * 80)

# Check environment variables
mongodb_uri = os.getenv("MONGODB_URI")
db_name = os.getenv("DATABASE_NAME", "predictive_maintenance")
voyage_api_key = os.getenv("VOYAGE_API_KEY")

print(f"‚úì MongoDB URI: {mongodb_uri[:50]}..." if mongodb_uri else "‚úó MongoDB URI not set")
print(f"‚úì Database: {db_name}")
print(f"‚úì Voyage AI API: {'Configured' if voyage_api_key else 'NOT SET'}")

if not mongodb_uri or not voyage_api_key:
    print("\n‚ö†Ô∏è  WARNING: MongoDB or Voyage AI not fully configured!")
    print("   Please ensure:")
    print("   1. MONGODB_URI is set")
    print("   2. VOYAGE_API_KEY is set")
    print("   3. Data has been ingested: python scripts/ingest_data_mongodb.py")
else:
    print("\n‚úì MongoDB and Voyage AI are properly configured!")
    print("‚úì Ready to use vector search for retrieval")

print("=" * 80)


## 4. Create Tool Functions

The failure agent uses four main tools to diagnose failures and generate incident reports:
- **retrieve_manual**: Search technical manuals for relevant information
- **retrieve_work_orders**: Find related maintenance work orders
- **retrieve_interviews**: Access maintenance staff expertise and historical insights
- **generate_incident_report**: Create and store incident reports

In [None]:
# Define State Schema
from typing import TypedDict

class FailureAgentState(TypedDict):
    """State schema for the Failure Agent"""
    messages: Annotated[List[BaseMessage], add_messages]
    
print("‚úì State schema defined")

## 3. Define the State Schema

The state schema maintains the conversation history and messages throughout the agent's execution.

In [None]:
# Import Required Libraries
import os
import json
from datetime import datetime
from typing import Any, Dict, Optional, List
from dotenv import load_dotenv

# LangChain and LangGraph imports
from langchain_core.tools import tool
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, END
from langgraph.types import StateSnapshot
from langgraph.graph.message import add_messages
from typing import Annotated

# Load environment variables
load_dotenv()

print("‚úì All libraries imported successfully")

## 2. Import Required Libraries

Import necessary libraries including langchain, langgraph, and other dependencies for building the failure agent.

In [None]:
# Install required libraries
%pip install python-dotenv pymongo voyageai openai langchain langchain-openai langgraph

## 1. Install Required Libraries

Install necessary packages for the failure agent implementation.