# Reflexion Orchestration Agent

## Overview
In this example we will guide you through how to create a Reflexion pattern implementation using Strands multiagent orchestration. We will demonstrate an iterative self-improvement system that generates initial responses, critically evaluates them, and refines them through systematic reflection until reaching acceptable quality standards.

## Agent Details
<div style="float: left; margin-right: 20px;">
    
|Feature             |Description                                        |
|--------------------|---------------------------------------------------|
|Native tools used   |generate_initial_answer, generate_revised_answer  |
|Custom tools created|None (uses built-in reflection capabilities)      |
|Agent Structure     |Multi-agent orchestration with feedback loops     |
|AWS services used   |Amazon Bedrock                                     |

</div>

## Architecture

<div style="text-align:center">
    <img src="./images/reflexion.png" alt="Reflexion Architecture" width="800">
    <p>The system consists of two specialized agents connected in an iterative feedback loop:</p>
    <p><em>Reflexion Architecture: User Query → [Draft Agent] ⟷ [Revisor Agent] → Final Response</em></p>
</div>

## Key Features

* **Iterative self-improvement**: Implements reflection and revision cycles for quality enhancement through systematic feedback loops
* **Multi-agent orchestration**: Creates a system with Draft Agent and Revisor Agent working in iterative sequence
* **Quality assessment**: Multi-dimensional evaluation including completeness, clarity, and actionability with configurable thresholds
* **State management**: Maintains revision state across iterations for consistent improvement and progress tracking
* **Feedback loops**: Internal revision loops with configurable maximum iterations to prevent infinite cycles

In [None]:
!pip3 install -r ./requirements.txt --quiet --upgrade
!pip3 install strands-agents strands-agents-tools --quiet


## Importing dependency packages

Now let's import all the necessary libraries and modules for our Reflexion implementation. This includes standard Python libraries, AWS SDK components, Strands framework modules, and custom helper functions.

In [None]:
import time
import boto3
import ipywidgets as widgets
import uuid
import pandas as pd
import numpy as np
import os
import shutil
import sqlite3
import functools
import requests
import pytz
import warnings
from IPython.display import Image, display
from botocore.config import Config
from typing import Annotated, Literal, Optional, Union
from typing_extensions import TypedDict
from bs4 import BeautifulSoup
from datetime import date, datetime

from typing import List, Dict, Any
import re
import json


from strands import Agent
from strands import tool
from strands.models import BedrockModel
from strands.agent.conversation_manager import SlidingWindowConversationManager

from strands.multiagent.graph import GraphBuilder
from strands.agent import AgentResult
from strands.types.content import Message
from strands.types.streaming import StopReason
from strands.telemetry.metrics import EventLoopMetrics

import logging

## Import airline domain tools

Now we'll import the comprehensive set of airline domain tools from MAbench and TauBench. These tools provide the actual functionality that our Reflexion agent will execute, including flight booking, reservation management, and customer service operations.

In [None]:
import sys
sys.path.append('../data/ma-bench/')
sys.path.append('../data/tau-bench/')

from mabench.environments.airline.tools.book_reservation import book_reservation
from mabench.environments.airline.tools.calculate import calculate
from mabench.environments.airline.tools.cancel_reservation import cancel_reservation
from mabench.environments.airline.tools.get_reservation_details import get_reservation_details
from mabench.environments.airline.tools.get_user_details import get_user_details
from mabench.environments.airline.tools.list_all_airports import list_all_airports
from mabench.environments.airline.tools.search_direct_flight import search_direct_flight
from mabench.environments.airline.tools.search_onestop_flight import search_onestop_flight
from mabench.environments.airline.tools.send_certificate import send_certificate
from mabench.environments.airline.tools.think import think
from mabench.environments.airline.tools.transfer_to_human_agents import transfer_to_human_agents
from mabench.environments.airline.tools.update_reservation_baggages import update_reservation_baggages
from mabench.environments.airline.tools.update_reservation_flights import update_reservation_flights
from mabench.environments.airline.tools.update_reservation_passengers import update_reservation_passengers

domain = "airline"

from tau_bench.envs.airline.data import *
from tau_bench.envs.airline.tasks import *
from tau_bench.envs.airline.wiki import WIKI

## Configure Strands Framework

Now let's set up the core Strands framework components that will power our Reflexion multiagent system. We need to configure the AWS Bedrock connection, conversation management, and logging to ensure our three-agent pipeline runs smoothly.

### Framework Setup Process

First, we'll establish the **AWS region** and create a `BedrockModel` instance that the two  Reflexion agents (Draft and Revisor) will share. We do this so that all agents use the same LLM configuration for consistent behavior. You canalso use different LLM configurations.

Finally, we'll configure **logging** to minimize noise during execution so we can focus on the Reflexion execution flow and results.

In [None]:
region = "us-east-1"

bedrock_model_taubench = BedrockModel(region_name=region)

# Disable logging
logging.basicConfig(level=logging.CRITICAL)
for logger_name in ["strands", "graph", "event_loop", "registry", "sliding_window_conversation_manager", "bedrock", "streaming"]:
    logging.getLogger(logger_name).setLevel(logging.CRITICAL)

## Reflexion with State Management

## Define the Reflexion State


In [None]:
from dataclasses import dataclass

@dataclass
class ReflexionState:
    """State for reflexion workflow"""
    user_query: str = ""
    response: str = ""
    reflection: str = ""
    needs_revision: bool = False
    max_iterations: int = 5
    revision_count: int = 0

In [None]:
def normalize_prompt(prompt) -> str:
    if isinstance(prompt, list):
        # assume list of dicts like [{"text": "..."}]
        texts = [p.get("text", "") for p in prompt if isinstance(p, dict)]
        return " ".join(t.strip() for t in texts if t)
    if isinstance(prompt, dict) and "text" in prompt:
        return prompt["text"].strip()
    if isinstance(prompt, str):
        return prompt.strip()
    return str(prompt).strip()

## Create Common Agent and Prompts 
We will first create the flight tool executor agent and the reflection prompt which will be used by the 2 different nodes of the Reflexion graph.

### Create Flight Tool Executor Agent

Now let's create a comprehensive flight tool executor that serves as our baseline ReAct agent. This agent has direct access to all airline tools and uses intelligent reasoning to handle customer queries without the structured planning approach of REWOO.


#### System Prompt Configuration

First, we define a **system prompt template** that establishes the agent's role and operational guidelines.The prompt includes **policy integration** from the airline WIKI to ensure compliance with business rules, **geographic validation** to verify airport locations match user-specified states, and **accuracy requirements** to prevent hallucinated information.

In [None]:
system_prompt_template = """
You are a helpful assistant for a travel website. Help the user answer any questions.

<instructions>
-You MUST refer the <policy> to follow the guidelines to answer user question accurately
- Remeber to check if the the airport city is in the state mentioned by the user. For example, Houston is in Texas.
- Infer about the the U.S. state in which the airport city resides. For example, Houston is in Texas.
- You should not use made-up or placeholder arguments.
<instructions>

<policy>
{policy}
</policy>
"""

prompt = system_prompt_template.replace("{policy}", WIKI)



#### FlightToolExecutor Class

Next, we create a custom agent class that extends the base `Agent` with comprehensive airline tool access. This agent has access to the **complete airline tool suite** including booking, search, update, and customer service operations. It uses the **shared Bedrock model** for consistency and incorporates **policy-aware prompting** to ensure compliant responses.


In [None]:

class FlightToolExecutor(Agent):
    def __init__(self):
        super().__init__(
            model=bedrock_model_taubench,
            system_prompt=prompt,
            tools=[
                book_reservation, calculate, cancel_reservation, get_reservation_details,
                get_user_details, list_all_airports, search_direct_flight, search_onestop_flight,
                send_certificate, think, transfer_to_human_agents, update_reservation_baggages,
                update_reservation_flights, update_reservation_passengers
            ]
        )

flight_executor = FlightToolExecutor()

#### Define Reflection System Prompt

Now let's establish the specialized system prompt that guides our reflection agent in critically evaluating flight assistant responses. This prompt ensures comprehensive quality assessment across multiple dimensions to determine if responses need improvement.

#### Multi-Dimensional Quality Assessment

The `reflection_system_prompt` defines a structured evaluation framework that analyzes flight assistant responses on key quality dimensions:

- **Completeness**
- **Final Answer**
- **Clarity**
- **Actionability**
- **User Experience**
- **Missing Information**

#### Decision Framework

The prompt concludes with a **binary decision mechanism** (REVISE or ACCEPT) along with reasoning, enabling the Reflexion system to make informed decisions about whether responses meet quality standards or require iterative improvement through additional revision cycles.

This systematic approach ensures that our Reflexion agent maintains consistent evaluation criteria while focusing on real database-driven responses rather than hallucinated information. You can change this prompt to suit your usecase.



In [None]:
reflection_system_prompt="""You are analyzing a flight assistant's response that uses real flight database tools.
        
IMPORTANT: The flight data comes from real database queries, NOT hallucination.
        
Analyze the response quality on these dimensions:
1. **Completeness**: Does it address all parts of the user's query? 
2. **Final Answer**: If the user query clearly states the final goal and if it can be fulfiled as per the policy, then does the response show that?
2. **Clarity**: Is the information presented clearly and logically?
3. **Actionability**: Are next steps or options clearly presented?
4. **User Experience**: Is the tone helpful and appropriate?
5. **Missing Information**: What important details are missing?
6. **Decision**: REVISE or ACCEPT
7. **Reason**: Why this decision was made


"""

## Draft Node

Now let's create the draft node. This node has access to the `generate_initial_answer` tool and serves as the entry point for our Reflexion system, generating initial responses with built-in self-reflection capabilities.

### Create Initial Answer Generation Tool

Now let's implement the core Reflexion tool that generates initial responses and performs self-reflection. This tool combines the baseline flight executor with a specialized reflection agent to evaluate response quality and determine if revision is needed.

### Initial Response with Self-Reflection

The `generate_initial_answer` tool implements the first stage of the Reflexion pattern by following a structured process:

1. **Initial Response Generation**: Uses the baseline `flight_executor` to generate an initial answer with full airline tool access
2. **Critical Self-Reflection**: Creates a specialized reflection agent that evaluates the response quality using the `reflection_system_prompt`
3. **Revision Decision**: Analyzes the reflection text to determine if the response needs improvement (looks for "REVISE" keyword)
4. **Structured Output**: Returns a formatted response containing the original answer, reflection analysis, and revision decision

This tool forms the foundation of our Reflexion system, enabling **systematic self-evaluation** of response quality before presenting results to users. The reflection agent critically examines aspects like completeness, accuracy, and helpfulness to determine if iterative improvement is necessary.

### Draft Agent Implementation

The Draft Agent extends the base `Agent` class with custom `stream_async()` methods required for multiagent graph compatibility. It uses the shared Bedrock model and has access to the `generate_initial_answer` tool, making it the starting point for all Reflexion workflows where initial responses are generated and evaluated for quality.

In [None]:
@tool
def generate_initial_answer(query: str) -> str:
    """Generate initial answer, reflect, and decide if revision needed"""
    
    flight_response = flight_executor(query)
    answer_text = str(flight_response)
    
    reflection_agent = Agent(
        model=bedrock_model_taubench,
        system_prompt=reflection_system_prompt
    )
    
    reflection_prompt = f"""
Original Query: {query}
Flight Assistant's Answer: {answer_text}

Remember: The flight data comes from real database queries.
Please provide a critical reflection of this answer:"""
    
    reflection_response = reflection_agent(reflection_prompt)
    reflection_text = str(reflection_response)
    
    needs_revision = "REVISE" in reflection_text.upper()
    
    return f"**Answer**: {answer_text}\n**Self-Reflection**: {reflection_text}\n**Needs Revision**: {needs_revision}"


## Revisor Agent

Now let's explore the **Revisor Agent**, which serves as the iterative improvement engine in our Reflexion pattern. This agent takes the initial response from the Draft Agent and continuously refines it through self-reflection and query optimization until the quality meets our standards.

### Query Improvement System

The Revisor Agent starts with a specialized query improvement system that analyzes reflection feedback and creates better queries to guide improved responses. We do this so that the agent can address specific issues identified during self-reflection rather than just regenerating the same type of response.

The `query_improver_system_prompt` teaches the agent to transform vague or problematic queries into more specific, actionable ones. For example, if the original query was "Book me a flight from NYC to LA tomorrow" and the reflection identified that "Agent booked immediately without showing options", the improved query becomes "Please SEARCH and SHOW ME available flight options from NYC to LA tomorrow. I want to see different times, prices, and airlines before deciding. DO NOT book anything until I confirm."


In [None]:
query_improver_system_prompt="""You are a query improvement specialist. Based on reflection analysis, improve the original user query to address identified issues and guide better responses.

Examples:
Original: "Book me a flight from NYC to LA tomorrow"
Issue: "Agent booked immediately without showing options"
Improved: "Please SEARCH and SHOW ME available flight options from NYC to LA tomorrow. I want to see different times, prices, and airlines before deciding. DO NOT book anything until I confirm."

Now improve the provided query based on the specific reflection issues identified."""



### Response Analysis and Extraction

The `extract_answer_reflection_revision` function parses the structured output from the revision process to extract four key components. We do this so that we can programmatically determine whether another revision cycle is needed and what specific improvements should be made.

This function uses regex patterns to extract:
- **Answer**: The actual response content
- **Self-Reflection**: The agent's analysis of its own response quality
- **Needs Revision**: Boolean flag indicating if further improvement is required
- **User Query**: The current query being processed

In [None]:

def extract_answer_reflection_revision(tool_result):
    content_text = tool_result["content"][0]["text"]

    answer_match = re.search(r"\*\*Answer\*\*:(.*?)(?=\*\*Self-Reflection\*\*:)", content_text, re.DOTALL)
    answer = answer_match.group(1).strip() if answer_match else "[Not found]"

    reflection_match = re.search(r"\*\*Self-Reflection\*\*:(.*?)(?=\*\*Needs Revision\*\*:|$)", content_text, re.DOTALL)
    self_reflection = reflection_match.group(1).strip() if reflection_match else "[Not found]"

    revision_match = re.search(r"\*\*Needs Revision\*\*:\s*(True|False)", content_text)
    needs_revision = revision_match.group(1) == "True" if revision_match else False

    query_match = re.search(r"\*\*User-Query\*\*:\s*(.*?)(?=\n|$)", content_text)
    query = query_match.group(1).strip() if query_match else "[Not found]"

    return {
        "answer": answer,
        "self_reflection": self_reflection,
        "needs_revision": needs_revision,
        "user_query": query
    }

### Revised Answer Generation Tool

The `generate_revised_answer` tool orchestrates the complete revision cycle by combining query improvement, response regeneration, and quality assessment. We implement this as a single tool so that all revision steps happen atomically and maintain consistency throughout the improvement process.

The tool follows this workflow:
1. **Query Improvement**: Uses the query improver agent to create a better version of the user query based on current reflection feedback
2. **Response Regeneration**: Executes the improved query using the flight executor to generate a revised answer
3. **Quality Assessment**: Applies the reflection system prompt to analyze the new response and determine if further revision is needed
4. **Structured Output**: Returns all components in the standardized format for the next iteration

### Iterative Improvement Process

The Revisor Agent creates a feedback loop where each iteration builds upon the insights from previous attempts. We do this so that the system can progressively address different quality issues rather than getting stuck on the same problems.

The revision decision is made by checking if "REVISE" appears in the reflection text, providing a simple but effective mechanism for the agent to signal when it believes the response quality is sufficient. This approach allows the system to naturally converge on high-quality responses while preventing infinite revision loops.

The Revisor Agent represents the core innovation of the Reflexion pattern - the ability to iteratively improve responses through structured self-reflection and targeted query refinement, ensuring that the final output meets quality standards before being presented to the user.



In [None]:
@tool
def generate_revised_answer(current_user_query, current_response="", current_reflection="") -> str:
    """Generate revised answer, reflect, and decide if further revision needed"""
    
    query_improver = Agent(
        model=bedrock_model_taubench,
        system_prompt=query_improver_system_prompt
    )
    
    improved_query = query_improver(f"Current user query: {current_user_query}\nCurrent Response: {current_response}\nReflection: {current_reflection}\nCreate better query:")
    
    flight_response = flight_executor(str(improved_query))
    revised_answer = str(flight_response)
    
    revision_reflection_agent = Agent(
        model=bedrock_model_taubench,
        system_prompt=reflection_system_prompt
    )
    
    new_reflection = revision_reflection_agent(f"Task: {current_user_query} Revised Answer: {revised_answer} Analyze and decide:")
    reflection_text = str(new_reflection)
    
    needs_revision = "REVISE" in reflection_text.upper()
    
    return f"\n**User-Query**: {current_user_query}\n**Answer**: {revised_answer}\n**Self-Reflection**: {reflection_text}\n**Needs Revision**: {needs_revision}"



## Custom Agent Classes

Now let's examine the **custom agent classes** that implement the Reflexion pattern within the Strands multiagent graph framework. These agents extend the base Agent class with specialized streaming behavior to handle the iterative reflection and revision process.

### Draft Agent Implementation

The `DraftAgent` class serves as the entry point for our Reflexion system, generating initial responses with built-in self-reflection capabilities. We implement this as a custom agent class so that it can seamlessly integrate with the multiagent graph while providing the structured output format required for the revision process.

The agent initializes with the `generate_initial_answer` tool and uses the `bedrock_model_taubench` for consistent model behavior across the system. We do this so that both agents in the Reflexion pattern use the same underlying language model for coherent reasoning.

In the `stream_async` method, the agent normalizes the input prompt and calls its tool to generate the initial response with self-reflection. The `extract_answer_reflection_revision` function then parses the structured output to separate the answer, reflection, and revision decision components. We structure it this way so that the multiagent graph can easily pass the parsed components to the next agent in the workflow.



In [None]:


class DraftAgent(Agent):
    def __init__(self):
        super().__init__(
            model=bedrock_model_taubench,
            tools=[generate_initial_answer],
            name="draft",
            description="Generates flight assistance answers with self-reflection"
        )
    
    async def stream_async(self, prompt: str):

        prompt=normalize_prompt(prompt)        
        result = self.tool.generate_initial_answer(query=prompt)        
        extracted = extract_answer_reflection_revision(result)
        
        message = Message(content=[{"text": str(result)}])
        print("DEBUG: REVISOR AGENT RESULT: \n", json.dumps(message), "\n")
        agent_result = AgentResult(
            stop_reason="end_turn",
            message=message,
            metrics=EventLoopMetrics(),
            state=None 
        )
        yield {"result": agent_result}



### Revisor Agent Implementation

The `RevisorAgent` class handles the iterative improvement process, managing state across multiple revision cycles until the response quality is satisfactory. We implement this with sophisticated state management so that the agent can track revision history and prevent infinite loops.

The agent maintains a `ReflexionState` object that tracks the user query, current response, reflection analysis, revision status, and iteration count. We do this so that each revision cycle builds upon previous insights rather than starting from scratch.

### State Management and Flow Control

The revisor agent's `stream_async` method implements complex input parsing to handle both initial queries and continuation from the draft agent. When receiving input from the draft agent (identified by the "From draft:" marker), it extracts the previous response and reflection to initialize the state properly.

The revision loop continues as long as `needs_revision` is true and the `revision_count` hasn't exceeded `max_iterations` (set to 5). We implement this safeguard so that the system gracefully handles cases where the agent cannot achieve satisfactory quality within reasonable bounds.

### Multiagent Graph Integration

Both agent classes yield `AgentResult` objects with properly formatted messages and state information. The draft agent passes its results forward, while the revisor agent can either continue revising or provide the final polished response. We structure the results this way so that the multiagent graph can properly route information between agents and maintain conversation flow.

The debug print statements help track the flow of information between agents during development and troubleshooting. These custom agent implementations demonstrate how the Strands framework can be extended to support sophisticated multi-turn reasoning patterns like Reflexion.


In [None]:
class RevisorAgent(Agent):
    def __init__(self):
        super().__init__(
            model=bedrock_model_taubench,
            tools=[generate_revised_answer],
            name="revisor",
            description="Revises flight responses"
        )

    async def stream_async(self, input_data):
        input_data=normalize_prompt(input_data)  
        if isinstance(input_data, str):
            state = ReflexionState(user_query=input_data)
        else:
            prev_state = getattr(input_data, 'state', {}) or {}
            print(f"PREV STATE FROM REVIOSR: {prev_state} \n")
            state = ReflexionState(**prev_state) if prev_state else ReflexionState()

        # Extract draft agent result
        draft_start = input_data.find('From draft:')
        if draft_start != -1:
            draft_content = input_data[draft_start + len('From draft:'):].strip()
            # Parse the draft result
            extracted = extract_answer_reflection_revision({'content': [{'text': draft_content}]})
            draft_response = extracted['answer']
            draft_reflection = extracted['self_reflection']
            needs_revision = extracted['needs_revision']

        state = ReflexionState(
            user_query=user_query,
            response=draft_response,
            reflection=draft_reflection,
            needs_revision=needs_revision,
            revision_count=0,
            max_iterations=5
        )
        
        print(f"Revisor starting: revision_count={state.revision_count}, needs_revision={state.needs_revision}")
        
        if state.needs_revision and state.revision_count < state.max_iterations:
            state.revision_count += 1
            
            result = self.tool.generate_revised_answer(
                current_user_query=state.user_query,
                current_response=state.response,
                current_reflection=state.reflection
            )
            
            extracted = extract_answer_reflection_revision(result)
            state.response = extracted["answer"]
            state.reflection = extracted["self_reflection"]
            state.needs_revision = extracted["needs_revision"]
        else:
            result = f"**Answer**: {state.response}\n**Final Reflection**: {state.reflection}\n**Revision Complete**: After {state.revision_count} revisions"
        
        message = Message(content=[{"text": str(result)}])
        print("DEBUG: REVISOR AGENT RESULT: \n", json.dumps(message), "\n")
        agent_result = AgentResult(
            stop_reason="end_turn",
            message=message,
            metrics=EventLoopMetrics(),
            state=state.__dict__
        )
        yield {"result": agent_result}

## Build Reflexion Graph

## Graph Construction and Execution

Now let's create the **Reflexion multiagent graph** that orchestrates the interaction between our Draft and Revisor agents. This function builds the complete workflow using the Strands GraphBuilder to create a seamless reflection and revision pipeline.

### Graph Creation Function

The `create_reflexion_graph` function instantiates both agent classes and connects them in a sequential workflow. We do this so that the draft agent's output automatically flows to the revisor agent for quality assessment and potential improvement.

First, we create instances of both `DraftAgent` and `RevisorAgent` classes, ensuring they're properly initialized with their respective tools and configurations. We instantiate them separately so that each agent maintains its own state and tool access throughout the workflow.

### Graph Builder Configuration

Using the Strands `GraphBuilder`, we add both agents as nodes in our multiagent graph. The `add_node` method registers each agent with a unique identifier ("draft" and "revisor") that allows the graph to route messages and maintain execution flow.

We then establish the connection between agents using `add_edge(draft_node, revisor_node)`, creating a direct path from the draft agent's output to the revisor agent's input. We do this so that the reflection process happens automatically without requiring manual intervention or complex routing logic.

### Entry Point and Execution Flow

The `set_entry_point("draft")` call designates the draft agent as the starting point for all user queries. We configure it this way so that every interaction begins with initial response generation, followed by the reflection and revision process.

When the graph executes, it follows this flow:
1. User query enters at the draft agent
2. Draft agent generates initial response with self-reflection
3. Output automatically routes to revisor agent
4. Revisor agent performs iterative improvement until quality standards are met
5. Final polished response is returned to the user

### Graph Instantiation

The final line `reflexion_graph = create_reflexion_graph()` builds and stores the complete multiagent graph, making it ready for execution. We create this as a reusable object so that multiple queries can be processed through the same reflection pipeline without rebuilding the graph structure.

This simple but powerful setup demonstrates how the Strands multiagent graph framework can orchestrate complex reasoning patterns like Reflexion with minimal configuration code, while maintaining full control over agent behavior and state management.


In [None]:
from strands.multiagent.graph import GraphBuilder, GraphState

def create_reflexion_graph():
    """Create reflexion graph with state management"""
    
    draft_agent = DraftAgent()
    revisor_agent = RevisorAgent()
    
    builder = GraphBuilder()
    
    draft_node = builder.add_node(draft_agent, "draft")
    revisor_node = builder.add_node(revisor_agent, "revisor")
    
    builder.add_edge(draft_node, revisor_node)
    builder.set_entry_point("draft")
    
    return builder.build()

reflexion_graph = create_reflexion_graph()

## Load Dataset

Now let's load the **TauBench evaluation dataset** that contains real airline customer service scenarios. We do this so that we can test our Reflexion system against standardized benchmarks and measure its performance on authentic customer queries like:

- Flight changes
- Cancellations  
- Booking modifications

This loads the **single-turn airline tasks** from TauBench, which provides us with a collection of customer queries along with their expected outcomes for evaluation purposes.

In [None]:
output_path = os.path.join("..", "data", "tau-bench", "tau_bench", "envs", f"{domain}", "tasks_singleturn.json")
with open(output_path, "r") as file:
    tasks = json.load(file)

## Testing and Evaluation

Now let's examine the **testing framework** that demonstrates how to execute and analyze the Reflexion graph performance. This function provides comprehensive insights into the graph execution process and helps validate the quality improvement achieved through the reflection pattern.

### Test Function Setup

The `test_reflexion_graph` function creates a fresh instance of the reflexion graph for each test execution. We do this so that each test starts with a clean state and doesn't carry over any residual information from previous executions.

The function begins by printing the test prompt and execution status, providing clear visibility into what query is being processed. We include this logging so that developers can track the progression of different test cases and identify any patterns in the reflection behavior.

### Performance Monitoring

The test function captures execution timing using `time.time()` measurements around the graph execution. We measure this so that we can evaluate the performance impact of the reflection process compared to single-pass approaches.

The timing measurement helps identify whether the quality improvements from reflection justify the additional computational cost, which is crucial for production deployment decisions.

### Graph Execution Analysis

After executing the graph with the user query, the function extracts comprehensive execution metadata from the result object. We analyze these metrics so that we can understand how the multiagent graph performed and whether both agents completed successfully.

The key metrics include:
- **Graph Status**: Overall execution success or failure
- **Total Nodes**: Number of agents in the graph (should be 2 for our Reflexion pattern)
- **Completed Nodes**: How many agents finished execution
- **Execution Order**: The sequence in which agents were invoked

### Node-Level Result Inspection

The function iterates through each node's results to display the individual agent outputs. We examine each node separately so that we can trace the evolution from initial draft response to final refined answer.

For each agent (draft and revisor), the function shows the execution status and result content, allowing developers to see exactly how the reflection process improved the response quality.

### Test Execution Example

The final section demonstrates how to run a test using a specific question from the tasks dataset. We select `question_id = 43` and extract the corresponding user query to test the system with real airline customer service scenarios.

This testing approach provides a complete view of the Reflexion pattern in action, showing both the technical execution details and the practical quality improvements achieved through iterative self-reflection and revision.

In [None]:
def test_reflexion_graph(user_query):
    reflexion_graph = create_reflexion_graph()
    
    print("=== Testing Reflexion Graph ===")
    print(f"Test Prompt: {user_query}")
    print("\n--- Executing Graph ---")
    start= time.time()
    result = reflexion_graph(user_query)
    exec_time= time.time()-start
    print(f"\n EXEC Time: {exec_time}")
    print(f"\nGraph Status: {result.status}")
    print(f"Total Nodes: {result.total_nodes}")
    print(f"Completed Nodes: {result.completed_nodes}")
    print(f"Execution Order: {[node.node_id for node in result.execution_order]}")
    
    print("\n--- Node Results ---")
    for node_id, node_result in result.results.items():
        print(f"\n{node_id.upper()}:")
        print(f"Status: {node_result.status}")
        if hasattr(node_result, 'result') and node_result.result:
            print(f"Result: {str(node_result.result)}...")
    
    return result

# Test with different question_id
question_id = 43
task = tasks[question_id]
user_query = task["question"]
print(user_query)
reflexion_response = test_reflexion_graph(user_query)

## Congrats!

Congratulations! You've successfully created and tested a Reflexion pattern implementation using Strands multiagent orchestration. This system demonstrates:

- **Iterative self-improvement** through systematic reflection and revision
- **Multi-agent orchestration** with specialized roles and feedback loops
- **Quality-driven processing** with configurable thresholds and iteration limits
- **State management** across revision cycles for consistent improvement

The reflexion pattern is particularly useful for applications requiring high-quality responses, such as content generation, problem-solving, and complex reasoning tasks where initial responses can be systematically improved through reflection and revision.