# ReAct Agent: Deep Dive Analysis

**Professional Analysis of Reasoning and Acting Framework**

This notebook provides a comprehensive analysis of the ReAct (Reasoning + Acting) framework implementation with:
- Visual representation of agent execution graphs
- Step-by-step reasoning trace analysis
- Performance metrics and comparisons
- Critical evaluation of tool usage patterns

---

In [None]:
# Core imports
import sys
from pathlib import Path
import json
import time
from datetime import datetime

# Add project to path
sys.path.insert(0, str(Path.cwd().parent))

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import networkx as nx
from IPython.display import display, HTML, Image, Markdown

# Project imports
from src.agents.agent_factory import AgentFactory
from src.agents.agent_executor import AgentExecutor
from src.database.db_manager import DatabaseManager
from src.config import config

# Styling
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("All imports loaded successfully")
print(f"OpenAI API Key configured: {bool(config.openai_api_key)}")
print(f"Database path: {config.database_path}")

## 1. ReAct Framework Overview

### What is ReAct?

ReAct combines **Reasoning** (thinking about what to do) and **Acting** (using tools to do it) in a unified framework.

**Key Insight**: Instead of just generating text, the LLM can reason about which tools to use and when, creating a transparent decision-making process.

### ReAct Loop:

```
Query → [Thought → Action → Observation]* → Final Answer
```

Where:
- **Thought**: LLM reasons about what to do next
- **Action**: LLM selects a tool and provides input
- **Observation**: Tool returns results to LLM
- **Repeat** until LLM has enough information to answer

---

## 2. Initialize Agent and Tools

We'll test with two different query types to understand how ReAct adapts:
1. **Database Query**: Requires searching the Pink Floyd database
2. **Currency Query**: Requires calling external currency API
3. **Multi-Tool Query**: Requires using both tools

In [None]:
# Initialize agent factory
factory = AgentFactory()

# Create agent with gpt-4o-mini (fast and cost-effective)
model_name = "gpt-4o-mini"
agent = factory.create_agent(model_name)
executor = AgentExecutor(agent, model_name)

# Display available tools
print("Available Tools:")
print("="*60)
for i, tool in enumerate(factory.get_tools(), 1):
    print(f"{i}. {tool.name}")
    print(f"   Description: {tool.description[:100]}...")
    print()

print(f"Agent initialized with model: {model_name}")
print(f"{len(factory.get_tools())} tools available")

## 3. Test Case 1: Database Query

**Query**: "Find melancholic Pink Floyd songs from the 1970s"

**Expected Behavior**:
- Agent should recognize this requires database search
- Should use `pink_floyd_database` tool
- Should parse mood="melancholic" and decade="1970s"
- Should format results clearly

In [None]:
# Execute database query
query1 = "Find melancholic Pink Floyd songs from the 1970s"

print(f"Query: {query1}")
print("="*60)
print("Executing...\n")

result1 = executor.execute(query1)

# Display answer
print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result1["answer"])
print("\n" + "="*60)

# Display metrics
metrics = result1["metrics"]
print(f"\nExecution Time: {metrics['execution_time_seconds']}s")
print(f"Tokens Used: {metrics['estimated_tokens']['total']}")
print(f"Estimated Cost: ${metrics['estimated_cost_usd']:.6f}")
print(f"Reasoning Steps: {metrics['num_steps']}")

### 3.1 Visualize Reasoning Trace - Database Query

In [None]:
def visualize_reasoning_trace(trace, title="ReAct Execution Graph"):
    """
    Create a visual representation of the ReAct reasoning trace.
    
    Uses NetworkX to build a directed graph showing:
    - Query node (starting point)
    - Thought nodes (reasoning)
    - Action nodes (tool calls)
    - Observation nodes (tool results)
    - Answer node (final output)
    """
    G = nx.DiGraph()
    
    # Color mapping for different node types
    colors = {
        'query': '#FF1493',      # Pink
        'thought': '#4169E1',    # Blue
        'action': '#32CD32',     # Green
        'observation': '#FFA500',# Orange
        'answer': '#9370DB'      # Purple
    }
    
    node_colors = []
    node_sizes = []
    labels = {}
    
    # Add query node
    G.add_node(0)
    node_colors.append(colors['query'])
    node_sizes.append(3000)
    labels[0] = "Query"
    
    prev_node = 0
    node_id = 1
    
    # Process trace steps
    for step in trace:
        step_type = step.get('type', '')
        
        if step_type in ['thought', 'action', 'observation']:
            G.add_node(node_id)
            G.add_edge(prev_node, node_id)
            
            # Set color and label based on type
            if step_type == 'action':
                node_colors.append(colors['action'])
                node_sizes.append(2500)
                tool = step.get('tool', 'unknown')
                labels[node_id] = f"Action\n{tool}"
            elif step_type == 'observation':
                node_colors.append(colors['observation'])
                node_sizes.append(2500)
                labels[node_id] = "Observation"
            else:
                node_colors.append(colors['thought'])
                node_sizes.append(2000)
                labels[node_id] = "Thought"
            
            prev_node = node_id
            node_id += 1
    
    # Add answer node
    G.add_node(node_id)
    G.add_edge(prev_node, node_id)
    node_colors.append(colors['answer'])
    node_sizes.append(3000)
    labels[node_id] = "Answer"
    
    # Create visualization
    plt.figure(figsize=(14, 8))
    
    # Use hierarchical layout for better readability
    pos = nx.spring_layout(G, k=2, iterations=50)
    
    # Draw graph
    nx.draw_networkx_nodes(G, pos, 
                          node_color=node_colors,
                          node_size=node_sizes,
                          alpha=0.9)
    
    nx.draw_networkx_edges(G, pos,
                          edge_color='gray',
                          arrows=True,
                          arrowsize=20,
                          arrowstyle='->',
                          width=2)
    
    nx.draw_networkx_labels(G, pos, labels,
                           font_size=9,
                           font_weight='bold')
    
    plt.title(title, fontsize=16, fontweight='bold', pad=20)
    plt.axis('off')
    plt.tight_layout()
    
    # Add legend
    legend_elements = [
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colors['query'], markersize=10, label='Query'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colors['thought'], markersize=10, label='Thought'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colors['action'], markersize=10, label='Action'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colors['observation'], markersize=10, label='Observation'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colors['answer'], markersize=10, label='Answer')
    ]
    plt.legend(handles=legend_elements, loc='upper right', fontsize=10)
    
    plt.show()
    
    return G

# Visualize the database query execution
print("Database Query Execution Graph:")
print("="*60)
graph1 = visualize_reasoning_trace(
    result1["reasoning_trace"],
    title="ReAct: Database Query Execution"
)

### 3.2 Detailed Trace Analysis - Database Query

In [None]:
def analyze_trace(trace, query):
    """
    Analyze reasoning trace with critical evaluation.
    """
    print(f"Query: {query}")
    print("="*80)
    print()
    
    action_count = 0
    tools_used = set()
    
    for i, step in enumerate(trace, 1):
        step_type = step.get('type', '')
        
        if step_type == 'query':
            print(f"[QUERY] STEP {i}: USER QUERY")
            print(f"   Content: {step.get('content', '')[:100]}...")
            
        elif step_type == 'thought':
            print(f"\n[THOUGHT] STEP {i}: THOUGHT (Reasoning)")
            print(f"   {step.get('content', '')[:200]}...")
            
        elif step_type == 'action':
            action_count += 1
            tool = step.get('tool', 'unknown')
            tools_used.add(tool)
            print(f"\n[ACTION] STEP {i}: ACTION (Tool Call #{action_count})")
            print(f"   Tool: {tool}")
            print(f"   Input: {json.dumps(step.get('input', {}), indent=6)}")
            
        elif step_type == 'observation':
            print(f"\n[OBSERVATION] STEP {i}: OBSERVATION (Tool Result)")
            content = step.get('content', '')
            if len(content) > 300:
                print(f"   Result: {content[:300]}...")
                print(f"   (truncated - full length: {len(content)} chars)")
            else:
                print(f"   Result: {content}")
    
    print("\n" + "="*80)
    print("SUMMARY:")
    print(f"  Total Steps: {len(trace)}")
    print(f"  Tool Calls: {action_count}")
    print(f"  Tools Used: {', '.join(tools_used)}")
    print("="*80)

analyze_trace(result1["reasoning_trace"], query1)

### 3.3 Critical Analysis - Database Query

**What went well:**
- Agent correctly identified need for database tool
- Extracted relevant parameters (mood, decade)
- Formatted results in user-friendly way

**Potential improvements:**
- Could the agent have been more specific in the query?
- Is one tool call sufficient or should it verify results?
- How does answer quality compare to ground truth?

---

## 4. Test Case 2: Currency Query

**Query**: "What is the current exchange rate from USD to EUR?"

**Expected Behavior**:
- Agent should recognize this requires currency data
- Should use `currency_price_checker` tool
- Should parse currency pair (USD/EUR)
- Should provide current rate with context

In [None]:
# Execute currency query
query2 = "What is the current exchange rate from USD to EUR?"

print(f"Query: {query2}")
print("="*60)
print("Executing...\n")

result2 = executor.execute(query2)

# Display answer
print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result2["answer"])
print("\n" + "="*60)

# Display metrics
metrics = result2["metrics"]
print(f"\nExecution Time: {metrics['execution_time_seconds']}s")
print(f"Tokens Used: {metrics['estimated_tokens']['total']}")
print(f"Estimated Cost: ${metrics['estimated_cost_usd']:.6f}")
print(f"Reasoning Steps: {metrics['num_steps']}")

### 4.1 Visualize Reasoning Trace - Currency Query

In [None]:
# Visualize the currency query execution
print("Currency Query Execution Graph:")
print("="*60)
graph2 = visualize_reasoning_trace(
    result2["reasoning_trace"],
    title="ReAct: Currency Query Execution"
)

### 4.2 Detailed Trace Analysis - Currency Query

In [None]:
analyze_trace(result2["reasoning_trace"], query2)

---

## 5. Test Case 3: Multi-Tool Query

**Query**: "I want energetic Pink Floyd music to listen to, and also tell me the EUR to GBP exchange rate"

**Expected Behavior**:
- Agent must recognize TWO distinct tasks
- Should use BOTH tools sequentially
- Should combine results coherently
- **This tests agent's ability to handle complex, multi-step queries**

In [None]:
# Execute multi-tool query
query3 = "I want energetic Pink Floyd music to listen to, and also tell me the EUR to GBP exchange rate"

print(f"Query: {query3}")
print("="*60)
print("Executing...\n")

result3 = executor.execute(query3)

# Display answer
print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result3["answer"])
print("\n" + "="*60)

# Display metrics
metrics = result3["metrics"]
print(f"\nExecution Time: {metrics['execution_time_seconds']}s")
print(f"Tokens Used: {metrics['estimated_tokens']['total']}")
print(f"Estimated Cost: ${metrics['estimated_cost_usd']:.6f}")
print(f"Reasoning Steps: {metrics['num_steps']}")

### 5.1 Visualize Reasoning Trace - Multi-Tool Query

In [None]:
# Visualize the multi-tool query execution
print("Multi-Tool Query Execution Graph:")
print("="*60)
graph3 = visualize_reasoning_trace(
    result3["reasoning_trace"],
    title="ReAct: Multi-Tool Query Execution"
)

### 5.2 Detailed Trace Analysis - Multi-Tool Query

In [None]:
analyze_trace(result3["reasoning_trace"], query3)

### 5.3 Critical Analysis - Multi-Tool Query

**Key Observations:**
1. Did the agent correctly identify both sub-tasks?
2. Were tools called in logical order?
3. Was the final answer coherent and complete?
4. How many additional reasoning steps were needed vs single-tool queries?

**This demonstrates the agent's ability to:**
- Decompose complex queries
- Coordinate multiple tools
- Synthesize disparate information

---

## 6. Comparative Analysis

Compare performance across all three test cases

In [None]:
# Collect metrics
comparison_data = {
    'Query Type': ['Database Only', 'Currency Only', 'Multi-Tool'],
    'Query': [
        query1[:40] + '...',
        query2[:40] + '...',
        query3[:40] + '...'
    ],
    'Execution Time (s)': [
        result1['metrics']['execution_time_seconds'],
        result2['metrics']['execution_time_seconds'],
        result3['metrics']['execution_time_seconds']
    ],
    'Total Tokens': [
        result1['metrics']['estimated_tokens']['total'],
        result2['metrics']['estimated_tokens']['total'],
        result3['metrics']['estimated_tokens']['total']
    ],
    'Cost (USD)': [
        result1['metrics']['estimated_cost_usd'],
        result2['metrics']['estimated_cost_usd'],
        result3['metrics']['estimated_cost_usd']
    ],
    'Reasoning Steps': [
        result1['metrics']['num_steps'],
        result2['metrics']['num_steps'],
        result3['metrics']['num_steps']
    ],
    'Tools Used': [
        len([s for s in result1['reasoning_trace'] if s.get('type') == 'action']),
        len([s for s in result2['reasoning_trace'] if s.get('type') == 'action']),
        len([s for s in result3['reasoning_trace'] if s.get('type') == 'action'])
    ]
}

df = pd.DataFrame(comparison_data)

# Display table
print("\nPerformance Comparison:")
print("="*80)
display(df.style.background_gradient(cmap='Blues', subset=['Execution Time (s)', 'Total Tokens', 'Reasoning Steps']))

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Execution Time
axes[0, 0].bar(df['Query Type'], df['Execution Time (s)'], color=['#FF1493', '#4169E1', '#32CD32'])
axes[0, 0].set_title('Execution Time Comparison', fontweight='bold')
axes[0, 0].set_ylabel('Seconds')
axes[0, 0].set_ylim(0, max(df['Execution Time (s)']) * 1.2)

# 2. Token Usage
axes[0, 1].bar(df['Query Type'], df['Total Tokens'], color=['#FF1493', '#4169E1', '#32CD32'])
axes[0, 1].set_title('Token Usage Comparison', fontweight='bold')
axes[0, 1].set_ylabel('Tokens')

# 3. Cost
axes[1, 0].bar(df['Query Type'], df['Cost (USD)'], color=['#FF1493', '#4169E1', '#32CD32'])
axes[1, 0].set_title('Cost Comparison', fontweight='bold')
axes[1, 0].set_ylabel('USD')
axes[1, 0].ticklabel_format(style='plain', axis='y')

# 4. Complexity (Steps + Tools)
x = range(len(df))
width = 0.35
axes[1, 1].bar([i - width/2 for i in x], df['Reasoning Steps'], width, label='Reasoning Steps', color='#4169E1')
axes[1, 1].bar([i + width/2 for i in x], df['Tools Used'], width, label='Tools Used', color='#32CD32')
axes[1, 1].set_title('Complexity Comparison', fontweight='bold')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(df['Query Type'])
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## 7. Critical Insights

### 7.1 ReAct Framework Strengths

1. **Transparency**: Every decision is visible in the reasoning trace
2. **Tool Selection**: Agent autonomously chooses correct tools
3. **Adaptability**: Handles single-tool and multi-tool queries
4. **Iterative Refinement**: Can call tools multiple times if needed

### 7.2 Observed Patterns

**From our experiments:**

- **Database queries**: Typically 1 tool call, fast execution
- **Currency queries**: 1 tool call, similar performance
- **Multi-tool queries**: 2+ tool calls, proportionally more tokens/time

**Complexity scaling**:
- Linear increase in tokens with number of tools
- Execution time grows sub-linearly (tool calls can be parallel in theory)
- Cost scales predictably with token usage

### 7.3 Potential Failure Modes

1. **Hallucination**: Agent might "make up" tool results
2. **Tool Selection Error**: Choosing wrong tool for task
3. **Incomplete Parsing**: Missing parameters in tool input
4. **Infinite Loops**: Agent keeps calling tools without progress
5. **Context Overflow**: Too many iterations exceed context window

### 7.4 Production Considerations

**Must implement:**
- Max iteration limits (prevent infinite loops)
- Input validation (sanitize tool inputs)
- Error handling (graceful degradation)
- Logging (debug tool usage patterns)
- Rate limiting (prevent API abuse)

**Nice to have:**
- Tool call caching (avoid redundant calls)
- Parallel tool execution (speed up multi-tool queries)
- Confidence scoring (flag uncertain answers)
- User feedback loop (improve over time)

---

## 8. Database Verification

Let's verify the database query results to check agent accuracy

In [None]:
# Check what's actually in the database
db_manager = DatabaseManager(config.database_path)

# Get melancholic songs from 1970s
melancholic_songs = db_manager.get_songs_by_mood("melancholic")
songs_1970s = [s for s in melancholic_songs if 1970 <= s.year <= 1979]

print("Ground Truth: Melancholic Pink Floyd Songs from 1970s")
print("="*60)
print(f"Found {len(songs_1970s)} songs:\n")

for song in songs_1970s:
    print(f"  - {song.title} ({song.year}) - {song.album}")

print("\n" + "="*60)
print("\nAgent Answer Accuracy Check:")
print("Compare the agent's answer above with this ground truth.")
print("Did the agent:")
print("  [OK] Find all songs?")
print("  [OK] Include only songs from 1970s?")
print("  [OK] Include only melancholic songs?")
print("  [OK] Format results clearly?")

---

## 9. Model Comparison (Optional)

Compare different models on the same query

In [None]:
# Compare gpt-4o-mini vs gpt-4o
models = ["gpt-4o-mini", "gpt-4o"]
test_query = "Find melancholic Pink Floyd songs"

results = {}

print(f"Testing query: {test_query}")
print("="*60)

for model in models:
    print(f"\nTesting {model}...")
    
    try:
        agent = factory.create_agent(model)
        executor = AgentExecutor(agent, model)
        result = executor.execute(test_query)
        
        results[model] = {
            'time': result['metrics']['execution_time_seconds'],
            'tokens': result['metrics']['estimated_tokens']['total'],
            'cost': result['metrics']['estimated_cost_usd'],
            'steps': result['metrics']['num_steps'],
            'answer_length': len(result['answer'])
        }
        
        print(f"  [OK] Time: {results[model]['time']}s")
        print(f"  [OK] Cost: ${results[model]['cost']:.6f}")
        print(f"  [OK] Steps: {results[model]['steps']}")
        
    except Exception as e:
        print(f"  [ERROR] Error: {e}")
        results[model] = None

# Compare results
if all(results.values()):
    print("\n" + "="*60)
    print("Model Comparison Summary:")
    print("="*60)
    
    comparison_df = pd.DataFrame(results).T
    display(comparison_df.style.highlight_min(color='lightgreen', subset=['time', 'cost']))
    
    # Calculate differences
    if len(models) == 2:
        m1, m2 = models
        time_diff = (results[m2]['time'] / results[m1]['time'] - 1) * 100
        cost_diff = (results[m2]['cost'] / results[m1]['cost'] - 1) * 100
        
        print(f"\n{m2} vs {m1}:")
        print(f"  Time: {time_diff:+.1f}% {'slower' if time_diff > 0 else 'faster'}")
        print(f"  Cost: {cost_diff:+.1f}% {'more expensive' if cost_diff > 0 else 'cheaper'}")

---

## 10. Conclusion

### Key Takeaways

1. **ReAct provides transparency**: Every decision is traceable
2. **Tool selection is reliable**: Agent consistently chooses correct tools
3. **Performance is predictable**: Cost and time scale linearly with complexity
4. **Multi-tool queries work**: Agent can coordinate multiple tools effectively

### When to Use ReAct

**Good for:**
- Tasks requiring external data/tools
- Situations needing transparency/explainability
- Multi-step reasoning workflows
- Dynamic tool selection scenarios

**Not ideal for:**
- Pure text generation (no tools needed)
- Latency-critical applications (multiple LLM calls)
- Simple classification tasks (overkill)
- When tool reliability is uncertain

### Next Steps

1. **Add more tools**: Expand agent capabilities
2. **Implement caching**: Reduce redundant tool calls
3. **Add validation**: Verify tool results before using
4. **Monitor in production**: Track failure modes
5. **A/B test models**: Find optimal cost/performance balance

---

**Further Reading:**
- ReAct Paper: https://arxiv.org/abs/2210.03629
- LangChain Agents: https://python.langchain.com/docs/modules/agents/
- LangGraph: https://langchain-ai.github.io/langgraph/