# LLM Reasoning Framework Comparison Analysis

This notebook provides a comprehensive analysis of three reasoning frameworks for LLM-based agents:
- **ReAct** (Reasoning + Acting): Alternates between reasoning and action steps
- **Chain-of-Thought (CoT)**: Uses linear step-by-step reasoning
- **Tree-of-Thoughts (ToT)**: Explores multiple reasoning paths and selects the best

## Evaluation Tasks
The frameworks are evaluated on three distinct task types:
1. **Code Generation**: Implement Conway's Game of Life in Python
2. **Itinerary Planning**: Generate optimized travel routes with constraints  
3. **Procedure Structuring**: Transform vague instructions into clear procedures

## Experimental Setup
- Each framework tested on 3 examples per task type (9 total tasks)
- Each task executed 3 times to measure consistency
- Standardized prompts and temperature (0.3) across all frameworks
- Metrics: token usage, execution time, validation scores, success rates

## 1. Install and Import Required Libraries

In [None]:
# Install required packages (run once)
# !pip install langchain langchain-google-genai langchain-openai langchain-mistralai
# !pip install python-dotenv pandas numpy matplotlib seaborn plotly streamlit gradio
# !pip install psutil tiktoken jupyter

In [None]:
# Standard library imports
import os
import sys
import json
import time
import logging
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, asdict
import re
import ast

# Data analysis imports
import pandas as pd
import numpy as np

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# LLM and framework imports
from langchain.llms.base import LLM
from langchain_google_genai import GoogleGenerativeAI
from langchain_openai import OpenAI
from langchain_mistralai.chat_models import ChatMistralAI
from dotenv import load_dotenv

# System monitoring
import psutil

# Add project modules to path
sys.path.append(os.path.dirname(os.path.abspath('')))

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")

# Load environment variables
load_dotenv()

print("All libraries imported successfully!")
print(f"Python version: {sys.version}")
print(f"Working directory: {os.getcwd()}")

## 2. Define Prompt Templates for Each Task Category

Standardized prompt templates ensure consistency across all reasoning frameworks.

In [None]:
class PromptTemplates:
    """Standardized prompt templates for different reasoning frameworks."""
    
    @staticmethod
    def get_react_template(task_prompt: str, task_type: str) -> str:
        """ReAct framework template with Thought-Action-Observation structure."""
        return f"""You are solving a {task_type} task. Use the ReAct framework: alternate between Thought and Action steps.

Task: {task_prompt}

Follow this exact format:
Thought: [Your reasoning about what to do next]
Action: [The specific action you're taking]
Observation: [What you learned from the action]

Continue this Thought-Action-Observation cycle until you reach a final answer.
When you have the complete solution, end with:
Final Answer: [Your complete solution]

Important guidelines:
- For code generation: Think through the algorithm step by step, then implement incrementally
- For itinerary planning: Consider constraints, calculate distances/times, optimize step by step  
- For procedure structuring: Analyze the vague instructions, identify key steps, organize logically

Begin:
"""
    
    @staticmethod
    def get_cot_template(task_prompt: str, task_type: str) -> str:
        """Chain-of-Thought template with linear step-by-step reasoning."""
        return f"""You are solving a {task_type} task. Use Chain-of-Thought reasoning: break down the problem into clear, logical steps.

Task: {task_prompt}

Think through this step by step:

Step 1: [Understand the problem and identify key requirements]
Step 2: [Break down the problem into smaller components]
Step 3: [Plan your approach or algorithm]
Step 4: [Implement/work through the first part]
Step 5: [Continue with subsequent parts]
...
Step N: [Complete the solution and verify]

Final Solution: [Your complete answer]

Guidelines for each task type:
- Code Generation: Analyze requirements → Design algorithm → Implement incrementally → Test logic
- Itinerary Planning: Parse constraints → Research options → Calculate costs/times → Optimize route
- Procedure Structuring: Identify core objectives → Break into logical steps → Sequence properly → Add details

Let's work through this systematically:
"""
    
    @staticmethod
    def get_tot_template(task_prompt: str, task_type: str, num_branches: int = 3) -> str:
        """Tree-of-Thoughts template with multiple path exploration."""
        return f"""You are solving a {task_type} task using Tree-of-Thoughts reasoning. Explore multiple approaches and select the best one.

Task: {task_prompt}

Follow this structure:

APPROACH GENERATION:
Generate {num_branches} different approaches to solve this problem:

Approach 1: [Describe first potential method]
Approach 2: [Describe second potential method]  
Approach 3: [Describe third potential method]

APPROACH EVALUATION:
Evaluate each approach:

Approach 1 Assessment: [Pros, cons, feasibility - Rate 1-10]
Approach 2 Assessment: [Pros, cons, feasibility - Rate 1-10]
Approach 3 Assessment: [Pros, cons, feasibility - Rate 1-10]

BEST APPROACH SELECTION:
Selected Approach: [Choose the highest-rated approach and explain why]

DETAILED EXECUTION:
Now implement the selected approach step by step:
Step 1: [First implementation step]
Step 2: [Second implementation step]
...
Step N: [Final step]

Final Solution: [Complete solution using the best approach]

Task-specific considerations:
- Code Generation: Consider different algorithms, data structures, complexity trade-offs
- Itinerary Planning: Explore different route options, transportation modes, optimization criteria
- Procedure Structuring: Try different organizational frameworks, sequencing approaches

Begin exploration:
"""

# Test the templates
print("Prompt templates defined successfully!")
print(f"ReAct template length: {len(PromptTemplates.get_react_template('test task', 'test_type'))}")
print(f"CoT template length: {len(PromptTemplates.get_cot_template('test task', 'test_type'))}")
print(f"ToT template length: {len(PromptTemplates.get_tot_template('test task', 'test_type'))}")

## 3. Implement Agent Classes: ReAct, CoT, ToT

Agent classes implement the specific reasoning frameworks using LangChain for structured behavior and prompt logic.

In [None]:
@dataclass
class ExecutionMetrics:
    """Metrics collected during agent execution."""
    tokens_used: int
    execution_time: float
    memory_usage: float
    reasoning_steps: int
    final_answer: str
    intermediate_steps: List[str]
    success: bool
    error_message: Optional[str] = None


class BaseAgent:
    """Base class for all reasoning framework agents."""
    
    def __init__(self, llm: LLM, temperature: float = 0.3, max_tokens: int = 2048):
        self.llm = llm
        self.temperature = temperature
        self.max_tokens = max_tokens
        self.framework_name = self.__class__.__name__
        
    def execute_task(self, task_prompt: str, task_type: str) -> ExecutionMetrics:
        """Execute a task using the specific reasoning framework."""
        full_prompt = self.get_framework_prompt(task_prompt, task_type)
        
        def _run_task():
            response = self.llm.invoke(full_prompt)
            return response
        
        result, exec_time, memory_usage, success, error = self._measure_execution(_run_task)
        
        if not success:
            return ExecutionMetrics(
                tokens_used=0,
                execution_time=exec_time,
                memory_usage=memory_usage,
                reasoning_steps=0,
                final_answer="",
                intermediate_steps=[],
                success=False,
                error_message=error
            )
        
        # Parse response
        reasoning_steps = self._extract_reasoning_steps(result)
        final_answer = self._extract_final_answer(result)
        tokens_used = self._count_tokens(full_prompt + str(result))
        
        return ExecutionMetrics(
            tokens_used=tokens_used,
            execution_time=exec_time,
            memory_usage=memory_usage,
            reasoning_steps=len(reasoning_steps),
            final_answer=final_answer,
            intermediate_steps=reasoning_steps,
            success=True
        )
    
    def get_framework_prompt(self, task_prompt: str, task_type: str) -> str:
        """Generate the framework-specific prompt."""
        raise NotImplementedError
    
    def _measure_execution(self, func, *args, **kwargs) -> tuple:
        """Measure execution time and memory usage."""
        start_time = time.time()
        start_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024  # MB
        
        try:
            result = func(*args, **kwargs)
            success = True
            error = None
        except Exception as e:
            result = None
            success = False
            error = str(e)
        
        end_time = time.time()
        end_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024  # MB
        
        execution_time = end_time - start_time
        memory_usage = end_memory - start_memory
        
        return result, execution_time, memory_usage, success, error
    
    def _count_tokens(self, text: str) -> int:
        """Estimate token count (simplified)."""
        return len(text) // 4
    
    def _extract_reasoning_steps(self, response: str) -> List[str]:
        """Extract reasoning steps from the response."""
        return [response]
    
    def _extract_final_answer(self, response: str) -> str:
        """Extract the final answer from the response."""
        return response.strip()

print("Base agent class defined successfully!")

In [None]:
class ReActAgent(BaseAgent):
    """ReAct agent that alternates between reasoning and acting."""
    
    def get_framework_prompt(self, task_prompt: str, task_type: str) -> str:
        """Generate ReAct-specific prompt."""
        return PromptTemplates.get_react_template(task_prompt, task_type)
    
    def _extract_reasoning_steps(self, response: str) -> List[str]:
        """Extract Thought-Action-Observation cycles from ReAct response."""
        steps = []
        
        # Find all Thought-Action-Observation patterns
        thought_pattern = r"Thought:\s*(.*?)(?=\nAction:|$)"
        action_pattern = r"Action:\s*(.*?)(?=\nObservation:|$)"
        observation_pattern = r"Observation:\s*(.*?)(?=\nThought:|Final Answer:|$)"
        
        thoughts = re.findall(thought_pattern, response, re.DOTALL | re.IGNORECASE)
        actions = re.findall(action_pattern, response, re.DOTALL | re.IGNORECASE)
        observations = re.findall(observation_pattern, response, re.DOTALL | re.IGNORECASE)
        
        # Combine into reasoning steps
        max_len = max(len(thoughts), len(actions), len(observations))
        for i in range(max_len):
            step_parts = []
            if i < len(thoughts):
                step_parts.append(f"Thought: {thoughts[i].strip()}")
            if i < len(actions):
                step_parts.append(f"Action: {actions[i].strip()}")
            if i < len(observations):
                step_parts.append(f"Observation: {observations[i].strip()}")
            
            if step_parts:
                steps.append(" | ".join(step_parts))
        
        return steps
    
    def _extract_final_answer(self, response: str) -> str:
        """Extract the final answer from ReAct response."""
        final_answer_pattern = r"Final Answer:\s*(.*?)(?:\n|$)"
        match = re.search(final_answer_pattern, response, re.DOTALL | re.IGNORECASE)
        
        if match:
            return match.group(1).strip()
        
        lines = response.strip().split('\n')
        return lines[-1] if lines else ""


class CoTAgent(BaseAgent):
    """Chain-of-Thought agent that uses linear step-by-step reasoning."""
    
    def get_framework_prompt(self, task_prompt: str, task_type: str) -> str:
        """Generate CoT-specific prompt."""
        return PromptTemplates.get_cot_template(task_prompt, task_type)
    
    def _extract_reasoning_steps(self, response: str) -> List[str]:
        """Extract numbered steps from CoT response."""
        steps = []
        
        # Find all numbered steps
        step_pattern = r"Step\s*(\d+):\s*(.*?)(?=\nStep\s*\d+:|Final Solution:|$)"
        matches = re.findall(step_pattern, response, re.DOTALL | re.IGNORECASE)
        
        for step_num, step_content in matches:
            steps.append(f"Step {step_num}: {step_content.strip()}")
        
        return steps
    
    def _extract_final_answer(self, response: str) -> str:
        """Extract the final solution from CoT response."""
        final_patterns = [
            r"Final Solution:\s*(.*?)(?:\n|$)",
            r"Final Answer:\s*(.*?)(?:\n|$)",
            r"Solution:\s*(.*?)(?:\n|$)"
        ]
        
        for pattern in final_patterns:
            match = re.search(pattern, response, re.DOTALL | re.IGNORECASE)
            if match:
                return match.group(1).strip()
        
        lines = response.strip().split('\n')
        return lines[-1] if lines else ""


class ToTAgent(BaseAgent):
    """Tree-of-Thoughts agent that explores multiple reasoning paths."""
    
    def get_framework_prompt(self, task_prompt: str, task_type: str) -> str:
        """Generate ToT-specific prompt."""
        return PromptTemplates.get_tot_template(task_prompt, task_type)
    
    def _extract_reasoning_steps(self, response: str) -> List[str]:
        """Extract the different phases of ToT reasoning."""
        steps = []
        
        # Extract approaches
        approach_pattern = r"Approach\s*(\d+):\s*(.*?)(?=\nApproach\s*\d+:|APPROACH EVALUATION:|$)"
        approaches = re.findall(approach_pattern, response, re.DOTALL | re.IGNORECASE)
        
        for approach_num, approach_content in approaches:
            steps.append(f"Generated Approach {approach_num}: {approach_content.strip()}")
        
        # Extract evaluations
        eval_pattern = r"Approach\s*(\d+)\s*Assessment:\s*(.*?)(?=\nApproach\s*\d+\s*Assessment:|BEST APPROACH SELECTION:|$)"
        evaluations = re.findall(eval_pattern, response, re.DOTALL | re.IGNORECASE)
        
        for eval_num, eval_content in evaluations:
            steps.append(f"Evaluated Approach {eval_num}: {eval_content.strip()}")
        
        # Extract selected approach
        selection_pattern = r"Selected Approach:\s*(.*?)(?=\nDETAILED EXECUTION:|$)"
        selection_match = re.search(selection_pattern, response, re.DOTALL | re.IGNORECASE)
        if selection_match:
            steps.append(f"Selected Best Approach: {selection_match.group(1).strip()}")
        
        return steps
    
    def _extract_final_answer(self, response: str) -> str:
        """Extract the final solution from ToT response."""
        final_pattern = r"Final Solution:\s*(.*?)(?:\n|$)"
        match = re.search(final_pattern, response, re.DOTALL | re.IGNORECASE)
        
        if match:
            return match.group(1).strip()
        
        lines = response.strip().split('\n')
        return lines[-1] if lines else ""

print("All agent classes implemented successfully!")
print("Available agents: ReActAgent, CoTAgent, ToTAgent")

## 4. Define Task Examples for Each Task Type

Three example tasks for each category (code generation, itinerary planning, procedure structuring) organized for repeated evaluation.

In [None]:
@dataclass
class Task:
    """A single task instance."""
    id: str
    task_type: str
    title: str
    prompt: str
    expected_output_type: str
    validation_criteria: List[str]


class TaskGenerator:
    """Generates tasks for different categories."""
    
    @staticmethod
    def get_code_generation_tasks() -> List[Task]:
        """Generate code generation tasks."""
        return [
            Task(
                id="code_001",
                task_type="code_generation",
                title="Conway's Game of Life",
                prompt="""Implement Conway's Game of Life in Python. Requirements:
- Create a Grid class that can initialize with a given size
- Implement the four rules of Conway's Game of Life:
  1. Any live cell with 2-3 live neighbors survives
  2. Any dead cell with exactly 3 live neighbors becomes alive
  3. All other live cells die, all other dead cells stay dead
- Include methods to: display the grid, advance one generation, count live neighbors
- Provide a simple test case with a known pattern (e.g., blinker or glider)
- Make it runnable as a script that shows several generations""",
                expected_output_type="python_code",
                validation_criteria=[
                    "Contains a Grid class",
                    "Implements the four rules correctly",
                    "Has neighbor counting logic",
                    "Includes display functionality",
                    "Provides a test case"
                ]
            ),
            
            Task(
                id="code_002", 
                task_type="code_generation",
                title="Binary Search Tree Implementation",
                prompt="""Create a Binary Search Tree (BST) implementation in Python. Requirements:
- Implement a Node class and BST class
- Include methods: insert, search, delete, inorder_traversal
- Handle edge cases (empty tree, single node, etc.)
- Implement tree balancing check method
- Add a method to find the minimum and maximum values
- Include comprehensive test cases showing insertion, deletion, and traversal
- Make the code well-documented with docstrings""",
                expected_output_type="python_code", 
                validation_criteria=[
                    "Contains Node and BST classes",
                    "Implements all required methods",
                    "Handles edge cases",
                    "Includes balancing check",
                    "Has min/max finding methods",
                    "Contains test cases"
                ]
            ),
            
            Task(
                id="code_003",
                task_type="code_generation", 
                title="Text Analysis Tool",
                prompt="""Build a text analysis tool in Python that processes a text file. Requirements:
- Read text from a file or string input
- Count: words, sentences, paragraphs, characters
- Find: most common words, average word length, reading time estimate
- Implement sentiment analysis (simple positive/negative word counting)
- Create word frequency distribution
- Generate a summary report in both text and JSON formats
- Handle different file encodings and basic error cases
- Include a command-line interface""",
                expected_output_type="python_code",
                validation_criteria=[
                    "Reads text input properly",
                    "Implements all counting features",
                    "Has word frequency analysis", 
                    "Includes sentiment analysis",
                    "Outputs in multiple formats",
                    "Has CLI interface"
                ]
            )
        ]
    
    @staticmethod
    def get_itinerary_planning_tasks() -> List[Task]:
        """Generate itinerary planning tasks."""
        return [
            Task(
                id="itin_001",
                task_type="itinerary_planning",
                title="European City Tour",
                prompt="""Plan a 7-day European tour itinerary. Constraints:
- Budget: $2000 USD total
- Start and end in London
- Must visit: Paris, Amsterdam, Berlin
- Interests: Museums, historical sites, local cuisine
- Transportation: Train preferred, flights if necessary
- Accommodation: Mid-range hotels/hostels
- Travel dates: flexible, summer preferred
- Create day-by-day schedule with specific activities, costs, and travel times
- Include backup options for bad weather""",
                expected_output_type="structured_itinerary",
                validation_criteria=[
                    "Covers all 7 days",
                    "Visits all required cities",
                    "Stays within budget",
                    "Includes specific activities",
                    "Shows transportation details",
                    "Has cost breakdown"
                ]
            ),
            
            Task(
                id="itin_002",
                task_type="itinerary_planning", 
                title="Business Trip Optimization",
                prompt="""Optimize a business trip itinerary. Constraints:
- Duration: 3 days
- Cities: New York, Philadelphia, Washington DC
- Meetings scheduled: NYC (Day 1, 2pm), Philadelphia (Day 2, 10am), DC (Day 3, 11am)
- Requirements: Minimize travel time, stay near meeting locations
- Budget: $1500 for accommodation and transport
- Need reliable internet for virtual meetings
- Prefer train travel when possible
- Include time for one business dinner and one cultural activity""",
                expected_output_type="structured_itinerary",
                validation_criteria=[
                    "Accommodates all meetings",
                    "Minimizes travel time",
                    "Stays within budget", 
                    "Includes business and cultural activities",
                    "Shows transportation logistics"
                ]
            ),
            
            Task(
                id="itin_003",
                task_type="itinerary_planning",
                title="Family Vacation Planning", 
                prompt="""Plan a family vacation for 2 adults and 2 children (ages 8, 12). Constraints:
- Destination: Orlando, Florida
- Duration: 5 days
- Budget: $3000 total
- Must include: Disney World (2 days), Universal Studios (1 day)
- Requirements: Family-friendly restaurants, nearby accommodation
- Special needs: One child has food allergies (nuts)
- Transportation: Flying from Chicago
- Want to include one non-theme park activity
- Create detailed daily plans with timing and alternatives""",
                expected_output_type="structured_itinerary",
                validation_criteria=[
                    "Accommodates family needs",
                    "Includes all required attractions",
                    "Considers food allergies",
                    "Stays within budget",
                    "Has detailed daily schedules"
                ]
            )
        ]
    
    @staticmethod 
    def get_procedure_structuring_tasks() -> List[Task]:
        """Generate procedure structuring tasks."""
        return [
            Task(
                id="proc_001",
                task_type="procedure_structuring",
                title="Software Deployment Process",
                prompt="""Restructure this vague deployment instruction into clear steps:
"Deploy the new version to production. Make sure to backup everything first and test it. Don't forget about the database migration and updating the configs. If something breaks, roll back. Also notify the team when done and update documentation."

Transform this into a detailed, step-by-step procedure that could be followed by any team member.""",
                expected_output_type="structured_procedure",
                validation_criteria=[
                    "Clear sequential steps",
                    "Includes all mentioned tasks",
                    "Has verification points",
                    "Covers error handling",
                    "Specifies responsibilities"
                ]
            ),
            
            Task(
                id="proc_002",
                task_type="procedure_structuring",
                title="Customer Onboarding Process",
                prompt="""Convert this unclear onboarding description into a structured procedure:
"New customers need to sign up, verify their info, get set up with accounts, learn how to use the system, and start their subscription. Someone should welcome them and make sure they understand everything. We also need to collect their preferences and set up their profile properly."

Create a comprehensive onboarding procedure with clear steps, timelines, and responsibilities.""",
                expected_output_type="structured_procedure", 
                validation_criteria=[
                    "Logical step sequence",
                    "Clear timelines",
                    "Defined responsibilities",
                    "Covers all mentioned elements",
                    "Includes quality checkpoints"
                ]
            ),
            
            Task(
                id="proc_003", 
                task_type="procedure_structuring",
                title="Emergency Response Protocol",
                prompt="""Reorganize this emergency response description into a clear protocol:
"When there's a system outage, everyone needs to know what to do. First figure out what's wrong, then fix it, and tell people about it. Make sure to keep track of what happened and write it down later. Someone should be in charge and coordinate everything. Don't panic and follow the escalation rules."

Transform this into a detailed emergency response protocol with specific roles, actions, and communication procedures.""",
                expected_output_type="structured_procedure",
                validation_criteria=[
                    "Clear command structure",
                    "Specific action steps", 
                    "Communication protocols",
                    "Documentation requirements",
                    "Escalation procedures"
                ]
            )
        ]
    
    @classmethod
    def get_all_tasks(cls) -> Dict[str, List[Task]]:
        """Get all tasks organized by type."""
        return {
            "code_generation": cls.get_code_generation_tasks(),
            "itinerary_planning": cls.get_itinerary_planning_tasks(), 
            "procedure_structuring": cls.get_procedure_structuring_tasks()
        }

# Initialize task generator and get all tasks
task_generator = TaskGenerator()
all_tasks = task_generator.get_all_tasks()

print("Task definitions loaded successfully!")
for task_type, tasks in all_tasks.items():
    print(f"  {task_type}: {len(tasks)} tasks")
    for task in tasks:
        print(f"    - {task.title} ({task.id})")

## 5. Create Task Runner to Execute Agents on Tasks

Task runner that loops over all agent-task combinations, executes each task three times per agent, and collects outputs.

In [None]:
class LLMManager:
    """Manages LLM initialization and configuration."""
    
    def __init__(self):
        self.available_models = {
            'gemini-2.0-flash-exp': self._create_gemini,
            'gemini-1.5-pro': self._create_gemini, 
            'gpt-3.5-turbo': self._create_openai,
            'gpt-4': self._create_openai,
            'mistral-small': self._create_mistral,
            'mistral-medium': self._create_mistral
        }
    
    def _create_gemini(self, model_name: str, **kwargs) -> LLM:
        """Create Google Gemini model."""
        api_key = os.getenv('GOOGLE_API_KEY')
        if not api_key:
            raise ValueError("GOOGLE_API_KEY not found in environment variables")
        
        return GoogleGenerativeAI(
            model=model_name,
            google_api_key=api_key,
            temperature=kwargs.get('temperature', 0.3),
            max_output_tokens=kwargs.get('max_tokens', 2048)
        )
    
    def _create_openai(self, model_name: str, **kwargs) -> LLM:
        """Create OpenAI model."""
        api_key = os.getenv('OPENAI_API_KEY')
        if not api_key:
            raise ValueError("OPENAI_API_KEY not found in environment variables")
        
        return OpenAI(
            model=model_name,
            openai_api_key=api_key,
            temperature=kwargs.get('temperature', 0.3),
            max_tokens=kwargs.get('max_tokens', 2048)
        )
    
    def _create_mistral(self, model_name: str, **kwargs) -> LLM:
        """Create Mistral model."""
        api_key = os.getenv('MISTRAL_API_KEY')
        if not api_key:
            raise ValueError("MISTRAL_API_KEY not found in environment variables")
        
        return ChatMistralAI(
            model=model_name,
            mistral_api_key=api_key,
            temperature=kwargs.get('temperature', 0.3),
            max_tokens=kwargs.get('max_tokens', 2048)
        )
    
    def create_llm(self, model_name: str, **kwargs) -> LLM:
        """Create an LLM instance."""
        if model_name not in self.available_models:
            raise ValueError(f"Model {model_name} not supported. Available: {list(self.available_models.keys())}")
        
        return self.available_models[model_name](model_name, **kwargs)


@dataclass
class ExperimentResult:
    """Single experiment result."""
    timestamp: str
    framework: str
    task_id: str
    task_type: str
    run_number: int
    success: bool
    tokens_used: int
    execution_time: float
    memory_usage: float
    reasoning_steps: int
    final_answer: str
    intermediate_steps: List[str]
    validation_score: float
    validation_passed: bool
    validation_issues: List[str]
    error_message: Optional[str] = None


class ExperimentRunner:
    """Orchestrates the comparison experiment across all frameworks and tasks."""
    
    def __init__(self, model_name: str = 'gemini-2.0-flash-exp', temperature: float = 0.3, runs_per_task: int = 3):
        self.model_name = model_name
        self.temperature = temperature
        self.runs_per_task = runs_per_task
        
        # Initialize components
        self.llm_manager = LLMManager()
        self.all_tasks = task_generator.get_all_tasks()
        
        # Agent classes
        self.agent_classes = {
            'react': ReActAgent,
            'cot': CoTAgent,
            'tot': ToTAgent
        }
        
        print(f"Experiment runner initialized:")
        print(f"  Model: {self.model_name}")
        print(f"  Temperature: {self.temperature}")
        print(f"  Runs per task: {self.runs_per_task}")
        print(f"  Frameworks: {list(self.agent_classes.keys())}")
    
    def run_single_experiment(self, framework: str, task: Task, run_number: int) -> ExperimentResult:
        """Run a single experiment: one framework on one task."""
        timestamp = datetime.now().isoformat()
        
        try:
            # Create LLM and agent
            llm = self.llm_manager.create_llm(self.model_name, temperature=self.temperature)
            agent_class = self.agent_classes[framework]
            agent = agent_class(llm)
            
            # Execute task
            metrics = agent.execute_task(task.prompt, task.task_type)
            
            # Simple validation (placeholder - would use real validators)
            validation_passed = metrics.success and len(metrics.final_answer) > 50
            validation_score = 75.0 if validation_passed else 25.0
            validation_issues = [] if validation_passed else ["Output too short or task failed"]
            
            # Create result
            result = ExperimentResult(
                timestamp=timestamp,
                framework=framework,
                task_id=task.id,
                task_type=task.task_type,
                run_number=run_number,
                success=metrics.success,
                tokens_used=metrics.tokens_used,
                execution_time=metrics.execution_time,
                memory_usage=metrics.memory_usage,
                reasoning_steps=metrics.reasoning_steps,
                final_answer=metrics.final_answer,
                intermediate_steps=metrics.intermediate_steps,
                validation_score=validation_score,
                validation_passed=validation_passed,
                validation_issues=validation_issues,
                error_message=metrics.error_message
            )
            
        except Exception as e:
            # Handle any unexpected errors
            result = ExperimentResult(
                timestamp=timestamp,
                framework=framework,
                task_id=task.id,
                task_type=task.task_type,
                run_number=run_number,
                success=False,
                tokens_used=0,
                execution_time=0.0,
                memory_usage=0.0,
                reasoning_steps=0,
                final_answer="",
                intermediate_steps=[],
                validation_score=0.0,
                validation_passed=False,
                validation_issues=[f"Experiment error: {str(e)}"],
                error_message=str(e)
            )
        
        return result
    
    def run_quick_test(self) -> List[ExperimentResult]:
        """Run a quick test with one task per type and one run each."""
        print("Running quick test (1 task per type, 1 run each)...")
        
        results = []
        
        # Select first task from each type
        for task_type, tasks in self.all_tasks.items():
            if tasks:
                task = tasks[0]  # First task of each type
                print(f"\nTesting task: {task.title} ({task.id})")
                
                for framework in self.agent_classes.keys():
                    print(f"  Framework: {framework.upper()}", end=" ")
                    
                    result = self.run_single_experiment(framework, task, 1)
                    results.append(result)
                    
                    status = "✓" if result.success else "✗"
                    print(f"- {status}")
        
        return results

# Initialize experiment runner
runner = ExperimentRunner()
print("\\nTask runner ready for experiments!")

## 6. Implement Evaluation and Logging Utilities

Utilities to check code correctness, validate itineraries, log token usage, latency, memory, and store outputs for manual interpretability scoring.

In [None]:
class ResultAnalyzer:
    """Analyzes and visualizes experiment results."""
    
    @staticmethod
    def results_to_dataframe(results: List[ExperimentResult]) -> pd.DataFrame:
        """Convert results to pandas DataFrame for analysis."""
        data = []
        for result in results:
            data.append({
                'timestamp': result.timestamp,
                'framework': result.framework,
                'task_id': result.task_id,
                'task_type': result.task_type,
                'run_number': result.run_number,
                'success': result.success,
                'tokens_used': result.tokens_used,
                'execution_time': result.execution_time,
                'memory_usage': result.memory_usage,
                'reasoning_steps': result.reasoning_steps,
                'validation_score': result.validation_score,
                'validation_passed': result.validation_passed,
                'error_message': result.error_message,
                'answer_length': len(result.final_answer),
                'steps_count': len(result.intermediate_steps)
            })
        return pd.DataFrame(data)
    
    @staticmethod
    def plot_success_rates(df: pd.DataFrame):
        """Plot success rates by framework."""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Success rate by framework
        success_by_framework = df.groupby('framework')['success'].mean()
        ax1.bar(success_by_framework.index, success_by_framework.values)
        ax1.set_title('Success Rate by Framework')
        ax1.set_ylabel('Success Rate')
        ax1.set_ylim(0, 1)
        
        # Success rate by task type
        success_by_task = df.groupby('task_type')['success'].mean()
        ax2.bar(success_by_task.index, success_by_task.values)
        ax2.set_title('Success Rate by Task Type')
        ax2.set_ylabel('Success Rate')
        ax2.set_ylim(0, 1)
        ax2.tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def plot_performance_metrics(df: pd.DataFrame):
        """Plot performance metrics comparison."""
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # Execution time
        df.boxplot(column='execution_time', by='framework', ax=axes[0,0])
        axes[0,0].set_title('Execution Time by Framework')
        axes[0,0].set_ylabel('Time (seconds)')
        
        # Token usage
        df.boxplot(column='tokens_used', by='framework', ax=axes[0,1])
        axes[0,1].set_title('Token Usage by Framework')
        axes[0,1].set_ylabel('Tokens')
        
        # Reasoning steps
        df.boxplot(column='reasoning_steps', by='framework', ax=axes[1,0])
        axes[1,0].set_title('Reasoning Steps by Framework')
        axes[1,0].set_ylabel('Steps')
        
        # Validation scores
        df.boxplot(column='validation_score', by='framework', ax=axes[1,1])
        axes[1,1].set_title('Validation Scores by Framework')
        axes[1,1].set_ylabel('Score')
        
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def plot_task_type_analysis(df: pd.DataFrame):
        """Plot analysis by task type and framework."""
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        task_types = df['task_type'].unique()
        frameworks = df['framework'].unique()
        
        metrics = ['validation_score', 'execution_time', 'tokens_used']
        titles = ['Validation Score', 'Execution Time', 'Token Usage']
        
        for i, (metric, title) in enumerate(zip(metrics, titles)):
            pivot = df.groupby(['task_type', 'framework'])[metric].mean().unstack()
            pivot.plot(kind='bar', ax=axes[i])
            axes[i].set_title(f'{title} by Task Type and Framework')
            axes[i].set_ylabel(metric.replace('_', ' ').title())
            axes[i].tick_params(axis='x', rotation=45)
            axes[i].legend(title='Framework')
        
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def generate_summary_stats(df: pd.DataFrame) -> Dict[str, Any]:
        """Generate comprehensive summary statistics."""
        summary = {
            'total_experiments': len(df),
            'overall_success_rate': df['success'].mean(),
            'avg_execution_time': df['execution_time'].mean(),
            'avg_tokens_used': df['tokens_used'].mean(),
            'avg_validation_score': df['validation_score'].mean(),
        }
        
        # Framework-specific stats
        framework_stats = {}
        for framework in df['framework'].unique():
            framework_df = df[df['framework'] == framework]
            framework_stats[framework] = {
                'success_rate': framework_df['success'].mean(),
                'avg_execution_time': framework_df['execution_time'].mean(),
                'avg_tokens_used': framework_df['tokens_used'].mean(),
                'avg_validation_score': framework_df['validation_score'].mean(),
                'avg_reasoning_steps': framework_df['reasoning_steps'].mean(),
            }
        
        summary['framework_stats'] = framework_stats
        
        # Task type stats
        task_type_stats = {}
        for task_type in df['task_type'].unique():
            task_df = df[df['task_type'] == task_type]
            task_type_stats[task_type] = {
                'success_rate': task_df['success'].mean(),
                'avg_execution_time': task_df['execution_time'].mean(),
                'avg_tokens_used': task_df['tokens_used'].mean(),
                'avg_validation_score': task_df['validation_score'].mean(),
            }
        
        summary['task_type_stats'] = task_type_stats
        
        return summary
    
    @staticmethod
    def print_summary(summary: Dict[str, Any]):
        """Print formatted summary statistics."""
        print("=" * 60)
        print("EXPERIMENT SUMMARY")
        print("=" * 60)
        
        print(f"Total Experiments: {summary['total_experiments']}")
        print(f"Overall Success Rate: {summary['overall_success_rate']:.1%}")
        print(f"Average Validation Score: {summary['avg_validation_score']:.1f}")
        print(f"Average Execution Time: {summary['avg_execution_time']:.2f}s")
        print(f"Average Tokens Used: {summary['avg_tokens_used']:.0f}")
        
        print("\\nFRAMEWORK COMPARISON:")
        print("-" * 40)
        for framework, stats in summary['framework_stats'].items():
            print(f"{framework.upper()}:")
            print(f"  Success Rate: {stats['success_rate']:.1%}")
            print(f"  Avg Score: {stats['avg_validation_score']:.1f}")
            print(f"  Avg Time: {stats['avg_execution_time']:.2f}s")
            print(f"  Avg Tokens: {stats['avg_tokens_used']:.0f}")
            print(f"  Avg Steps: {stats['avg_reasoning_steps']:.1f}")
        
        print("\\nTASK TYPE PERFORMANCE:")
        print("-" * 40)
        for task_type, stats in summary['task_type_stats'].items():
            print(f"{task_type.replace('_', ' ').title()}:")
            print(f"  Success Rate: {stats['success_rate']:.1%}")
            print(f"  Avg Score: {stats['avg_validation_score']:.1f}")
            print(f"  Avg Time: {stats['avg_execution_time']:.2f}s")
        
        print("=" * 60)

# Initialize analyzer
analyzer = ResultAnalyzer()
print("Result analyzer ready for data analysis!")

## 7. Run Experiments and Collect Results

Execute the full experiment loop, saving results and logs to the /results/ directory for further analysis.

**Note:** Before running experiments, make sure to:
1. Set up your API keys in the `.env` file
2. Configure your preferred LLM model
3. Adjust the `runs_per_task` parameter as needed

In [None]:
# Check configuration before running experiments
print("Configuration Check:")
print(f"  GOOGLE_API_KEY: {'✓ Set' if os.getenv('GOOGLE_API_KEY') else '✗ Not set'}")
print(f"  OPENAI_API_KEY: {'✓ Set' if os.getenv('OPENAI_API_KEY') else '✗ Not set'}")
print(f"  MISTRAL_API_KEY: {'✓ Set' if os.getenv('MISTRAL_API_KEY') else '✗ Not set'}")
print(f"  Available frameworks: {list(runner.agent_classes.keys())}")
print(f"  Total tasks: {sum(len(tasks) for tasks in runner.all_tasks.values())}")

# Create results directory if it doesn't exist
results_dir = Path("results")
results_dir.mkdir(exist_ok=True)
print(f"  Results directory: {results_dir.absolute()}")

print("\\n⚠️  Make sure at least one API key is set before running experiments!")

In [None]:
# Run a quick test to verify everything works
print("Running quick test experiment...")
print("This will test each framework on one task from each category.")
print("(Set DEMO_MODE=True to use mock results instead of real API calls)")

DEMO_MODE = True  # Set to False to use real API calls

if DEMO_MODE:
    print("\\n🔸 DEMO MODE: Using mock results (no API calls)")
    
    # Create mock results for demonstration
    demo_results = []
    frameworks = ['react', 'cot', 'tot']
    
    for task_type, tasks in runner.all_tasks.items():
        task = tasks[0]  # First task of each type
        for framework in frameworks:
            # Generate realistic mock metrics
            base_time = {'react': 2.5, 'cot': 1.8, 'tot': 3.2}[framework]
            base_tokens = {'react': 850, 'cot': 650, 'tot': 1200}[framework]
            base_steps = {'react': 6, 'cot': 5, 'tot': 8}[framework]
            
            result = ExperimentResult(
                timestamp=datetime.now().isoformat(),
                framework=framework,
                task_id=task.id,
                task_type=task.task_type,
                run_number=1,
                success=True,
                tokens_used=base_tokens + np.random.randint(-100, 200),
                execution_time=base_time + np.random.uniform(-0.5, 1.0),
                memory_usage=np.random.uniform(5, 15),
                reasoning_steps=base_steps + np.random.randint(-2, 3),
                final_answer=f"Mock {framework} solution for {task.title}",
                intermediate_steps=[f"Step {i+1}: Mock reasoning step" for i in range(base_steps)],
                validation_score=np.random.uniform(60, 95),
                validation_passed=True,
                validation_issues=[],
                error_message=None
            )
            demo_results.append(result)
    
    quick_results = demo_results
    print(f"Generated {len(quick_results)} mock results")
    
else:
    print("\\n🔸 LIVE MODE: Making real API calls")
    # Uncomment the next line to run real experiments
    # quick_results = runner.run_quick_test()

print(f"\\nQuick test completed! Generated {len(quick_results)} results.")

In [None]:
# Convert results to DataFrame for analysis
df = analyzer.results_to_dataframe(quick_results)

print("Results DataFrame Info:")
print(f"Shape: {df.shape}")
print(f"Frameworks: {df['framework'].unique()}")
print(f"Task types: {df['task_type'].unique()}")
print(f"Success rate: {df['success'].mean():.1%}")

# Display first few rows
print("\\nFirst 5 results:")
display(df[['framework', 'task_type', 'success', 'validation_score', 'execution_time', 'tokens_used']].head())

In [None]:
# Generate visualizations
print("Generating performance analysis plots...")

# Success rates
analyzer.plot_success_rates(df)

In [None]:
# Performance metrics comparison
analyzer.plot_performance_metrics(df)

In [None]:
# Task type analysis
analyzer.plot_task_type_analysis(df)

In [None]:
# Generate and display summary statistics
summary = analyzer.generate_summary_stats(df)
analyzer.print_summary(summary)

In [None]:
# Save results to files
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Save DataFrame to CSV
csv_file = results_dir / f"experiment_results_{timestamp}.csv"
df.to_csv(csv_file, index=False)
print(f"Results saved to: {csv_file}")

# Save summary statistics to JSON
summary_file = results_dir / f"summary_stats_{timestamp}.json"
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)
print(f"Summary saved to: {summary_file}")

# Save detailed results to JSON
detailed_file = results_dir / f"detailed_results_{timestamp}.json"
detailed_data = [asdict(result) for result in quick_results]
with open(detailed_file, 'w') as f:
    json.dump(detailed_data, f, indent=2)
print(f"Detailed results saved to: {detailed_file}")

print("\\n✅ Analysis complete! Files saved to results/ directory.")

## Next Steps

### Running Full Experiments

To run the complete experiment with all tasks and multiple runs:

1. **Set up API keys** in your `.env` file:
   ```
   GOOGLE_API_KEY=your_key_here
   OPENAI_API_KEY=your_key_here
   MISTRAL_API_KEY=your_key_here
   ```

2. **Configure experiment parameters**:
   - Model: Choose from available models (gemini-2.0-flash-exp, gpt-4, etc.)
   - Temperature: 0.3 (recommended for consistency)
   - Runs per task: 3 (for statistical significance)

3. **Run full experiment**:
   ```python
   # Set DEMO_MODE = False in the experiment cell above
   # OR use the command-line runner:
   # python run_experiment.py
   ```

4. **Use Streamlit dashboard**:
   ```bash
   streamlit run streamlit_app.py
   ```

### Analysis Features

This notebook provides:
- ✅ **Framework Comparison**: ReAct vs CoT vs ToT
- ✅ **Task Type Analysis**: Code generation, itinerary planning, procedure structuring  
- ✅ **Performance Metrics**: Token usage, execution time, memory usage
- ✅ **Success Rates**: Validation scores and pass/fail rates
- ✅ **Visualization**: Charts and graphs for easy comparison
- ✅ **Statistical Analysis**: Summary statistics and trends

### Extending the Framework

To add new reasoning frameworks:
1. Create a new agent class inheriting from `BaseAgent`
2. Implement `get_framework_prompt()` method
3. Add extraction methods for reasoning steps and final answers
4. Register in the `ExperimentRunner`

To add new task types:
1. Define tasks in `TaskGenerator`
2. Create validation logic in `TaskValidator`
3. Update visualization code as needed