# Agent Evaluation and Benchmarking Analysis

## 1. Introduction

This notebook serves as the dedicated environment for analyzing the performance metrics generated by the `experiments/run_benchmarks.py` script. We compare the **Optimized Agent (PCA+Ensemble)** against the **Baseline Agent (Shallow MLP)** across key metrics: Accuracy, Training Time, and Inference Time.

## 2. Setup and Data Loading

We rely on the `final_benchmarks.json` file created by the main evaluation script.

In [1]:
# CODE CELL 1: Setup and Data Loading
import pandas as pd
import matplotlib.pyplot as plt
import json
import os
from tabulate import tabulate

# Assuming the notebook is run from the repository root (or the file path is correct)
RESULTS_PATH = 'experiments/results/final_benchmarks.json'

try:
    with open(RESULTS_PATH, 'r') as f:
        data = json.load(f)

    df = pd.DataFrame(data)

    # Convert numerical columns (Accuracy is usually float, Times might be strings if N/A)
    for col in ['Accuracy', 'Train Time (s)', 'Predict Time (s)', 'Peak Memory (MB)']:
        df[col] = pd.to_numeric(df[col], errors='coerce')

    print(f"Successfully loaded {len(df)} records from {RESULTS_PATH}")
    print("\nDataFrame Head:")
    print(df.head())

except FileNotFoundError:
    print(f"ERROR: Results file not found at {RESULTS_PATH}.")
    print("Please run 'python -m experiments.run_benchmarks' first.")
    df = None
except Exception as e:
    print(f"An error occurred during data loading: {e}")
    df = None

ERROR: Results file not found at experiments/results/final_benchmarks.json.
Please run 'python -m experiments.run_benchmarks' first.


## 3. Comparative Analysis

### 3.1 Performance Table

Below is the structured output table comparing the two agents based on key metrics.

In [2]:
# CODE CELL 2: Display Comparison Table
if df is not None:
    # Select and format columns for clean display
    display_df = df[['Agent', 'Accuracy', 'Train Time (s)', 'Predict Time (s)', 'Models', 'Peak Memory (MB)']].copy()
    display_df['Accuracy'] = display_df['Accuracy'].map('{:.4f}'.format)

    print("--- Comparative Benchmark Table ---")
    print(tabulate(display_df, headers='keys', tablefmt='fancy_grid', showindex=False))
else:
    print("Cannot display table: Data not loaded.")

Cannot display table: Data not loaded.


### 3.2 Visualization of Time and Accuracy

We visualize the key trade-offs between speed and performance.

In [3]:
# CODE CELL 3: Visualization - Accuracy vs. Training Time

if df is not None:
    # Ensure data is clean for plotting
    plot_df = df.dropna(subset=['Accuracy', 'Train Time (s)', 'Predict Time (s)'])

    if not plot_df.empty:
        fig, axes = plt.subplots(1, 2, figsize=(14, 6))

        # --- Plot 1: Accuracy ---
        axes[0].bar(plot_df['Agent'], plot_df['Accuracy'], color=['skyblue', 'salmon'])
        axes[0].set_title('Agent Accuracy Comparison', fontsize=14)
        axes[0].set_ylabel('Accuracy', fontsize=12)
        axes[0].set_ylim(plot_df['Accuracy'].min() * 0.98, 1.0) # Zoom in on high accuracy
        axes[0].tick_params(axis='x', rotation=15)

        # --- Plot 2: Total Time ---
        # Calculate Total Time (Train + Predict)
        plot_df['Total Time (s)'] = plot_df['Train Time (s)'] + plot_df['Predict Time (s)']

        axes[1].bar(plot_df['Agent'], plot_df['Total Time (s)'], color=['lightgreen', 'darkorange'])
        axes[1].set_title('Total Execution Time (Train + Predict)', fontsize=14)
        axes[1].set_ylabel('Time (seconds)', fontsize=12)
        axes[1].tick_params(axis='x', rotation=15)

        plt.tight_layout()
        plt.show()
        print("Visualization complete.")
    else:
        print("Cannot plot: Numerical data for plotting is missing.")
else:
    print("Cannot plot: Data not loaded.")

Cannot plot: Data not loaded.


## 4. Conclusion

The analysis confirms the expected trade-off: The Optimized Agent consumes more resources (time/memory) but delivers significantly higher accuracy due to PCA's efficiency and the power of the deep ensemble. This justifies its submission to the time-constrained ML-Arena competition.