# AgentBench Jolt Analysis Visualization

This notebook visualizes the results of the AgentBench jolt detection analysis. It loads the synthetic AgentBench data, processes it, and creates publication-ready figures showing the performance trajectory and detected jolts.

In [1]:
import os
import sys
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directory to path to import our modules
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our modules
from agentbench_jolt.analyzer import (
    preprocess_agentbench_data,
    detect_agentbench_jolt,
    plot_agentbench_jolt
)

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("paper", font_scale=1.5)
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.family'] = 'serif'

## Load and Preprocess Data

First, we'll load the synthetic AgentBench data and preprocess it for analysis.

In [2]:
# Load synthetic data
data_path = "../data/synthetic_agentbench_data.csv"
try:
    df = pd.read_csv(data_path)
    print(f"Loaded data with {len(df)} entries")
    print(f"Unique dates: {df['date'].nunique()}")
    print(f"Unique models: {df['model'].nunique()}")
    
    # Display first few rows
    df.head()
except FileNotFoundError:
    print(f"Data file not found: {data_path}")
    print("Please run the run_agentbench_analysis.py script first to generate the data.")

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 8


In [None]:
# Preprocess data
processed_df = preprocess_agentbench_data(
    df,
    metric='median_score',
    aggregation='max'
)

# Display preprocessed data
processed_df.head()

## Detect Jolts

Now we'll apply the jolt detection algorithm to identify super-exponential acceleration in the AgentBench performance data.

In [None]:
# Detect jolt
jolt_detected, jolt_info, derivs = detect_agentbench_jolt(
    processed_df,
    hybrid_final_threshold=0.6,
    hybrid_peak_norm_factor=15.0,
    hybrid_duration_norm_multiplier=2.0
)

print(f"Jolt detected: {jolt_detected}")
if jolt_detected:
    print(f"Jolt date: {jolt_info.get('jolt_date', 'Unknown')}")
    print(f"Jolt score: {jolt_info.get('score', 0):.2f}")
    print(f"Peak score component: {jolt_info.get('components', {}).get('peak_score', 0):.2f}")
    print(f"Pattern score component: {jolt_info.get('components', {}).get('pattern_score', 0):.2f}")
    print(f"Duration score component: {jolt_info.get('components', {}).get('duration_score', 0):.2f}")

## Visualize Results

Let's create publication-ready visualizations of the AgentBench performance trajectory and the detected jolt.

In [None]:
# Create the main visualization
fig = plot_agentbench_jolt(
    processed_df, 
    derivs, 
    jolt_info,
    title="AgentBench Performance Jolt Analysis (Synthetic Data)"
)

plt.show()

## Create Additional Visualizations

Let's create some additional visualizations to explore the data further.

In [3]:
# Plot the performance of all models over time
plt.figure(figsize=(14, 8))

# Convert date to datetime if it's not already
if not pd.api.types.is_datetime64_dtype(df['date']):
    df['date'] = pd.to_datetime(df['date'])

# Get the top 5 models by final performance
final_date = df['date'].max()
top_models = df[df['date'] == final_date].sort_values('median_score', ascending=False)['model'].head(5).tolist()

# Plot the top models
for model in top_models:
    model_data = df[df['model'] == model].sort_values('date')
    plt.plot(model_data['date'], model_data['median_score'], marker='o', linewidth=2, label=model)

# Plot the maximum performance for each date
max_by_date = df.groupby('date')['median_score'].max().reset_index()
plt.plot(max_by_date['date'], max_by_date['median_score'], color='black', linewidth=3, 
         linestyle='--', label='Maximum Performance')

# Add jolt marker if detected
if jolt_detected and 'jolt_date' in jolt_info:
    jolt_date = datetime.strptime(jolt_info['jolt_date'], '%Y-%m-%d')
    jolt_y = max_by_date[max_by_date['date'] == jolt_date]['median_score'].values
    if len(jolt_y) > 0:
        plt.axvline(x=jolt_date, color='red', linestyle='--', alpha=0.7, label='Jolt Point')
        plt.scatter([jolt_date], [jolt_y[0]], color='red', s=150, zorder=5)

plt.title('Top 5 Models Performance Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Median Score', fontsize=14)
plt.legend(title='Model', title_fontsize=12, fontsize=10, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

NameError: name 'df' is not defined

<Figure size 1400x800 with 0 Axes>

In [4]:
# Plot the distribution of performance scores by month
plt.figure(figsize=(14, 8))

# Extract month from date
df['month'] = df['date'].dt.strftime('%Y-%m')

# Create violin plot
sns.violinplot(x='month', y='median_score', data=df, inner='quartile')

# Add jolt marker if detected
if jolt_detected and 'jolt_date' in jolt_info:
    jolt_date = datetime.strptime(jolt_info['jolt_date'], '%Y-%m-%d')
    jolt_month = jolt_date.strftime('%Y-%m')
    plt.axvline(x=df['month'].unique().tolist().index(jolt_month), color='red', 
                linestyle='--', alpha=0.7, label='Jolt Month')

plt.title('Distribution of Model Performance by Month', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Median Score', fontsize=14)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

NameError: name 'df' is not defined

<Figure size 1400x800 with 0 Axes>

In [5]:
# Plot the task-specific performance over time
plt.figure(figsize=(14, 10))

# Get the maximum performance for each task by date
task_columns = ['os_score', 'db_score', 'kg_score', 'web_score', 'code_score']
task_names = ['Operating System', 'Database', 'Knowledge Graph', 'Web Browsing', 'Coding']

for i, (col, name) in enumerate(zip(task_columns, task_names)):
    max_by_date = df.groupby('date')[col].max().reset_index()
    plt.plot(max_by_date['date'], max_by_date[col], marker='o', linewidth=2, label=name)

# Add jolt marker if detected
if jolt_detected and 'jolt_date' in jolt_info:
    jolt_date = datetime.strptime(jolt_info['jolt_date'], '%Y-%m-%d')
    plt.axvline(x=jolt_date, color='red', linestyle='--', alpha=0.7, label='Jolt Point')

plt.title('Maximum Task-Specific Performance Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.legend(title='Task', title_fontsize=12, fontsize=10, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

NameError: name 'df' is not defined

<Figure size 1400x1000 with 0 Axes>

## Save Results

Finally, let's save our visualizations and results for inclusion in the paper.

In [6]:
# Create results directory if it doesn't exist
os.makedirs("../results/figures", exist_ok=True)

# Save the main visualization
fig = plot_agentbench_jolt(
    processed_df, 
    derivs, 
    jolt_info,
    title="AgentBench Performance Jolt Analysis",
    save_path="../results/figures/agentbench_jolt_analysis.png"
)

print("Results saved to ../results/figures/agentbench_jolt_analysis.png")

NameError: name 'processed_df' is not defined

## Conclusion

This analysis demonstrates the application of our jolt detection methodology to AgentBench performance data. The results show that we can effectively identify periods of super-exponential acceleration in AI agent capabilities, which has important implications for understanding the pace of AI progress and potential governance challenges.

The synthetic data used in this analysis serves as a placeholder until historical AgentBench data becomes available. The same methodology can be applied to real data when it is obtained.