# Benchmarking Models: Comprehensive Performance Comparison

**Author:** [Your Name]

**Date:** [Current Date]

## Overview

This notebook is part of the VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) research project. It performs a comprehensive benchmarking analysis across all models evaluated in the study.

### Purpose
- Compare performance metrics across different models (GPT-4o, Claude, CheXagent)
- Analyze vulnerability patterns across multiple attack vectors
- Generate comprehensive cross-model performance comparisons
- Identify strengths and weaknesses of each model
- Create publication-quality visualizations for research presentation

### Workflow
1. Fetch evaluation data for all models from the database
2. Calculate comparative metrics for model performance
3. Analyze performance across different vulnerability dimensions
4. Generate visualizations showing cross-model comparisons
5. Identify patterns and insights about model vulnerabilities

### Models Compared
- **GPT-4o Vision** (OpenAI): General-purpose multimodal model
- **Claude Opus** (Anthropic): General-purpose multimodal model
- **CheXagent-8b** (StanfordAIMI): Specialized medical imaging model
- **Additional models** (if available): Third-party or open models

## 1. Environment Setup

### 1.1 Install Required Libraries

First, we'll install all necessary libraries for data analysis and visualization.

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn sqlalchemy psycopg2-binary python-dotenv pyyaml scikit-learn plotly

In [None]:
# Import required libraries
import os
import sys
import yaml
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from dotenv import load_dotenv
from sqlalchemy import create_engine, text
from datetime import datetime
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Add the src directory to the path for importing custom modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_dir)

# Load environment variables from .env file
load_dotenv()

# Set up plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Check platform for environment-specific settings
import platform
operating_system = platform.system()
print(f"Operating System: {operating_system}")

### 1.2 Configuration Setup

Load configuration from YAML file and set up environment-specific settings.

In [None]:
# Load configuration
config_path = os.path.join(parent_dir, 'src', 'config', 'default_config.yaml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Configure paths based on operating system
if operating_system == 'Darwin':  # macOS
    output_dir = os.path.expanduser('~/data/vsf-med/output')
elif operating_system == 'Linux':
    output_dir = '/data/vsf-med/output'
else:  # Windows or other
    output_dir = config['paths']['output_dir'].replace('${HOME}', os.path.expanduser('~'))

# Create output directory if it doesn't exist
figures_dir = os.path.join(output_dir, 'figures')
os.makedirs(figures_dir, exist_ok=True)

# Set up database connection
db_config = config['database']
db_password = os.environ.get('DB_PASSWORD', '')
CONNECTION_STRING = f"postgresql://{db_config['user']}:{db_password}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
engine = create_engine(CONNECTION_STRING)

# Get evaluation dimensions and risk classifications
dimensions = config['evaluation']['dimensions']
risk_classifications = config['evaluation']['risk_classifications']

print(f"Output directory: {output_dir}")
print(f"Figures directory: {figures_dir}")

## 2. Data Collection

Fetch and prepare data for all models from the database.

In [None]:
def fetch_all_evaluations():
    """
    Fetch all model evaluations from the database.
    
    Returns:
        pd.DataFrame: DataFrame containing all evaluations
    """
    query = """
    SELECT e.*, r.model_name, r.question_category, r.question, r.model_answer
    FROM mimicxp.evaluations e
    JOIN mimicxp.model_responses_r2 r ON e.model_response_id = r.id
    ORDER BY e.created_at DESC
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} evaluations from database")
    return df

def fetch_model_responses():
    """
    Fetch all model responses from the database.
    
    Returns:
        pd.DataFrame: DataFrame containing all model responses
    """
    query = """
    SELECT id, uid, question_id, question, question_category, model_name, model_answer, image_link, created_at
    FROM mimicxp.model_responses_r2
    ORDER BY created_at DESC
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} model responses from database")
    return df

# Fetch all evaluations and responses
all_evaluations = fetch_all_evaluations()
all_responses = fetch_model_responses()

In [None]:
# Prepare data for analysis
def preprocess_data(evaluations_df, responses_df):
    """
    Preprocess and prepare data for analysis.
    
    Args:
        evaluations_df (pd.DataFrame): DataFrame with evaluations
        responses_df (pd.DataFrame): DataFrame with responses
        
    Returns:
        dict: Dictionary with processed DataFrames
    """
    # Check if we have data
    if evaluations_df.empty or responses_df.empty:
        print("Warning: Empty data. Cannot proceed with analysis.")
        return {}
        
    # Add category type (text attack, visual perturbation, original)
    def categorize_question(category):
        if category == 'original':
            return 'baseline'
        elif category.startswith('visual_perturb'):
            return 'visual_perturbation'
        elif category.startswith('text_attack'):
            return 'text_attack'
        else:
            return 'other'
    
    # Add category to evaluations
    evaluations_df['category_type'] = evaluations_df['question_category'].apply(categorize_question)
    
    # Add category to responses
    responses_df['category_type'] = responses_df['question_category'].apply(categorize_question)
    
    # Get response lengths
    responses_df['response_length'] = responses_df['model_answer'].apply(len)
    
    # Create maps between score columns and dimension names
    dimension_map = {
        'prompt_injection_score': 'Prompt Injection Effectiveness',
        'jailbreak_score': 'Jailbreak Success',
        'confidentiality_score': 'Confidentiality Breach',
        'misinformation_score': 'Misinformation/Malicious',
        'dos_resilience_score': 'DoS Resilience',
        'persistence_score': 'Persistence of Effects',
        'safety_bypass_score': 'Safety Bypass Success',
        'medical_impact_score': 'Medical Impact'
    }
    
    # Create list of dimension score columns
    dimension_cols = list(dimension_map.keys())
    
    # Group evaluations by model and category
    model_category_groups = evaluations_df.groupby(['model_name', 'category_type'])
    
    # Calculate average scores by model and category
    avg_scores = model_category_groups[dimension_cols + ['total_score']].mean().reset_index()
    
    # Calculate response stats by model and category
    response_stats = responses_df.groupby(['model_name', 'category_type'])['response_length'].agg(
        ['count', 'mean', 'std', 'min', 'max']
    ).reset_index()
    
    # Return processed data
    return {
        'evaluations': evaluations_df,
        'responses': responses_df,
        'avg_scores': avg_scores,
        'response_stats': response_stats,
        'dimension_map': dimension_map,
        'dimension_cols': dimension_cols
    }

# Process data
data = preprocess_data(all_evaluations, all_responses)

## 3. Performance Overview

Generate a high-level overview of model performance.

In [None]:
# Create summary of model performance
if data:
    # Get unique models and count samples
    evaluations = data['evaluations']
    models = evaluations['model_name'].unique()
    
    print(f"Models evaluated: {len(models)}")
    model_counts = evaluations['model_name'].value_counts()
    
    for model, count in model_counts.items():
        print(f"  {model}: {count} evaluations")
    
    # Calculate overall average vulnerability score by model
    model_scores = evaluations.groupby('model_name')['total_score'].agg(['mean', 'std', 'min', 'max']).reset_index()
    model_scores = model_scores.sort_values(by='mean', ascending=True)
    
    print("\nOverall vulnerability scores (lower is better):")
    for _, row in model_scores.iterrows():
        print(f"  {row['model_name']}: {row['mean']:.2f} ± {row['std']:.2f} (range: {row['min']}-{row['max']})")
    
    # Create visual comparison
    plt.figure(figsize=(10, 6))
    plt.bar(model_scores['model_name'], model_scores['mean'], yerr=model_scores['std'], capsize=10, alpha=0.7)
    plt.xlabel('Model')
    plt.ylabel('Average Vulnerability Score (0-32)')
    plt.title('Overall Vulnerability Scores by Model')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'overall_vulnerability_scores.png'), dpi=300, bbox_inches='tight')
    plt.show()

In [None]:
# Compare performance across category types
if data:
    # Get data by category type
    avg_scores = data['avg_scores']
    dimension_cols = data['dimension_cols']
    
    # Create cross-tabulation of model vs category
    model_category_pivot = avg_scores.pivot(index='model_name', columns='category_type', values='total_score')
    
    # Display the pivot table
    print("Average vulnerability score by model and category type:")
    display(model_category_pivot)
    
    # Create bar chart comparing models across categories
    plt.figure(figsize=(12, 8))
    
    # Convert pivot table to long format for seaborn
    model_category_long = avg_scores[['model_name', 'category_type', 'total_score']]
    
    # Create bar chart
    sns.barplot(data=model_category_long, x='model_name', y='total_score', hue='category_type')
    plt.title('Vulnerability Score by Model and Category Type')
    plt.xlabel('Model')
    plt.ylabel('Average Vulnerability Score (0-32)')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Category Type')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'vulnerability_by_category.png'), dpi=300, bbox_inches='tight')
    plt.show()

## 4. Dimension-Level Comparison

Compare model performance across each vulnerability dimension.

In [None]:
# Create radar chart for comparing models across dimensions
if data:
    # Prepare data for radar chart
    dimension_cols = data['dimension_cols']
    dimension_map = data['dimension_map']
    evaluations = data['evaluations']
    
    # Calculate average scores by model
    model_dimension_scores = evaluations.groupby('model_name')[dimension_cols].mean()
    
    # Create radar chart
    def create_radar_chart(scores_df):
        # Number of dimensions
        N = len(dimension_cols)
        
        # Angle of each axis
        angles = np.linspace(0, 2*np.pi, N, endpoint=False).tolist()
        angles += angles[:1]  # Close the loop
        
        # Labels for each dimension
        labels = [dimension_map[col] for col in dimension_cols]
        labels += labels[:1]  # Close the loop
        
        # Create figure
        fig = plt.figure(figsize=(10, 10))
        ax = fig.add_subplot(111, polar=True)
        
        # Plot each model
        for model_name, scores in scores_df.iterrows():
            values = scores.values.tolist()
            values += values[:1]  # Close the loop
            
            ax.plot(angles, values, linewidth=2, label=model_name)
            ax.fill(angles, values, alpha=0.1)
        
        # Set labels and title
        ax.set_xticks(angles)
        ax.set_xticklabels(labels, fontsize=12)
        
        # Set y-axis limits
        ax.set_ylim(0, 4)
        plt.yticks([0, 1, 2, 3, 4], ['0', '1', '2', '3', '4'], fontsize=10)
        
        # Add legend
        plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
        
        plt.title('Vulnerability Profile by Model Across Dimensions', size=15)
        plt.tight_layout()
        
        return fig
    
    # Create and save radar chart
    radar_fig = create_radar_chart(model_dimension_scores)
    radar_fig.savefig(os.path.join(figures_dir, 'model_radar_comparison.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Create bar chart comparing models across dimensions
    # Reshape data for plotting
    dimension_comparison = model_dimension_scores.reset_index().melt(
        id_vars=['model_name'],
        value_vars=dimension_cols,
        var_name='dimension',
        value_name='score'
    )
    
    # Map column names to readable dimension names
    dimension_comparison['dimension'] = dimension_comparison['dimension'].map(dimension_map)
    
    # Create figure
    plt.figure(figsize=(14, 10))
    sns.barplot(data=dimension_comparison, x='dimension', y='score', hue='model_name')
    plt.title('Vulnerability Score by Dimension and Model', fontsize=15)
    plt.xlabel('Vulnerability Dimension', fontsize=12)
    plt.ylabel('Average Score (0-4)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Model')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'dimension_comparison.png'), dpi=300, bbox_inches='tight')
    plt.show()

In [None]:
# Identify key vulnerabilities for each model
if data:
    # Get data
    model_dimension_scores = data['evaluations'].groupby('model_name')[data['dimension_cols']].mean()
    dimension_map = data['dimension_map']
    
    print("Key vulnerabilities by model (highest scoring dimensions):")
    for model, scores in model_dimension_scores.iterrows():
        # Sort dimensions by score (descending)
        sorted_dimensions = scores.sort_values(ascending=False)
        
        # Display top 3 vulnerabilities
        print(f"\n{model}:")
        for dim, score in sorted_dimensions.head(3).items():
            print(f"  {dimension_map[dim]}: {score:.2f}/4")
            
    # Create heatmap of vulnerabilities
    plt.figure(figsize=(12, 8))
    ax = sns.heatmap(model_dimension_scores, annot=True, cmap='YlOrRd', vmin=0, vmax=4, fmt='.2f')
    plt.title('Vulnerability Heatmap by Model and Dimension', fontsize=15)
    plt.ylabel('Model', fontsize=12)
    plt.xlabel('Vulnerability Dimension', fontsize=12)
    
    # Fix x-axis labels
    ax.set_xticklabels([dimension_map[dim] for dim in data['dimension_cols']], rotation=45, ha='right')
    
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'vulnerability_heatmap.png'), dpi=300, bbox_inches='tight')
    plt.show()

## 5. Attack Category Effectiveness

Analyze which attack categories are most effective against each model.

In [None]:
# Analyze effectiveness of different text attack categories
if data:
    # Filter for text attack evaluations
    text_attacks = data['evaluations'][data['evaluations']['category_type'] == 'text_attack']
    
    if not text_attacks.empty:
        # Extract specific attack type from category
        def extract_attack_type(category):
            if category.startswith('text_attack_'):
                return category.replace('text_attack_', '')
            return category
        
        text_attacks['attack_type'] = text_attacks['question_category'].apply(extract_attack_type)
        
        # Calculate average score by model and attack type
        attack_effectiveness = text_attacks.groupby(['model_name', 'attack_type'])['total_score'].mean().reset_index()
        
        # Create pivot table
        attack_pivot = attack_effectiveness.pivot(index='attack_type', columns='model_name', values='total_score')
        
        # Sort by average effectiveness across models
        attack_pivot['avg'] = attack_pivot.mean(axis=1)
        attack_pivot = attack_pivot.sort_values('avg', ascending=False).drop('avg', axis=1)
        
        # Display top 10 most effective attacks
        print("Top 10 most effective text attack categories:")
        display(attack_pivot.head(10))
        
        # Create heatmap of attack effectiveness
        plt.figure(figsize=(12, 10))
        sns.heatmap(attack_pivot.head(10), annot=True, cmap='YlOrRd', fmt='.2f')
        plt.title('Text Attack Effectiveness by Model (Top 10)', fontsize=15)
        plt.xlabel('Model', fontsize=12)
        plt.ylabel('Attack Type', fontsize=12)
        plt.tight_layout()
        
        # Save figure
        plt.savefig(os.path.join(figures_dir, 'text_attack_effectiveness.png'), dpi=300, bbox_inches='tight')
        plt.show()

In [None]:
# Analyze effectiveness of different visual perturbation types
if data:
    # Filter for visual perturbation evaluations
    visual_perturbs = data['evaluations'][data['evaluations']['category_type'] == 'visual_perturbation']
    
    if not visual_perturbs.empty:
        # Extract specific perturbation type from category
        def extract_perturb_type(category):
            if category.startswith('visual_perturb_'):
                return category.replace('visual_perturb_', '')
            return category
        
        visual_perturbs['perturb_type'] = visual_perturbs['question_category'].apply(extract_perturb_type)
        
        # Calculate average score by model and perturbation type
        perturb_effectiveness = visual_perturbs.groupby(['model_name', 'perturb_type'])['total_score'].mean().reset_index()
        
        # Create pivot table
        perturb_pivot = perturb_effectiveness.pivot(index='perturb_type', columns='model_name', values='total_score')
        
        # Sort by average effectiveness across models
        perturb_pivot['avg'] = perturb_pivot.mean(axis=1)
        perturb_pivot = perturb_pivot.sort_values('avg', ascending=False).drop('avg', axis=1)
        
        # Display most effective visual perturbations
        print("Most effective visual perturbation types:")
        display(perturb_pivot)
        
        # Create bar chart of perturbation effectiveness
        plt.figure(figsize=(12, 8))
        
        # Reshape for plotting
        perturb_long = perturb_effectiveness.copy()
        
        # Create bar chart
        sns.barplot(data=perturb_long, x='perturb_type', y='total_score', hue='model_name')
        plt.title('Visual Perturbation Effectiveness by Model', fontsize=15)
        plt.xlabel('Perturbation Type', fontsize=12)
        plt.ylabel('Average Vulnerability Score', fontsize=12)
        plt.xticks(rotation=45, ha='right')
        plt.legend(title='Model')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        
        # Save figure
        plt.savefig(os.path.join(figures_dir, 'visual_perturbation_effectiveness.png'), dpi=300, bbox_inches='tight')
        plt.show()

## 6. Response Length Analysis

Analyze response length patterns across models and attack types.

In [None]:
# Analyze response length patterns
if data:
    # Get response data
    responses = data['responses']
    
    # Calculate response length statistics by model and category
    length_stats = responses.groupby(['model_name', 'category_type'])['response_length'].agg(
        ['mean', 'std', 'count']
    ).reset_index()
    
    # Plot response length comparison
    plt.figure(figsize=(12, 8))
    
    # Create bar chart with error bars
    ax = sns.barplot(data=length_stats, x='model_name', y='mean', hue='category_type')
    
    # Add error bars
    for i, bar in enumerate(ax.patches):
        bar_idx = i // len(length_stats['category_type'].unique())
        category_idx = i % len(length_stats['category_type'].unique())
        
        category_types = length_stats['category_type'].unique()
        model_names = length_stats['model_name'].unique()
        
        if bar_idx < len(model_names) and category_idx < len(category_types):
            std = length_stats[
                (length_stats['model_name'] == model_names[bar_idx]) & 
                (length_stats['category_type'] == category_types[category_idx])
            ]['std'].values
            
            if len(std) > 0:
                x = bar.get_x() + bar.get_width() / 2
                height = bar.get_height()
                plt.errorbar(x, height, yerr=std[0], color='black', capsize=5)
    
    plt.title('Response Length by Model and Category Type', fontsize=15)
    plt.xlabel('Model', fontsize=12)
    plt.ylabel('Average Response Length (characters)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Category Type')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'response_length_comparison.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Calculate percentage change in response length from baseline to adversarial
    def calculate_length_changes(length_stats_df):
        # Pivot the data
        pivot_df = length_stats_df.pivot(index='model_name', columns='category_type', values='mean')
        
        # Calculate percentage changes
        if 'baseline' in pivot_df.columns:
            if 'text_attack' in pivot_df.columns:
                pivot_df['text_attack_pct_change'] = (pivot_df['text_attack'] - pivot_df['baseline']) / pivot_df['baseline'] * 100
            
            if 'visual_perturbation' in pivot_df.columns:
                pivot_df['visual_perturbation_pct_change'] = (pivot_df['visual_perturbation'] - pivot_df['baseline']) / pivot_df['baseline'] * 100
        
        return pivot_df
    
    # Calculate and display percentage changes
    changes_df = calculate_length_changes(length_stats)
    
    print("\nPercentage Change in Response Length from Baseline:")
    display(changes_df[['text_attack_pct_change', 'visual_perturbation_pct_change']])
    
    # Create bar chart of percentage changes
    if 'text_attack_pct_change' in changes_df.columns or 'visual_perturbation_pct_change' in changes_df.columns:
        # Reshape for plotting
        change_cols = [col for col in ['text_attack_pct_change', 'visual_perturbation_pct_change'] if col in changes_df.columns]
        changes_long = changes_df.reset_index().melt(
            id_vars=['model_name'],
            value_vars=change_cols,
            var_name='attack_type',
            value_name='percent_change'
        )
        
        # Clean up attack type names
        changes_long['attack_type'] = changes_long['attack_type'].str.replace('_pct_change', '')
        
        # Create bar chart
        plt.figure(figsize=(12, 8))
        sns.barplot(data=changes_long, x='model_name', y='percent_change', hue='attack_type')
        plt.title('Percentage Change in Response Length from Baseline', fontsize=15)
        plt.xlabel('Model', fontsize=12)
        plt.ylabel('Percentage Change (%)', fontsize=12)
        plt.xticks(rotation=45, ha='right')
        plt.axhline(y=0, color='black', linestyle='-', alpha=0.5)
        plt.legend(title='Attack Type')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        
        # Save figure
        plt.savefig(os.path.join(figures_dir, 'response_length_pct_change.png'), dpi=300, bbox_inches='tight')
        plt.show()

## 7. Risk Classification Analysis

Analyze the distribution of risk classifications for each model.

In [None]:
# Analyze risk classification distributions
if data:
    # Get evaluation data
    evaluations = data['evaluations']
    
    # Calculate risk distribution by model
    risk_distribution = evaluations.groupby('model_name')['risk_classification'].value_counts(normalize=True).reset_index(name='percentage')
    risk_distribution['percentage'] = risk_distribution['percentage'] * 100
    
    # Define risk order
    risk_order = ['Low Risk', 'Moderate Risk', 'High Risk', 'Critical Risk']
    
    # Filter to keep only standard risk classifications
    risk_distribution = risk_distribution[risk_distribution['risk_classification'].isin(risk_order)]
    
    # Create pivot table
    risk_pivot = risk_distribution.pivot(index='model_name', columns='risk_classification', values='percentage')
    
    # Reorder columns by risk level
    risk_pivot = risk_pivot.reindex(columns=risk_order)
    
    # Fill NaN with zeros
    risk_pivot = risk_pivot.fillna(0)
    
    # Display risk distribution
    print("Risk Classification Distribution by Model (percentage):")
    display(risk_pivot)
    
    # Create stacked bar chart
    risk_pivot.plot(kind='bar', stacked=True, figsize=(12, 8), colormap='YlOrRd')
    plt.title('Risk Classification Distribution by Model', fontsize=15)
    plt.xlabel('Model', fontsize=12)
    plt.ylabel('Percentage (%)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Risk Classification')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'risk_classification_distribution.png'), dpi=300, bbox_inches='tight')
    plt.show()

## 8. Vulnerability Profile Analysis

Create comprehensive vulnerability profiles for each model.

In [None]:
# Create comprehensive vulnerability profiles
if data:
    # Prepare data
    evaluations = data['evaluations']
    dimension_cols = data['dimension_cols']
    dimension_map = data['dimension_map']
    
    # Create profile for each model
    for model in evaluations['model_name'].unique():
        model_data = evaluations[evaluations['model_name'] == model]
        
        print(f"\n{'='*80}\n")
        print(f"Vulnerability Profile: {model}\n")
        
        # Overall stats
        total_evaluations = len(model_data)
        avg_score = model_data['total_score'].mean()
        std_score = model_data['total_score'].std()
        risk_counts = model_data['risk_classification'].value_counts(normalize=True) * 100
        
        print(f"Total evaluations: {total_evaluations}")
        print(f"Average vulnerability score: {avg_score:.2f} ± {std_score:.2f} (0-32 scale)")
        print(f"Risk profile: {', '.join([f'{risk}: {pct:.1f}%' for risk, pct in risk_counts.items()])}")
        
        # Dimension scores
        dimension_scores = model_data[dimension_cols].mean()
        
        print("\nVulnerability by dimension:")
        for dim, score in dimension_scores.sort_values(ascending=False).items():
            print(f"  {dimension_map[dim]}: {score:.2f}/4")
        
        # Most vulnerable categories
        category_scores = model_data.groupby('category_type')['total_score'].mean().sort_values(ascending=False)
        
        print("\nVulnerability by category type:")
        for category, score in category_scores.items():
            print(f"  {category}: {score:.2f}/32")
            
        # Most effective attacks (top 5)
        if 'attack_type' not in model_data.columns and 'question_category' in model_data.columns:
            model_data['attack_type'] = model_data['question_category'].apply(
                lambda x: x.replace('text_attack_', '') if x.startswith('text_attack_') else 
                (x.replace('visual_perturb_', '') if x.startswith('visual_perturb_') else x)
            )
            
        attack_scores = model_data.groupby('attack_type')['total_score'].mean().sort_values(ascending=False)
        
        print("\nMost effective attacks (top 5):")
        for attack, score in attack_scores.head(5).items():
            print(f"  {attack}: {score:.2f}/32")
        
        # Create radar chart for this model
        N = len(dimension_cols)
        angles = np.linspace(0, 2*np.pi, N, endpoint=False).tolist()
        angles += angles[:1]  # Close the loop
        
        labels = [dimension_map[col] for col in dimension_cols]
        labels += labels[:1]  # Close the loop
        
        values = dimension_scores.values.tolist()
        values += values[:1]  # Close the loop
        
        fig = plt.figure(figsize=(10, 10))
        ax = fig.add_subplot(111, polar=True)
        
        ax.plot(angles, values, linewidth=2, label=model)
        ax.fill(angles, values, alpha=0.25)
        
        ax.set_xticks(angles)
        ax.set_xticklabels(labels, fontsize=12)
        
        ax.set_ylim(0, 4)
        plt.yticks([0, 1, 2, 3, 4], ['0', '1', '2', '3', '4'], fontsize=10)
        
        plt.title(f'Vulnerability Profile: {model}', size=15)
        plt.tight_layout()
        
        # Save model-specific radar chart
        plt.savefig(os.path.join(figures_dir, f'{model.replace("/", "_")}_vulnerability_profile.png'), dpi=300, bbox_inches='tight')
        plt.show()

## 9. Export Results

Export comprehensive benchmarking results for further analysis and publication.

In [None]:
# Export comprehensive results to CSV
if data:
    # Create summary dataframe
    models = data['evaluations']['model_name'].unique()
    summary_rows = []
    
    for model in models:
        model_data = data['evaluations'][data['evaluations']['model_name'] == model]
        
        # Overall stats
        total_evaluations = len(model_data)
        avg_score = model_data['total_score'].mean()
        std_score = model_data['total_score'].std()
        min_score = model_data['total_score'].min()
        max_score = model_data['total_score'].max()
        
        # Risk classification
        risk_counts = model_data['risk_classification'].value_counts(normalize=True) * 100
        low_risk = risk_counts.get('Low Risk', 0)
        moderate_risk = risk_counts.get('Moderate Risk', 0)
        high_risk = risk_counts.get('High Risk', 0)
        critical_risk = risk_counts.get('Critical Risk', 0)
        
        # Dimension scores
        dimension_scores = {}
        for dim in data['dimension_cols']:
            dimension_scores[f"{dim}_avg"] = model_data[dim].mean()
            dimension_scores[f"{dim}_std"] = model_data[dim].std()
        
        # Category scores
        category_scores = {}
        for category in model_data['category_type'].unique():
            category_data = model_data[model_data['category_type'] == category]
            category_scores[f"{category}_avg"] = category_data['total_score'].mean()
            category_scores[f"{category}_std"] = category_data['total_score'].std()
            category_scores[f"{category}_count"] = len(category_data)
        
        # Combine all data
        summary_rows.append({
            'model': model,
            'total_evaluations': total_evaluations,
            'avg_score': avg_score,
            'std_score': std_score,
            'min_score': min_score,
            'max_score': max_score,
            'low_risk_pct': low_risk,
            'moderate_risk_pct': moderate_risk,
            'high_risk_pct': high_risk,
            'critical_risk_pct': critical_risk,
            **dimension_scores,
            **category_scores
        })
    
    # Create summary dataframe
    summary_df = pd.DataFrame(summary_rows)
    
    # Export to CSV
    summary_path = os.path.join(output_dir, 'benchmarking_summary.csv')
    summary_df.to_csv(summary_path, index=False)
    print(f"Saved benchmarking summary to {summary_path}")
    
    # Export raw data for further analysis
    raw_data_path = os.path.join(output_dir, 'vsf_med_complete_dataset.csv')
    data['evaluations'].to_csv(raw_data_path, index=False)
    print(f"Saved complete VSF-Med dataset to {raw_data_path}")

## 10. Summary and Conclusions

In this notebook, we've performed a comprehensive benchmarking analysis across all models evaluated in the VSF-Med framework. Our analysis provides insights into the relative vulnerability of different models to various types of adversarial inputs.

### Key Findings
- [Summarize which model performed best overall in terms of security]
- [Identify the most common vulnerabilities across all models]
- [Note which attack types were most effective]
- [Highlight differences between general models (GPT, Claude) vs. specialized models (CheXagent)]
- [Describe patterns in how models respond to different types of attacks]

### Implications for Medical AI
- These results highlight important security considerations for deploying vision-language models in clinical settings
- The identified vulnerabilities suggest specific mitigations that should be implemented
- Different models show different vulnerability profiles, suggesting different use cases may require different models

### Next Steps
- Proceed to notebook `08_analysis_radiologist_comparison.ipynb` to compare model performance with radiologists
- Consider implementing suggested mitigations and re-evaluating model performance
- Explore additional attack vectors and defense mechanisms

This comprehensive benchmarking provides a foundation for understanding the security posture of current medical vision-language models and informs future development of more robust systems.