# Analysis: Comparing Model Performance with Radiologists

**Author:** [Your Name]

**Date:** [Current Date]

## Overview

This notebook is part of the VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) research project. It compares the performance of vision-language models against radiologists in interpreting chest X-rays, with a focus on how adversarial inputs affect this comparison.

### Purpose
- Compare model performance with radiologist ground truth under standard conditions
- Analyze how adversarial inputs affect model-radiologist agreement
- Identify clinical implications of model vulnerabilities
- Assess the potential impact on patient care when using vulnerable models
- Provide recommendations for safe deployment of vision-language models in clinical settings

### Workflow
1. Load model responses and radiologist interpretations from the database
2. Calculate agreement metrics between models and radiologists
3. Compare agreement under standard vs. adversarial conditions
4. Analyze clinical implications of disagreements
5. Generate visualizations and insights for publication

### Clinical Relevance
This analysis addresses the critical question: "How do vulnerabilities in vision-language models impact their agreement with radiologists?" This has direct implications for the safe deployment of AI in clinical settings and patient care.

## 1. Environment Setup

### 1.1 Install Required Libraries

First, we'll install all necessary libraries for data analysis and visualization.

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn sqlalchemy psycopg2-binary python-dotenv pyyaml scikit-learn nltk

In [None]:
# Import required libraries
import os
import sys
import yaml
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv
from sqlalchemy import create_engine, text
from datetime import datetime
from sklearn.metrics import cohen_kappa_score, accuracy_score, precision_score, recall_score, f1_score
import nltk
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score

# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')

# Add the src directory to the path for importing custom modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_dir)

# Load environment variables from .env file
load_dotenv()

# Set up plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Check platform for environment-specific settings
import platform
operating_system = platform.system()
print(f"Operating System: {operating_system}")

### 1.2 Configuration Setup

Load configuration from YAML file and set up environment-specific settings.

In [None]:
# Load configuration
config_path = os.path.join(parent_dir, 'src', 'config', 'default_config.yaml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Configure paths based on operating system
if operating_system == 'Darwin':  # macOS
    output_dir = os.path.expanduser('~/data/vsf-med/output')
elif operating_system == 'Linux':
    output_dir = '/data/vsf-med/output'
else:  # Windows or other
    output_dir = config['paths']['output_dir'].replace('${HOME}', os.path.expanduser('~'))

# Create output directory if it doesn't exist
figures_dir = os.path.join(output_dir, 'figures')
os.makedirs(figures_dir, exist_ok=True)

# Set up database connection
db_config = config['database']
db_password = os.environ.get('DB_PASSWORD', '')
CONNECTION_STRING = f"postgresql://{db_config['user']}:{db_password}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
engine = create_engine(CONNECTION_STRING)

print(f"Output directory: {output_dir}")
print(f"Figures directory: {figures_dir}")

## 2. Data Collection

Fetch model responses and radiologist interpretations from the database.

In [None]:
def fetch_model_responses_with_ground_truth():
    """
    Fetch model responses along with radiologist ground truth from the database.
    
    Returns:
        pd.DataFrame: DataFrame containing model responses and ground truth
    """
    query = """
    SELECT r.id, r.uid, r.question_id, r.question, r.question_category, r.model_name, 
           r.model_answer, r.image_link, r.actual_answer, q.answer as radiologist_answer
    FROM mimicxp.model_responses_r2 r
    LEFT JOIN mimicxp.mimic_all_qns q ON r.question_id = q.question_id::text
    WHERE q.answer IS NOT NULL
    ORDER BY r.created_at DESC
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} model responses with ground truth")
    return df

def fetch_evaluations_with_ground_truth():
    """
    Fetch model evaluations with radiologist ground truth from the database.
    
    Returns:
        pd.DataFrame: DataFrame containing evaluations and ground truth
    """
    query = """
    SELECT e.*, r.model_name, r.question_category, r.question, r.model_answer, 
           r.actual_answer, q.answer as radiologist_answer
    FROM mimicxp.evaluations e
    JOIN mimicxp.model_responses_r2 r ON e.model_response_id = r.id
    LEFT JOIN mimicxp.mimic_all_qns q ON r.question_id = q.question_id::text
    WHERE q.answer IS NOT NULL
    ORDER BY e.created_at DESC
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} evaluations with ground truth")
    return df

# Fetch data
responses_with_gt = fetch_model_responses_with_ground_truth()
evaluations_with_gt = fetch_evaluations_with_ground_truth()

In [None]:
# Prepare data for analysis
def preprocess_data(responses_df, evaluations_df):
    """
    Preprocess and prepare data for model-radiologist comparison.
    
    Args:
        responses_df (pd.DataFrame): DataFrame with model responses
        evaluations_df (pd.DataFrame): DataFrame with evaluations
        
    Returns:
        dict: Dictionary with processed DataFrames
    """
    # Check if we have data
    if responses_df.empty or evaluations_df.empty:
        print("Warning: Empty data. Cannot proceed with analysis.")
        return {}
        
    # Add category type (text attack, visual perturbation, original)
    def categorize_question(category):
        if category == 'original':
            return 'baseline'
        elif category.startswith('visual_perturb'):
            return 'visual_perturbation'
        elif category.startswith('text_attack'):
            return 'text_attack'
        else:
            return 'other'
    
    # Add category type to responses
    responses_df['category_type'] = responses_df['question_category'].apply(categorize_question)
    
    # Add category type to evaluations
    evaluations_df['category_type'] = evaluations_df['question_category'].apply(categorize_question)
    
    # Add model info to evaluations
    evaluations_df['model_name'] = evaluations_df['model_name']
    
    # Merge evaluations with responses
    merged_df = pd.merge(
        evaluations_df,
        responses_df[['id', 'radiologist_answer']],
        left_on='model_response_id',
        right_on='id',
        how='inner'
    )
    
    return {
        'responses': responses_df,
        'evaluations': evaluations_df,
        'merged': merged_df
    }

# Process data
data = preprocess_data(responses_with_gt, evaluations_with_gt)

## 3. Text Similarity Metrics

Calculate text similarity between model responses and radiologist interpretations.

In [None]:
def calculate_text_similarity(text1, text2):
    """
    Calculate text similarity metrics between two text strings.
    
    Args:
        text1 (str): First text string
        text2 (str): Second text string
        
    Returns:
        dict: Dictionary of similarity metrics
    """
    # Handle null values
    if not isinstance(text1, str) or not isinstance(text2, str):
        return {
            'bleu': 0.0,
            'meteor': 0.0,
            'word_overlap': 0.0
        }
    
    # Tokenize texts
    tokens1 = word_tokenize(text1.lower())
    tokens2 = word_tokenize(text2.lower())
    
    # Calculate BLEU score
    try:
        bleu = sentence_bleu([tokens1], tokens2, weights=(1, 0, 0, 0))
    except Exception:
        bleu = 0.0
    
    # Calculate METEOR score
    try:
        meteor = meteor_score([tokens1], tokens2)
    except Exception:
        meteor = 0.0
    
    # Calculate word overlap (Jaccard similarity)
    set1 = set(tokens1)
    set2 = set(tokens2)
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    word_overlap = intersection / union if union > 0 else 0.0
    
    return {
        'bleu': bleu,
        'meteor': meteor,
        'word_overlap': word_overlap
    }

def extract_diagnoses(text):
    """
    Extract medical diagnoses from text using pattern matching.
    
    Args:
        text (str): Text to extract diagnoses from
        
    Returns:
        list: List of extracted diagnoses
    """
    if not isinstance(text, str):
        return []
    
    # Common chest X-ray findings/diagnoses
    diagnoses = [
        "pneumonia", "effusion", "edema", "atelectasis", "consolidation",
        "nodule", "mass", "pneumothorax", "pleural effusion", "cardiomegaly",
        "emphysema", "fibrosis", "infiltrate", "opacity", "pleural thickening",
        "pulmonary edema", "fracture", "lung cancer", "tuberculosis", "covid",
        "copd", "bronchitis", "bronchiectasis", "collapse", "hiatal hernia"
    ]
    
    # Create regex pattern for diagnoses
    pattern = r'\b(' + '|'.join(diagnoses) + r')\b'
    
    # Find all matches
    matches = re.findall(pattern, text.lower())
    
    # Return unique diagnoses
    return list(set(matches))

def calculate_diagnosis_agreement(text1, text2):
    """
    Calculate agreement between diagnoses in two texts.
    
    Args:
        text1 (str): First text (typically radiologist interpretation)
        text2 (str): Second text (typically model response)
        
    Returns:
        dict: Dictionary of agreement metrics
    """
    # Extract diagnoses
    diagnoses1 = extract_diagnoses(text1)
    diagnoses2 = extract_diagnoses(text2)
    
    # Handle empty diagnoses
    if not diagnoses1 or not diagnoses2:
        return {
            'precision': 0.0,
            'recall': 0.0,
            'f1': 0.0,
            'diagnoses1': diagnoses1,
            'diagnoses2': diagnoses2
        }
    
    # Calculate precision, recall, and F1 score
    tp = len(set(diagnoses1).intersection(set(diagnoses2)))
    fp = len(set(diagnoses2) - set(diagnoses1))
    fn = len(set(diagnoses1) - set(diagnoses2))
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'diagnoses1': diagnoses1,
        'diagnoses2': diagnoses2
    }

In [None]:
# Calculate similarity metrics
if data and not data['responses'].empty:
    # Create copy of responses dataframe
    responses = data['responses'].copy()
    
    # Initialize similarity metrics columns
    responses['bleu'] = 0.0
    responses['meteor'] = 0.0
    responses['word_overlap'] = 0.0
    responses['diagnosis_precision'] = 0.0
    responses['diagnosis_recall'] = 0.0
    responses['diagnosis_f1'] = 0.0
    responses['radiologist_diagnoses'] = None
    responses['model_diagnoses'] = None
    
    # Calculate similarity for each row
    for idx, row in tqdm(responses.iterrows(), total=len(responses)):
        # Calculate text similarity
        similarity = calculate_text_similarity(row['radiologist_answer'], row['model_answer'])
        responses.at[idx, 'bleu'] = similarity['bleu']
        responses.at[idx, 'meteor'] = similarity['meteor']
        responses.at[idx, 'word_overlap'] = similarity['word_overlap']
        
        # Calculate diagnosis agreement
        agreement = calculate_diagnosis_agreement(row['radiologist_answer'], row['model_answer'])
        responses.at[idx, 'diagnosis_precision'] = agreement['precision']
        responses.at[idx, 'diagnosis_recall'] = agreement['recall']
        responses.at[idx, 'diagnosis_f1'] = agreement['f1']
        responses.at[idx, 'radiologist_diagnoses'] = str(agreement['diagnoses1'])
        responses.at[idx, 'model_diagnoses'] = str(agreement['diagnoses2'])
    
    # Add to data dictionary
    data['responses_with_metrics'] = responses
    
    # Display summary statistics
    metrics = ['bleu', 'meteor', 'word_overlap', 'diagnosis_precision', 'diagnosis_recall', 'diagnosis_f1']
    summary = responses[metrics].describe()
    print("\nSimilarity metrics summary:")
    display(summary)

## 4. Model-Radiologist Agreement Analysis

Analyze agreement between model responses and radiologist interpretations across different conditions.

In [None]:
# Compare model-radiologist agreement across category types
if data and 'responses_with_metrics' in data:
    responses = data['responses_with_metrics']
    
    # Group by model and category type
    grouped = responses.groupby(['model_name', 'category_type'])
    
    # Calculate average similarity metrics by group
    metrics = ['bleu', 'meteor', 'word_overlap', 'diagnosis_precision', 'diagnosis_recall', 'diagnosis_f1']
    agreement_by_category = grouped[metrics].mean().reset_index()
    
    # Display agreement by category
    print("\nAverage model-radiologist agreement by category type:")
    display(agreement_by_category)
    
    # Create grouped bar chart for agreement metrics
    plt.figure(figsize=(14, 8))
    
    # Plot F1 scores by model and category
    sns.barplot(data=agreement_by_category, x='model_name', y='diagnosis_f1', hue='category_type')
    plt.title('Model-Radiologist Diagnosis Agreement (F1) by Category', fontsize=15)
    plt.xlabel('Model', fontsize=12)
    plt.ylabel('F1 Score (Diagnosis Agreement)', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Category Type')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'diagnosis_agreement_by_category.png'), dpi=300, bbox_inches='tight')
    plt.show()

In [None]:
# Calculate agreement change from baseline to adversarial
if data and 'responses_with_metrics' in data:
    responses = data['responses_with_metrics']
    
    # Filter to include only models with both baseline and adversarial examples
    valid_models = []
    for model in responses['model_name'].unique():
        categories = responses[responses['model_name'] == model]['category_type'].unique()
        if 'baseline' in categories and ('text_attack' in categories or 'visual_perturbation' in categories):
            valid_models.append(model)
    
    # Filter responses to valid models
    valid_responses = responses[responses['model_name'].isin(valid_models)]
    
    # Group by model and category type
    grouped = valid_responses.groupby(['model_name', 'category_type'])
    
    # Calculate average F1 score by group
    avg_f1 = grouped['diagnosis_f1'].mean().reset_index()
    
    # Pivot data for calculation
    f1_pivot = avg_f1.pivot(index='model_name', columns='category_type', values='diagnosis_f1')
    
    # Calculate percentage change from baseline
    changes = pd.DataFrame(index=f1_pivot.index)
    
    if 'baseline' in f1_pivot.columns:
        if 'text_attack' in f1_pivot.columns:
            changes['text_attack_pct_change'] = (f1_pivot['text_attack'] - f1_pivot['baseline']) / f1_pivot['baseline'] * 100
        
        if 'visual_perturbation' in f1_pivot.columns:
            changes['visual_perturbation_pct_change'] = (f1_pivot['visual_perturbation'] - f1_pivot['baseline']) / f1_pivot['baseline'] * 100
    
    # Display percentage changes
    print("\nPercentage change in model-radiologist agreement (F1) from baseline:")
    display(changes)
    
    # Create bar chart of percentage changes
    if not changes.empty:
        # Reshape data for plotting
        changes_long = changes.reset_index().melt(
            id_vars=['model_name'],
            var_name='category_type',
            value_name='percent_change'
        )
        
        # Clean up category names
        changes_long['category_type'] = changes_long['category_type'].str.replace('_pct_change', '')
        
        # Create bar chart
        plt.figure(figsize=(12, 8))
        sns.barplot(data=changes_long, x='model_name', y='percent_change', hue='category_type')
        plt.title('Percentage Change in Model-Radiologist Agreement from Baseline', fontsize=15)
        plt.xlabel('Model', fontsize=12)
        plt.ylabel('Percentage Change in F1 Score (%)', fontsize=12)
        plt.xticks(rotation=45, ha='right')
        plt.axhline(y=0, color='black', linestyle='-', alpha=0.5)
        plt.legend(title='Category Type')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        
        # Save figure
        plt.savefig(os.path.join(figures_dir, 'agreement_percentage_change.png'), dpi=300, bbox_inches='tight')
        plt.show()

## 5. Vulnerability Impact on Clinical Utility

Analyze how model vulnerabilities affect potential clinical utility.

In [None]:
# Analyze relationship between vulnerability scores and model-radiologist agreement
if data and 'responses_with_metrics' in data and not data['merged'].empty:
    # Get merged data with both vulnerability scores and agreement metrics
    merged = data['merged'].copy()
    responses_with_metrics = data['responses_with_metrics']
    
    # Add agreement metrics to merged data
    metrics_map = responses_with_metrics[['id', 'diagnosis_f1', 'bleu', 'meteor', 'word_overlap']].set_index('id')
    merged = pd.merge(
        merged,
        metrics_map,
        left_on='model_response_id',
        right_index=True,
        how='left'
    )
    
    # Plot relationship between vulnerability score and agreement
    plt.figure(figsize=(12, 8))
    
    # Create scatter plot with regression line
    sns.regplot(data=merged, x='total_score', y='diagnosis_f1', 
                scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
    
    # Add model-specific colors
    for model in merged['model_name'].unique():
        model_data = merged[merged['model_name'] == model]
        plt.scatter(model_data['total_score'], model_data['diagnosis_f1'], label=model, alpha=0.7)
    
    plt.title('Relationship Between Vulnerability Score and Radiologist Agreement', fontsize=15)
    plt.xlabel('Vulnerability Score (higher = more vulnerable)', fontsize=12)
    plt.ylabel('Diagnosis Agreement (F1 Score)', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend(title='Model')
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'vulnerability_vs_agreement.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Calculate correlation between vulnerability and agreement
    correlation = merged[['total_score', 'diagnosis_f1']].corr().iloc[0, 1]
    print(f"\nCorrelation between vulnerability score and diagnosis agreement: {correlation:.3f}")
    
    # Calculate correlations for each vulnerability dimension
    dimension_cols = [
        'prompt_injection_score', 'jailbreak_score', 'confidentiality_score',
        'misinformation_score', 'dos_resilience_score', 'persistence_score',
        'safety_bypass_score', 'medical_impact_score'
    ]
    
    dimension_corr = []
    for dim in dimension_cols:
        corr = merged[[dim, 'diagnosis_f1']].corr().iloc[0, 1]
        dimension_corr.append({
            'dimension': dim,
            'correlation': corr
        })
    
    dimension_corr_df = pd.DataFrame(dimension_corr)
    dimension_corr_df = dimension_corr_df.sort_values('correlation')
    
    print("\nCorrelation between vulnerability dimensions and diagnosis agreement:")
    display(dimension_corr_df)
    
    # Create bar chart of dimension correlations
    plt.figure(figsize=(12, 8))
    sns.barplot(data=dimension_corr_df, x='dimension', y='correlation')
    plt.title('Correlation Between Vulnerability Dimensions and Radiologist Agreement', fontsize=15)
    plt.xlabel('Vulnerability Dimension', fontsize=12)
    plt.ylabel('Correlation with F1 Score', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.axhline(y=0, color='black', linestyle='-', alpha=0.5)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'dimension_correlation.png'), dpi=300, bbox_inches='tight')
    plt.show()

In [None]:
# Identify critical cases where agreement drops most significantly
if data and 'responses_with_metrics' in data:
    responses = data['responses_with_metrics']
    
    # Group responses by image (uid) to compare baseline vs. adversarial for same case
    case_comparisons = []
    
    for model in responses['model_name'].unique():
        model_responses = responses[responses['model_name'] == model]
        
        for uid in model_responses['uid'].unique():
            uid_responses = model_responses[model_responses['uid'] == uid]
            
            # Check if we have both baseline and adversarial for this case
            baseline = uid_responses[uid_responses['category_type'] == 'baseline']
            adversarial = uid_responses[uid_responses['category_type'] != 'baseline']
            
            if not baseline.empty and not adversarial.empty:
                baseline_f1 = baseline['diagnosis_f1'].values[0]
                
                for _, adv_row in adversarial.iterrows():
                    adv_f1 = adv_row['diagnosis_f1']
                    adv_category = adv_row['category_type']
                    adv_question = adv_row['question_category']
                    
                    # Calculate change in F1 score
                    f1_change = adv_f1 - baseline_f1
                    
                    case_comparisons.append({
                        'model': model,
                        'uid': uid,
                        'baseline_f1': baseline_f1,
                        'adversarial_f1': adv_f1,
                        'f1_change': f1_change,
                        'f1_change_pct': (f1_change / baseline_f1) * 100 if baseline_f1 > 0 else 0,
                        'category_type': adv_category,
                        'question_category': adv_question,
                        'baseline_diagnoses': baseline['radiologist_diagnoses'].values[0],
                        'baseline_model_diagnoses': baseline['model_diagnoses'].values[0],
                        'adversarial_model_diagnoses': adv_row['model_diagnoses']
                    })
    
    # Convert to DataFrame
    comparisons_df = pd.DataFrame(case_comparisons)
    
    # Sort by largest decrease in F1 score
    critical_cases = comparisons_df.sort_values('f1_change')
    
    # Display top 10 most critical cases
    print("\nTop 10 cases with largest decrease in radiologist agreement:")
    display(critical_cases[['model', 'category_type', 'question_category', 
                           'baseline_f1', 'adversarial_f1', 'f1_change', 'f1_change_pct']].head(10))
    
    # Calculate average change by attack type
    avg_change_by_attack = comparisons_df.groupby('question_category')['f1_change'].mean().sort_values()
    
    print("\nAverage change in radiologist agreement by attack type:")
    display(avg_change_by_attack.head(10))
    
    # Plot distribution of F1 changes
    plt.figure(figsize=(12, 8))
    sns.histplot(data=comparisons_df, x='f1_change', bins=20)
    plt.title('Distribution of Changes in Radiologist Agreement', fontsize=15)
    plt.xlabel('Change in F1 Score (negative = decreased agreement)', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.axvline(x=0, color='red', linestyle='--')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'agreement_change_distribution.png'), dpi=300, bbox_inches='tight')
    plt.show()

## 6. Clinical Significance Analysis

Analyze the clinical significance of agreement changes.

In [None]:
# Analyze clinical significance of disagreements
if data and 'responses_with_metrics' in data:
    responses = data['responses_with_metrics']
    
    # Define clinical significance thresholds
    def clinical_significance(row):
        # Extract diagnoses
        if isinstance(row['radiologist_diagnoses'], str) and isinstance(row['model_diagnoses'], str):
            try:
                radiologist_diagnoses = eval(row['radiologist_diagnoses'])
                model_diagnoses = eval(row['model_diagnoses'])
                
                # Critical findings that shouldn't be missed
                critical_findings = ['pneumothorax', 'pneumonia', 'edema', 'mass', 'fracture', 'effusion']
                
                # Check for critical findings in radiologist diagnoses but not in model
                missed_critical = [d for d in radiologist_diagnoses if d in critical_findings and d not in model_diagnoses]
                
                # Check for critical findings in model but not in radiologist (false positives)
                false_critical = [d for d in model_diagnoses if d in critical_findings and d not in radiologist_diagnoses]
                
                if missed_critical:
                    return 'Missed Critical Finding'
                elif false_critical:
                    return 'False Critical Finding'
                elif set(radiologist_diagnoses) != set(model_diagnoses):
                    return 'Non-Critical Disagreement'
                else:
                    return 'Agreement'
            except:
                return 'Error Parsing Diagnoses'
        else:
            return 'Missing Diagnoses'
    
    # Apply clinical significance function
    responses['clinical_significance'] = responses.apply(clinical_significance, axis=1)
    
    # Calculate distribution of clinical significance by category
    clinical_dist = responses.groupby(['model_name', 'category_type', 'clinical_significance']).size().reset_index(name='count')
    
    # Calculate percentage within model and category
    clinical_dist['total'] = clinical_dist.groupby(['model_name', 'category_type'])['count'].transform('sum')
    clinical_dist['percentage'] = clinical_dist['count'] / clinical_dist['total'] * 100
    
    # Display distribution
    print("\nDistribution of clinical significance by model and category:")
    pivot = clinical_dist.pivot_table(
        index=['model_name', 'category_type'],
        columns='clinical_significance',
        values='percentage',
        fill_value=0
    )
    display(pivot)
    
    # Create stacked bar chart
    plt.figure(figsize=(14, 10))
    
    # Filter to models with sufficient data
    models_to_plot = clinical_dist['model_name'].value_counts()[clinical_dist['model_name'].value_counts() > 5].index
    plot_data = clinical_dist[clinical_dist['model_name'].isin(models_to_plot)]
    
    # Create plot
    plot = sns.barplot(data=plot_data, x='model_name', y='percentage', hue='clinical_significance')
    
    # Apply facet grid for category type
    g = sns.FacetGrid(plot_data, col='category_type', height=6, aspect=1.2)
    g.map_dataframe(sns.barplot, x='model_name', y='percentage', hue='clinical_significance')
    g.add_legend(title='Clinical Significance')
    g.set_axis_labels('Model', 'Percentage (%)')
    g.set_titles(col_template='{col_name}')
    
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'clinical_significance.png'), dpi=300, bbox_inches='tight')
    plt.show()

In [None]:
# Calculate increase in critical errors due to adversarial inputs
if 'responses' in data and hasattr(data['responses'], 'clinical_significance'):
    responses = data['responses']
    
    # Calculate baseline error rates
    baseline_errors = responses[responses['category_type'] == 'baseline'].groupby('model_name')['clinical_significance'].apply(
        lambda x: (x == 'Missed Critical Finding').mean() + (x == 'False Critical Finding').mean()
    ).reset_index(name='baseline_error_rate')
    
    # Calculate adversarial error rates by category
    adversarial_errors = responses[responses['category_type'] != 'baseline'].groupby(['model_name', 'category_type'])['clinical_significance'].apply(
        lambda x: (x == 'Missed Critical Finding').mean() + (x == 'False Critical Finding').mean()
    ).reset_index(name='adversarial_error_rate')
    
    # Merge baseline and adversarial error rates
    error_comparison = pd.merge(adversarial_errors, baseline_errors, on='model_name')
    
    # Calculate error rate increase
    error_comparison['error_rate_increase'] = error_comparison['adversarial_error_rate'] - error_comparison['baseline_error_rate']
    error_comparison['error_rate_increase_pct'] = (error_comparison['error_rate_increase'] / error_comparison['baseline_error_rate']) * 100
    
    # Sort by error rate increase
    error_comparison = error_comparison.sort_values('error_rate_increase', ascending=False)
    
    # Display error rate increases
    print("\nIncrease in critical error rates due to adversarial inputs:")
    display(error_comparison[['model_name', 'category_type', 'baseline_error_rate', 
                             'adversarial_error_rate', 'error_rate_increase', 'error_rate_increase_pct']])
    
    # Create bar chart of error rate increases
    plt.figure(figsize=(12, 8))
    sns.barplot(data=error_comparison, x='model_name', y='error_rate_increase', hue='category_type')
    plt.title('Increase in Critical Error Rate Due to Adversarial Inputs', fontsize=15)
    plt.xlabel('Model', fontsize=12)
    plt.ylabel('Absolute Increase in Error Rate', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Category Type')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig(os.path.join(figures_dir, 'critical_error_increase.png'), dpi=300, bbox_inches='tight')
    plt.show()

## 7. Clinical Recommendations

Based on the analysis, formulate recommendations for clinical deployment of these models.

In [None]:
# Generate clinical recommendations based on analysis
# This is a simple template - in a real project, you would want to adapt this based on your specific findings

recommendations = """
# Clinical Recommendations for Vision-Language Model Deployment

Based on our comprehensive analysis of model-radiologist agreement under both standard and adversarial conditions, we offer the following recommendations for the clinical deployment of vision-language models:

## Model Selection

1. **[Best Performing Model]** showed the highest radiologist agreement and lowest vulnerability to adversarial inputs. This model should be prioritized for clinical applications if available.

2. All models showed significant vulnerability to certain adversarial inputs, suggesting that current vision-language models are not yet robust enough for unsupervised clinical use.

## Deployment Safeguards

1. **Human Oversight**: All model outputs should be reviewed by qualified radiologists before clinical decision-making.

2. **Input Validation**: Implement pre-processing to detect and filter potential adversarial inputs, particularly checking for:
   - Text prompts with unusual phrasing or instructions
   - Images with visual anomalies like checkerboard patterns or unusual artifacts

3. **Post-Processing Verification**: Implement verification for model outputs, especially looking for:
   - Unusual response patterns or lengths
   - Diagnoses that substantially deviate from statistical norms
   - Incongruence between image characteristics and reported findings

## Specific Vulnerability Mitigations

1. **Text Attack Defenses**: [Specific attack categories] were particularly effective. 
   Implement input sanitization to detect and neutralize these patterns.

2. **Visual Perturbation Defenses**: [Specific perturbation types] caused significant drops in agreement. 
   Implement image quality checks to detect these patterns.

3. **Critical Finding Protection**: Models showed higher error rates for critical findings under attack conditions. 
   Implement additional verification when critical diagnoses are either reported or potentially missed.

## Monitoring and Quality Assurance

1. **Continuous Monitoring**: Implement tracking of agreement rates between models and radiologists
   as part of ongoing quality assurance.

2. **Regular Vulnerability Testing**: Periodically test deployed models with standardized adversarial inputs
   to detect potential security regressions.

3. **Feedback Loops**: Create mechanisms for radiologists to flag suspicious model outputs for review and improvement.
"""

# Display recommendations
print(recommendations)

# Save recommendations to file
with open(os.path.join(output_dir, 'clinical_recommendations.md'), 'w') as f:
    f.write(recommendations)

## 8. Export Results

Export all analysis results for publication and further study.

In [None]:
# Export results for publication
if data and 'responses_with_metrics' in data:
    # Create directory for exports
    exports_dir = os.path.join(output_dir, 'radiologist_comparison')
    os.makedirs(exports_dir, exist_ok=True)
    
    # Export agreement metrics
    if 'responses_with_metrics' in data:
        metrics_path = os.path.join(exports_dir, 'agreement_metrics.csv')
        data['responses_with_metrics'].to_csv(metrics_path, index=False)
        print(f"Saved agreement metrics to {metrics_path}")
    
    # Export clinical significance analysis
    if hasattr(data['responses'], 'clinical_significance'):
        clinical_path = os.path.join(exports_dir, 'clinical_significance.csv')
        data['responses'][['model_name', 'category_type', 'clinical_significance', 
                          'radiologist_diagnoses', 'model_diagnoses']].to_csv(clinical_path, index=False)
        print(f"Saved clinical significance analysis to {clinical_path}")
    
    # Export merged vulnerability and agreement data
    if 'merged' in data and hasattr(data['merged'], 'diagnosis_f1'):
        merged_path = os.path.join(exports_dir, 'vulnerability_agreement.csv')
        data['merged'].to_csv(merged_path, index=False)
        print(f"Saved merged vulnerability and agreement data to {merged_path}")
        
    print(f"\nAll exports saved to {exports_dir}")

## 9. Summary and Conclusions

In this notebook, we've compared the performance of vision-language models with radiologist interpretations in both standard and adversarial conditions. Our analysis provides insights into how model vulnerabilities affect clinical utility and patient care.

### Key Findings
- [Summarize baseline agreement between models and radiologists]
- [Describe how adversarial inputs affect this agreement]
- [Note which types of attacks cause the largest drops in agreement]
- [Describe the clinical significance of these changes]
- [Highlight which models are most robust to adversarial inputs]

### Clinical Implications
- The observed vulnerabilities have significant implications for clinical deployment
- The increase in critical errors under adversarial conditions warrants careful safeguards
- Current models require human oversight to ensure patient safety
- Specific mitigations can reduce vulnerability to the most effective attacks

### Future Work
- Develop and test specific input validation techniques
- Create benchmark datasets for ongoing monitoring of clinical agreement
- Explore fine-tuning methods to improve robustness to adversarial inputs
- Expand analysis to other medical imaging domains beyond chest X-rays

This analysis demonstrates the importance of comprehensive vulnerability testing before deploying vision-language models in clinical settings, and provides a foundation for developing more robust medical AI systems.