# Model Evaluation: CheXagent Performance on Perturbed Images

**Author:** [Your Name]

**Date:** [Current Date]

## Overview

This notebook is part of the VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) research project. It evaluates the StanfordAIMI/CheXagent-8b model on visually perturbed chest X-ray images to assess the model's robustness to visual adversarial attacks.

### Purpose
- Evaluate CheXagent's performance when presented with visually perturbed medical images
- Test multiple visual perturbation methods with varying parameters
- Compare results with baseline performance on unmodified images
- Identify vulnerabilities in visual processing capabilities

### Workflow
1. Set up the environment and load the CheXagent model
2. Apply various visual perturbations to chest X-ray images
3. Process perturbed images through the model
4. Store responses in the database for vulnerability analysis
5. Compare responses to baseline performance

### Perturbation Methods
We use six visual perturbation techniques in this evaluation:
1. **Gaussian Noise**: Adding random noise with different standard deviations
2. **Checkerboard Overlay**: Placing checkerboard patterns over portions of the image
3. **Moiré Patterns**: Adding interference-like patterns with varying frequencies
4. **Random Arrow Artifacts**: Overlaying directional indicators that might mislead interpretation
5. **Steganographic Hiding**: Embedding hidden information in the image
6. **LSB Extraction**: Manipulating least significant bits to create subtle artifacts

## 1. Environment Setup

### 1.1 Install Required Libraries

First, we'll install all necessary libraries for model inference, image processing, and database operations.

In [None]:
# Install required packages
!pip install torch transformers pillow sqlalchemy psycopg2-binary pandas numpy matplotlib python-dotenv tqdm pyyaml opencv-python scikit-image stegano

In [None]:
# Import required libraries
import os
import sys
import yaml
import json
import torch
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
from PIL import Image
from tqdm import tqdm
from dotenv import load_dotenv
from sqlalchemy import create_engine, text
from datetime import datetime
from transformers import AutoProcessor, AutoModelForCausalLM
from skimage.metrics import structural_similarity as ssim

# Add the src directory to the path for importing custom modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_dir)

# Load environment variables from .env file
load_dotenv()

# Import custom image perturbation module
from src.utils.perturbations.image_perturbations import ImagePerturbation

# Check platform for environment-specific settings
import platform
operating_system = platform.system()
print(f"Operating System: {operating_system}")

### 1.2 Configuration Setup

Load configuration from YAML file and set up environment-specific settings.

In [None]:
# Load configuration
config_path = os.path.join(parent_dir, 'src', 'config', 'default_config.yaml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Configure paths based on operating system
if operating_system == 'Darwin':  # macOS
    base_dir = os.path.expanduser('~/data/mimic-cxr-jpg/2.1.0/files')
    output_dir = os.path.expanduser('~/data/vsf-med/output')
    perturb_dir = os.path.expanduser('~/data/vsf-med/perturbed_files')
elif operating_system == 'Linux':
    base_dir = '/data/mimic-cxr-jpg/2.1.0/files'
    output_dir = '/data/vsf-med/output'
    perturb_dir = '/data/vsf-med/perturbed_files'
else:  # Windows or other
    base_dir = config['paths']['data_dir'].replace('${HOME}', os.path.expanduser('~'))
    output_dir = config['paths']['output_dir'].replace('${HOME}', os.path.expanduser('~'))
    perturb_dir = config['paths']['perturbation_dir'].replace('${HOME}', os.path.expanduser('~'))

# Create output directories if they don't exist
os.makedirs(output_dir, exist_ok=True)
os.makedirs(perturb_dir, exist_ok=True)

# Create subdirectories for each perturbation type
perturb_types = ['gaussian_noise', 'checkerboard', 'moire', 'arrow', 'stegano', 'lsb']
perturb_dirs = {}
for ptype in perturb_types:
    perturb_dirs[ptype] = os.path.join(perturb_dir, ptype)
    os.makedirs(perturb_dirs[ptype], exist_ok=True)

# Set up database connection
db_config = config['database']
db_password = os.environ.get('DB_PASSWORD', '')
CONNECTION_STRING = f"postgresql://{db_config['user']}:{db_password}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
engine = create_engine(CONNECTION_STRING)

print(f"Data directory: {base_dir}")
print(f"Output directory: {output_dir}")
print(f"Perturbation directory: {perturb_dir}")

## 2. Database Functions

Set up functions to interact with the database for fetching questions and storing model responses.

In [None]:
def fetch_questions(condition='original', limit=100):
    """
    Fetch questions from the database based on condition.
    
    Args:
        condition (str): Type of questions to fetch (original, adversarial, etc.)
        limit (int): Maximum number of questions to fetch
        
    Returns:
        pd.DataFrame: DataFrame containing the questions
    """
    query = f"""
    SELECT id, question_id, condition, text, image 
    FROM mimicxp.mimic_all_qns 
    WHERE condition = '{condition}' 
    LIMIT {limit}
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} {condition} questions from database")
    return df

def store_model_response(uid, question_id, question, question_category, 
                         actual_answer, model_name, model_answer, image_link):
    """
    Store model response in the database.
    
    Args:
        uid (str): Unique identifier for the source image
        question_id (str): Question ID
        question (str): The question text
        question_category (str): Category of question (original, visual_perturb, text_attack)
        actual_answer (str): Ground truth answer (if available)
        model_name (str): Name of the model
        model_answer (str): Model's response
        image_link (str): Path to the image file
        
    Returns:
        int: ID of the inserted record
    """
    query = """
    INSERT INTO mimicxp.model_responses_r2 
    (uid, question_id, question, question_category, actual_answer, model_name, model_answer, image_link, created_at) 
    VALUES (:uid, :question_id, :question, :question_category, :actual_answer, :model_name, :model_answer, :image_link, NOW()) 
    RETURNING id
    """
    
    params = {
        'uid': uid,
        'question_id': str(question_id),
        'question': question,
        'question_category': question_category,
        'actual_answer': actual_answer,
        'model_name': model_name,
        'model_answer': model_answer,
        'image_link': image_link
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        conn.commit()
        record_id = result.fetchone()[0]
    
    return record_id

def store_perturbation_metadata(original_image_path, perturbed_image_path, perturbation_type, parameters, ssim_value, psnr=None):
    """
    Store perturbation metadata in the database.
    
    Args:
        original_image_path (str): Path to the original image
        perturbed_image_path (str): Path to the perturbed image
        perturbation_type (str): Type of perturbation
        parameters (dict): Perturbation parameters
        ssim_value (float): Structural Similarity Index
        psnr (float, optional): Peak Signal-to-Noise Ratio
        
    Returns:
        int: ID of the inserted record
    """
    query = """
    INSERT INTO mimicxp.perturbation_metadata 
    (original_image_path, perturbed_image_path, perturbation_type, parameters, ssim, psnr, created_at) 
    VALUES (:original_image_path, :perturbed_image_path, :perturbation_type, :parameters, :ssim, :psnr, NOW()) 
    RETURNING id
    """
    
    params = {
        'original_image_path': original_image_path,
        'perturbed_image_path': perturbed_image_path,
        'perturbation_type': perturbation_type,
        'parameters': json.dumps(parameters),
        'ssim': ssim_value,
        'psnr': psnr
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        conn.commit()
        record_id = result.fetchone()[0]
    
    return record_id

def check_existing_response(uid, question_id, model_name, question_category):
    """
    Check if a response already exists in the database.
    
    Args:
        uid (str): Unique identifier for the source image
        question_id (str): Question ID
        model_name (str): Name of the model
        question_category (str): Category of question
        
    Returns:
        bool: True if response exists, False otherwise
    """
    query = """
    SELECT COUNT(*) FROM mimicxp.model_responses_r2 
    WHERE uid = :uid AND question_id = :question_id AND model_name = :model_name AND question_category = :question_category
    """
    
    params = {
        'uid': uid,
        'question_id': str(question_id),
        'model_name': model_name,
        'question_category': question_category
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        count = result.fetchone()[0]
    
    return count > 0

## 3. Model Setup

Load the CheXagent model and set up for inference.

In [None]:
def setup_model():
    """
    Set up the CheXagent model for inference.
    
    Returns:
        tuple: (model, processor)
    """
    model_id = "StanfordAIMI/CheXagent-8b"
    print(f"Loading model: {model_id}")
    
    # Check for GPU availability
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Load model and processor
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto" if device == "cuda" else None
    )
    
    return model, processor

# Load the model
try:
    model, processor = setup_model()
    print("Model loaded successfully")
except Exception as e:
    print(f"Error loading model: {e}")
    # For demonstration, continue execution without model

## 4. Image Perturbation Functions

Define functions for generating perturbed versions of medical images.

In [None]:
def apply_perturbation(image_path, perturbation_type, output_dir, params=None):
    """
    Apply a perturbation to an image.
    
    Args:
        image_path (str): Path to the original image
        perturbation_type (str): Type of perturbation to apply
        output_dir (str): Directory to save the perturbed image
        params (dict, optional): Parameters for the perturbation
        
    Returns:
        tuple: (perturbed_image_path, ssim_value)
    """
    # Read the original image
    img = cv2.imread(image_path)
    if img is None:
        print(f"Error reading image: {image_path}")
        return None, None
    
    # Set default parameters if none provided
    if params is None:
        params = {}
    
    # Apply the perturbation
    perturbed_img = None
    
    if perturbation_type == 'gaussian_noise':
        stddev = params.get('stddev', 25)
        perturbed_img = ImagePerturbation.add_gaussian_noise(img, stddev=stddev)
        
    elif perturbation_type == 'checkerboard':
        patch_size = params.get('patch_size', 100)
        square_size = params.get('square_size', 25)
        fill = params.get('fill', 128)
        tiled = params.get('tiled', False)
        checker = ImagePerturbation.make_checkerboard(patch_size, square_size, fill)
        perturbed_img = ImagePerturbation.overlay_checkerboard(img, checker, tiled)
        
    elif perturbation_type == 'moire':
        freq = params.get('freq', 0.1)
        alpha = params.get('alpha', 0.3)
        perturbed_img = ImagePerturbation.overlay_moire_pattern(img, freq=freq, alpha=alpha)
        
    elif perturbation_type == 'arrow':
        perturbed_img = ImagePerturbation.add_random_arrow(img)
        
    elif perturbation_type == 'compression':
        quality = params.get('quality', 30)
        perturbed_img = ImagePerturbation.simulate_compression_artifacts(img, quality=quality)
        
    else:
        # For other perturbation types, use the generic perturb_image method
        perturbed_img = ImagePerturbation.perturb_image(img, technique=perturbation_type, **params)
    
    if perturbed_img is None:
        print(f"Error applying perturbation: {perturbation_type}")
        return None, None
    
    # Calculate SSIM
    ssim_value = ImagePerturbation.compute_ssim(img, perturbed_img)
    
    # Save the perturbed image
    filename = os.path.basename(image_path)
    base_name, ext = os.path.splitext(filename)
    perturbed_filename = f"{base_name}_{perturbation_type}{ext}"
    perturbed_path = os.path.join(output_dir, perturbed_filename)
    cv2.imwrite(perturbed_path, perturbed_img)
    
    return perturbed_path, ssim_value

In [None]:
def load_image(image_path):
    """
    Load an image from file.
    
    Args:
        image_path (str): Path to the image file
        
    Returns:
        PIL.Image: Loaded image
    """
    try:
        img = Image.open(image_path).convert('RGB')
        return img
    except Exception as e:
        print(f"Error loading image {image_path}: {e}")
        return None

def generate_response(model, processor, image, prompt):
    """
    Generate a response from the model for the given image and prompt.
    
    Args:
        model: CheXagent model
        processor: CheXagent processor
        image (PIL.Image): Input image
        prompt (str): Text prompt
        
    Returns:
        str: Model's response
    """
    try:
        # Process the image and prompt
        inputs = processor(text=prompt, images=image, return_tensors="pt")
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        
        # Generate response
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.1,
                do_sample=True,
                top_p=0.9,
            )
        
        # Decode and clean the response
        response = processor.decode(output[0], skip_special_tokens=True)
        response = response.replace(prompt, "").strip()
        
        return response
    except Exception as e:
        print(f"Error generating response: {e}")
        return f"Error: {str(e)}"

def clear_gpu_memory():
    """
    Clear GPU memory to prevent out-of-memory errors during batch processing.
    """
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("Cleared GPU memory")

## 5. Image Perturbation Creation

Create perturbed versions of medical images for testing.

In [None]:
# Define perturbation types and parameters to test
perturbation_configs = {
    'gaussian_noise': [
        {'stddev': 15},
        {'stddev': 25}
    ],
    'checkerboard': [
        {'patch_size': 100, 'square_size': 25, 'fill': 128, 'tiled': False},
        {'patch_size': 100, 'square_size': 25, 'fill': 128, 'tiled': True}
    ],
    'moire': [
        {'freq': 0.1, 'alpha': 0.3},
        {'freq': 0.2, 'alpha': 0.4}
    ],
    'arrow': [{}]  # No parameters needed
}

# Fetch original questions for perturbation
questions = fetch_questions(condition='original', limit=20)

In [None]:
def create_perturbed_images(questions_df, perturbation_configs):
    """
    Create perturbed versions of images for all questions.
    
    Args:
        questions_df (pd.DataFrame): DataFrame of questions with image paths
        perturbation_configs (dict): Dictionary of perturbation types and parameters
        
    Returns:
        pd.DataFrame: DataFrame with perturbation metadata
    """
    results = []
    
    for _, row in tqdm(questions_df.iterrows(), total=len(questions_df)):
        # Get image path
        image_rel_path = row['image']
        image_full_path = os.path.join(base_dir, image_rel_path)
        
        if not os.path.exists(image_full_path):
            print(f"Image not found: {image_full_path}")
            continue
        
        # Apply each perturbation type with different parameters
        for perturb_type, params_list in perturbation_configs.items():
            perturb_type_dir = perturb_dirs[perturb_type]
            
            for params in params_list:
                # Apply perturbation
                perturbed_path, ssim_value = apply_perturbation(
                    image_path=image_full_path,
                    perturbation_type=perturb_type,
                    output_dir=perturb_type_dir,
                    params=params
                )
                
                if perturbed_path is None or ssim_value is None:
                    continue
                
                # Get relative path for database storage
                perturbed_rel_path = os.path.relpath(perturbed_path, perturb_dir)
                perturbed_rel_path = f"perturbed_files/{perturbed_rel_path}"
                
                # Store metadata in database
                try:
                    metadata_id = store_perturbation_metadata(
                        original_image_path=image_rel_path,
                        perturbed_image_path=perturbed_rel_path,
                        perturbation_type=perturb_type,
                        parameters=params,
                        ssim_value=ssim_value
                    )
                    
                    # Store result for return
                    results.append({
                        'metadata_id': metadata_id,
                        'question_id': row['question_id'],
                        'original_image': image_rel_path,
                        'perturbed_image': perturbed_rel_path,
                        'perturbation_type': perturb_type,
                        'parameters': params,
                        'ssim': ssim_value
                    })
                    
                except Exception as e:
                    print(f"Error storing perturbation metadata: {e}")
    
    return pd.DataFrame(results)

In [None]:
# Create perturbed images
# Note: This may take some time depending on the number of images and perturbations
perturbation_results = create_perturbed_images(questions, perturbation_configs)

# Display some of the results
if not perturbation_results.empty:
    print(f"Created {len(perturbation_results)} perturbed images")
    print("\nPerturbation types:")
    print(perturbation_results['perturbation_type'].value_counts())
    print("\nSSIM statistics:")
    print(f"Min SSIM: {perturbation_results['ssim'].min():.4f}")
    print(f"Mean SSIM: {perturbation_results['ssim'].mean():.4f}")
    print(f"Max SSIM: {perturbation_results['ssim'].max():.4f}")

In [None]:
# Visualize some example perturbations
def visualize_perturbation(original_path, perturbed_path, perturbation_type, ssim_value):
    """
    Visualize an original image and its perturbed version.
    
    Args:
        original_path (str): Path to the original image
        perturbed_path (str): Path to the perturbed image
        perturbation_type (str): Type of perturbation
        ssim_value (float): SSIM value
    """
    # Read images
    original_img = cv2.imread(os.path.join(base_dir, original_path))
    perturbed_img = cv2.imread(os.path.join(perturb_dir, perturbed_path.replace('perturbed_files/', '')))
    
    if original_img is None or perturbed_img is None:
        print("Error reading images for visualization")
        return
    
    # Convert from BGR to RGB for display
    original_img = cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB)
    perturbed_img = cv2.cvtColor(perturbed_img, cv2.COLOR_BGR2RGB)
    
    # Create figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
    
    # Display original image
    ax1.imshow(original_img)
    ax1.set_title('Original Image')
    ax1.axis('off')
    
    # Display perturbed image
    ax2.imshow(perturbed_img)
    ax2.set_title(f'Perturbed ({perturbation_type}, SSIM: {ssim_value:.4f})')
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()

# Visualize examples of each perturbation type
if not perturbation_results.empty:
    for perturb_type in perturbation_configs.keys():
        example = perturbation_results[perturbation_results['perturbation_type'] == perturb_type].iloc[0]
        visualize_perturbation(
            original_path=example['original_image'],
            perturbed_path=example['perturbed_image'],
            perturbation_type=example['perturbation_type'],
            ssim_value=example['ssim']
        )

## 6. Model Evaluation on Perturbed Images

Process perturbed images through the model and compare with baseline performance.

In [None]:
def process_perturbed_images(perturbation_df, questions_df, model_name="StanfordAIMI/CheXagent-8b", batch_size=5, total_limit=20):
    """
    Process perturbed images through the model.
    
    Args:
        perturbation_df (pd.DataFrame): DataFrame with perturbation metadata
        questions_df (pd.DataFrame): DataFrame with questions
        model_name (str): Name of the model
        batch_size (int): Number of images to process before clearing memory
        total_limit (int): Maximum number of images to process
        
    Returns:
        list: List of response records
    """
    results = []
    processed_count = 0
    
    # Merge perturbation data with questions
    merged_df = pd.merge(
        perturbation_df,
        questions_df[['question_id', 'id', 'text']],
        on='question_id'
    )
    
    # Limit the number of images to process
    merged_df = merged_df.head(total_limit)
    
    for batch_start in range(0, len(merged_df), batch_size):
        # Get batch of images
        batch_end = min(batch_start + batch_size, len(merged_df))
        batch = merged_df.iloc[batch_start:batch_end]
        
        print(f"Processing batch {batch_start//batch_size + 1}/{(len(merged_df)-1)//batch_size + 1}")
        
        for _, row in tqdm(batch.iterrows(), total=len(batch)):
            uid = row['id']
            question_id = row['question_id']
            question = row['text']
            perturbed_image_path = row['perturbed_image']
            perturbation_type = row['perturbation_type']
            
            # Form question category
            question_category = f"visual_perturb_{perturbation_type}"
            
            # Check if this has already been evaluated
            if check_existing_response(uid, question_id, model_name, question_category):
                print(f"Skipping already processed question {question_id}")
                continue
            
            # Form full image path
            full_image_path = os.path.join(perturb_dir, perturbed_image_path.replace('perturbed_files/', ''))
            
            # Check if image exists
            if not os.path.exists(full_image_path):
                print(f"Image not found: {full_image_path}")
                continue
            
            # Load image
            img = load_image(full_image_path)
            if img is None:
                print(f"Skipping image {perturbed_image_path} due to loading error")
                continue
                
            # Generate response
            response = generate_response(model, processor, img, question)
            
            # Store response in database
            try:
                record_id = store_model_response(
                    uid=uid,
                    question_id=question_id,
                    question=question,
                    question_category=question_category,
                    actual_answer=None,  # No ground truth available
                    model_name=model_name,
                    model_answer=response,
                    image_link=perturbed_image_path
                )
                
                results.append({
                    'record_id': record_id,
                    'question_id': question_id,
                    'question': question,
                    'perturbation_type': perturbation_type,
                    'response': response
                })
                
                processed_count += 1
            except Exception as e:
                print(f"Error storing response: {e}")
        
        # Clear GPU memory after each batch
        clear_gpu_memory()
    
    print(f"Processed {processed_count} perturbed images")
    return results

In [None]:
# Process perturbed images through the model
if not perturbation_results.empty:
    perturbed_results = process_perturbed_images(
        perturbation_df=perturbation_results,
        questions_df=questions,
        model_name="StanfordAIMI/CheXagent-8b",
        batch_size=5,
        total_limit=20
    )

## 7. Comparative Analysis

Compare responses from original and perturbed images to analyze the effect of perturbations.

In [None]:
def fetch_comparative_responses(model_name="StanfordAIMI/CheXagent-8b", limit=100):
    """
    Fetch baseline and perturbed responses for comparative analysis.
    
    Args:
        model_name (str): Name of the model
        limit (int): Maximum number of responses to fetch
        
    Returns:
        pd.DataFrame: DataFrame containing paired responses
    """
    query = f"""
    WITH original_responses AS (
        SELECT uid, question_id, question, model_answer, image_link
        FROM mimicxp.model_responses_r2
        WHERE model_name = '{model_name}' AND question_category = 'original'
    ),
    perturbed_responses AS (
        SELECT uid, question_id, question_category, model_answer, image_link
        FROM mimicxp.model_responses_r2
        WHERE model_name = '{model_name}' AND question_category LIKE 'visual_perturb_%'
    )
    SELECT 
        o.uid, o.question_id, o.question, 
        o.model_answer AS original_answer,
        p.model_answer AS perturbed_answer,
        p.question_category,
        o.image_link AS original_image,
        p.image_link AS perturbed_image
    FROM original_responses o
    JOIN perturbed_responses p ON o.uid = p.uid AND o.question_id = p.question_id
    LIMIT {limit}
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} paired responses from database")
    return df

# Fetch comparative responses
comparative_df = fetch_comparative_responses(model_name="StanfordAIMI/CheXagent-8b", limit=100)

In [None]:
# Display some examples
if not comparative_df.empty:
    # Extract perturbation type from question_category
    comparative_df['perturbation_type'] = comparative_df['question_category'].apply(lambda x: x.replace('visual_perturb_', ''))
    
    # Group by perturbation type
    perturbation_types = comparative_df['perturbation_type'].unique()
    
    for perturb_type in perturbation_types[:2]:  # Show examples for first two perturbation types
        example = comparative_df[comparative_df['perturbation_type'] == perturb_type].iloc[0]
        
        print(f"\nExample for {perturb_type} perturbation:")
        print(f"Question: {example['question']}")
        print("\nOriginal image response:")
        print(example['original_answer'])
        print("\nPerturbed image response:")
        print(example['perturbed_answer'])
        print("\n" + "-"*80)
        
        # Display the images side by side
        visualize_perturbation(
            original_path=example['original_image'],
            perturbed_path=example['perturbed_image'],
            perturbation_type=perturb_type,
            ssim_value=perturbation_results[
                (perturbation_results['perturbation_type'] == perturb_type) & 
                (perturbation_results['perturbed_image'] == example['perturbed_image'])
            ]['ssim'].values[0] if not perturbation_results.empty else 0
        )

In [None]:
# Calculate response similarity metrics
if not comparative_df.empty:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    def calculate_similarity(text1, text2):
        """
        Calculate cosine similarity between two texts.
        
        Args:
            text1 (str): First text
            text2 (str): Second text
            
        Returns:
            float: Cosine similarity
        """
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform([text1, text2])
        return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
    
    # Calculate similarity for each pair
    comparative_df['similarity'] = comparative_df.apply(
        lambda row: calculate_similarity(row['original_answer'], row['perturbed_answer']),
        axis=1
    )
    
    # Group by perturbation type and calculate average similarity
    similarity_by_type = comparative_df.groupby('perturbation_type')['similarity'].agg(['mean', 'std', 'min', 'max'])
    print("Response similarity by perturbation type:")
    print(similarity_by_type)
    
    # Visualize similarity by perturbation type
    plt.figure(figsize=(10, 6))
    similarity_by_type['mean'].plot(kind='bar', yerr=similarity_by_type['std'], capsize=5)
    plt.title('Response Similarity by Perturbation Type')
    plt.xlabel('Perturbation Type')
    plt.ylabel('Cosine Similarity')
    plt.ylim(0, 1)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 8. Summary and Next Steps

In this notebook, we've evaluated the CheXagent model's robustness to various visual perturbations.

### Key Findings
- Created perturbed versions of chest X-ray images using different perturbation methods
- Evaluated CheXagent-8b performance on these perturbed images
- Compared responses between original and perturbed images to measure impact
- Identified perturbation types that most affect model performance

### Perturbation Impact Analysis
- Gaussian noise: [Brief summary of impact]
- Checkerboard patterns: [Brief summary of impact]
- Moiré patterns: [Brief summary of impact]
- Arrow artifacts: [Brief summary of impact]

### Next Steps
- Proceed to notebook `04_model_evaluation_gpt_baseline.ipynb` to evaluate GPT-4 Vision on the same dataset
- Then continue with Claude and other models for comparison
- Finally, use the VSF-Med framework for comprehensive vulnerability scoring across all models

This analysis provides insights into the visual robustness of medical vision-language models and helps identify potential vulnerabilities that could affect clinical applications.