# Model Evaluation: GPT-4o Baseline Performance

**Author:** [Your Name]

**Date:** [Current Date]

## Overview

This notebook is part of the VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) research project. It evaluates the baseline performance of OpenAI's GPT-4o Vision model on unmodified chest X-ray images from the MIMIC-CXR dataset.

### Purpose
- Establish baseline performance of GPT-4o on standard (non-adversarial) medical imaging tasks
- Process a variety of chest X-ray images with standard medical prompts
- Record model responses for later vulnerability analysis
- Provide a comparison point for evaluating the impact of adversarial prompts and visual perturbations

### Workflow
1. Set up the environment and OpenAI API connection
2. Fetch original chest X-ray images and questions from the dataset
3. Process images through GPT-4o with standard prompts
4. Store responses in a database for later analysis
5. Analyze initial performance metrics

### Model Information
- **Model**: GPT-4o (OpenAI)
- **Architecture**: Large-scale multimodal transformer model with vision capabilities
- **Parameters**: Estimated >1 trillion parameters
- **Training Data**: Diverse internet data including medical content
- **Purpose**: General-purpose AI assistant with vision capabilities, not specifically trained for medical imaging

## 1. Environment Setup

### 1.1 Install Required Libraries

First, we'll install all necessary libraries for API access, image processing, and database operations.

In [None]:
# Install required packages
!pip install openai pillow sqlalchemy psycopg2-binary pandas numpy matplotlib python-dotenv tqdm pyyaml

In [None]:
# Import required libraries
import os
import sys
import yaml
import json
import base64
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
from tqdm import tqdm
from dotenv import load_dotenv
from sqlalchemy import create_engine, text
from datetime import datetime
import time
import openai
from openai import OpenAI

# Add the src directory to the path for importing custom modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_dir)

# Load environment variables from .env file
load_dotenv()

# Check platform for environment-specific settings
import platform
operating_system = platform.system()
print(f"Operating System: {operating_system}")

### 1.2 Configuration Setup

Load configuration from YAML file and set up environment-specific settings.

In [None]:
# Load configuration
config_path = os.path.join(parent_dir, 'src', 'config', 'default_config.yaml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Configure paths based on operating system
if operating_system == 'Darwin':  # macOS
    base_dir = os.path.expanduser('~/data/mimic-cxr-jpg/2.1.0/files')
    output_dir = os.path.expanduser('~/data/vsf-med/output')
elif operating_system == 'Linux':
    base_dir = '/data/mimic-cxr-jpg/2.1.0/files'
    output_dir = '/data/vsf-med/output'
else:  # Windows or other
    base_dir = config['paths']['data_dir'].replace('${HOME}', os.path.expanduser('~'))
    output_dir = config['paths']['output_dir'].replace('${HOME}', os.path.expanduser('~'))

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Set up database connection
db_config = config['database']
db_password = os.environ.get('DB_PASSWORD', '')
CONNECTION_STRING = f"postgresql://{db_config['user']}:{db_password}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
engine = create_engine(CONNECTION_STRING)

# Set up OpenAI API
api_key = os.environ.get('OPENAI_API_KEY')
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set. Please set it in your .env file.")

client = OpenAI(api_key=api_key)

# Get model configuration
model_config = config['models']['gpt4o']
model_name = model_config['name']
temperature = model_config['temperature']
max_tokens = model_config['max_tokens']
system_prompt = model_config['system_prompt']

print(f"Data directory: {base_dir}")
print(f"Output directory: {output_dir}")
print(f"Using model: {model_name}")

## 2. Database Functions

Set up functions to interact with the database for fetching questions and storing model responses.

In [None]:
def fetch_questions(condition='original', limit=100):
    """
    Fetch questions from the database based on condition.
    
    Args:
        condition (str): Type of questions to fetch (original, adversarial, etc.)
        limit (int): Maximum number of questions to fetch
        
    Returns:
        pd.DataFrame: DataFrame containing the questions
    """
    query = f"""
    SELECT id, question_id, condition, text, image 
    FROM mimicxp.mimic_all_qns 
    WHERE condition = '{condition}' 
    LIMIT {limit}
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} {condition} questions from database")
    return df

def store_model_response(uid, question_id, question, question_category, 
                         actual_answer, model_name, model_answer, image_link):
    """
    Store model response in the database.
    
    Args:
        uid (str): Unique identifier for the source image
        question_id (str): Question ID
        question (str): The question text
        question_category (str): Category of question (original, visual_perturb, text_attack)
        actual_answer (str): Ground truth answer (if available)
        model_name (str): Name of the model
        model_answer (str): Model's response
        image_link (str): Path to the image file
        
    Returns:
        int: ID of the inserted record
    """
    query = """
    INSERT INTO mimicxp.model_responses_r2 
    (uid, question_id, question, question_category, actual_answer, model_name, model_answer, image_link, created_at) 
    VALUES (:uid, :question_id, :question, :question_category, :actual_answer, :model_name, :model_answer, :image_link, NOW()) 
    RETURNING id
    """
    
    params = {
        'uid': uid,
        'question_id': str(question_id),
        'question': question,
        'question_category': question_category,
        'actual_answer': actual_answer,
        'model_name': model_name,
        'model_answer': model_answer,
        'image_link': image_link
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        conn.commit()
        record_id = result.fetchone()[0]
    
    return record_id

def check_existing_response(uid, question_id, model_name, question_category):
    """
    Check if a response already exists in the database.
    
    Args:
        uid (str): Unique identifier for the source image
        question_id (str): Question ID
        model_name (str): Name of the model
        question_category (str): Category of question
        
    Returns:
        bool: True if response exists, False otherwise
    """
    query = """
    SELECT COUNT(*) FROM mimicxp.model_responses_r2 
    WHERE uid = :uid AND question_id = :question_id AND model_name = :model_name AND question_category = :question_category
    """
    
    params = {
        'uid': uid,
        'question_id': str(question_id),
        'model_name': model_name,
        'question_category': question_category
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        count = result.fetchone()[0]
    
    return count > 0

## 3. GPT API Functions

Define functions for interacting with the OpenAI GPT-4o Vision API.

In [None]:
def encode_image(image_path):
    """
    Encode image to base64 for API submission.
    
    Args:
        image_path (str): Path to the image file
        
    Returns:
        str: Base64-encoded image
    """
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    except Exception as e:
        print(f"Error encoding image {image_path}: {e}")
        return None

def generate_response(image_path, prompt, retries=3, delay=2):
    """
    Generate a response from GPT-4o Vision for the given image and prompt.
    
    Args:
        image_path (str): Path to the image file
        prompt (str): Text prompt
        retries (int): Number of retry attempts
        delay (int): Delay between retries in seconds
        
    Returns:
        str: Model's response
    """
    # Encode image to base64
    base64_image = encode_image(image_path)
    if base64_image is None:
        return "Error: Could not encode image"
    
    # Prepare messages for API call
    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
    
    # Call API with retry logic
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=model_name,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens
            )
            
            return response.choices[0].message.content
            
        except openai.RateLimitError:
            print(f"Rate limit exceeded. Waiting {delay} seconds...")
            time.sleep(delay)
            delay *= 2  # Exponential backoff
            
        except Exception as e:
            print(f"Error generating response (attempt {attempt+1}/{retries}): {e}")
            time.sleep(delay)
    
    return "Error: Failed to generate response after multiple attempts"

## 4. Image Helper Functions

Set up functions for loading and displaying images.

In [None]:
def load_image(image_path):
    """
    Load an image from file for display.
    
    Args:
        image_path (str): Path to the image file
        
    Returns:
        PIL.Image: Loaded image
    """
    try:
        img = Image.open(image_path).convert('RGB')
        return img
    except Exception as e:
        print(f"Error loading image {image_path}: {e}")
        return None

def display_image_with_response(image_path, question, response):
    """
    Display an image alongside the question and model response.
    
    Args:
        image_path (str): Path to the image file
        question (str): The question text
        response (str): Model's response
    """
    img = load_image(image_path)
    if img is None:
        print("Could not load image for display")
        return
    
    plt.figure(figsize=(10, 10))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Question: {question}")
    plt.tight_layout()
    plt.show()
    
    print(f"Response:\n{response}")

## 5. Evaluation Loop

Process multiple X-ray images with baseline questions to establish model performance.

In [None]:
# Fetch original questions for baseline evaluation
original_questions = fetch_questions(condition='original', limit=50)

# Display a few sample questions
original_questions[['question_id', 'text']].head(5)

In [None]:
def process_questions(questions_df, model_name, question_category="original", total_limit=20):
    """
    Process a batch of questions and images through the model.
    
    Args:
        questions_df (pd.DataFrame): DataFrame of questions
        model_name (str): Name of the model
        question_category (str): Category of questions
        total_limit (int): Maximum number of questions to process
        
    Returns:
        list: List of response records
    """
    results = []
    processed_count = 0
    
    # Limit the number of questions to process
    questions_to_process = questions_df.head(total_limit)
    
    for idx, row in tqdm(questions_to_process.iterrows(), total=len(questions_to_process)):
        uid = row['id']
        question_id = row['question_id']
        question = row['text']
        image_path = row['image']
        
        # Form full image path
        full_image_path = os.path.join(base_dir, image_path)
        
        # Check if this has already been evaluated
        if check_existing_response(uid, question_id, model_name, question_category):
            print(f"Skipping already processed question {question_id}")
            continue
        
        # Check if image exists
        if not os.path.exists(full_image_path):
            print(f"Image not found: {full_image_path}")
            continue
                
        # Generate response with rate limiting
        response = generate_response(full_image_path, question)
        
        # Add delay to avoid rate limiting
        time.sleep(0.5)
        
        # Store response in database
        try:
            record_id = store_model_response(
                uid=uid,
                question_id=question_id,
                question=question,
                question_category=question_category,
                actual_answer=None,  # No ground truth available
                model_name=model_name,
                model_answer=response,
                image_link=image_path
            )
            
            results.append({
                'record_id': record_id,
                'question_id': question_id,
                'question': question,
                'response': response,
                'image_path': full_image_path
            })
            
            processed_count += 1
        except Exception as e:
            print(f"Error storing response: {e}")
    
    print(f"Processed {processed_count} questions")
    return results

In [None]:
# Process original questions
# Note: This cell will make API calls and may take a while to complete
# It also incurs costs for OpenAI API usage
baseline_results = process_questions(
    questions_df=original_questions,
    model_name=model_name,
    question_category="original",
    total_limit=10  # Process only 10 images to limit API costs for demonstration
)

## 6. Results Analysis

Analyze the model responses to establish a baseline for model performance.

In [None]:
# Display some example results
if baseline_results:
    for i, result in enumerate(baseline_results[:3]):
        print(f"Example {i+1}:")
        display_image_with_response(
            image_path=result['image_path'],
            question=result['question'],
            response=result['response']
        )
        print("\n" + "-"*80 + "\n")
else:
    print("No results available to display")

In [None]:
# Fetch all responses from the database for analysis
def fetch_model_responses(model_name, question_category="original", limit=100):
    """
    Fetch model responses from the database.
    
    Args:
        model_name (str): Name of the model
        question_category (str): Category of questions
        limit (int): Maximum number of responses to fetch
        
    Returns:
        pd.DataFrame: DataFrame containing the responses
    """
    query = f"""
    SELECT id, uid, question_id, question, model_answer, image_link 
    FROM mimicxp.model_responses_r2 
    WHERE model_name = '{model_name}' AND question_category = '{question_category}' 
    LIMIT {limit}
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} responses from database")
    return df

# Fetch and analyze responses
responses_df = fetch_model_responses(model_name=model_name, question_category="original", limit=100)
if not responses_df.empty:
    # Analyze response length
    responses_df['response_length'] = responses_df['model_answer'].apply(len)
    
    print(f"Average response length: {responses_df['response_length'].mean():.2f} characters")
    print(f"Median response length: {responses_df['response_length'].median()} characters")
    print(f"Min response length: {responses_df['response_length'].min()} characters")
    print(f"Max response length: {responses_df['response_length'].max()} characters")
    
    # Plot response length distribution
    plt.figure(figsize=(10, 6))
    plt.hist(responses_df['response_length'], bins=20, alpha=0.7)
    plt.title('Distribution of Response Lengths')
    plt.xlabel('Response Length (characters)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    plt.show()

## 7. Text Analysis of Responses

Analyze the content of responses to identify patterns and medical terminology usage.

In [None]:
# Basic text analysis of responses
if not responses_df.empty:
    # Join all responses into one text for analysis
    all_responses = " ".join(responses_df['model_answer'].tolist())
    
    # Count common medical terms
    medical_terms = [
        "opacity", "pneumonia", "effusion", "consolidation", "atelectasis",
        "nodule", "mass", "cardiomegaly", "edema", "pleural", "lung",
        "heart", "chest", "rib", "pulmonary", "abnormality", "finding"
    ]
    
    term_counts = {}
    for term in medical_terms:
        term_counts[term] = all_responses.lower().count(term)
    
    # Sort by frequency
    sorted_terms = sorted(term_counts.items(), key=lambda x: x[1], reverse=True)
    
    # Plot term frequencies
    plt.figure(figsize=(12, 6))
    terms, counts = zip(*sorted_terms)
    plt.bar(terms, counts)
    plt.title('Medical Term Frequency in GPT-4o Responses')
    plt.xlabel('Medical Terms')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## 8. Summary and Next Steps

In this notebook, we've established the baseline performance of the GPT-4o Vision model on standard chest X-ray questions. This provides a foundation for comparing performance with adversarial inputs in subsequent notebooks.

### Key Findings
- Established GPT-4o baseline performance on MIMIC-CXR images
- Documented response patterns for standard medical imaging questions
- Analyzed frequency of medical terminology in model outputs
- Stored baseline responses in the database for later comparison

### Next Steps
- Proceed to notebook `05_vulnerability_scoring_framework.ipynb` to apply the VSF-Med framework to evaluate model vulnerabilities
- Analyze baseline performance in comparison with specialist models like CheXagent
- Test GPT-4o on adversarial prompts and perturbed images
- Compare with other models like Claude and Gemini

This baseline serves as the control condition for our vulnerability assessment, allowing us to isolate the effects of adversarial inputs in later analyses.