# Model Evaluation: CheXagent Baseline Performance

**Author:** [Your Name]

**Date:** [Current Date]

## Overview

This notebook is part of the VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) research project. It evaluates the baseline performance of the StanfordAIMI/CheXagent-8b model on unmodified chest X-ray images from the MIMIC-CXR dataset.

### Purpose
- Establish baseline performance of the CheXagent model on standard (non-adversarial) inputs
- Process a variety of chest X-ray images with medical prompts
- Record model responses for later vulnerability analysis
- Provide a comparison point for evaluating the impact of adversarial prompts and visual perturbations

### Workflow
1. Set up the environment and load the CheXagent model
2. Fetch original chest X-ray images and questions from the dataset
3. Process images through the model with standard prompts
4. Store responses in a database for later analysis
5. Analyze initial performance metrics

### Model Information
- **Model**: StanfordAIMI/CheXagent-8b
- **Architecture**: Vision-Language Model optimized for chest radiograph interpretation
- **Parameters**: 8 billion parameters
- **Training Data**: Medical imaging datasets including chest X-rays
- **Purpose**: Assist in medical image interpretation and diagnosis

## 1. Environment Setup

### 1.1 Install Required Libraries

First, we'll install all necessary libraries for model inference, image processing, and database operations.

In [None]:
# Install required packages
!pip install torch transformers pillow sqlalchemy psycopg2-binary pandas numpy matplotlib python-dotenv tqdm pyyaml

In [None]:
# Import required libraries
import os
import sys
import yaml
import json
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from tqdm import tqdm
from dotenv import load_dotenv
from sqlalchemy import create_engine, text
from datetime import datetime
from transformers import AutoProcessor, AutoModelForCausalLM

# Add the src directory to the path for importing custom modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_dir)

# Load environment variables from .env file
load_dotenv()

# Check platform for environment-specific settings
import platform
operating_system = platform.system()
print(f"Operating System: {operating_system}")

### 1.2 Configuration Setup

Load configuration from YAML file and set up environment-specific settings (paths, database connection, etc.)

In [None]:
# Load configuration
config_path = os.path.join(parent_dir, 'src', 'config', 'default_config.yaml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Configure paths based on operating system
if operating_system == 'Darwin':  # macOS
    base_dir = os.path.expanduser('~/data/mimic-cxr-jpg/2.1.0/files')
    output_dir = os.path.expanduser('~/data/vsf-med/output')
elif operating_system == 'Linux':
    base_dir = '/data/mimic-cxr-jpg/2.1.0/files'
    output_dir = '/data/vsf-med/output'
else:  # Windows or other
    base_dir = config['paths']['data_dir'].replace('${HOME}', os.path.expanduser('~'))
    output_dir = config['paths']['output_dir'].replace('${HOME}', os.path.expanduser('~'))

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Set up database connection
db_config = config['database']
db_password = os.environ.get('DB_PASSWORD', '')
CONNECTION_STRING = f"postgresql://{db_config['user']}:{db_password}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
engine = create_engine(CONNECTION_STRING)

print(f"Data directory: {base_dir}")
print(f"Output directory: {output_dir}")

## 2. Database Functions

Set up functions to interact with the database for fetching questions and storing model responses.

In [None]:
def fetch_questions(condition='original', limit=100):
    """
    Fetch questions from the database based on condition.
    
    Args:
        condition (str): Type of questions to fetch (original, adversarial, etc.)
        limit (int): Maximum number of questions to fetch
        
    Returns:
        pd.DataFrame: DataFrame containing the questions
    """
    query = f"""
    SELECT id, question_id, condition, text, image 
    FROM mimicxp.mimic_all_qns 
    WHERE condition = '{condition}' 
    LIMIT {limit}
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} {condition} questions from database")
    return df

def store_model_response(uid, question_id, question, question_category, 
                         actual_answer, model_name, model_answer, image_link):
    """
    Store model response in the database.
    
    Args:
        uid (str): Unique identifier for the source image
        question_id (str): Question ID
        question (str): The question text
        question_category (str): Category of question (original, visual_perturb, text_attack)
        actual_answer (str): Ground truth answer (if available)
        model_name (str): Name of the model
        model_answer (str): Model's response
        image_link (str): Path to the image file
        
    Returns:
        int: ID of the inserted record
    """
    query = """
    INSERT INTO mimicxp.model_responses_r2 
    (uid, question_id, question, question_category, actual_answer, model_name, model_answer, image_link, created_at) 
    VALUES (:uid, :question_id, :question, :question_category, :actual_answer, :model_name, :model_answer, :image_link, NOW()) 
    RETURNING id
    """
    
    params = {
        'uid': uid,
        'question_id': str(question_id),
        'question': question,
        'question_category': question_category,
        'actual_answer': actual_answer,
        'model_name': model_name,
        'model_answer': model_answer,
        'image_link': image_link
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        conn.commit()
        record_id = result.fetchone()[0]
    
    return record_id

def check_existing_response(uid, question_id, model_name, question_category):
    """
    Check if a response already exists in the database.
    
    Args:
        uid (str): Unique identifier for the source image
        question_id (str): Question ID
        model_name (str): Name of the model
        question_category (str): Category of question
        
    Returns:
        bool: True if response exists, False otherwise
    """
    query = """
    SELECT COUNT(*) FROM mimicxp.model_responses_r2 
    WHERE uid = :uid AND question_id = :question_id AND model_name = :model_name AND question_category = :question_category
    """
    
    params = {
        'uid': uid,
        'question_id': str(question_id),
        'model_name': model_name,
        'question_category': question_category
    }
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params)
        count = result.fetchone()[0]
    
    return count > 0

## 3. Model Setup

Load the CheXagent model and set up for inference.

In [None]:
def setup_model():
    """
    Set up the CheXagent model for inference.
    
    Returns:
        tuple: (model, processor)
    """
    model_id = "StanfordAIMI/CheXagent-8b"
    print(f"Loading model: {model_id}")
    
    # Check for GPU availability
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Load model and processor
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto" if device == "cuda" else None
    )
    
    return model, processor

# Load the model
try:
    model, processor = setup_model()
    print("Model loaded successfully")
except Exception as e:
    print(f"Error loading model: {e}")
    # For demonstration, continue execution without model

## 4. Inference Functions

Define functions for processing images and generating responses from the model.

In [None]:
def load_image(image_path):
    """
    Load an image from file.
    
    Args:
        image_path (str): Path to the image file
        
    Returns:
        PIL.Image: Loaded image
    """
    try:
        img = Image.open(image_path).convert('RGB')
        return img
    except Exception as e:
        print(f"Error loading image {image_path}: {e}")
        return None

def generate_response(model, processor, image, prompt):
    """
    Generate a response from the model for the given image and prompt.
    
    Args:
        model: CheXagent model
        processor: CheXagent processor
        image (PIL.Image): Input image
        prompt (str): Text prompt
        
    Returns:
        str: Model's response
    """
    try:
        # Process the image and prompt
        inputs = processor(text=prompt, images=image, return_tensors="pt")
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        
        # Generate response
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.1,
                do_sample=True,
                top_p=0.9,
            )
        
        # Decode and clean the response
        response = processor.decode(output[0], skip_special_tokens=True)
        response = response.replace(prompt, "").strip()
        
        return response
    except Exception as e:
        print(f"Error generating response: {e}")
        return f"Error: {str(e)}"

def clear_gpu_memory():
    """
    Clear GPU memory to prevent out-of-memory errors during batch processing.
    """
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("Cleared GPU memory")

## 5. Evaluation Loop

Process multiple X-ray images with baseline questions to establish model performance.

In [None]:
# Fetch original questions for baseline evaluation
original_questions = fetch_questions(condition='original', limit=200)

# Display a few sample questions
original_questions[['question_id', 'text']].head(5)

In [None]:
def process_batch(questions_df, model_name="StanfordAIMI/CheXagent-8b", question_category="original", batch_size=10, total_limit=50):
    """
    Process a batch of questions and images through the model.
    
    Args:
        questions_df (pd.DataFrame): DataFrame of questions
        model_name (str): Name of the model
        question_category (str): Category of questions
        batch_size (int): Number of questions to process before clearing memory
        total_limit (int): Maximum number of questions to process
        
    Returns:
        list: List of response records
    """
    results = []
    processed_count = 0
    
    # Limit the number of questions to process
    questions_to_process = questions_df.head(total_limit)
    
    for batch_start in range(0, len(questions_to_process), batch_size):
        # Get batch of questions
        batch_end = min(batch_start + batch_size, len(questions_to_process))
        batch = questions_to_process.iloc[batch_start:batch_end]
        
        print(f"Processing batch {batch_start//batch_size + 1}/{(len(questions_to_process)-1)//batch_size + 1}")
        
        for _, row in tqdm(batch.iterrows(), total=len(batch)):
            uid = row['id']
            question_id = row['question_id']
            question = row['text']
            image_path = row['image']
            
            # Form full image path
            full_image_path = os.path.join(base_dir, image_path)
            
            # Check if this has already been evaluated
            if check_existing_response(uid, question_id, model_name, question_category):
                print(f"Skipping already processed question {question_id}")
                continue
            
            # Load image
            img = load_image(full_image_path)
            if img is None:
                print(f"Skipping image {image_path} due to loading error")
                continue
                
            # Generate response
            response = generate_response(model, processor, img, question)
            
            # Store response in database
            try:
                record_id = store_model_response(
                    uid=uid,
                    question_id=question_id,
                    question=question,
                    question_category=question_category,
                    actual_answer=None,  # No ground truth available
                    model_name=model_name,
                    model_answer=response,
                    image_link=image_path
                )
                
                results.append({
                    'record_id': record_id,
                    'question_id': question_id,
                    'question': question,
                    'response': response
                })
                
                processed_count += 1
            except Exception as e:
                print(f"Error storing response: {e}")
        
        # Clear GPU memory after each batch
        clear_gpu_memory()
    
    print(f"Processed {processed_count} questions")
    return results

In [None]:
# Process a batch of original questions
# Note: This cell will process images and may take a while to complete
baseline_results = process_batch(
    questions_df=original_questions,
    model_name="StanfordAIMI/CheXagent-8b",
    question_category="original",
    batch_size=5,  # Process 5 images before clearing memory
    total_limit=20  # Process up to 20 images in total
)

## 6. Results Analysis

Analyze the model responses to establish a baseline for model performance.

In [None]:
# Display some example results
if baseline_results:
    for i, result in enumerate(baseline_results[:3]):
        print(f"Example {i+1}:")
        print(f"Question: {result['question']}")
        print(f"Response: {result['response']}")
        print("\n" + "-"*80 + "\n")
else:
    print("No results available to display")

In [None]:
# Fetch all responses from the database for analysis
def fetch_model_responses(model_name="StanfordAIMI/CheXagent-8b", question_category="original", limit=100):
    """
    Fetch model responses from the database.
    
    Args:
        model_name (str): Name of the model
        question_category (str): Category of questions
        limit (int): Maximum number of responses to fetch
        
    Returns:
        pd.DataFrame: DataFrame containing the responses
    """
    query = f"""
    SELECT id, uid, question_id, question, model_answer, image_link 
    FROM mimicxp.model_responses_r2 
    WHERE model_name = '{model_name}' AND question_category = '{question_category}' 
    LIMIT {limit}
    """
    
    with engine.connect() as conn:
        df = pd.read_sql(query, conn)
    
    print(f"Fetched {len(df)} responses from database")
    return df

# Fetch and analyze responses
responses_df = fetch_model_responses(model_name="StanfordAIMI/CheXagent-8b", question_category="original", limit=100)
if not responses_df.empty:
    # Analyze response length
    responses_df['response_length'] = responses_df['model_answer'].apply(len)
    
    print(f"Average response length: {responses_df['response_length'].mean():.2f} characters")
    print(f"Median response length: {responses_df['response_length'].median()} characters")
    print(f"Min response length: {responses_df['response_length'].min()} characters")
    print(f"Max response length: {responses_df['response_length'].max()} characters")
    
    # Plot response length distribution
    plt.figure(figsize=(10, 6))
    plt.hist(responses_df['response_length'], bins=20, alpha=0.7)
    plt.title('Distribution of Response Lengths')
    plt.xlabel('Response Length (characters)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    plt.show()

## 7. Summary and Next Steps

In this notebook, we've established the baseline performance of the CheXagent-8b model on standard chest X-ray questions. This provides a foundation for comparing performance with adversarial inputs in subsequent notebooks.

### Key Findings
- Established CheXagent-8b baseline performance on MIMIC-CXR images
- Documented response patterns for standard medical imaging questions
- Stored baseline responses in the database for later comparison

### Next Steps
- Proceed to notebook `03_model_evaluation_chexagent_perturbed.ipynb` to evaluate the model on visually perturbed images
- Then compare with other models in subsequent notebooks
- Finally, use the VSF-Med framework to comprehensively evaluate vulnerabilities

This baseline serves as the control condition for our vulnerability assessment, allowing us to isolate the effects of adversarial inputs in later analyses.