# 🧪 SLR Abstract Screening Experiment
#### Experiment Information
- **ID**: 016
- **Date**: 08/14
#### 🎯 Goal
- 
#### ⚙️ Configuration
- **LLM** : GPT-4o
- **Data**: BM
- **Examples** : 0
- **Output**: Likert
#### 📝 Notes
- 


## 🔧 Setup and Configuration

In [1]:
# Essential imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json
from pathlib import Path
import os
from dotenv import load_dotenv
from openai import OpenAI

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set plotting style
sns.set_theme()  # This is the correct way to set seaborn style
plt.rcParams['figure.figsize'] = (12, 8)

In [2]:
# Data Import 

# Define the data paths for both datasets
DATA_PATH_1 = "../data/SSOT_manual_LB_20250808_120908.csv" # ⬅️ Change this path if needed
DATA_PATH_2 = "../data/SSOT_manual_BM_20250813_132621.csv" # ⬅️ Change this path if needed

# Load the first dataset (df1)
try:
    df_LB = pd.read_csv(DATA_PATH_1)
    print(f"✓ First dataset loaded successfully")
    print(f"✓ Shape of dataset 1: {df_LB.shape}")
except FileNotFoundError:
    print("❌ Error: The file LB dataset was not found in the data directory")
except Exception as e:
    print(f"❌ Error loading the first dataset: {str(e)}")

# Load the second dataset (df2)
try:
    df_BM = pd.read_csv(DATA_PATH_2)
    print(f"\n✓ Second dataset loaded successfully")
    print(f"✓ Shape of dataset 2: {df_BM.shape}")
except FileNotFoundError:
    print("❌ Error: The file df_BM was not found in the data directory")
except Exception as e:
    print(f"❌ Error loading the second dataset: {str(e)}")

# Display basic information about both datasets
print("\nFirst few rows of dataset 1:\n")
display(df_LB.head())

print("\nFirst few rows of dataset 2:\n")
display(df_BM.head())

✓ First dataset loaded successfully
✓ Shape of dataset 1: (3944, 15)

✓ Second dataset loaded successfully
✓ Shape of dataset 2: (917, 13)

First few rows of dataset 1:



Unnamed: 0,ID,abstract,acmid,author,doi,outlet,title_full,url,year,qualtrics_id,wos_id,ebsco_id,stage_1,stage_2,stage_3
0,Bindu2018503,Online social networks have become immensely p...,,"Bindu, P V and Mishra, R and Thilagam, P S",10.1007/s10844-017-0494-z,Journal of Intelligent Information Systems,{Discovering spammer communities in TWITTER},https://www.scopus.com/inward/record.uri?eid=2...,2018,12,,,True,False,False
1,Moraga2018470,This article explores the ways Latinos—as audi...,,"Moraga, J E",10.1177/0193723518797030,Journal of Sport and Social Issues,"{On ESPN Deportes: Latinos, Sport MEDIA, and t...",https://www.scopus.com/inward/record.uri?eid=2...,2018,22,,,True,False,False
2,Lanosga20181676,This study of American investigative reporting...,,"Lanosga, G and Martin, J",10.1177/1464884916683555,JOURNALISm,"{JOURNALISts, sources, and policy outcomes: In...",https://www.scopus.com/inward/record.uri?eid=2...,2018,47,,,True,False,True
3,Warner2018720,"In this study, we test the indirect and condit...",,"Warner, B R and Jennings, F J and Bramlett, J ...",10.1080/15205436.2018.1472283,Mass Communication and Society,{A MultiMEDIA Analysis of Persuasion in the 20...,https://www.scopus.com/inward/record.uri?eid=2...,2018,50,,,True,False,False
4,Burrows20181117,Professional communicators produce a diverse r...,,"Burrows, E",10.1177/0163443718764807,"MEDIA, Culture and Society",{Indigenous MEDIA producers' perspectives on o...,https://www.scopus.com/inward/record.uri?eid=2...,2018,56,,,True,False,False



First few rows of dataset 2:



Unnamed: 0,(internal) id,(source) id,abstract,title_full,journal,authors,tags,consensus,labeled_at...9,code,stage_1,stage_2,stage_3
0,33937314,175,There is a worry that serious forms of politic...,Is Context the Key? The (Non-)Differential Eff...,Polit. Commun.,,,o,,-1,True,False,False
1,33937315,113,The electoral model of democracy holds the ide...,POLITICAL NEWS IN ONLINE AND PRINT NEWSPAPERS ...,Digit. Journal.,,,o,,-1,True,False,False
2,33937316,122,Machine learning is a field at the intersectio...,Machine Learning for Sociology,Annu. Rev. Sociol.,,,o,,-1,True,False,False
3,33937317,467,Research on digital glocalization has found th...,Improving Health in Low-Income Communities Wit...,J. Commun.,,,o,,-1,True,False,False
4,33937318,10,Political scientists often wish to classify do...,Using Word Order in Political Text Classificat...,Polit. Anal.,,,o,,-1,True,False,False


## 🧫 Define Experiment Parameters

In [3]:
from datetime import datetime

# Experiment Metadata
EXPERIMENT_ID = "016"  # ⬅️ Change this for each new experiment
EXPERIMENT_DATE = "2025-08-14"  # ⬅️ Update the date
EXPERIMENT_CATEGORY = "Testing"  # ⬅️ Category of experiment
EXPERIMENT_GOAL = "Test Set Up"  # ⬅️ What are you testing?

# Model Configuration
MODEL_NAME = "gpt-4o"
TEMPERATURE = 0.0
MAX_TOKENS = 4000

# Print experiment info
print("🧪 EXPERIMENT SETUP")
print("=" * 50)
print(f"ID: {EXPERIMENT_ID}")
print(f"Date: {EXPERIMENT_DATE}")
print(f"Category: {EXPERIMENT_CATEGORY}")
print(f"🎯Goal: {EXPERIMENT_GOAL}")
print(f"Model: {MODEL_NAME} (temp={TEMPERATURE})")
print("=" * 50)
print("✅ Experiment configuration loaded")

🧪 EXPERIMENT SETUP
ID: 016
Date: 2025-08-14
Category: Testing
🎯Goal: Test Set Up
Model: gpt-4o (temp=0.0)
✅ Experiment configuration loaded


## 📣 Set up Basic API Call

In [4]:
import os
import json
from openai import OpenAI
from dotenv import load_dotenv
from datetime import datetime

# Load environment variables
load_dotenv()

# Get the API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")

# Validate API key
if not api_key:
    print("⚠️  Error: OPENAI_API_KEY not found.")
    print("Please make sure you have a .env file with OPENAI_API_KEY='sk-...'")
else:
    print("✅ OpenAI API Key loaded successfully.")
    client = OpenAI(api_key=api_key)
    print("✅ OpenAI client initialized.")

# Enhanced analysis function for abstract screening
def screen_abstract_llm(abstract_text, system_prompt, user_prompt_template, 
                       model="gpt-4o", temperature=0.0):
    """
    Screen an abstract using LLM with system and user prompts.
    
    Args:
        abstract_text (str): The abstract to analyze
        system_prompt (str): The system prompt defining the role
        user_prompt_template (str): Template with {abstract} placeholder
        model (str): The OpenAI model to use
        temperature (float): Temperature setting for response randomness
    
    Returns:
        dict: Result with decision, reasoning, and metadata
    """
    if 'client' not in globals():
        return {"error": "OpenAI client is not initialized. Please check your API key."}

    try:
        # Insert abstract into user prompt template
        user_prompt = user_prompt_template.format(abstract=abstract_text)
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=temperature,
            max_tokens=4000
        )
        
        if response and response.choices:
            result = {
                "decision": "INCLUDE" if "INCLUDE" in response.choices[0].message.content.upper() else "EXCLUDE",
                "reasoning": response.choices[0].message.content,
                "model": model,
                "temperature": temperature,
                "timestamp": datetime.now().isoformat(),
                "error": None
            }
            return result
        else:
            return {"error": "API Error: Empty or invalid response."}
            
    except Exception as e:
        return {"error": f"API Error: {e}"}

print("✅ Enhanced screening function defined.")

✅ OpenAI API Key loaded successfully.
✅ OpenAI client initialized.
✅ Enhanced screening function defined.


## 🏛️ Set Up System Prompt 

In [5]:
# System prompt configuration
# System prompt configuration
SYSTEM_PROMPT_ID = "SYS_001"  # ⬅️ Change this ID for different system prompts
SYSTEM_PROMPT_DESCRIPTION = "Generic expert literature review screener for systematic reviews"

# Define the system prompt that sets the LLM's role
SYSTEM_PROMPT = """You are an expert in scientific literature review and systematic review methodology.

Your task is to screen research abstracts and decide whether they should be INCLUDED or EXCLUDED from a systematic literature review based on provided criteria.

INSTRUCTIONS:
1. Carefully read the provided inclusion/exclusion criteria
2. Review any example abstracts to understand the decision-making pattern
3. Apply the criteria systematically to the given abstract and title
4. Provide your decision in the exact format requested
5. Base your reasoning strictly on the provided criteria

Be consistent, objective, and systematic in your evaluation. Do not make up additional criteria beyond what is provided. Focus only on what is explicitly stated in the instructions."""

print(f"✅ System prompt defined")
print(f"📋 ID: {SYSTEM_PROMPT_ID}")
print(f"📏 Length: {len(SYSTEM_PROMPT)} characters")
print(f"📄 Description: {SYSTEM_PROMPT_DESCRIPTION}")

✅ System prompt defined
📋 ID: SYS_001
📏 Length: 759 characters
📄 Description: Generic expert literature review screener for systematic reviews


## 👩🏻‍⚕️ Create User Prompt


In [6]:
# User prompt configuration
USER_PROMPT_ID = "USR_005"  # ⬅️ Change this ID for different user prompts
USER_PROMPT_DESCRIPTION = "Basic screening with criteria, no examples from CSV files, Likert 1-5 relevance scale"

# File paths for modular components
CRITERIA_FILE = "../prompts/Criteria_BM_01.csv"  # ⬅️ Change criteria file here
EXAMPLES_FILE = None  # ⬅️ Change examples file here (or set to None)

# Output configuration
OUTPUT_FORMAT = "Likert"  # ⬅️ Options: "Binary", "Yes/Maybe/No", "Likert"
DECISION_OPTIONS = ["1", "2", "3", "4", "5"] # ⬅️ Change according to the output format

# Additional metadata for results tracking
DOMAIN = "political_communication" # ⬅️ Change this to the domain of the study
TOPIC = "Computational Text Analysis Methods"  # ⬅️ Change this to the topic of the study
DATASET_SOURCE = "BM"  # ⬅️ Which dataset (BM/LB)

# Define the user prompt template with placeholders
USER_PROMPT_TEMPLATE = """## SCREENING TASK:
You are screening abstracts for a systematic literature review on {topic} in {domain}. Your task is to evaluate the relevance of each abstract based on how well it meets the inclusion criteria.

## INCLUSION/EXCLUSION CRITERIA:
{criteria_text}

{examples_section}

## RELEVANCE SCALE:
Rate the relevance of each abstract on a scale from 1 to 5:

**5 - Highly Relevant**: Abstract clearly and strongly meets all inclusion criteria. Definitely should be included.

**4 - Relevant**: Abstract meets most inclusion criteria with good alignment to the research focus. Should likely be included.

**3 - Moderately Relevant**: Abstract shows some relevance but has unclear aspects or partial alignment with criteria. Requires careful consideration.

**2 - Minimally Relevant**: Abstract has limited relevance, meets few criteria, or has significant gaps. Should likely be excluded.

**1 - Not Relevant**: Abstract clearly does not meet inclusion criteria or is completely outside the research scope. Definitely should be excluded.

## ABSTRACT TO SCREEN:
**Title:** {title}
**Abstract:** {abstract}

## YOUR EVALUATION:
Provide your relevance rating and detailed reasoning:

**Relevance Score:** [Choose exactly one number: 1, 2, 3, 4, or 5]

**Reasoning:** [Explain your rating by specifically addressing how the abstract aligns with each inclusion/exclusion criterion. For scores of 3, clearly identify what information would be needed to make a definitive decision. For scores of 1-2, specify which criteria are not met. For scores of 4-5, highlight the strong alignment with criteria.]"""

print(f"✅ User prompt configuration and template loaded")
print(f"📋 ID: {USER_PROMPT_ID}")
print(f"📄 Description: {USER_PROMPT_DESCRIPTION}")
print(f"📁 Criteria: {CRITERIA_FILE}")
print(f"📁 Examples: {EXAMPLES_FILE}")
print(f"🎯 Output: {OUTPUT_FORMAT}")
print(f"🔬 Topic: {TOPIC} | Domain: {DOMAIN} | Source: {DATASET_SOURCE}")
print(f"📏 Template length: {len(USER_PROMPT_TEMPLATE)} characters")

✅ User prompt configuration and template loaded
📋 ID: USR_005
📄 Description: Basic screening with criteria, no examples from CSV files, Likert 1-5 relevance scale
📁 Criteria: ../prompts/Criteria_BM_01.csv
📁 Examples: None
🎯 Output: Likert
🔬 Topic: Computational Text Analysis Methods | Domain: political_communication | Source: BM
📏 Template length: 1601 characters


## ✅ Valdiation Check

In [8]:
def validate_experiment_setup(df, dataset_source="BM"):
    """
    Validate that all required variables and data are available for the experiment.
    
    Args:
        df: DataFrame to be used in experiment
        dataset_source: Dataset identifier
    
    Returns:
        bool: True if all validations pass, False otherwise
    """
    
    print("🔍 VALIDATION CHECK")
    print("=" * 50)
    
    validation_passed = True
    
    # Check required global variables
    required_vars = {
        'EXPERIMENT_ID': globals().get('EXPERIMENT_ID'),
        'SYSTEM_PROMPT_ID': globals().get('SYSTEM_PROMPT_ID'), 
        'USER_PROMPT_ID': globals().get('USER_PROMPT_ID'),
        'SYSTEM_PROMPT': globals().get('SYSTEM_PROMPT'),
        'USER_PROMPT_TEMPLATE': globals().get('USER_PROMPT_TEMPLATE'),
        'CRITERIA_FILE': globals().get('CRITERIA_FILE'),
        'DECISION_OPTIONS': globals().get('DECISION_OPTIONS'),
        'MODEL_NAME': globals().get('MODEL_NAME'),
        'TEMPERATURE': globals().get('TEMPERATURE'),
        'TOPIC': globals().get('TOPIC'),
        'DOMAIN': globals().get('DOMAIN')
    }
    
    # Optional variables that can be None
    optional_vars = {
        'EXAMPLES_FILE': globals().get('EXAMPLES_FILE')
    }
    
    print("📋 Checking required variables:")
    for var_name, var_value in required_vars.items():
        if var_value is None:
            print(f"   ❌ {var_name}: NOT DEFINED")
            validation_passed = False
        else:
            print(f"   ✅ {var_name}: {str(var_value)[:50]}{'...' if len(str(var_value)) > 50 else ''}")
    
    print("📋 Checking optional variables:")
    for var_name, var_value in optional_vars.items():
        if var_value is None:
            print(f"   ✅ {var_name}: None (optional - will run without examples)")
        else:
            print(f"   ✅ {var_name}: {str(var_value)[:50]}{'...' if len(str(var_value)) > 50 else ''}")
    
    # Check DataFrame structure
    print(f"\n📊 Checking DataFrame structure:")
    required_columns = ['abstract', 'title_full', 'stage_2', 'stage_3']
    
    if df is None:
        print(f"   ❌ DataFrame is None")
        validation_passed = False
    else:
        print(f"   ✅ DataFrame shape: {df.shape}")
        
        for col in required_columns:
            if col in df.columns:
                print(f"   ✅ Column '{col}': Present")
            else:
                print(f"   ❌ Column '{col}': MISSING")
                validation_passed = False
    
    # Check data availability
    if df is not None and all(col in df.columns for col in required_columns):
        print(f"\n📈 Checking data availability:")
        stage2_true = len(df[df['stage_2'] == True])
        stage2_false = len(df[df['stage_2'] == False])
        stage3_true = len(df[df['stage_3'] == True])
        stage3_false = len(df[df['stage_3'] == False])
        
        print(f"   📊 Stage 2 True: {stage2_true}")
        print(f"   📊 Stage 2 False: {stage2_false}")
        print(f"   📊 Stage 3 True: {stage3_true}")
        print(f"   📊 Stage 3 False: {stage3_false}")
        
        if stage3_true < 10:
            print(f"   ⚠️  Warning: Only {stage3_true} stage_3=True examples available")
        if stage3_false < 10:
            print(f"   ⚠️  Warning: Only {stage3_false} stage_3=False examples available")
    
    # Check file paths
    print(f"\n📁 Checking file paths:")
    import os
    
    # CRITERIA_FILE is required
    if CRITERIA_FILE and os.path.exists(CRITERIA_FILE):
        print(f"   ✅ Criteria file: {CRITERIA_FILE}")
    elif CRITERIA_FILE:
        print(f"   ❌ Criteria file: {CRITERIA_FILE} (NOT FOUND)")
        validation_passed = False
    else:
        print(f"   ❌ Criteria file: NOT SPECIFIED")
        validation_passed = False
    
    # EXAMPLES_FILE is optional
    if EXAMPLES_FILE is None:
        print(f"   ✅ Examples file: None (will run without examples)")
    elif os.path.exists(EXAMPLES_FILE):
        print(f"   ✅ Examples file: {EXAMPLES_FILE}")
    else:
        print(f"   ❌ Examples file: {EXAMPLES_FILE} (NOT FOUND)")
        validation_passed = False
    
    # Check API function
    print(f"\n🤖 Checking API function:")
    if 'screen_abstract_llm' in globals():
        print(f"   ✅ screen_abstract_llm function: Available")
    else:
        print(f"   ❌ screen_abstract_llm function: NOT DEFINED")
        validation_passed = False
    
    # Final result
    print("\n" + "=" * 50)
    if validation_passed:
        print("✅ ALL VALIDATIONS PASSED - Ready to run experiment!")
    else:
        print("❌ VALIDATION FAILED - Please fix the issues above before running")
    
    return validation_passed

# Run validation
validation_result = validate_experiment_setup(df_BM, "BM")

🔍 VALIDATION CHECK
📋 Checking required variables:
   ✅ EXPERIMENT_ID: 016
   ✅ SYSTEM_PROMPT_ID: SYS_001
   ✅ USER_PROMPT_ID: USR_005
   ✅ SYSTEM_PROMPT: You are an expert in scientific literature review ...
   ✅ USER_PROMPT_TEMPLATE: ## SCREENING TASK:
You are screening abstracts for...
   ✅ CRITERIA_FILE: ../prompts/Criteria_BM_01.csv
   ✅ DECISION_OPTIONS: ['1', '2', '3', '4', '5']
   ✅ MODEL_NAME: gpt-4o
   ✅ TEMPERATURE: 0.0
   ✅ TOPIC: Computational Text Analysis Methods
   ✅ DOMAIN: political_communication
📋 Checking optional variables:
   ✅ EXAMPLES_FILE: None (optional - will run without examples)

📊 Checking DataFrame structure:
   ✅ DataFrame shape: (917, 13)
   ✅ Column 'abstract': Present
   ✅ Column 'title_full': Present
   ✅ Column 'stage_2': Present
   ✅ Column 'stage_3': Present

📈 Checking data availability:
   📊 Stage 2 True: 166
   📊 Stage 2 False: 751
   📊 Stage 3 True: 96
   📊 Stage 3 False: 821

📁 Checking file paths:
   ✅ Criteria file: ../prompts/Criteria_BM_01

## 🔬 Set Up Function

In [9]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix 
from datetime import datetime
import os
import time

def run_classification_experiment(
    df, 
    n_total_examples=50,  # ⬅️ Total number of examples to test
    n_stage3_true=5,     # ⬅️ Number of stage_3=True examples
    n_stage3_false=45,    # ⬅️ Number of stage_3=False examples
    dataset_source="BM",  # ⬅️ Dataset identifier (LB/BM)
    batch_size=20,        # ⬅️ Batch size for processing (max 20 to avoid timeouts)
    save_results=True,    # ⬅️ Whether to save results to CSV
    verbose=True          # ⬅️ Print progress updates
):
    """
    Run LLM classification experiment on abstracts with batch processing.
    
    Args:
        df: DataFrame with abstracts (must have 'abstract', 'title_full', 'stage_2', 'stage_3')
        n_total_examples: Total number of examples to test
        n_stage3_true: Number of stage_3=True examples to include
        n_stage3_false: Number of stage_3=False examples to include
        dataset_source: Dataset identifier for results filename
        batch_size: Number of examples to process in each batch (max 20)
        save_results: Whether to save results to CSV
        verbose: Whether to print progress
    
    Returns:
        dict: Results including metrics and DataFrame
    """
    
    # Validate batch size
    if batch_size > 20:
        print("⚠️  Warning: Batch size > 20 may cause timeouts. Setting to 20.")
        batch_size = 20
    
    if verbose:
        print(f"🧪 Starting Classification Experiment with Batch Processing")
        print(f"📊 Dataset: {dataset_source}")
        print(f"🎯 Total examples: {n_total_examples}")
        print(f"✅ Stage 3 True: {n_stage3_true}")
        print(f"❌ Stage 3 False: {n_stage3_false}")
        print(f"📦 Batch size: {batch_size}")
        print("=" * 50)
    
    # Sample examples
    stage3_true_samples = df[df['stage_3'] == True].sample(n=n_stage3_true, random_state=42)
    stage3_false_samples = df[df['stage_3'] == False].sample(n=n_stage3_false, random_state=42)
    
    # Combine samples
    test_samples = pd.concat([stage3_true_samples, stage3_false_samples]).reset_index(drop=True)
    
    if verbose:
        print(f"📝 Sampled {len(test_samples)} examples")
    
    # Load criteria and examples text
    def load_criteria_text(criteria_file):
        try:
            criteria_df = pd.read_csv(criteria_file)
            criteria_text = ""
            
            # Add inclusion criteria
            inclusion_criteria = criteria_df[criteria_df['type'] == 'inclusion']
            if len(inclusion_criteria) > 0:
                criteria_text += "**INCLUSION CRITERIA:**\n"
                for _, row in inclusion_criteria.iterrows():
                    criteria_text += f"- **{row['criterion_id']}**: {row['description']}\n"
                    if pd.notna(row['examples']) and row['examples'].strip():
                        criteria_text += f"  *Examples: {row['examples']}*\n"
            
            # Add exclusion criteria
            exclusion_criteria = criteria_df[criteria_df['type'] == 'exclusion']
            if len(exclusion_criteria) > 0:
                criteria_text += "\n**EXCLUSION CRITERIA:**\n"
                for _, row in exclusion_criteria.iterrows():
                    criteria_text += f"- **{row['criterion_id']}**: {row['description']}\n"
                    if pd.notna(row['examples']) and row['examples'].strip():
                        criteria_text += f"  *Examples: {row['examples']}*\n"
            
            return criteria_text
        except Exception as e:
            return f"Error loading criteria: {e}"
    
    def load_examples_text(examples_file):
        if not examples_file:
            return ""
        try:
            examples_df = pd.read_csv(examples_file)
            examples_text = "\n## EXAMPLE DECISIONS:\n"
            
            for _, row in examples_df.iterrows():
                decision_label = "INCLUDE" if row['decision'].upper() == 'INCLUDE' else "EXCLUDE"
                examples_text += f"\n**{decision_label} Example:**\n"
                examples_text += f"*Title:* {row['title']}\n"
                examples_text += f"*Abstract:* {row['abstract_text'][:200]}{'...' if len(row['abstract_text']) > 200 else ''}\n"
                examples_text += f"→ **{decision_label}** ({row['reasoning']})\n"
            
            return examples_text
        except Exception as e:
            return f"\n## EXAMPLES:\nError loading examples: {e}\n"
    
    # Load prompt components
    criteria_text = load_criteria_text(CRITERIA_FILE)
    examples_section = load_examples_text(EXAMPLES_FILE) if EXAMPLES_FILE else ""
    
    # Initialize results list
    results_list = []
    
    # Calculate number of batches
    total_examples = len(test_samples)
    num_batches = (total_examples + batch_size - 1) // batch_size  # Ceiling division
    
    if verbose:
        print(f"📦 Processing {total_examples} examples in {num_batches} batch(es)")
        print(f"⏱️  Estimated time: ~{num_batches * 2} minutes (2 min per batch)")
    
    # Process examples in batches
    for batch_idx in range(num_batches):
        start_idx = batch_idx * batch_size
        end_idx = min(start_idx + batch_size, total_examples)
        batch_samples = test_samples.iloc[start_idx:end_idx]
        
        if verbose:
            print(f"\n🔄 Processing Batch {batch_idx + 1}/{num_batches} (examples {start_idx + 1}-{end_idx})")
        
        batch_start_time = time.time()
        
        for idx, row in batch_samples.iterrows():
            sample_number = start_idx + (idx - batch_samples.index[0]) + 1
            
            try:
                # Create complete prompt
                complete_prompt = USER_PROMPT_TEMPLATE.format(
                    topic=TOPIC,
                    domain=DOMAIN,
                    criteria_text=criteria_text,
                    examples_section=examples_section,
                    title=row['title_full'],
                    abstract=row['abstract']
                )
                
                # Call LLM
                llm_result = screen_abstract_llm(
                    abstract_text=complete_prompt,
                    system_prompt=SYSTEM_PROMPT,
                    user_prompt_template="{abstract}",  # Just pass through since we formatted above
                    model=MODEL_NAME,
                    temperature=TEMPERATURE
                )
                
                # Parse LLM decision - extract Likert score from response
                llm_reasoning = llm_result.get('reasoning', 'No reasoning provided')
                
                # Extract Likert score (1-5) from the response
                llm_score = None
                for score in ['5', '4', '3', '2', '1']:  # Check in order of preference
                    if f"Relevance Score:** {score}" in llm_reasoning or f"**{score}" in llm_reasoning:
                        llm_score = int(score)
                        break
                
                # Fallback: look for any number 1-5 in the response
                if llm_score is None:
                    import re
                    scores = re.findall(r'\b[1-5]\b', llm_reasoning)
                    if scores:
                        llm_score = int(scores[0])
                    else:
                        llm_score = 3  # Default to middle score if no score found
                
                # Convert Likert scores to binary for evaluation
                # Stage 2: 3,4,5 = positive (worth further review)
                stage2_llm_binary = 1 if llm_score >= 3 else 0
                # Stage 3: 4,5 = positive (definitely include)  
                stage3_llm_binary = 1 if llm_score >= 4 else 0
                
                stage2_binary = 1 if row['stage_2'] else 0
                stage3_binary = 1 if row['stage_3'] else 0
                
                # Store result
                result_row = {
                    'example_id': sample_number,
                    'title': row['title_full'],
                    'abstract': row['abstract'],
                    'stage_2_true': row['stage_2'],
                    'stage_3_true': row['stage_3'],
                    'stage_2_binary': stage2_binary,
                    'stage_3_binary': stage3_binary,
                    'llm_score': llm_score,  # New: Likert score (1-5)
                    'llm_decision': str(llm_score),  # For compatibility
                    'stage2_llm_binary': stage2_llm_binary,  # New: Binary for stage 2 (3+ = positive)
                    'stage3_llm_binary': stage3_llm_binary,  # New: Binary for stage 3 (4+ = positive)
                    'llm_reasoning': llm_reasoning,
                    'experiment_id': EXPERIMENT_ID,
                    'dataset_source': dataset_source,
                    'system_prompt_id': SYSTEM_PROMPT_ID,
                    'user_prompt_id': USER_PROMPT_ID,
                    'model': MODEL_NAME,
                    'temperature': TEMPERATURE,
                    'timestamp': datetime.now().isoformat()
                }
                
                results_list.append(result_row)
                
            except Exception as e:
                if verbose:
                    print(f"❌ Error processing example {sample_number}: {e}")
                
                # Store error result
                result_row = {
                    'example_id': sample_number,
                    'title': row['title_full'],
                    'abstract': row['abstract'],
                    'stage_2_true': row['stage_2'],
                    'stage_3_true': row['stage_3'],
                    'stage_2_binary': 1 if row['stage_2'] else 0,
                    'stage_3_binary': 1 if row['stage_3'] else 0,
                    'llm_score': 0,  # Error case
                    'llm_decision': 'ERROR',
                    'stage2_llm_binary': 0,
                    'stage3_llm_binary': 0,
                    'llm_reasoning': f'Processing error: {e}',
                    'experiment_id': EXPERIMENT_ID,
                    'dataset_source': dataset_source,
                    'system_prompt_id': SYSTEM_PROMPT_ID,
                    'user_prompt_id': USER_PROMPT_ID,
                    'model': MODEL_NAME,
                    'temperature': TEMPERATURE,
                    'timestamp': datetime.now().isoformat()
                }
                
                results_list.append(result_row)
        
        # Batch completion info
        batch_time = time.time() - batch_start_time
        if verbose:
            print(f"✅ Batch {batch_idx + 1} completed in {batch_time:.1f}s")
            if batch_idx < num_batches - 1:  # Not the last batch
                print(f"⏳ Brief pause before next batch...")
                time.sleep(2)  # Small delay between batches
    
    # Create results DataFrame
    results_df = pd.DataFrame(results_list)
    
    # Filter out errors for analysis
    valid_results = results_df[results_df['llm_decision'] != 'ERROR']
    
    # LIKERT SCALE ANALYSIS
    if len(valid_results) > 0:
        # Overall Likert distribution
        likert_counts = valid_results['llm_score'].value_counts().sort_index()
        
        # Likert distribution by Stage 2 ground truth
        stage2_true_scores = valid_results[valid_results['stage_2_true'] == True]['llm_score'].value_counts().sort_index()
        stage2_false_scores = valid_results[valid_results['stage_2_true'] == False]['llm_score'].value_counts().sort_index()
        
        # Likert distribution by Stage 3 ground truth
        stage3_true_scores = valid_results[valid_results['stage_3_true'] == True]['llm_score'].value_counts().sort_index()
        stage3_false_scores = valid_results[valid_results['stage_3_true'] == False]['llm_score'].value_counts().sort_index()
        
        # BINARY CLASSIFICATION METRICS
        # Stage 2 evaluation (3,4,5 vs 1,2)
        y_true_stage2 = valid_results['stage_2_binary'].values
        y_pred_stage2 = valid_results['stage2_llm_binary'].values
        
        accuracy_stage2 = accuracy_score(y_true_stage2, y_pred_stage2)
        precision_stage2 = precision_score(y_true_stage2, y_pred_stage2, zero_division=0)
        recall_stage2 = recall_score(y_true_stage2, y_pred_stage2, zero_division=0)
        f1_stage2 = f1_score(y_true_stage2, y_pred_stage2, zero_division=0)
        tn2, fp2, fn2, tp2 = confusion_matrix(y_true_stage2, y_pred_stage2).ravel()
        
        # Stage 3 evaluation (4,5 vs 1,2,3)
        y_true_stage3 = valid_results['stage_3_binary'].values
        y_pred_stage3 = valid_results['stage3_llm_binary'].values
        
        accuracy_stage3 = accuracy_score(y_true_stage3, y_pred_stage3)
        precision_stage3 = precision_score(y_true_stage3, y_pred_stage3, zero_division=0)
        recall_stage3 = recall_score(y_true_stage3, y_pred_stage3, zero_division=0)
        f1_stage3 = f1_score(y_true_stage3, y_pred_stage3, zero_division=0)
        tn3, fp3, fn3, tp3 = confusion_matrix(y_true_stage3, y_pred_stage3).ravel()
        
    else:
        # Handle case with no valid results
        likert_counts = pd.Series(dtype=int)
        stage2_true_scores = stage2_false_scores = pd.Series(dtype=int)
        stage3_true_scores = stage3_false_scores = pd.Series(dtype=int)
        accuracy_stage2 = precision_stage2 = recall_stage2 = f1_stage2 = 0.0
        accuracy_stage3 = precision_stage3 = recall_stage3 = f1_stage3 = 0.0
        tp2 = fp2 = tn2 = fn2 = tp3 = fp3 = tn3 = fn3 = 0
    
    # Updated metrics dictionary with Likert analysis
    metrics = {
        'stage_2_metrics': {
            'accuracy': accuracy_stage2,
            'precision': precision_stage2,
            'recall': recall_stage2,
            'f1_score': f1_stage2,
            'tp': int(tp2),
            'fp': int(fp2),
            'tn': int(tn2),
            'fn': int(fn2),
            'threshold': '3+ (moderate to high relevance)'
        },
        'stage_3_metrics': {
            'accuracy': accuracy_stage3,
            'precision': precision_stage3,
            'recall': recall_stage3,
            'f1_score': f1_stage3,
            'tp': int(tp3),
            'fp': int(fp3),
            'tn': int(tn3),
            'fn': int(fn3),
            'threshold': '4+ (high relevance)'
        },
        'likert_analysis': {
            'overall_distribution': {f'score_{i}': int(likert_counts.get(i, 0)) for i in range(1, 6)},
            'stage2_true_distribution': {f'score_{i}': int(stage2_true_scores.get(i, 0)) for i in range(1, 6)},
            'stage2_false_distribution': {f'score_{i}': int(stage2_false_scores.get(i, 0)) for i in range(1, 6)},
            'stage3_true_distribution': {f'score_{i}': int(stage3_true_scores.get(i, 0)) for i in range(1, 6)},
            'stage3_false_distribution': {f'score_{i}': int(stage3_false_scores.get(i, 0)) for i in range(1, 6)}
        },
        'total_examples': len(results_df),
        'successful_classifications': len(valid_results),
        'errors': len(results_df) - len(valid_results)
    }
    
    # Enhanced results printing
    if verbose:
        print(f"\n📊 EXPERIMENT RESULTS")
        print("=" * 50)
        print(f"📊 Likert Scale Distribution:")
        for i in range(1, 6):
            count = likert_counts.get(i, 0)
            print(f"   Score {i}: {count}")
        
        print(f"\n📈 Stage 2 Evaluation (3+ = positive):")
        print(f"   Accuracy:  {accuracy_stage2:.3f}")
        print(f"   Precision: {precision_stage2:.3f}")
        print(f"   Recall:    {recall_stage2:.3f}")
        print(f"   F1 Score:  {f1_stage2:.3f}")
        print(f"   TP: {tp2}, FP: {fp2}, TN: {tn2}, FN: {fn2}")
        
        print(f"\n📈 Stage 3 Evaluation (4+ = positive):")
        print(f"   Accuracy:  {accuracy_stage3:.3f}")
        print(f"   Precision: {precision_stage3:.3f}")
        print(f"   Recall:    {recall_stage3:.3f}")
        print(f"   F1 Score:  {f1_stage3:.3f}")
        print(f"   TP: {tp3}, FP: {fp3}, TN: {tn3}, FN: {fn3}")
        
        print(f"\n📊 Score Distribution by Ground Truth:")
        print(f"   Stage 2 True:  [1:{stage2_true_scores.get(1,0)}, 2:{stage2_true_scores.get(2,0)}, 3:{stage2_true_scores.get(3,0)}, 4:{stage2_true_scores.get(4,0)}, 5:{stage2_true_scores.get(5,0)}]")
        print(f"   Stage 2 False: [1:{stage2_false_scores.get(1,0)}, 2:{stage2_false_scores.get(2,0)}, 3:{stage2_false_scores.get(3,0)}, 4:{stage2_false_scores.get(4,0)}, 5:{stage2_false_scores.get(5,0)}]")
        print(f"   Stage 3 True:  [1:{stage3_true_scores.get(1,0)}, 2:{stage3_true_scores.get(2,0)}, 3:{stage3_true_scores.get(3,0)}, 4:{stage3_true_scores.get(4,0)}, 5:{stage3_true_scores.get(5,0)}]")
        print(f"   Stage 3 False: [1:{stage3_false_scores.get(1,0)}, 2:{stage3_false_scores.get(2,0)}, 3:{stage3_false_scores.get(3,0)}, 4:{stage3_false_scores.get(4,0)}, 5:{stage3_false_scores.get(5,0)}]")
        
        print(f"\n📋 Processing Summary:")
        print(f"   Total examples: {len(results_df)}")
        print(f"   Successful: {len(valid_results)}")
        print(f"   Errors: {len(results_df) - len(valid_results)}")
    
    # Save results
    if save_results:
        # Create filename with timestamp
        timestamp = datetime.now().strftime("%m%d%H%M")
        filename = f"{EXPERIMENT_ID}_{dataset_source}_{timestamp}.csv"
        results_dir = "../results"
        os.makedirs(results_dir, exist_ok=True)
        output_path = os.path.join(results_dir, filename)
        
        results_df.to_csv(output_path, index=False)
        
        if verbose:
            print(f"\n💾 Results saved to: {output_path}")
    
    return {
        'results_df': results_df,
        'metrics': metrics,
        'filename': filename if save_results else None
    }

print("✅ Classification experiment function with Likert scale analysis defined")
print("🚀 Ready to run: run_classification_experiment(df_BM, batch_size=20)")

✅ Classification experiment function with Likert scale analysis defined
🚀 Ready to run: run_classification_experiment(df_BM, batch_size=20)


## 🚀 Run experiment! 

In [10]:
# Run experiment with default settings
results = run_classification_experiment(df_BM)

🧪 Starting Classification Experiment with Batch Processing
📊 Dataset: BM
🎯 Total examples: 50
✅ Stage 3 True: 5
❌ Stage 3 False: 45
📦 Batch size: 20
📝 Sampled 50 examples
📦 Processing 50 examples in 3 batch(es)
⏱️  Estimated time: ~6 minutes (2 min per batch)

🔄 Processing Batch 1/3 (examples 1-20)
✅ Batch 1 completed in 154.0s
⏳ Brief pause before next batch...

🔄 Processing Batch 2/3 (examples 21-40)
✅ Batch 2 completed in 160.4s
⏳ Brief pause before next batch...

🔄 Processing Batch 3/3 (examples 41-50)
✅ Batch 3 completed in 77.7s

📊 EXPERIMENT RESULTS
📊 Likert Scale Distribution:
   Score 1: 34
   Score 2: 10
   Score 3: 3
   Score 4: 2
   Score 5: 1

📈 Stage 2 Evaluation (3+ = positive):
   Accuracy:  0.880
   Precision: 0.833
   Recall:    0.500
   F1 Score:  0.625
   TP: 5, FP: 1, TN: 39, FN: 5

📈 Stage 3 Evaluation (4+ = positive):
   Accuracy:  0.960
   Precision: 1.000
   Recall:    0.600
   F1 Score:  0.750
   TP: 3, FP: 0, TN: 45, FN: 2

📊 Score Distribution by Ground Trut

## 📊 Results Analysis

In [None]:
# Load results file - you can specify the exact file path here
RESULTS_FILE_PATH = "../results/0002_LB_08131412.csv"  # ⬅️ Change this to your specific file path

# Alternative: Set to None to auto-load the most recent file
# RESULTS_FILE_PATH = None

if RESULTS_FILE_PATH:
    # Load specific file
    if os.path.exists(RESULTS_FILE_PATH):
        print(f"📁 Loading specified file: {os.path.basename(RESULTS_FILE_PATH)}")
        df_results = pd.read_csv(RESULTS_FILE_PATH)
    else:
        print(f"❌ Error: File not found: {RESULTS_FILE_PATH}")
        df_results = None
else:
    # Auto-load most recent file (original behavior)
    results_dir = "../results"
    result_files = [f for f in os.listdir(results_dir) if f.endswith('.csv')]
    if result_files:
        latest_file = sorted(result_files)[-1]
        file_path = os.path.join(results_dir, latest_file)
        print(f"📁 Auto-loading most recent file: {latest_file}")
        df_results = pd.read_csv(file_path)
    else:
        print("❌ No result files found in ../results directory")
        df_results = None

# Continue with analysis if file was loaded successfully
if df_results is not None:
    print(f"\n📊 RESULTS OVERVIEW")
    print("=" * 50)
    print(f"Shape: {df_results.shape}")
    print(f"Columns: {list(df_results.columns)}")
    
    print(f"\n🎯 DECISION SUMMARY")
    print("=" * 30)
    print(df_results['llm_decision'].value_counts())
    
    print(f"\n📈 PERFORMANCE PREVIEW")
    print("=" * 30)
    print("Stage 2 vs LLM:")
    print(pd.crosstab(df_results['stage_2_true'], df_results['llm_decision']))
    print("\nStage 3 vs LLM:")
    print(pd.crosstab(df_results['stage_3_true'], df_results['llm_decision']))
    
    print(f"\n📋 FIRST FEW RESULTS")
    print("=" * 30)
    display(df_results[['example_id', 'stage_2_true', 'stage_3_true', 'llm_decision', 'llm_reasoning']].head())
else:
    print("❌ Could not load results file for analysis")

In [None]:
# Display full reasoning for first 5 examples
print("🤖 FULL LLM REASONING EXAMPLES")
print("=" * 80)

for idx in range(min(5, len(df_results))):
    row = df_results.iloc[idx]
    print(f"\n📋 EXAMPLE {row['example_id']} - {row['llm_decision']}")
    print(f"🎯 Ground Truth: Stage 2={row['stage_2_true']}, Stage 3={row['stage_3_true']}")
    print(f"📖 Title: {row['title'][:100]}{'...' if len(row['title']) > 100 else ''}")
    print(f"\n💭 FULL REASONING:")
    print("-" * 60)
    print(row['llm_reasoning'])
    print("-" * 60)

## ➕ Add experiment info to the results_df

In [11]:
def add_experiment_to_summary(results_dict, summary_file="../results/experiment_summary.csv"):
    """Add new experiment results to the summary DataFrame with confusion matrix metrics and Likert analysis"""
    
    new_row = pd.DataFrame({
        'experiment_id': [EXPERIMENT_ID],
        'experiment_date': [EXPERIMENT_DATE],
        'experiment_category': [EXPERIMENT_CATEGORY],
        'experiment_goal': [EXPERIMENT_GOAL],
        'system_prompt_id': [SYSTEM_PROMPT_ID],
        'user_prompt_id': [USER_PROMPT_ID],
        'model_name': [MODEL_NAME],
        'temperature': [TEMPERATURE],
        'max_tokens': [MAX_TOKENS],
        'criteria_file': [CRITERIA_FILE],
        'examples_file': [EXAMPLES_FILE],
        'output_format': [OUTPUT_FORMAT],
        'domain': [DOMAIN],
        'topic': [TOPIC],
        'dataset_source': [DATASET_SOURCE],
        'n_total_examples': [results_dict['metrics']['total_examples']],
        'n_successful': [results_dict['metrics']['successful_classifications']],
        'n_errors': [results_dict['metrics']['errors']],
        # Stage 2 metrics
        'stage2_accuracy': [results_dict['metrics']['stage_2_metrics']['accuracy']],
        'stage2_precision': [results_dict['metrics']['stage_2_metrics']['precision']],
        'stage2_recall': [results_dict['metrics']['stage_2_metrics']['recall']],
        'stage2_f1': [results_dict['metrics']['stage_2_metrics']['f1_score']],
        'stage2_tp': [results_dict['metrics']['stage_2_metrics']['tp']],
        'stage2_fp': [results_dict['metrics']['stage_2_metrics']['fp']],
        'stage2_tn': [results_dict['metrics']['stage_2_metrics']['tn']],
        'stage2_fn': [results_dict['metrics']['stage_2_metrics']['fn']],
        # Stage 3 metrics
        'stage3_accuracy': [results_dict['metrics']['stage_3_metrics']['accuracy']],
        'stage3_precision': [results_dict['metrics']['stage_3_metrics']['precision']],
        'stage3_recall': [results_dict['metrics']['stage_3_metrics']['recall']],
        'stage3_f1': [results_dict['metrics']['stage_3_metrics']['f1_score']],
        'stage3_tp': [results_dict['metrics']['stage_3_metrics']['tp']],
        'stage3_fp': [results_dict['metrics']['stage_3_metrics']['fp']],
        'stage3_tn': [results_dict['metrics']['stage_3_metrics']['tn']],
        'stage3_fn': [results_dict['metrics']['stage_3_metrics']['fn']],
        # Likert scale overall distribution
        'likert_score_1': [results_dict['metrics']['likert_analysis']['overall_distribution']['score_1']],
        'likert_score_2': [results_dict['metrics']['likert_analysis']['overall_distribution']['score_2']],
        'likert_score_3': [results_dict['metrics']['likert_analysis']['overall_distribution']['score_3']],
        'likert_score_4': [results_dict['metrics']['likert_analysis']['overall_distribution']['score_4']],
        'likert_score_5': [results_dict['metrics']['likert_analysis']['overall_distribution']['score_5']],
        # Likert distribution for Stage 2 True
        'stage2_true_score_1': [results_dict['metrics']['likert_analysis']['stage2_true_distribution']['score_1']],
        'stage2_true_score_2': [results_dict['metrics']['likert_analysis']['stage2_true_distribution']['score_2']],
        'stage2_true_score_3': [results_dict['metrics']['likert_analysis']['stage2_true_distribution']['score_3']],
        'stage2_true_score_4': [results_dict['metrics']['likert_analysis']['stage2_true_distribution']['score_4']],
        'stage2_true_score_5': [results_dict['metrics']['likert_analysis']['stage2_true_distribution']['score_5']],
        # Likert distribution for Stage 2 False
        'stage2_false_score_1': [results_dict['metrics']['likert_analysis']['stage2_false_distribution']['score_1']],
        'stage2_false_score_2': [results_dict['metrics']['likert_analysis']['stage2_false_distribution']['score_2']],
        'stage2_false_score_3': [results_dict['metrics']['likert_analysis']['stage2_false_distribution']['score_3']],
        'stage2_false_score_4': [results_dict['metrics']['likert_analysis']['stage2_false_distribution']['score_4']],
        'stage2_false_score_5': [results_dict['metrics']['likert_analysis']['stage2_false_distribution']['score_5']],
        # Likert distribution for Stage 3 True
        'stage3_true_score_1': [results_dict['metrics']['likert_analysis']['stage3_true_distribution']['score_1']],
        'stage3_true_score_2': [results_dict['metrics']['likert_analysis']['stage3_true_distribution']['score_2']],
        'stage3_true_score_3': [results_dict['metrics']['likert_analysis']['stage3_true_distribution']['score_3']],
        'stage3_true_score_4': [results_dict['metrics']['likert_analysis']['stage3_true_distribution']['score_4']],
        'stage3_true_score_5': [results_dict['metrics']['likert_analysis']['stage3_true_distribution']['score_5']],
        # Likert distribution for Stage 3 False
        'stage3_false_score_1': [results_dict['metrics']['likert_analysis']['stage3_false_distribution']['score_1']],
        'stage3_false_score_2': [results_dict['metrics']['likert_analysis']['stage3_false_distribution']['score_2']],
        'stage3_false_score_3': [results_dict['metrics']['likert_analysis']['stage3_false_distribution']['score_3']],
        'stage3_false_score_4': [results_dict['metrics']['likert_analysis']['stage3_false_distribution']['score_4']],
        'stage3_false_score_5': [results_dict['metrics']['likert_analysis']['stage3_false_distribution']['score_5']],
        # Thresholds for documentation
        'stage2_threshold': [results_dict['metrics']['stage_2_metrics']['threshold']],
        'stage3_threshold': [results_dict['metrics']['stage_3_metrics']['threshold']],
        # Existing columns
        'results_filename': [results_dict['filename']],
        'timestamp': [datetime.now().isoformat()]
    })
    
    # Load existing summary or create new one
    if os.path.exists(summary_file):
        existing_summary = pd.read_csv(summary_file)
        updated_summary = pd.concat([existing_summary, new_row], ignore_index=True)
        print(f"✅ Added experiment {EXPERIMENT_ID} to existing summary")
    else:
        updated_summary = new_row
        print(f"✅ Created new summary file with experiment {EXPERIMENT_ID}")
    
    # Save updated summary
    updated_summary.to_csv(summary_file, index=False)
    print(f"💾 Summary saved to: {summary_file}")
    
    # Display last 5 rows for verification
    print(f"\n📋 LAST 5 EXPERIMENTS IN SUMMARY:")
    print("=" * 50)
    display(updated_summary.tail())
    
    # Display Likert summary for the new experiment
    print(f"\n📊 LIKERT SCALE SUMMARY FOR EXPERIMENT {EXPERIMENT_ID}:")
    print("=" * 50)
    print(f"Overall Distribution: [1:{new_row['likert_score_1'].iloc[0]}, 2:{new_row['likert_score_2'].iloc[0]}, 3:{new_row['likert_score_3'].iloc[0]}, 4:{new_row['likert_score_4'].iloc[0]}, 5:{new_row['likert_score_5'].iloc[0]}]")
    print(f"Stage 2 True:  [1:{new_row['stage2_true_score_1'].iloc[0]}, 2:{new_row['stage2_true_score_2'].iloc[0]}, 3:{new_row['stage2_true_score_3'].iloc[0]}, 4:{new_row['stage2_true_score_4'].iloc[0]}, 5:{new_row['stage2_true_score_5'].iloc[0]}]")
    print(f"Stage 2 False: [1:{new_row['stage2_false_score_1'].iloc[0]}, 2:{new_row['stage2_false_score_2'].iloc[0]}, 3:{new_row['stage2_false_score_3'].iloc[0]}, 4:{new_row['stage2_false_score_4'].iloc[0]}, 5:{new_row['stage2_false_score_5'].iloc[0]}]")
    print(f"Stage 3 True:  [1:{new_row['stage3_true_score_1'].iloc[0]}, 2:{new_row['stage3_true_score_2'].iloc[0]}, 3:{new_row['stage3_true_score_3'].iloc[0]}, 4:{new_row['stage3_true_score_4'].iloc[0]}, 5:{new_row['stage3_true_score_5'].iloc[0]}]")
    print(f"Stage 3 False: [1:{new_row['stage3_false_score_1'].iloc[0]}, 2:{new_row['stage3_false_score_2'].iloc[0]}, 3:{new_row['stage3_false_score_3'].iloc[0]}, 4:{new_row['stage3_false_score_4'].iloc[0]}, 5:{new_row['stage3_false_score_5'].iloc[0]}]")
    
    print(f"\n📊 SUMMARY STATS:")
    print(f"   Total experiments: {len(updated_summary)}")
    print(f"   Unique experiment IDs: {updated_summary['experiment_id'].nunique()}")
    print(f"   Datasets used: {updated_summary['dataset_source'].unique().tolist()}")
    
    return updated_summary

# Usage example (uncomment to run):
summary_df = add_experiment_to_summary(results)

✅ Added experiment 016 to existing summary
💾 Summary saved to: ../results/experiment_summary.csv

📋 LAST 5 EXPERIMENTS IN SUMMARY:


Unnamed: 0,experiment_id,experiment_date,experiment_category,experiment_goal,system_prompt_id,user_prompt_id,model_name,temperature,max_tokens,criteria_file,examples_file,output_format,domain,topic,dataset_source,n_total_examples,n_successful,n_errors,stage2_accuracy,stage2_precision,stage2_recall,stage2_f1,stage2_tp,stage2_fp,stage2_tn,stage2_fn,stage3_accuracy,stage3_precision,stage3_recall,stage3_f1,stage3_tp,stage3_fp,stage3_tn,stage3_fn,results_filename,timestamp,llm_include_count,llm_maybe_count,llm_exclude_count,llm_error_count,likert_score_1,likert_score_2,likert_score_3,likert_score_4,likert_score_5,stage2_true_score_1,stage2_true_score_2,stage2_true_score_3,stage2_true_score_4,stage2_true_score_5,stage2_false_score_1,stage2_false_score_2,stage2_false_score_3,stage2_false_score_4,stage2_false_score_5,stage3_true_score_1,stage3_true_score_2,stage3_true_score_3,stage3_true_score_4,stage3_true_score_5,stage3_false_score_1,stage3_false_score_2,stage3_false_score_3,stage3_false_score_4,stage3_false_score_5,stage2_threshold,stage3_threshold
11,12,2025-08-14,Testing,Test Set Up,SYS_001,USR_001,gpt-4o,0.0,4000,../prompts/Criteria_BM_01.csv,../prompts/exmpl_five_BM_01.csv,Binary,political_communication,Computational Text Analysis Methods,BM,50,50,0,0.84,0.667,0.4,0.5,4,2,38,6,0.94,0.667,0.8,0.727,4.0,2.0,43.0,1.0,012_BM_08141308.csv,2025-08-14T13:09:04.489721,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12,13,2025-08-14,Testing,Test Set Up,SYS_001,USR_003,gpt-4o,0.0,4000,../prompts/Criteria_BM_01.csv,,Yes/Maybe/No,political_communication,Computational Text Analysis Methods,BM,50,50,0,0.86,0.8,0.4,0.533,4,1,39,6,,,,,,,,,013_BM_08141313.csv,2025-08-14T13:14:14.235853,5.0,0.0,45.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,
13,14,2025-08-14,Testing,Test Set Up,SYS_001,USR_004,gpt-4o,0.0,4000,../prompts/Criteria_BM_01.csv,../prompts/exmpl_single_BM_01.csv,Yes/Maybe/No,political_communication,Computational Text Analysis Methods,BM,50,50,0,0.86,1.0,0.3,0.462,3,0,40,7,,,,,,,,,014_BM_08141316.csv,2025-08-14T13:19:18.530402,3.0,0.0,47.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,
14,15,2025-08-14,Testing,Test Set Up,SYS_001,USR_004,gpt-4o,0.0,4000,../prompts/Criteria_BM_01.csv,../prompts/exmpl_five_BM_01.csv,Yes/Maybe/No,political_communication,Computational Text Analysis Methods,BM,50,50,0,0.84,0.667,0.4,0.5,4,2,38,6,,,,,,,,,015_BM_08141326.csv,2025-08-14T13:28:50.888383,6.0,0.0,44.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,
15,16,2025-08-14,Testing,Test Set Up,SYS_001,USR_005,gpt-4o,0.0,4000,../prompts/Criteria_BM_01.csv,,Likert,political_communication,Computational Text Analysis Methods,BM,50,50,0,0.88,0.833,0.5,0.625,5,1,39,5,0.96,1.0,0.6,0.75,3.0,0.0,45.0,2.0,016_BM_08141334.csv,2025-08-14T13:34:32.727516,,,,,34.0,10.0,3.0,2.0,1.0,2.0,3.0,2.0,2.0,1.0,32.0,7.0,1.0,0.0,0.0,0.0,1.0,1.0,2.0,1.0,34.0,9.0,2.0,0.0,0.0,3+ (moderate to high relevance),4+ (high relevance)



📊 LIKERT SCALE SUMMARY FOR EXPERIMENT 016:
Overall Distribution: [1:34, 2:10, 3:3, 4:2, 5:1]
Stage 2 True:  [1:2, 2:3, 3:2, 4:2, 5:1]
Stage 2 False: [1:32, 2:7, 3:1, 4:0, 5:0]
Stage 3 True:  [1:0, 2:1, 3:1, 4:2, 5:1]
Stage 3 False: [1:34, 2:9, 3:2, 4:0, 5:0]

📊 SUMMARY STATS:
   Total experiments: 16
   Unique experiment IDs: 16
   Datasets used: ['LB', 'BM']


## 📝 Conclusions and Next Steps

### Key Findings
- 

### Next Steps
- [Suggest follow-up experiments]
- [List potential improvements]
- [Identify areas for further investigation]