# 🧪 SLR Abstract Screening Experiment
#### Experiment Information
- **ID**: 011
- **Date**: 08/14
#### 🎯 Goal
- Test Set Up on small dataset
#### ⚙️ Configuration
- **LLM** : GPT-4o
- **Data**: BM
- **Examples** : Single
- **Output**: Binary
#### 📝 Notes
- 


## 🔧 Setup and Configuration

In [16]:
# Essential imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json
from pathlib import Path
import os
from dotenv import load_dotenv
from openai import OpenAI

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set plotting style
sns.set_theme()  # This is the correct way to set seaborn style
plt.rcParams['figure.figsize'] = (12, 8)

In [17]:
# Data Import 

# Define the data paths for both datasets
DATA_PATH_1 = "../data/SSOT_manual_LB_20250808_120908.csv" # ⬅️ Change this path if needed
DATA_PATH_2 =  "../data/SSOT_manual_BM_20250813_132621.csv" # ⬅️ Change this path if needed

# Load the first dataset (df1)
try:
    df_LB = pd.read_csv(DATA_PATH_1)
    print(f"✓ First dataset loaded successfully")
    print(f"✓ Shape of dataset 1: {df_LB.shape}")
except FileNotFoundError:
    print("❌ Error: The file LB dataset was not found in the data directory")
except Exception as e:
    print(f"❌ Error loading the first dataset: {str(e)}")

# Load the second dataset (df2)
try:
    df_BM = pd.read_csv(DATA_PATH_2)
    print(f"\n✓ Second dataset loaded successfully")
    print(f"✓ Shape of dataset 2: {df_BM.shape}")
except FileNotFoundError:
    print("❌ Error: The file df_BM was not found in the data directory")
except Exception as e:
    print(f"❌ Error loading the second dataset: {str(e)}")

# Display basic information about both datasets
print("\nFirst few rows of dataset 1:\n")
display(df_LB.head())

print("\nFirst few rows of dataset 2:\n")
display(df_BM.head())

✓ First dataset loaded successfully
✓ Shape of dataset 1: (3944, 15)

✓ Second dataset loaded successfully
✓ Shape of dataset 2: (917, 13)

First few rows of dataset 1:



Unnamed: 0,ID,abstract,acmid,author,doi,outlet,title_full,url,year,qualtrics_id,wos_id,ebsco_id,stage_1,stage_2,stage_3
0,Bindu2018503,Online social networks have become immensely p...,,"Bindu, P V and Mishra, R and Thilagam, P S",10.1007/s10844-017-0494-z,Journal of Intelligent Information Systems,{Discovering spammer communities in TWITTER},https://www.scopus.com/inward/record.uri?eid=2...,2018,12,,,True,False,False
1,Moraga2018470,This article explores the ways Latinos—as audi...,,"Moraga, J E",10.1177/0193723518797030,Journal of Sport and Social Issues,"{On ESPN Deportes: Latinos, Sport MEDIA, and t...",https://www.scopus.com/inward/record.uri?eid=2...,2018,22,,,True,False,False
2,Lanosga20181676,This study of American investigative reporting...,,"Lanosga, G and Martin, J",10.1177/1464884916683555,JOURNALISm,"{JOURNALISts, sources, and policy outcomes: In...",https://www.scopus.com/inward/record.uri?eid=2...,2018,47,,,True,False,True
3,Warner2018720,"In this study, we test the indirect and condit...",,"Warner, B R and Jennings, F J and Bramlett, J ...",10.1080/15205436.2018.1472283,Mass Communication and Society,{A MultiMEDIA Analysis of Persuasion in the 20...,https://www.scopus.com/inward/record.uri?eid=2...,2018,50,,,True,False,False
4,Burrows20181117,Professional communicators produce a diverse r...,,"Burrows, E",10.1177/0163443718764807,"MEDIA, Culture and Society",{Indigenous MEDIA producers' perspectives on o...,https://www.scopus.com/inward/record.uri?eid=2...,2018,56,,,True,False,False



First few rows of dataset 2:



Unnamed: 0,(internal) id,(source) id,abstract,title_full,journal,authors,tags,consensus,labeled_at...9,code,stage_1,stage_2,stage_3
0,33937314,175,There is a worry that serious forms of politic...,Is Context the Key? The (Non-)Differential Eff...,Polit. Commun.,,,o,,-1,True,False,False
1,33937315,113,The electoral model of democracy holds the ide...,POLITICAL NEWS IN ONLINE AND PRINT NEWSPAPERS ...,Digit. Journal.,,,o,,-1,True,False,False
2,33937316,122,Machine learning is a field at the intersectio...,Machine Learning for Sociology,Annu. Rev. Sociol.,,,o,,-1,True,False,False
3,33937317,467,Research on digital glocalization has found th...,Improving Health in Low-Income Communities Wit...,J. Commun.,,,o,,-1,True,False,False
4,33937318,10,Political scientists often wish to classify do...,Using Word Order in Political Text Classificat...,Polit. Anal.,,,o,,-1,True,False,False


## 🧫 Define Experiment Parameters

In [18]:
from datetime import datetime

# Experiment Metadata
EXPERIMENT_ID = "011"  # ⬅️ Change this for each new experiment
EXPERIMENT_DATE = "2025-08-14"  # ⬅️ Update the date
EXPERIMENT_CATEGORY = "Testing"  # ⬅️ Category of experiment
EXPERIMENT_GOAL = "Test Set Up"  # ⬅️ What are you testing?

# Model Configuration
MODEL_NAME = "gpt-4o"
TEMPERATURE = 0.0
MAX_TOKENS = 4000

# Print experiment info
print("🧪 EXPERIMENT SETUP")
print("=" * 50)
print(f"ID: {EXPERIMENT_ID}")
print(f"Date: {EXPERIMENT_DATE}")
print(f"Category: {EXPERIMENT_CATEGORY}")
print(f"🎯Goal: {EXPERIMENT_GOAL}")
print(f"Model: {MODEL_NAME} (temp={TEMPERATURE})")
print("=" * 50)
print("✅ Experiment configuration loaded")

🧪 EXPERIMENT SETUP
ID: 011
Date: 2025-08-14
Category: Testing
🎯Goal: Test Set Up
Model: gpt-4o (temp=0.0)
✅ Experiment configuration loaded


## 📣 Set up Basic API Call

In [19]:
import os
import json
from openai import OpenAI
from dotenv import load_dotenv
from datetime import datetime

# Load environment variables
load_dotenv()

# Get the API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")

# Validate API key
if not api_key:
    print("⚠️  Error: OPENAI_API_KEY not found.")
    print("Please make sure you have a .env file with OPENAI_API_KEY='sk-...'")
else:
    print("✅ OpenAI API Key loaded successfully.")
    client = OpenAI(api_key=api_key)
    print("✅ OpenAI client initialized.")

# Enhanced analysis function for abstract screening
def screen_abstract_llm(abstract_text, system_prompt, user_prompt_template, 
                       model="gpt-4o", temperature=0.0):
    """
    Screen an abstract using LLM with system and user prompts.
    
    Args:
        abstract_text (str): The abstract to analyze
        system_prompt (str): The system prompt defining the role
        user_prompt_template (str): Template with {abstract} placeholder
        model (str): The OpenAI model to use
        temperature (float): Temperature setting for response randomness
    
    Returns:
        dict: Result with decision, reasoning, and metadata
    """
    if 'client' not in globals():
        return {"error": "OpenAI client is not initialized. Please check your API key."}

    try:
        # Insert abstract into user prompt template
        user_prompt = user_prompt_template.format(abstract=abstract_text)
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=temperature,
            max_tokens=4000
        )
        
        if response and response.choices:
            result = {
                "decision": "INCLUDE" if "INCLUDE" in response.choices[0].message.content.upper() else "EXCLUDE",
                "reasoning": response.choices[0].message.content,
                "model": model,
                "temperature": temperature,
                "timestamp": datetime.now().isoformat(),
                "error": None
            }
            return result
        else:
            return {"error": "API Error: Empty or invalid response."}
            
    except Exception as e:
        return {"error": f"API Error: {e}"}

print("✅ Enhanced screening function defined.")

✅ OpenAI API Key loaded successfully.
✅ OpenAI client initialized.
✅ Enhanced screening function defined.


## 🏛️ Set Up System Prompt 

In [20]:
# System prompt configuration
# System prompt configuration
SYSTEM_PROMPT_ID = "SYS_001"  # ⬅️ Change this ID for different system prompts
SYSTEM_PROMPT_DESCRIPTION = "Generic expert literature review screener for systematic reviews"

# Define the system prompt that sets the LLM's role
SYSTEM_PROMPT = """You are an expert in scientific literature review and systematic review methodology.

Your task is to screen research abstracts and decide whether they should be INCLUDED or EXCLUDED from a systematic literature review based on provided criteria.

INSTRUCTIONS:
1. Carefully read the provided inclusion/exclusion criteria
2. Review any example abstracts to understand the decision-making pattern
3. Apply the criteria systematically to the given abstract and title
4. Provide your decision in the exact format requested
5. Base your reasoning strictly on the provided criteria

Be consistent, objective, and systematic in your evaluation. Do not make up additional criteria beyond what is provided. Focus only on what is explicitly stated in the instructions."""

print(f"✅ System prompt defined")
print(f"📋 ID: {SYSTEM_PROMPT_ID}")
print(f"📏 Length: {len(SYSTEM_PROMPT)} characters")
print(f"📄 Description: {SYSTEM_PROMPT_DESCRIPTION}")

✅ System prompt defined
📋 ID: SYS_001
📏 Length: 759 characters
📄 Description: Generic expert literature review screener for systematic reviews


## 👩🏻‍⚕️ Create User Prompt


In [21]:
# User prompt configuration
USER_PROMPT_ID = "USR_001"  # ⬅️ Change this ID for different user prompts
USER_PROMPT_DESCRIPTION = "Basic CTAM screening with criteria and examples from CSV files"

# File paths for modular components
CRITERIA_FILE = "../prompts/Criteria_BM_01.csv"  # ⬅️ Change criteria file here
EXAMPLES_FILE = "../prompts/exmpl_single_BM_01.csv"  # ⬅️ Change examples file here (or set to None)

# Output configuration
OUTPUT_FORMAT = "Binary"  # ⬅️ Options: "Binary", "Yes/Maybe/No", "Likert
DECISION_OPTIONS = ["INCLUDE", "EXCLUDE"] # ⬅️ Change according to the output format

# Additional metadata for results tracking
DOMAIN = "political_communication" # ⬅️ Change this to the domain of the study
TOPIC = "Computational Text Analysis Methods"  # ⬅️ Change this to the topic of the study
DATASET_SOURCE = "BM"  # ⬅️ Which dataset (BM/LB)

# Define the user prompt template with placeholders
USER_PROMPT_TEMPLATE = """## SCREENING TASK:
You are screening abstracts for a systematic literature review on {topic} in {domain}.

## INCLUSION/EXCLUSION CRITERIA:
{criteria_text}

{examples_section}

## ABSTRACT TO SCREEN:
**Title:** {title}
**Abstract:** {abstract}

## YOUR DECISION:
Based strictly on the criteria above, provide your decision as either "{decision_include}" or "{decision_exclude}" followed by your reasoning:

**Decision:** 
**Reasoning:** """

print(f"✅ User prompt configuration and template loaded")
print(f"📋 ID: {USER_PROMPT_ID}")
print(f"📄 Description: {USER_PROMPT_DESCRIPTION}")
print(f"📁 Criteria: {CRITERIA_FILE}")
print(f"📁 Examples: {EXAMPLES_FILE}")
print(f"🎯 Output: {OUTPUT_FORMAT}")
print(f"🔬 Topic: {TOPIC} | Domain: {DOMAIN} | Source: {DATASET_SOURCE}")
print(f"📏 Template length: {len(USER_PROMPT_TEMPLATE)} characters")

✅ User prompt configuration and template loaded
📋 ID: USR_001
📄 Description: Basic CTAM screening with criteria and examples from CSV files
📁 Criteria: ../prompts/Criteria_BM_01.csv
📁 Examples: ../prompts/exmpl_single_BM_01.csv
🎯 Output: Binary
🔬 Topic: Computational Text Analysis Methods | Domain: political_communication | Source: BM
📏 Template length: 437 characters


## ✅ Valdiation Check

In [22]:
# =============================================================================
# ✅ EXPERIMENT VALIDATION CHECK
# =============================================================================

def validate_experiment_setup(df, dataset_source="BM"):
    """
    Validate that all required variables and data are available for the experiment.
    
    Args:
        df: DataFrame to be used in experiment
        dataset_source: Dataset identifier
    
    Returns:
        bool: True if all validations pass, False otherwise
    """
    
    print("🔍 VALIDATION CHECK")
    print("=" * 50)
    
    validation_passed = True
    
    # Check required global variables
    required_vars = {
        'EXPERIMENT_ID': globals().get('EXPERIMENT_ID'),
        'SYSTEM_PROMPT_ID': globals().get('SYSTEM_PROMPT_ID'), 
        'USER_PROMPT_ID': globals().get('USER_PROMPT_ID'),
        'SYSTEM_PROMPT': globals().get('SYSTEM_PROMPT'),
        'USER_PROMPT_TEMPLATE': globals().get('USER_PROMPT_TEMPLATE'),
        'CRITERIA_FILE': globals().get('CRITERIA_FILE'),
        'EXAMPLES_FILE': globals().get('EXAMPLES_FILE'),
        'DECISION_OPTIONS': globals().get('DECISION_OPTIONS'),
        'MODEL_NAME': globals().get('MODEL_NAME'),
        'TEMPERATURE': globals().get('TEMPERATURE'),
        'TOPIC': globals().get('TOPIC'),  # Changed from METHOD
        'DOMAIN': globals().get('DOMAIN')
    }
    
    print("📋 Checking required variables:")
    for var_name, var_value in required_vars.items():
        if var_value is None:
            print(f"   ❌ {var_name}: NOT DEFINED")
            validation_passed = False
        else:
            print(f"   ✅ {var_name}: {str(var_value)[:50]}{'...' if len(str(var_value)) > 50 else ''}")
    
    # Check DataFrame structure
    print(f"\n📊 Checking DataFrame structure:")
    required_columns = ['abstract', 'title_full', 'stage_2', 'stage_3']
    
    if df is None:
        print(f"   ❌ DataFrame is None")
        validation_passed = False
    else:
        print(f"   ✅ DataFrame shape: {df.shape}")
        
        for col in required_columns:
            if col in df.columns:
                print(f"   ✅ Column '{col}': Present")
            else:
                print(f"   ❌ Column '{col}': MISSING")
                validation_passed = False
    
    # Check data availability
    if df is not None and all(col in df.columns for col in required_columns):
        print(f"\n📈 Checking data availability:")
        stage2_true = len(df[df['stage_2'] == True])
        stage2_false = len(df[df['stage_2'] == False])
        stage3_true = len(df[df['stage_3'] == True])
        stage3_false = len(df[df['stage_3'] == False])
        
        print(f"   📊 Stage 2 True: {stage2_true}")
        print(f"   📊 Stage 2 False: {stage2_false}")
        print(f"   📊 Stage 3 True: {stage3_true}")
        print(f"   📊 Stage 3 False: {stage3_false}")
        
        if stage3_true < 10:
            print(f"   ⚠️  Warning: Only {stage3_true} stage_3=True examples available")
        if stage3_false < 10:
            print(f"   ⚠️  Warning: Only {stage3_false} stage_3=False examples available")
    
    # Check file paths
    print(f"\n📁 Checking file paths:")
    import os
    
    if CRITERIA_FILE and os.path.exists(CRITERIA_FILE):
        print(f"   ✅ Criteria file: {CRITERIA_FILE}")
    elif CRITERIA_FILE:
        print(f"   ❌ Criteria file: {CRITERIA_FILE} (NOT FOUND)")
        validation_passed = False
    else:
        print(f"   ❌ Criteria file: NOT SPECIFIED")
        validation_passed = False
    
    if EXAMPLES_FILE:
        if os.path.exists(EXAMPLES_FILE):
            print(f"   ✅ Examples file: {EXAMPLES_FILE}")
        else:
            print(f"   ❌ Examples file: {EXAMPLES_FILE} (NOT FOUND)")
            validation_passed = False
    else:
        print(f"   ℹ️  Examples file: None (will run without examples)")
    
    # Check API function
    print(f"\n🤖 Checking API function:")
    if 'screen_abstract_llm' in globals():
        print(f"   ✅ screen_abstract_llm function: Available")
    else:
        print(f"   ❌ screen_abstract_llm function: NOT DEFINED")
        validation_passed = False
    
    # Final result
    print("\n" + "=" * 50)
    if validation_passed:
        print("✅ ALL VALIDATIONS PASSED - Ready to run experiment!")
    else:
        print("❌ VALIDATION FAILED - Please fix the issues above before running")
    
    return validation_passed

# Run validation
validation_result = validate_experiment_setup(df_BM, "BM")

🔍 VALIDATION CHECK
📋 Checking required variables:
   ✅ EXPERIMENT_ID: 011
   ✅ SYSTEM_PROMPT_ID: SYS_001
   ✅ USER_PROMPT_ID: USR_001
   ✅ SYSTEM_PROMPT: You are an expert in scientific literature review ...
   ✅ USER_PROMPT_TEMPLATE: ## SCREENING TASK:
You are screening abstracts for...
   ✅ CRITERIA_FILE: ../prompts/Criteria_BM_01.csv
   ✅ EXAMPLES_FILE: ../prompts/exmpl_single_BM_01.csv
   ✅ DECISION_OPTIONS: ['INCLUDE', 'EXCLUDE']
   ✅ MODEL_NAME: gpt-4o
   ✅ TEMPERATURE: 0.0
   ✅ TOPIC: Computational Text Analysis Methods
   ✅ DOMAIN: political_communication

📊 Checking DataFrame structure:
   ✅ DataFrame shape: (917, 13)
   ✅ Column 'abstract': Present
   ✅ Column 'title_full': Present
   ✅ Column 'stage_2': Present
   ✅ Column 'stage_3': Present

📈 Checking data availability:
   📊 Stage 2 True: 166
   📊 Stage 2 False: 751
   📊 Stage 3 True: 96
   📊 Stage 3 False: 821

📁 Checking file paths:
   ✅ Criteria file: ../prompts/Criteria_BM_01.csv
   ✅ Examples file: ../prompts/exmpl_si

## 🔬 Set Up Function

In [23]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix 
from datetime import datetime
import os

def run_classification_experiment(
    df, 
    n_total_examples=50,  # ⬅️ Total number of examples to test
    n_stage3_true=5,     # ⬅️ Number of stage_3=True examples
    n_stage3_false=45,    # ⬅️ Number of stage_3=False examples
    dataset_source="BM",  # ⬅️ Dataset identifier (LB/BM)
    save_results=True,    # ⬅️ Whether to save results to CSV
    verbose=True          # ⬅️ Print progress updates
):
    """
    Run LLM classification experiment on abstracts.
    
    Args:
        df: DataFrame with abstracts (must have 'abstract', 'title_full', 'stage_2', 'stage_3')
        n_total_examples: Total number of examples to test
        n_stage3_true: Number of stage_3=True examples to include
        n_stage3_false: Number of stage_3=False examples to include
        dataset_source: Dataset identifier for results filename
        save_results: Whether to save results to CSV
        verbose: Whether to print progress
    
    Returns:
        dict: Results including metrics and DataFrame
    """
    
    if verbose:
        print(f"🧪 Starting Classification Experiment")
        print(f"📊 Dataset: {dataset_source}")
        print(f"🎯 Total examples: {n_total_examples}")
        print(f"✅ Stage 3 True: {n_stage3_true}")
        print(f"❌ Stage 3 False: {n_stage3_false}")
        print("=" * 50)
    
    # Sample examples
    stage3_true_samples = df[df['stage_3'] == True].sample(n=n_stage3_true, random_state=42)
    stage3_false_samples = df[df['stage_3'] == False].sample(n=n_stage3_false, random_state=42)
    
    # Combine samples
    test_samples = pd.concat([stage3_true_samples, stage3_false_samples]).reset_index(drop=True)
    
    if verbose:
        print(f"📝 Sampled {len(test_samples)} examples")
    
    # Load criteria and examples text
    def load_criteria_text(criteria_file):
        try:
            criteria_df = pd.read_csv(criteria_file)
            criteria_text = ""
            
            # Add inclusion criteria
            inclusion_criteria = criteria_df[criteria_df['type'] == 'inclusion']
            if len(inclusion_criteria) > 0:
                criteria_text += "**INCLUSION CRITERIA:**\n"
                for _, row in inclusion_criteria.iterrows():
                    criteria_text += f"- **{row['criterion_id']}**: {row['description']}\n"
                    if pd.notna(row['examples']) and row['examples'].strip():
                        criteria_text += f"  *Examples: {row['examples']}*\n"
            
            # Add exclusion criteria
            exclusion_criteria = criteria_df[criteria_df['type'] == 'exclusion']
            if len(exclusion_criteria) > 0:
                criteria_text += "\n**EXCLUSION CRITERIA:**\n"
                for _, row in exclusion_criteria.iterrows():
                    criteria_text += f"- **{row['criterion_id']}**: {row['description']}\n"
                    if pd.notna(row['examples']) and row['examples'].strip():
                        criteria_text += f"  *Examples: {row['examples']}*\n"
            
            return criteria_text
        except Exception as e:
            return f"Error loading criteria: {e}"
    
    def load_examples_text(examples_file):
        if not examples_file:
            return ""
        try:
            examples_df = pd.read_csv(examples_file)
            examples_text = "\n## EXAMPLE DECISIONS:\n"
            
            for _, row in examples_df.iterrows():
                decision_label = "INCLUDE" if row['decision'].upper() == 'INCLUDE' else "EXCLUDE"
                examples_text += f"\n**{decision_label} Example:**\n"
                examples_text += f"*Title:* {row['title']}\n"
                examples_text += f"*Abstract:* {row['abstract_text'][:200]}{'...' if len(row['abstract_text']) > 200 else ''}\n"
                examples_text += f"→ **{decision_label}** ({row['reasoning']})\n"
            
            return examples_text
        except Exception as e:
            return f"\n## EXAMPLES:\nError loading examples: {e}\n"
    
    # Load prompt components
    criteria_text = load_criteria_text(CRITERIA_FILE)
    examples_section = load_examples_text(EXAMPLES_FILE) if EXAMPLES_FILE else ""
    
    # Initialize results list
    results_list = []
    
    # Process each example
    for idx, row in test_samples.iterrows():
        if verbose and (idx + 1) % 10 == 0:
            print(f"🔄 Processing example {idx + 1}/{len(test_samples)}")
        
        try:
            # Create complete prompt
            complete_prompt = USER_PROMPT_TEMPLATE.format(
                topic=TOPIC,
                domain=DOMAIN,
                criteria_text=criteria_text,
                examples_section=examples_section,
                title=row['title_full'],
                abstract=row['abstract'],
                decision_include=DECISION_OPTIONS[0],
                decision_exclude=DECISION_OPTIONS[1]
            )
            
            # Call LLM
            llm_result = screen_abstract_llm(
                abstract_text=complete_prompt,
                system_prompt=SYSTEM_PROMPT,
                user_prompt_template="{abstract}",  # Just pass through since we formatted above
                model=MODEL_NAME,
                temperature=TEMPERATURE
            )
            
            # Parse LLM decision
            llm_decision = llm_result.get('decision', 'UNKNOWN')
            llm_reasoning = llm_result.get('reasoning', 'No reasoning provided')
            
            # Convert to binary for evaluation
            llm_binary = 1 if llm_decision == 'INCLUDE' else 0
            stage2_binary = 1 if row['stage_2'] else 0
            stage3_binary = 1 if row['stage_3'] else 0
            
            # Store result
            result_row = {
                'example_id': idx + 1,
                'title': row['title_full'],
                'abstract': row['abstract'],
                'stage_2_true': row['stage_2'],
                'stage_3_true': row['stage_3'],
                'stage_2_binary': stage2_binary,
                'stage_3_binary': stage3_binary,
                'llm_decision': llm_decision,
                'llm_binary': llm_binary,
                'llm_reasoning': llm_reasoning,
                'experiment_id': EXPERIMENT_ID,
                'dataset_source': dataset_source,
                'system_prompt_id': SYSTEM_PROMPT_ID,
                'user_prompt_id': USER_PROMPT_ID,
                'model': MODEL_NAME,
                'temperature': TEMPERATURE,
                'timestamp': datetime.now().isoformat()
            }
            
            results_list.append(result_row)
            
        except Exception as e:
            if verbose:
                print(f"❌ Error processing example {idx + 1}: {e}")
            
            # Store error result
            result_row = {
                'example_id': idx + 1,
                'title': row['title_full'],
                'abstract': row['abstract'],
                'stage_2_true': row['stage_2'],
                'stage_3_true': row['stage_3'],
                'stage_2_binary': 1 if row['stage_2'] else 0,
                'stage_3_binary': 1 if row['stage_3'] else 0,
                'llm_decision': 'ERROR',
                'llm_binary': 0,
                'llm_reasoning': f'Processing error: {e}',
                'experiment_id': EXPERIMENT_ID,
                'dataset_source': dataset_source,
                'system_prompt_id': SYSTEM_PROMPT_ID,
                'user_prompt_id': USER_PROMPT_ID,
                'model': MODEL_NAME,
                'temperature': TEMPERATURE,
                'timestamp': datetime.now().isoformat()
            }
            
            results_list.append(result_row)
    
    # Create results DataFrame
    results_df = pd.DataFrame(results_list)
    
   # Calculate detailed metrics for stage_2
    valid_results_stage2 = results_df[results_df['llm_decision'] != 'ERROR']
    if len(valid_results_stage2) > 0:
        y_true_stage2 = valid_results_stage2['stage_2_binary'].values
        y_pred_stage2 = valid_results_stage2['llm_binary'].values
        
        # Basic metrics
        accuracy_stage2 = accuracy_score(y_true_stage2, y_pred_stage2)
        precision_stage2 = precision_score(y_true_stage2, y_pred_stage2, zero_division=0)
        recall_stage2 = recall_score(y_true_stage2, y_pred_stage2, zero_division=0)
        f1_stage2 = f1_score(y_true_stage2, y_pred_stage2, zero_division=0)
        
        # Confusion matrix metrics
        tn2, fp2, fn2, tp2 = confusion_matrix(y_true_stage2, y_pred_stage2).ravel()
    else:
        accuracy_stage2 = precision_stage2 = recall_stage2 = f1_stage2 = 0.0
        tp2 = fp2 = tn2 = fn2 = 0
    
    # Calculate detailed metrics for stage_3
    valid_results_stage3 = results_df[results_df['llm_decision'] != 'ERROR']
    if len(valid_results_stage3) > 0:
        y_true_stage3 = valid_results_stage3['stage_3_binary'].values
        y_pred_stage3 = valid_results_stage3['llm_binary'].values
        
        # Basic metrics
        accuracy_stage3 = accuracy_score(y_true_stage3, y_pred_stage3)
        precision_stage3 = precision_score(y_true_stage3, y_pred_stage3, zero_division=0)
        recall_stage3 = recall_score(y_true_stage3, y_pred_stage3, zero_division=0)
        f1_stage3 = f1_score(y_true_stage3, y_pred_stage3, zero_division=0)
        
        # Confusion matrix metrics
        tn3, fp3, fn3, tp3 = confusion_matrix(y_true_stage3, y_pred_stage3).ravel()
    else:
        accuracy_stage3 = precision_stage3 = recall_stage3 = f1_stage3 = 0.0
        tp3 = fp3 = tn3 = fn3 = 0
    
    # Updated metrics dictionary
    metrics = {
        'stage_2_metrics': {
            'accuracy': accuracy_stage2,
            'precision': precision_stage2,
            'recall': recall_stage2,
            'f1_score': f1_stage2,
            'tp': int(tp2),
            'fp': int(fp2),
            'tn': int(tn2),
            'fn': int(fn2)
        },
        'stage_3_metrics': {
            'accuracy': accuracy_stage3,
            'precision': precision_stage3,
            'recall': recall_stage3,
            'f1_score': f1_stage3,
            'tp': int(tp3),
            'fp': int(fp3),
            'tn': int(tn3),
            'fn': int(fn3)
        },
        'total_examples': len(results_df),
        'successful_classifications': len(valid_results_stage2),
        'errors': len(results_df) - len(valid_results_stage2)
    }
    
    # Enhanced results printing
    if verbose:
        print("\n📊 EXPERIMENT RESULTS")
        print("=" * 50)
        print(f"📈 Stage 2 Evaluation:")
        print(f"   Accuracy:  {accuracy_stage2:.3f}")
        print(f"   Precision: {precision_stage2:.3f}")
        print(f"   Recall:    {recall_stage2:.3f}")
        print(f"   F1 Score:  {f1_stage2:.3f}")
        print(f"   TP: {tp2}, FP: {fp2}, TN: {tn2}, FN: {fn2}")
        print(f"\n📈 Stage 3 Evaluation:")
        print(f"   Accuracy:  {accuracy_stage3:.3f}")
        print(f"   Precision: {precision_stage3:.3f}")
        print(f"   Recall:    {recall_stage3:.3f}")
        print(f"   F1 Score:  {f1_stage3:.3f}")
        print(f"   TP: {tp3}, FP: {fp3}, TN: {tn3}, FN: {fn3}")
        print(f"\n📋 Processing Summary:")
        print(f"   Total examples: {len(results_df)}")
        print(f"   Successful: {len(valid_results_stage2)}")
        print(f"   Errors: {len(results_df) - len(valid_results_stage2)}")
    
    # Save results
    if save_results:
        # Create filename with timestamp
        timestamp = datetime.now().strftime("%m%d%H%M")
        filename = f"{EXPERIMENT_ID}_{dataset_source}_{timestamp}.csv"
        results_dir = "../results"
        os.makedirs(results_dir, exist_ok=True)
        output_path = os.path.join(results_dir, filename)
        
        results_df.to_csv(output_path, index=False)
        
        if verbose:
            print(f"\n💾 Results saved to: {output_path}")
    
    return {
        'results_df': results_df,
        'metrics': metrics,
        'filename': filename if save_results else None
    }

print("✅ Classification experiment function defined")
print("🚀 Ready to run: run_classification_experiment(df_BM)")

✅ Classification experiment function defined
🚀 Ready to run: run_classification_experiment(df_BM)


## 🚀 Run experiment! 

In [24]:
# Run experiment with default settings
results = run_classification_experiment(df_BM)

🧪 Starting Classification Experiment
📊 Dataset: BM
🎯 Total examples: 50
✅ Stage 3 True: 5
❌ Stage 3 False: 45
📝 Sampled 50 examples
🔄 Processing example 10/50
🔄 Processing example 20/50
🔄 Processing example 30/50
🔄 Processing example 40/50
🔄 Processing example 50/50

📊 EXPERIMENT RESULTS
📈 Stage 2 Evaluation:
   Accuracy:  0.840
   Precision: 0.750
   Recall:    0.300
   F1 Score:  0.429
   TP: 3, FP: 1, TN: 39, FN: 7

📈 Stage 3 Evaluation:
   Accuracy:  0.940
   Precision: 0.750
   Recall:    0.600
   F1 Score:  0.667
   TP: 3, FP: 1, TN: 44, FN: 2

📋 Processing Summary:
   Total examples: 50
   Successful: 50
   Errors: 0

💾 Results saved to: ../results/011_BM_08141215.csv


## 📊 Results Analysis

## ➕ Add experiment info to the results_df

In [31]:
def add_experiment_to_summary(results_dict, summary_file="../results/experiment_summary.csv"):
    """Add new experiment results to the summary DataFrame with confusion matrix metrics"""
    
    new_row = pd.DataFrame({
        'experiment_id': [EXPERIMENT_ID],
        'experiment_date': [EXPERIMENT_DATE],
        'experiment_category': [EXPERIMENT_CATEGORY],
        'experiment_goal': [EXPERIMENT_GOAL],
        'system_prompt_id': [SYSTEM_PROMPT_ID],
        'user_prompt_id': [USER_PROMPT_ID],
        'model_name': [MODEL_NAME],
        'temperature': [TEMPERATURE],
        'max_tokens': [MAX_TOKENS],
        'criteria_file': [CRITERIA_FILE],
        'examples_file': [EXAMPLES_FILE],
        'output_format': [OUTPUT_FORMAT],
        'domain': [DOMAIN],
        'topic': [TOPIC],
        'dataset_source': [DATASET_SOURCE],
        'n_total_examples': [results_dict['metrics']['total_examples']],
        'n_successful': [results_dict['metrics']['successful_classifications']],
        'n_errors': [results_dict['metrics']['errors']],
        # Stage 2 metrics
        'stage2_accuracy': [results_dict['metrics']['stage_2_metrics']['accuracy']],
        'stage2_precision': [results_dict['metrics']['stage_2_metrics']['precision']],
        'stage2_recall': [results_dict['metrics']['stage_2_metrics']['recall']],
        'stage2_f1': [results_dict['metrics']['stage_2_metrics']['f1_score']],
        'stage2_tp': [results_dict['metrics']['stage_2_metrics']['tp']],
        'stage2_fp': [results_dict['metrics']['stage_2_metrics']['fp']],
        'stage2_tn': [results_dict['metrics']['stage_2_metrics']['tn']],
        'stage2_fn': [results_dict['metrics']['stage_2_metrics']['fn']],
        # Stage 3 metrics
        'stage3_accuracy': [results_dict['metrics']['stage_3_metrics']['accuracy']],
        'stage3_precision': [results_dict['metrics']['stage_3_metrics']['precision']],
        'stage3_recall': [results_dict['metrics']['stage_3_metrics']['recall']],
        'stage3_f1': [results_dict['metrics']['stage_3_metrics']['f1_score']],
        'stage3_tp': [results_dict['metrics']['stage_3_metrics']['tp']],
        'stage3_fp': [results_dict['metrics']['stage_3_metrics']['fp']],
        'stage3_tn': [results_dict['metrics']['stage_3_metrics']['tn']],
        'stage3_fn': [results_dict['metrics']['stage_3_metrics']['fn']],
        'results_filename': [results_dict['filename']],
        'timestamp': [datetime.now().isoformat()]
    })
    
    # Load existing summary or create new one
    if os.path.exists(summary_file):
        existing_summary = pd.read_csv(summary_file)
        updated_summary = pd.concat([existing_summary, new_row], ignore_index=True)
    else:
        updated_summary = new_row
    
    # Save updated summary
    updated_summary.to_csv(summary_file, index=False)
    print(f"✅ Experiment {EXPERIMENT_ID} added to summary: {summary_file}")


    
    return updated_summary

## 📝 Conclusions and Next Steps

### Key Findings
- 

### Next Steps
- [Suggest follow-up experiments]
- [List potential improvements]
- [Identify areas for further investigation]