# Verification Case Studies

This notebook provides detailed case studies of how different verification strategies affect optimization results. We'll examine examples where verification:

1. Successfully caught and corrected errors
2. Failed to catch errors
3. Incorrectly rejected valid updates
4. Showed different behaviors across verification strategies

These case studies will help illustrate the strengths and weaknesses of each verification approach.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display
import re
import os

# Load verification results
RESULTS_FILE = "../results/verification/verification_results.json"

with open(RESULTS_FILE, 'r') as f:
    results = json.load(f)

## Helper Functions for Analysis

In [None]:
def highlight_differences(original, updated, add_color='#d4f7d4', remove_color='#ffd4d4'):
    """Highlight differences between original and updated text"""
    import difflib
    d = difflib.Differ()
    diff = list(d.compare(original.splitlines(), updated.splitlines()))
    
    result = []
    for line in diff:
        if line.startswith('+ '):
            result.append(f"<div style='background-color: {add_color}'>{line[2:]}</div>")
        elif line.startswith('- '):
            result.append(f"<div style='background-color: {remove_color}'>{line[2:]}</div>")
        elif line.startswith('  '):
            result.append(line[2:])
    
    return '<br>'.join(result)

def display_verification_case(result, iteration=1):
    """Display a verification case with all relevant information"""
    question = result.get('question', 'N/A')
    answer = result.get('answer', 'N/A')
    
    strategy = result.get('verification_strategy', 'N/A')
    threshold = result.get('verification_threshold', 'N/A')
    
    performance_history = result.get('performance_history', [])
    predictions = result.get('predictions', [])
    verification_metrics = result.get('verification_metrics', [])
    
    # Display basic information
    print(f"Strategy: {strategy}, Threshold: {threshold}")
    print(f"Question: {question}")
    print(f"Correct Answer: {answer}")
    print(f"Performance History: {performance_history}")
    print()
    
    # Display iteration of interest
    if len(predictions) > iteration and iteration > 0:
        print(f"=== Iteration {iteration} ===")
        
        # Display the prediction before verification
        prev_prediction = predictions[iteration-1]
        print("Previous solution:")
        print(prev_prediction)
        print()
        
        # Display verification details if available
        if verification_metrics and len(verification_metrics) >= iteration:
            vm = verification_metrics[iteration-1]
            print(f"Verification applied: {vm.get('verification_applied', 'N/A')}")
            print(f"Verification confidence: {vm.get('verification_confidence', 'N/A')}")
            if 'corrections' in vm and vm['corrections']:
                print("Corrections:")
                print(vm['corrections'])
            print()
        
        # Display the updated prediction
        curr_prediction = predictions[iteration]
        print("Updated solution:")
        print(curr_prediction)
        print()
        
        # Display differences
        print("Differences:")
        display(HTML(highlight_differences(prev_prediction, curr_prediction)))

## 1. Successful Verification Cases

Let's find examples where verification successfully caught and corrected errors.

In [None]:
# Find samples where verification was applied and performance improved
successful_cases = []

for result in results:
    if result['verification_strategy'] == 'none':
        continue
        
    performance_history = result['performance_history']
    verification_metrics = result.get('verification_metrics', [])
    
    for i in range(1, len(performance_history)):
        # Check if verification was applied in this iteration
        was_verified = False
        if verification_metrics and i <= len(verification_metrics):
            was_verified = verification_metrics[i-1].get('verification_applied', False)
        
        # Check if performance improved
        if was_verified and performance_history[i] > performance_history[i-1]:
            successful_cases.append((result, i))
            break

print(f"Found {len(successful_cases)} successful verification cases")

# Display a few examples
for i, (result, iteration) in enumerate(successful_cases[:3]):
    print(f"\n--- Successful Case {i+1} ---\n")
    display_verification_case(result, iteration)

## 2. Failed Verification Cases

Now let's examine cases where verification was applied but failed to catch errors.

In [None]:
# Find samples where verification was applied but performance didn't improve
failed_cases = []

for result in results:
    if result['verification_strategy'] == 'none':
        continue
        
    performance_history = result['performance_history']
    verification_metrics = result.get('verification_metrics', [])
    
    for i in range(1, len(performance_history)):
        # Check if verification was applied in this iteration
        was_verified = False
        if verification_metrics and i <= len(verification_metrics):
            was_verified = verification_metrics[i-1].get('verification_applied', False)
        
        # Check if performance didn't improve or got worse
        if was_verified and performance_history[i] <= performance_history[i-1]:
            failed_cases.append((result, i))
            break

print(f"Found {len(failed_cases)} failed verification cases")

# Display a few examples
for i, (result, iteration) in enumerate(failed_cases[:3]):
    print(f"\n--- Failed Case {i+1} ---\n")
    display_verification_case(result, iteration)

## 3. Comparative Analysis Across Verification Strategies

Let's compare how different verification strategies handle the same example.

In [None]:
# Group results by sample_id
samples_by_id = {}
for result in results:
    sample_id = result['sample_id']
    if sample_id not in samples_by_id:
        samples_by_id[sample_id] = []
    samples_by_id[sample_id].append(result)

# Find samples that have results for all strategies
strategies = set([result['verification_strategy'] for result in results])
comparative_samples = []

for sample_id, sample_results in samples_by_id.items():
    sample_strategies = set([r['verification_strategy'] for r in sample_results])
    if sample_strategies == strategies:
        comparative_samples.append(sample_id)

print(f"Found {len(comparative_samples)} samples with results for all strategies")

# Select a sample for comparison
if comparative_samples:
    sample_id = comparative_samples[0]
    sample_results = samples_by_id[sample_id]
    
    print(f"\n=== Comparative Analysis for Sample {sample_id} ===\n")
    
    # Print question and answer once
    question = sample_results[0]['question']
    answer = sample_results[0]['answer']
    print(f"Question: {question}")
    print(f"Correct Answer: {answer}\n")
    
    # Compare final performance
    print("Final Performance by Strategy:")
    for result in sample_results:
        strategy = result['verification_strategy']
        threshold = result.get('verification_threshold', 'N/A')
        final_score = result['performance_history'][-1]
        
        print(f"{strategy} (threshold={threshold}): {final_score}")
    
    print("\nDetailed Comparison of First Iteration:")
    # Compare first iteration across strategies
    for result in sample_results:
        print(f"\n--- {result['verification_strategy']} ---")
        display_verification_case(result, 1)

## 4. Impact of Confidence Threshold

Let's examine how different confidence thresholds affect verification outcomes.

In [None]:
# Group results by strategy and threshold
threshold_impacts = {}

for result in results:
    if result['verification_strategy'] == 'none':
        continue
        
    strategy = result['verification_strategy']
    threshold = result['verification_threshold']
    
    key = (strategy, threshold)
    if key not in threshold_impacts:
        threshold_impacts[key] = {
            'verification_applied_count': 0,
            'total_iterations': 0,
            'improved_after_verification': 0,
            'worsened_after_verification': 0,
            'unchanged_after_verification': 0
        }
    
    verification_metrics = result.get('verification_metrics', [])
    performance_history = result['performance_history']
    
    for i in range(len(verification_metrics)):
        threshold_impacts[key]['total_iterations'] += 1
        
        if verification_metrics[i].get('verification_applied', False):
            threshold_impacts[key]['verification_applied_count'] += 1
            
            # Check performance change
            if i+1 < len(performance_history):
                if performance_history[i+1] > performance_history[i]:
                    threshold_impacts[key]['improved_after_verification'] += 1
                elif performance_history[i+1] < performance_history[i]:
                    threshold_impacts[key]['worsened_after_verification'] += 1
                else:
                    threshold_impacts[key]['unchanged_after_verification'] += 1

# Convert to DataFrame
threshold_data = []
for (strategy, threshold), impacts in threshold_impacts.items():
    # Calculate percentages
    applied_rate = impacts['verification_applied_count'] / impacts['total_iterations'] if impacts['total_iterations'] > 0 else 0
    improved_rate = impacts['improved_after_verification'] / impacts['verification_applied_count'] if impacts['verification_applied_count'] > 0 else 0
    worsened_rate = impacts['worsened_after_verification'] / impacts['verification_applied_count'] if impacts['verification_applied_count'] > 0 else 0
    unchanged_rate = impacts['unchanged_after_verification'] / impacts['verification_applied_count'] if impacts['verification_applied_count'] > 0 else 0
    
    threshold_data.append({
        'strategy': strategy,
        'threshold': threshold,
        'applied_rate': applied_rate,
        'improved_rate': improved_rate,
        'worsened_rate': worsened_rate,
        'unchanged_rate': unchanged_rate
    })

threshold_df = pd.DataFrame(threshold_data)
threshold_df.sort_values(['strategy', 'threshold'], inplace=True)
threshold_df

In [None]:
# Plot threshold impact
plt.figure(figsize=(15, 10))

# Plot verification application rate
plt.subplot(2, 2, 1)
for strategy in threshold_df['strategy'].unique():
    strategy_data = threshold_df[threshold_df['strategy'] == strategy]
    plt.plot(
        strategy_data['threshold'],
        strategy_data['applied_rate'],
        marker='o',
        label=strategy
    )
plt.xlabel('Confidence Threshold')
plt.ylabel('Verification Application Rate')
plt.title('Verification Application Rate vs. Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot improved rate
plt.subplot(2, 2, 2)
for strategy in threshold_df['strategy'].unique():
    strategy_data = threshold_df[threshold_df['strategy'] == strategy]
    plt.plot(
        strategy_data['threshold'],
        strategy_data['improved_rate'],
        marker='o',
        label=strategy
    )
plt.xlabel('Confidence Threshold')
plt.ylabel('Improvement Rate After Verification')
plt.title('Improvement Rate vs. Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot worsened rate
plt.subplot(2, 2, 3)
for strategy in threshold_df['strategy'].unique():
    strategy_data = threshold_df[threshold_df['strategy'] == strategy]
    plt.plot(
        strategy_data['threshold'],
        strategy_data['worsened_rate'],
        marker='o',
        label=strategy
    )
plt.xlabel('Confidence Threshold')
plt.ylabel('Worsening Rate After Verification')
plt.title('Worsening Rate vs. Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot unchanged rate
plt.subplot(2, 2, 4)
for strategy in threshold_df['strategy'].unique():
    strategy_data = threshold_df[threshold_df['strategy'] == strategy]
    plt.plot(
        strategy_data['threshold'],
        strategy_data['unchanged_rate'],
        marker='o',
        label=strategy
    )
plt.xlabel('Confidence Threshold')
plt.ylabel('Unchanged Rate After Verification')
plt.title('Unchanged Rate vs. Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Error Categories and Verification Effectiveness

Let's analyze the types of errors that each verification strategy is most effective at catching.

In [None]:
# Define error categories
error_categories = {
    'hallucination': [
        'hallucinate', 'fabricate', 'invent', 'make up', 'create facts', 
        'not based on', 'no evidence', 'unsupported claim'
    ],
    'logical_error': [
        'logical error', 'invalid reasoning', 'fallacy', 'reasoning error',
        'incorrect logic', 'illogical step', 'non sequitur'
    ],
    'mathematical_error': [
        'calculation error', 'arithmetic error', 'computational mistake',
        'incorrect computation', 'wrong calculation', 'mathematical error'
    ],
    'factual_error': [
        'factual error', 'incorrect fact', 'wrong information', 'inaccurate claim',
        'factually incorrect', 'wrong assumption'
    ]
}

# Function to classify error type in verification feedback
def classify_error_type(text):
    text = text.lower()
    error_types = []
    
    for error_type, keywords in error_categories.items():
        for keyword in keywords:
            if keyword.lower() in text:
                error_types.append(error_type)
                break
    
    return error_types or ['other_error']

# Analyze verification corrections
error_analysis = []

for result in results:
    if result['verification_strategy'] == 'none':
        continue
        
    verification_metrics = result.get('verification_metrics', [])
    
    for i, vm in enumerate(verification_metrics):
        if vm.get('verification_applied', False) and 'corrections' in vm and vm['corrections']:
            error_types = classify_error_type(vm['corrections'])
            
            # Check if performance improved after this verification
            performance_improved = False
            if i+1 < len(result['performance_history']):
                performance_improved = result['performance_history'][i+1] > result['performance_history'][i]
            
            for error_type in error_types:
                error_analysis.append({
                    'strategy': result['verification_strategy'],
                    'threshold': result['verification_threshold'],
                    'error_type': error_type,
                    'performance_improved': performance_improved,
                    'sample_id': result['sample_id'],
                    'iteration': i
                })

error_df = pd.DataFrame(error_analysis)

# Calculate effectiveness by error type and strategy
effectiveness = error_df.groupby(['strategy', 'error_type'])['performance_improved'].agg(
    ['count', 'mean']
).reset_index()
effectiveness.columns = ['strategy', 'error_type', 'count', 'effectiveness']
effectiveness = effectiveness[effectiveness['count'] >= 3]  # Filter for types with enough samples
effectiveness.sort_values(['strategy', 'effectiveness'], ascending=[True, False], inplace=True)

effectiveness

In [None]:
# Visualize error type effectiveness by strategy
plt.figure(figsize=(12, 8))

# Get unique error types and strategies
error_types = effectiveness['error_type'].unique()
strategies = effectiveness['strategy'].unique()

# Set up bar positions
x = np.arange(len(error_types))
width = 0.8 / len(strategies)
offsets = np.linspace(-0.4 + width/2, 0.4 - width/2, len(strategies))

# Plot bars for each strategy
for i, strategy in enumerate(strategies):
    strategy_data = effectiveness[effectiveness['strategy'] == strategy]
    
    # Create a dictionary mapping error types to effectiveness for this strategy
    strategy_dict = dict(zip(strategy_data['error_type'], strategy_data['effectiveness']))
    
    # Get effectiveness values for each error type (0 if not present)
    values = [strategy_dict.get(error_type, 0) for error_type in error_types]
    
    plt.bar(x + offsets[i], values, width, label=strategy)

plt.xlabel('Error Type')
plt.ylabel('Effectiveness (% Improved After Verification)')
plt.title('Verification Effectiveness by Error Type and Strategy')
plt.xticks(x, error_types)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Detailed Error Case Studies

Let's examine specific examples of each error type and how the different verification strategies handled them.

In [None]:
# Function to find examples of specific error types
def find_error_examples(error_type, limit=3):
    examples = []
    for result in results:
        if result['verification_strategy'] == 'none':
            continue
            
        verification_metrics = result.get('verification_metrics', [])
        
        for i, vm in enumerate(verification_metrics):
            if vm.get('verification_applied', False) and 'corrections' in vm and vm['corrections']:
                corrections = vm['corrections'].lower()
                
                # Check if this correction addresses the specified error type
                for keyword in error_categories.get(error_type, []):
                    if keyword.lower() in corrections:
                        examples.append((result, i+1))  # +1 because iteration index starts at 0
                        break
                        
                if len(examples) >= limit:
                    break
        
        if len(examples) >= limit:
            break
            
    return examples

# Let's look at examples of hallucinations
hallucination_examples = find_error_examples('hallucination')
print(f"Found {len(hallucination_examples)} hallucination examples\n")

for i, (result, iteration) in enumerate(hallucination_examples):
    print(f"=== Hallucination Example {i+1} ===\n")
    display_verification_case(result, iteration)

In [None]:
# Let's look at examples of logical errors
logical_error_examples = find_error_examples('logical_error')
print(f"Found {len(logical_error_examples)} logical error examples\n")

for i, (result, iteration) in enumerate(logical_error_examples):
    print(f"=== Logical Error Example {i+1} ===\n")
    display_verification_case(result, iteration)

In [None]:
# Let's look at examples of mathematical errors
mathematical_error_examples = find_error_examples('mathematical_error')
print(f"Found {len(mathematical_error_examples)} mathematical error examples\n")

for i, (result, iteration) in enumerate(mathematical_error_examples):
    print(f"=== Mathematical Error Example {i+1} ===\n")
    display_verification_case(result, iteration)

## 7. Process vs. Outcome Verification Comparison

Let's directly compare process verification and outcome verification on the same examples.

In [None]:
# Find examples where both process and outcome verification were applied
def find_comparison_examples(limit=3):
    examples = []
    
    # First, find all sample IDs
    sample_ids = set()
    for result in results:
        sample_ids.add(result['sample_id'])
    
    # Find samples where both process and outcome verification were applied
    for sample_id in sample_ids:
        process_result = None
        outcome_result = None
        
        for result in results:
            if result['sample_id'] != sample_id:
                continue
                
            if result['verification_strategy'] == 'process':
                process_result = result
            elif result['verification_strategy'] == 'outcome':
                outcome_result = result
        
        if process_result and outcome_result:
            # Find an iteration where both applied verification
            for i in range(min(len(process_result.get('verification_metrics', [])), 
                              len(outcome_result.get('verification_metrics', [])))):
                
                process_applied = process_result['verification_metrics'][i].get('verification_applied', False)
                outcome_applied = outcome_result['verification_metrics'][i].get('verification_applied', False)
                
                if process_applied and outcome_applied:
                    examples.append((sample_id, i+1, process_result, outcome_result))
                    break
                    
            if len(examples) >= limit:
                break
    
    return examples

comparison_examples = find_comparison_examples()
print(f"Found {len(comparison_examples)} comparison examples\n")

for i, (sample_id, iteration, process_result, outcome_result) in enumerate(comparison_examples):
    print(f"=== Comparison Example {i+1} ===\n")
    
    # Print question and answer once
    question = process_result['question']
    answer = process_result['answer']
    print(f"Question: {question}")
    print(f"Correct Answer: {answer}\n")
    
    print(f"Process Verification (Iteration {iteration}):")
    display_verification_case(process_result, iteration)
    
    print(f"\nOutcome Verification (Iteration {iteration}):")
    display_verification_case(outcome_result, iteration)
    
    # Compare final performance
    process_final = process_result['performance_history'][-1]
    outcome_final = outcome_result['performance_history'][-1]
    
    print(f"\nFinal Performance:\n")
    print(f"Process Verification: {process_final}")
    print(f"Outcome Verification: {outcome_final}")

## 8. Qualitative Analysis of Verification Feedback

Let's examine the quality and characteristics of feedback provided by different verification strategies.

In [None]:
# Collect verification feedback samples
feedback_samples = []

for result in results:
    if result['verification_strategy'] == 'none':
        continue
        
    strategy = result['verification_strategy']
    verification_metrics = result.get('verification_metrics', [])
    
    for vm in verification_metrics:
        if vm.get('verification_applied', False) and 'corrections' in vm and vm['corrections']:
            feedback_samples.append({
                'strategy': strategy,
                'corrections': vm['corrections'],
                'correction_length': len(vm['corrections'].split())
            })

feedback_df = pd.DataFrame(feedback_samples)

# Analyze feedback length by strategy
feedback_length = feedback_df.groupby('strategy')['correction_length'].agg(
    ['mean', 'std', 'min', 'max']
).reset_index()

print("Verification Feedback Length Statistics:")
feedback_length

In [None]:
# Plot feedback length distribution
plt.figure(figsize=(10, 6))

for strategy in feedback_df['strategy'].unique():
    strategy_data = feedback_df[feedback_df['strategy'] == strategy]['correction_length']
    plt.hist(strategy_data, alpha=0.5, bins=20, label=strategy)

plt.xlabel('Feedback Length (words)')
plt.ylabel('Frequency')
plt.title('Distribution of Verification Feedback Length by Strategy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Sample feedback from each strategy
print("Sample Verification Feedback by Strategy:")

for strategy in feedback_df['strategy'].unique():
    strategy_feedback = feedback_df[feedback_df['strategy'] == strategy]['corrections'].sample(3)
    
    print(f"\n=== {strategy.upper()} Verification Feedback ===\n")
    
    for i, feedback in enumerate(strategy_feedback):
        print(f"Example {i+1}:\n{feedback}\n")

## 9. Conclusions and Recommendations

Based on our analysis, we can draw the following conclusions:

1. **Verification Effectiveness**: [Fill in based on actual results]
   - Process verification appears to be most effective for [fill in]
   - Outcome verification excels at [fill in]
   - Hybrid approaches offer [fill in]

2. **Confidence Threshold Impact**: [Fill in based on actual results]
   - Higher thresholds tend to [fill in]
   - Lower thresholds [fill in]
   - The optimal threshold appears to be around [fill in]

3. **Error Type Handling**: [Fill in based on actual results]
   - Hallucinations are best caught by [fill in]
   - Logical errors are most effectively addressed by [fill in]
   - Mathematical errors [fill in]

4. **Feedback Quality**: [Fill in based on actual results]
   - Process verification feedback tends to be [fill in]
   - Outcome verification feedback is characterized by [fill in]

### Recommendations

Based on our findings, we recommend:

1. [Fill in recommendation]
2. [Fill in recommendation]
3. [Fill in recommendation]
4. [Fill in recommendation]

These recommendations should be implemented in the VTGD framework to maximize the effectiveness of verification in TextGrad optimization.