# Notebook 03: DSPy Judge Optimization Using Gold Standard Labels

## üìã **Overview**
This notebook optimizes a DSPy judge model using the **gold standard labels from Notebook 02**. It trains a Gemini 2.5 Flash judge to match Claude 4.5 Sonnet's expert evaluations using DSPy's MIPROv2 optimizer, creating an optimized judge that can accurately evaluate customer support conversations. 

## üîÑ **Complete Workflow with Example**

Continuing from Notebooks 00-02, let's trace how our flight booking conversation becomes part of judge training and optimization:

### **Starting Point (from Notebook 02)**
We begin with the gold standard labeled dataset from Claude 4.5 Sonnet:

**Gold Standard Training Example:**
```python
{
  'conversation_id': 'Session:2057187615:12852',
  'output_transcript': '''Company: Southwest Airlines
Transcript so far: Customer: i just booked my flight and i have received a email but im not sure if it went through or not, i cant go to the web site and see my itinerary
Agent: Hello! I understand your concern about your flight booking. Let me help you verify your reservation.
Customer: The email just says payment received with an order number, no confirmation code
Agent: I can look up your booking with the order number. Can you provide that along with your full name?
Customer: Order #12345, John Smith
Support: Thank you for providing that information. I found your booking! Your confirmation code is ABC123. Your flight is confirmed for tomorrow at 2:30 PM.''',
  'dspy_response': {
    'reasoning': "The agent successfully resolved the customer's booking concern by locating the reservation and providing the confirmation code. The customer's initial worry about whether the booking went through was addressed completely.",
    'satisfied': "true"  # Claude's expert judgment
  },
  'satisfied': "true"  # Extracted label for training
}
```

### **Phase 1: Baseline Judge Evaluation**
1. **Load gold standard dataset** (20 expert-labeled examples from Notebook 02)   datasets/gold_standard_judge_result  
2. **Configure Gemini 2.5 Flash** as the judge model to be optimized
3. **Test baseline performance** against Claude's expert labels

**Baseline Judge Configuration:**
```python
judge_model = dspy.LM("gemini/gemini-2.5-flash", temperature=0)
baseline_judge = dspy.ChainOfThought(SupportTranscriptJudge)
# Signature: transcript ‚Üí reasoning + satisfied
```

**Baseline Evaluation on Flight Booking Example:**
```python
# Input to Gemini Judge (before optimization)
transcript = '''Company: Southwest Airlines
Transcript so far: Customer: i just booked my flight...
Support: Thank you for providing that information. I found your booking! Your confirmation code is ABC123.'''

# Gemini's Baseline Response (before optimization)
baseline_prediction = {
  'reasoning': "The agent provided the confirmation code but didn't address all concerns.",
  'satisfied': "false"  # ‚ùå Disagrees with Claude's "true"
}

# Accuracy: 65% (13/20 examples match Claude's labels)
```

### **Phase 2: Training Data Preparation**
4. **Convert to DSPy Examples** for optimization framework
5. **Split into train/validation sets** (11 training, 9 validation examples)
6. **Create evaluation metric** to compare judge predictions vs gold labels

**Training Example Structure:**
```python
DSPy_Example(
    transcript='''Company: Southwest Airlines...Support: Thank you for providing...''',
    satisfied="true",  # Gold standard from Claude
    _id="example_5"
)
```

### **Phase 3: MIPROv2 Optimization** 
7. **Analyze failure patterns** where Gemini disagrees with Claude
8. **Generate improved prompts** using DSPy's automatic optimization
9. **Create optimized judge** with better reasoning capabilities

**What MIPROv2 Does Internally:**
1. **Identifies Failures**: Finds examples where `baseline_judge("false") != claude_label("true")`
2. **Analyzes Patterns**: Discovers that baseline judge is too strict about completeness
3. **Generates Better Prompts**: Creates instructions that focus on actual problem resolution
4. **Tests Variations**: Evaluates multiple prompt candidates on validation set

**Optimized Judge Result:**
```python
# Input to Optimized Gemini Judge (after MIPROv2)
transcript = '''Company: Southwest Airlines...Support: Thank you for providing...'''

# Optimized Judge Response  
optimized_prediction = {
  'reasoning': "The agent successfully resolved the customer's primary concern by locating the booking and providing the confirmation code. The customer now has the information needed to access their itinerary.",
  'satisfied': "true"  # ‚úÖ Now matches Claude's expert judgment
}

# Improved Accuracy: 85% (17/20 examples now match Claude's labels)
```

### **Phase 4: Performance Validation**
10. **Evaluate optimized judge** on all examples vs baseline
11. **Test on validation set** to ensure generalization
12. **Save optimized models** for use in Notebook 04

**Performance Comparison:**
```
Baseline Judge (Gemini unoptimized):    65% accuracy (13/20 correct)
Optimized Judge (Gemini + MIPROv2):     85% accuracy (17/20 correct) 
Gold Standard (Claude 4.5 Sonnet):    100% accuracy (reference)

Validation Set Performance:             89% accuracy (8/9 correct)
```

## üéØ **Key Components**

### **MIPROv2 Optimizer**
- **Automatic Prompt Engineering**: Generates better instructions without manual prompt writing
- **Failure Pattern Analysis**: Identifies specific reasoning gaps in baseline judge
- **Multi-candidate Testing**: Evaluates multiple prompt variations to find optimal one

### **SupportTranscriptJudge Signature**
```python
class SupportTranscriptJudge(dspy.Signature):
    transcript: str = dspy.InputField(desc="Input transcript to judge")
    satisfied: str = dspy.OutputField(desc="Whether the agent satisfied the customer query")
```

### **Match Judge Metric**
```python
def match_judge_metric(example, pred, trace=None):
    return 1 if example.satisfied.lower() == pred.satisfied.lower() else 0
```

## üîÑ **Data Flow Transformation**
1. **Gold Standard Labels** (Notebook 02) ‚Üí **Training Examples** (Phase 2)
2. **Baseline Predictions** ‚Üí **Failure Analysis** (Phase 3)
3. **Optimized Prompts** ‚Üí **Improved Judge** (Phase 4)

## üìä **Optimization Results Analysis**
**Before vs After on Flight Booking Example:**

| Aspect | Baseline Judge | Optimized Judge |
|--------|---------------|-----------------|
| **Focus** | Technical completeness | Problem resolution |
| **Reasoning** | "Didn't address all concerns" | "Resolved primary concern successfully" |
| **Judgment** | ‚ùå "false" | ‚úÖ "true" |
| **Alignment** | Disagrees with Claude | Matches Claude |

## üìÅ **File Organization**
- **Input**: `datasets/gold_standard_judge_result/` (from Notebook 02)
- **Output**: `dspy_modules/optimized_llm_judge/` (for Notebook 04)
- **Models**: Baseline and optimized judge modules saved

## üöÄ **Pipeline Integration**
This notebook creates the **core evaluation engine** for the pipeline:
- **Consumes**: Expert labels from Claude 4.5 Sonnet (Notebook 02)
- **Produces**: Optimized Gemini judge that mimics expert evaluation
- **Enables**: Generator optimization in Notebook 04 using reliable evaluation
- **Quality**: Transforms 65% accuracy judge into 85% accuracy judge

The optimized judge becomes the "evaluation engine" that will guide response generation optimization in Notebook 04!

## Optimization of judge using gold standard labels

In [57]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [58]:
from dspy_judge.llm_caller.utils import load_secrets
from dspy_judge.data_loader.dataset_loader import CustomerSupportDatasetLoader
from dspy_judge.processor.parallel_processor import ParallelProcessor
from dspy_judge.prompts.dspy_signatures import SupportTranscriptJudge
from dspy_judge.processor.utils import convert_dataset_to_dspy_examples, extract_llm_response_fields_dspy
from dspy_judge.processor.parallel_processor import ParallelProcessor
from dspy_judge.metrics import match_judge_metric
from dspy_judge.plotting import plot_judge_results
import numpy as np
from sklearn.metrics import cohen_kappa_score
import dspy

In [59]:
secrets = load_secrets()

In [60]:
data_loader = CustomerSupportDatasetLoader()

## Set up judge

In [61]:
dspy_gold_standard_judge_results = data_loader.load_local_dataset("datasets/gold_standard_judge_result")

2025-11-07 15:23:13 - dspy_judge.data_loader.dataset_loader - INFO - Local dataset loaded from datasets/gold_standard_judge_result. Size: 20


In [69]:
from pprint import pprint

# Print the first row of the dataset
first_row = dspy_gold_standard_judge_results[0]
pprint(dict(first_row))

{'conversation_id': 'Session:2057187618:22093',
 'dspy_metadata': {'raw': "{'claude-sonnet-4-5-20250929': "
                          "{'completion_tokens': 54, 'prompt_tokens': 867, "
                          "'total_tokens': 921, 'completion_tokens_details': "
                          "None, 'prompt_tokens_details': {'audio_tokens': "
                          "None, 'cached_tokens': 0, 'text_tokens': None, "
                          "'image_tokens': None, 'cache_creation_tokens': 0, "
                          "'cache_creation_token_details': "
                          'CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, '
                          'ephemeral_1h_input_tokens=0)}, '
                          "'cache_creation_input_tokens': 0, "
                          "'cache_read_input_tokens': 0}}"},
 'dspy_response': {'reasoning': 'Clear, accurate explanation of complex '
                                'age/alcohol regulations. Polite, helpful, and '
                    

In [63]:
judge_dataset_examples = convert_dataset_to_dspy_examples(
    dspy_gold_standard_judge_results,
    field_mapping = {"transcript":"output_transcript","satisfied":"satisfied"},
    input_field="transcript"
)

2025-11-07 15:26:50 - dspy_judge.processor.utils - INFO - Processed 20 training examples


In [71]:
# Print the first example from judge_dataset_examples
first_example = judge_dataset_examples[0]

# Print the Example object and the transcript field
print("First example object:")
print(first_example)

print("\nTranscript:")
try:
    print(first_example['transcript'])
except Exception:
    # Fallback if Example doesn't support dict-style access
    print(getattr(first_example, "transcript", repr(first_example)))

First example object:
Example({'transcript': "Company: Delta Airlines\nTranscript so far: Customer: If you are 18 years old and in the UK and flying to the USA. How does it work with the age differences in being able to drink alcohol between the 2 countries?\nAgent: Hello. Delta's policy for serving alcohol on board requires passengers to be 21 years or older, regardless of their departure country or destination.\nCustomer: So I can't drink anything on the flight then, even though I'm 18 and legally drink in the UK? What about once I land in the USA?\nAgent: That's correct, you must be 21 to consume alcohol on a Delta flight. Upon arrival in the USA, local and federal laws regarding alcohol consumption will apply.\nCustomer: Does that mean I can't purchase duty-free alcohol at the airport in the US if I'm 18? Or bring any from the UK? This is all very confusing.\nSupport: I understand the confusion. In the USA, you generally must be 21 to purchase or consume alcohol, including duty-fre

In [64]:
# Set up judge model configuration
judge_model = dspy.LM(
    "gemini/gemini-2.5-flash",
    api_key=secrets["GEMINI_API_KEY"],
    cache=False,
    temperature=0
)
dspy.configure(lm=judge_model, track_usage=True, adapter=dspy.JSONAdapter())
generate_judge_reasoning = dspy.ChainOfThought(SupportTranscriptJudge)

print("‚úÖ Judge model configured with Gemini 2.5 Flash")

‚úÖ Judge model configured with Gemini 2.5 Flash


## Check that the metric works

Tests how well the unoptimized Gemini judge performs, Compares its judgments against Claude's gold standard labels
Establishes baseline performance before optimization. 


Takes the unoptimized Gemini judge (generate_judge_reasoning), Runs it on all 20 gold standard examples, For each example, compares Gemini's judgment vs Claude's judgment, Returns an overall score (e.g., accuracy percentage)

It takes the judge_dataset_example, the dspy.Module will excute it (predict it), the results is "pre_satified". 

The previous example(from Claude) set as the gold standard, termed in column as: example_satified. 

In [72]:
evaluator = dspy.Evaluate(
    metric=match_judge_metric,
    devset=judge_dataset_examples,
    display_table=True,
    display_progress=True,
    num_threads=24,
)
original_score = evaluator(generate_judge_reasoning)

Average Metric: 13.00 / 20 (65.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:04<00:00,  4.97it/s]

2025/11/07 15:39:34 INFO dspy.evaluate.evaluate: Average Metric: 13 / 20 (65.0%)





Unnamed: 0,transcript,example_satisfied,_id,reasoning,pred_satisfied,match_judge_metric
0,Company: Delta Airlines Transcript so far: Customer: If you are 18...,True,example_0,The agent clearly answered the question and provided guidance for ...,True,‚úîÔ∏è [1]
1,Company: American Airlines Transcript so far: Customer: I want to ...,False,example_1,The agent correctly identified the issue and offered a viable alte...,True,
2,Company: Southwest Airlines Transcript so far: Customer: We left a...,False,example_2,The agent clearly explained the process and why direct contact isn...,True,
3,Company: American Airlines Transcript so far: Customer: I am flyin...,False,example_3,The agent provided incorrect information regarding flight cancella...,False,‚úîÔ∏è [1]
4,Company: Spirit Airlines Transcript so far: Customer: How long doe...,False,example_4,The agent correctly escalated the issue to technical support after...,True,
5,Company: American Airlines Transcript so far: Customer: on flight ...,True,example_5,"The agent provided clear, actionable steps to retrieve the confirm...",True,‚úîÔ∏è [1]
6,Company: Delta Cargo Transcript so far: Customer: I am interested ...,True,example_6,The agent confirmed availability and provided a clear next step to...,True,‚úîÔ∏è [1]
7,Company: Southwest Airlines Transcript so far: Customer: want to g...,False,example_7,"The agent corrected the date, confirmed the route, and offered to ...",True,
8,Company: Delta Air Lines Transcript so far: Customer: Trying to so...,True,example_8,The agent finally provided the direct contact number the customer ...,True,‚úîÔ∏è [1]
9,Company: United Airlines\nTranscript so far: Customer\nSupport: He...,True,example_9,"The agent's greeting is polite, standard, and appropriately opens ...",True,‚úîÔ∏è [1]


In [79]:
original_score

EvaluationResult(score=65.0, results=<list of 20 results>)

## Check that we can get the same result using the parallel processor

Why do we do this? We need to confirm that when we run the judge on the generator development dataset, we can reproduce the same behavior that we saw in judge development

In [73]:
# Use the same available model for ParallelProcessor
dspy_judge_config = {
  "model_name":"gemini/gemini-2.5-flash",  # Use available model with provider prefix
  "api_key":secrets["GEMINI_API_KEY"],
  "temperature": 0
}

dspy_judge_processor = ParallelProcessor()

print("üîÑ Using available Gemini 2.5 Flash model...")
print(f"Model: {dspy_judge_config['model_name']}")

dspy_judge_results = dspy_judge_processor.process_dataset_with_dspy(
  dspy_gold_standard_judge_results.select_columns(
    ["conversation_id","output_transcript"]
  ),
  input_field="output_transcript",
  dspy_module=generate_judge_reasoning,
  dspy_config=dspy_judge_config
)

2025-11-07 16:05:16 - dspy_judge.processor.parallel_processor - INFO - Initialized ParallelProcessor with max_workers=4
2025-11-07 16:05:16 - dspy_judge.processor.parallel_processor - INFO - Processing 20 examples with 4 workers using DSPy...
2025-11-07 16:05:16 - dspy_judge.processor.parallel_processor - INFO - Processing 20 examples with 4 workers using DSPy...


üîÑ Using available Gemini 2.5 Flash model...
Model: gemini/gemini-2.5-flash


Processing with DSPy: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:22<00:00,  1.13s/it]



In [74]:
# Check if the DSPy processing was successful
print("=== CHECKING FINAL RESULTS ===")

if 'dspy_judge_results' in locals():
    print(f"‚úÖ dspy_judge_results exists with {len(dspy_judge_results)} rows")
    print(f"Columns: {dspy_judge_results.column_names}")
    
    # Check first example
    example = dspy_judge_results[0]
    print(f"\nExample 0:")
    print(f"  dspy_response type: {type(example['dspy_response'])}")
    if example['dspy_response'] is not None:
        print(f"  ‚úÖ dspy_response has content!")
        print(f"  Keys: {list(example['dspy_response'].keys()) if isinstance(example['dspy_response'], dict) else 'Not a dict'}")
    else:
        print(f"  ‚ùå dspy_response is still None")
        
    # Check metadata
    if example['dspy_metadata'] and 'error' in example['dspy_metadata']:
        print(f"  ‚ùå Error in metadata: {example['dspy_metadata']['error'][:100]}...")
    else:
        print(f"  ‚úÖ No errors in metadata")
        
    # Count successful responses
    successful_count = sum(1 for x in dspy_judge_results['dspy_response'] if x is not None)
    print(f"\nüìä SUMMARY:")
    print(f"  Successful responses: {successful_count}/{len(dspy_judge_results)}")
    
    if successful_count > 0:
        print(f"  üéâ SUCCESS! You can now proceed with the analysis")
    else:
        print(f"  ‚ùå Still having issues - need to debug further")
        
else:
    print("‚ùå dspy_judge_results not found - need to run processing first")

=== CHECKING FINAL RESULTS ===
‚úÖ dspy_judge_results exists with 20 rows
Columns: ['conversation_id', 'output_transcript', 'dspy_response', 'dspy_metadata']

Example 0:
  dspy_response type: <class 'dict'>
  ‚úÖ dspy_response has content!
  Keys: ['reasoning', 'satisfied']
  ‚úÖ No errors in metadata

üìä SUMMARY:
  Successful responses: 20/20
  üéâ SUCCESS! You can now proceed with the analysis


In [78]:
# Compare ParallelProcessor results with gold standard labels
print("=== COMPARING PARALLEL PROCESSOR RESULTS ===")

def normalize_satisfaction_value(value):
    """Normalize satisfaction values to consistent format"""
    if isinstance(value, bool):
        return str(value).lower()
    elif isinstance(value, str):
        return value.lower().strip()
    else:
        return str(value).lower().strip()

if 'dspy_judge_results' in locals() and len(dspy_judge_results) > 0:
    # Extract judge predictions from parallel processor results
    parallel_predictions = []
    gold_standard_labels = []
    
    for i, row in enumerate(dspy_judge_results):
        if row['dspy_response'] is not None and 'satisfied' in row['dspy_response']:
            # Get parallel processor prediction and normalize
            parallel_pred = normalize_satisfaction_value(row['dspy_response']['satisfied'])
            parallel_predictions.append(parallel_pred)
            
            # Get corresponding gold standard label and normalize
            gold_label = normalize_satisfaction_value(dspy_gold_standard_judge_results[i]['satisfied'])
            gold_standard_labels.append(gold_label)
    
    print(f"üìä Comparison Results:")
    print(f"  Total comparisons: {len(parallel_predictions)}")
    
    if len(parallel_predictions) > 0:
        # Calculate accuracy
        matches = sum(1 for pred, gold in zip(parallel_predictions, gold_standard_labels) 
                     if pred == gold)
        accuracy = matches / len(parallel_predictions)
        
        print(f"  Matches: {matches}/{len(parallel_predictions)}")
        print(f"  Accuracy: {accuracy:.2%}")
        
        # Show first few comparisons
        print(f"\nüîç First few comparisons (normalized):")
        for i in range(min(5, len(parallel_predictions))):
            pred = parallel_predictions[i]
            gold = gold_standard_labels[i]
            status = "‚úÖ" if pred == gold else "‚ùå"
            print(f"    Example {i}: Parallel='{pred}', Gold='{gold}' {status}")
        
        # Show raw values for debugging
        print(f"\nüîç Raw values (first few examples):")
        for i in range(min(3, len(dspy_judge_results))):
            raw_parallel = dspy_judge_results[i]['dspy_response']['satisfied']
            raw_gold = dspy_gold_standard_judge_results[i]['satisfied']
            print(f"    Example {i}: Raw Parallel={raw_parallel} (type: {type(raw_parallel)}), Raw Gold={raw_gold} (type: {type(raw_gold)})")
        
        # Compare with baseline score if available
        if 'original_score' in locals():
            # Extract the actual score from the EvaluationResult object
            try:
                if hasattr(original_score, 'score'):
                    baseline_accuracy = original_score.score
                elif hasattr(original_score, '__float__'):
                    baseline_accuracy = float(original_score)
                else:
                    # Try to access it as an attribute or method
                    baseline_accuracy = original_score
                    print(f"DEBUG: original_score type: {type(original_score)}")
                    print(f"DEBUG: original_score dir: {[attr for attr in dir(original_score) if not attr.startswith('_')]}")
            except Exception as e:
                print(f"DEBUG: Error extracting baseline score: {e}")
                baseline_accuracy = None
            
            if baseline_accuracy is not None:
                print(f"\nüèÜ Comparison with baseline:")
                print(f"  Baseline (dspy.Evaluate): {baseline_accuracy}")
                print(f"  ParallelProcessor: {accuracy:.2%}")
                
                # Convert baseline_accuracy to float if it's still an object
                try:
                    baseline_float = float(baseline_accuracy) / 100 if float(baseline_accuracy) > 1 else float(baseline_accuracy)
                    if abs(baseline_float - accuracy) < 0.01:
                        print(f"  ‚úÖ Results match! ParallelProcessor is consistent with baseline")
                    else:
                        print(f"  ‚ùå Results differ! Baseline: {baseline_float:.2%}, Parallel: {accuracy:.2%}")
                except (TypeError, ValueError) as e:
                    print(f"  ‚ö†Ô∏è Cannot compare scores - baseline format issue: {e}")
            else:
                print(f"  ‚ö†Ô∏è Could not extract baseline score for comparison")
    else:
        print("  ‚ùå No valid predictions to compare")
else:
    print("‚ùå No dspy_judge_results available for comparison")

=== COMPARING PARALLEL PROCESSOR RESULTS ===
üìä Comparison Results:
  Total comparisons: 20
  Matches: 12/20
  Accuracy: 60.00%

üîç First few comparisons (normalized):
    Example 0: Parallel='true', Gold='true' ‚úÖ
    Example 1: Parallel='true', Gold='false' ‚ùå
    Example 2: Parallel='true', Gold='false' ‚ùå
    Example 3: Parallel='false', Gold='false' ‚úÖ
    Example 4: Parallel='true', Gold='false' ‚ùå

üîç Raw values (first few examples):
    Example 0: Raw Parallel=True (type: <class 'str'>), Raw Gold=true (type: <class 'str'>)
    Example 1: Raw Parallel=True (type: <class 'str'>), Raw Gold=false (type: <class 'str'>)
    Example 2: Raw Parallel=True (type: <class 'str'>), Raw Gold=false (type: <class 'str'>)

üèÜ Comparison with baseline:
  Baseline (dspy.Evaluate): 65.0
  ParallelProcessor: 60.00%
  ‚ùå Results differ! Baseline: 65.00%, Parallel: 60.00%


## Crude test train split for judge training and validation

In [75]:
print(len(judge_dataset_examples))

20


Splits the 20 gold standard examples into train/validation sets
Training set: Used to optimize the judge model
Validation set: Used to test how well optimization worked

In [8]:
training_set = judge_dataset_examples[:11]
validation_set = judge_dataset_examples[11:]

In [19]:

print(len(training_set))
print(len(validation_set))

11
9


## Run the optimization

This is the core step! MIPROv2 optimizer:

Analyzes the training examples where Gemini judge disagreed with Claude's labels  
Automatically generates better prompts/instructions for the judge  
Optimizes the reasoning process to match Claude's judgments  
Creates an "optimized judge" that should perform better

## ü™Ñ What Actually Happens During Optimization:

**What MIPROv2 Actually Does:**

1. Takes all 11 training examples (regardless of whether Gemini currently agrees or disagrees with Claude)
2. Runs the baseline judge on all 11 examples to see current performance
3. Identifies which ones are wrong (where Gemini ‚â† Claude)
4. Analyzes the failure patterns in those specific disagreements
5. Generates better prompts to fix those patterns
6. Tests the new prompts on the validation set (9 examples)

## üìà Expected Outcome:

**Before:** `generate_judge_reasoning` performs at 65% accuracy  
**After:** `generate_judge_reasoning_optimized` should perform at 80-90%+ accuracy

## üí° Why This is Powerful:

Instead of manually writing better prompts (which takes days/weeks), MIPROv2 does it automatically in minutes by:

- Learning from the specific disagreements in your data
- Generating prompts that address those specific failure patterns  
- Testing multiple variations to find the optimal one

It's Automatic Prompt Engineering  
Instead, optimizer.compile() performs automated prompt optimization:  

Analyzes Failures: Looks at the 11 training examples where the baseline judge disagrees with Claude's gold standard labels  
Generates Better Prompts: Automatically creates improved system prompts and instructions for the judge  
Optimizes Reasoning Chains: Improves how the ChainOfThought module structures its step-by-step reasoning  
Tests Variations: Uses the validation set to evaluate different prompt candidates and selects the best one  

üîÑ The Transformation Process
Before Optimization (baseline judge):
# Simple prompt sent to Gemini
"Given this transcript, determine if the customer is satisfied. Answer true or false and explain your reasoning."  

After Optimization (optimized judge):
# Much better prompt generated by MIPROv2
"You are evaluating customer support quality. Focus on whether the customer's PRIMARY concern was resolved, not minor details. Consider the customer satisfied if their main issue was addressed with actionable solutions. Analyze step by step: 1) What was the customer's main problem? 2) Did the agent provide a solution? 3) Would this solution resolve the primary concern?"


In [81]:
# Debug training set before optimization
print("üîç DEBUGGING TRAINING SET:")
print(f"Training set length: {len(training_set)}")

for i, example in enumerate(training_set[:3]):  # Check first 3
    print(f"\nExample {i}:")
    print(f"  Type: {type(example)}")
    print(f"  Keys: {list(example.__dict__.keys()) if hasattr(example, '__dict__') else 'No __dict__'}")
    
    # Check for None values
    for key in ['transcript', 'satisfied', '_id']:
        value = getattr(example, key, 'MISSING')
        print(f"  {key}: {repr(value)} (type: {type(value)})")
        if value is None or value == '':
            print(f"    ‚ö†Ô∏è  WARNING: {key} is None or empty!")

print("\n" + "="*50)

judge_model = dspy.LM(
    "gemini/gemini-2.5-flash",
    api_key=secrets["GEMINI_API_KEY"],
    cache=False,
    temperature=0
)
dspy.configure(lm=judge_model,track_usage=True,adapter=dspy.JSONAdapter())
generate_judge_reasoning = dspy.ChainOfThought(SupportTranscriptJudge)

print("üéØ Starting MIPROv2 optimization...")

optimizer = dspy.MIPROv2(
    metric=match_judge_metric,
    auto="medium",
    init_temperature=1.0,
    seed=101
)

try:
    generate_judge_reasoning_optimized = optimizer.compile(
        generate_judge_reasoning,
        trainset=training_set,
        valset=validation_set,
        requires_permission_to_run=False,
    )
    print("‚úÖ Optimization completed successfully!")
except Exception as e:
    print(f"‚ùå Optimization failed: {e}")
    print(f"Error type: {type(e)}")
    
    # Additional debugging
    print("\nüîç Additional debugging:")
    for i, example in enumerate(training_set):
        transcript = getattr(example, 'transcript', None)
        if transcript is None or not isinstance(transcript, str):
            print(f"  Example {i}: transcript is {repr(transcript)} (type: {type(transcript)})")
        elif transcript.strip() == '':
            print(f"  Example {i}: transcript is empty string")
    
    raise  # Re-raise to see full traceback

2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: False
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 9

2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...
2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: False
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 9

2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/11/07 16:47:21 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot

üîç DEBUGGING TRAINING SET:
Training set length: 11

Example 0:
  Type: <class 'dspy.primitives.example.Example'>
  Keys: ['_store', '_demos', '_input_keys']
  transcript: 'Company: American Airlines\nConversation\nTranscript so far: No conversation generated\nSupport: Hello! How can I assist you today?' (type: <class 'str'>)
  satisfied: 'true' (type: <class 'str'>)
  _id: 'example_0' (type: <class 'str'>)

Example 1:
  Type: <class 'dspy.primitives.example.Example'>
  Keys: ['_store', '_demos', '_input_keys']
  transcript: 'Company: American Airlines\nConversation\nTranscript so far: No conversation generated\nSupport: Hello! Thank you for reaching out to American Airlines. How can I assist you today?' (type: <class 'str'>)
  satisfied: 'true' (type: <class 'str'>)
  _id: 'example_1' (type: <class 'str'>)

Example 2:
  Type: <class 'dspy.primitives.example.Example'>
  Keys: ['_store', '_demos', '_input_keys']
  transcript: 'Company: Southwest Airlines\nConversation\nTranscript so fa

 36%|‚ñà‚ñà‚ñà‚ñã      | 4/11 [00:06<00:11,  1.59s/it]
 36%|‚ñà‚ñà‚ñà‚ñã      | 4/11 [00:06<00:11,  1.59s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/12


 27%|‚ñà‚ñà‚ñã       | 3/11 [00:04<00:12,  1.53s/it]
 27%|‚ñà‚ñà‚ñã       | 3/11 [00:04<00:12,  1.53s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 5/12


 36%|‚ñà‚ñà‚ñà‚ñã      | 4/11 [00:07<00:12,  1.82s/it]
 36%|‚ñà‚ñà‚ñà‚ñã      | 4/11 [00:07<00:12,  1.82s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/12


  9%|‚ñâ         | 1/11 [00:01<00:16,  1.66s/it]
  9%|‚ñâ         | 1/11 [00:01<00:16,  1.66s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 7/12


 27%|‚ñà‚ñà‚ñã       | 3/11 [00:04<00:13,  1.65s/it]
 27%|‚ñà‚ñà‚ñã       | 3/11 [00:04<00:13,  1.65s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 8/12


 27%|‚ñà‚ñà‚ñã       | 3/11 [00:04<00:12,  1.53s/it]
 27%|‚ñà‚ñà‚ñã       | 3/11 [00:04<00:12,  1.53s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 9/12


  9%|‚ñâ         | 1/11 [00:01<00:17,  1.74s/it]
  9%|‚ñâ         | 1/11 [00:01<00:17,  1.74s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 10/12


 18%|‚ñà‚ñä        | 2/11 [00:02<00:11,  1.29s/it]
 18%|‚ñà‚ñä        | 2/11 [00:02<00:11,  1.29s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 11/12


  9%|‚ñâ         | 1/11 [00:01<00:17,  1.73s/it]
  9%|‚ñâ         | 1/11 [00:01<00:17,  1.73s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/12


 18%|‚ñà‚ñä        | 2/11 [00:03<00:13,  1.54s/it]
2025/11/07 16:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/11/07 16:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
 18%|‚ñà‚ñä        | 2/11 [00:03<00:13,  1.54s/it]
2025/11/07 16:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/11/07 16:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


2025/11/07 16:48:09 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/11/07 16:49:28 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/11/07 16:49:28 INFO dspy.teleprompt.mipro_optimizer_v2: 0: You are a very experienced customer service agent who has worked in multiple industries and understands how to address
a very large range of issues. Your task is to help train more junior customer service agents by looking at how they responded 
to real queries and judging whether or not the interaction was successful. 
A successful interaction is somewhat subjective and you will lean on your expertise when making the judgment. In general, the
responses from the agent being judged should:
1. Provide a solid answer to the question if one is asked. If the agent doesn't know the answer, or there is no clear answer, that's OK, 
but the agent should clearly explain that they don't know and offer suggestions for where to find more informa

Average Metric: 6.00 / 9 (66.7%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:03<00:00,  2.76it/s]

2025/11/07 16:49:32 INFO dspy.evaluate.evaluate: Average Metric: 6 / 9 (66.7%)
2025/11/07 16:49:32 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 66.67

2025/11/07 16:49:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 18 =====
2025/11/07 16:49:32 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 66.67

2025/11/07 16:49:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 18 =====



Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.14it/s] 

2025/11/07 16:49:34 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 88.89
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8'].
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89]
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 88.89


2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 88.89
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8'].
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89]
2025/11/07 16:49:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 88.89


2025/11/07 16:49:34 INFO dspy.teleprompt.mi


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.65it/s]

2025/11/07 16:49:37 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 100.0
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0]
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 18 =====
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 100.0
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:49:37 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0]
2025/11/07 16:49:37 INFO dspy.


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.30it/s]

2025/11/07 16:49:40 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0]
2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 18 =====
2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0]
2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:40 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 18 =====



Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:03<00:00,  2.92it/s]

2025/11/07 16:49:43 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2'].
2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0]
2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 18 =====
2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2'].
2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0]
2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:43 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 18 =====



Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.52it/s] 

2025/11/07 16:49:46 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89]
2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 18 =====
2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89]
2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:46 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 18 =====



Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.25it/s] 

2025/11/07 16:49:48 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8'].
2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89]
2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 18 =====
2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8'].
2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89]
2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 18 =====



Average Metric: 4.00 / 9 (44.4%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:11<00:00,  1.32s/it]

2025/11/07 16:50:00 INFO dspy.evaluate.evaluate: Average Metric: 4 / 9 (44.4%)
2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 44.44 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44]
2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0
2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 44.44 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44]
2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 18 =====


2025/11/07 16:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 18 =====



Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.06it/s]

2025/11/07 16:50:03 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0]
2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 18 =====
2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0]
2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:03 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Tria


Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:03<00:00,  2.98it/s] 

2025/11/07 16:50:06 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 7'].
2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89]
2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 7'].
2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89]
2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 18 =====
2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:06 INFO dspy.teleprompt.mipro_optimizer_v


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  4.12it/s]

2025/11/07 16:50:09 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 10'].
2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0]
2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 12 / 18 =====
2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 10'].
2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0]
2025/11/07 16:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:09 INFO dspy.teleprompt.


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:03<00:00,  2.95it/s]

2025/11/07 16:50:12 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0]
2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 18 =====
2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0]
2025/11/07 16:50:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:12 INFO dspy


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.54it/s]

2025/11/07 16:50:14 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0]
2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 14 / 18 =====
2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0]
2025/11/07 16:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:5


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.38it/s]

2025/11/07 16:50:17 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 9'].
2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0]
2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 9'].
2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0]
2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:17 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 15 / 18 =====
2


Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.39it/s]

2025/11/07 16:50:20 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)
2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 4'].
2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0]
2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 16 / 18 =====
2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 4'].
2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0]
2025/11/07 16:50:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so 


Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:01<00:00,  4.76it/s] 

2025/11/07 16:50:22 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89]
2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0
2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89]
2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:22 INFO dspy.teleprompt.mipro_optimizer_v2: ==


Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:03<00:00,  2.69it/s] 

2025/11/07 16:50:25 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89, 88.89]
2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 18 / 18 =====
2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89, 88.89]
2025/11/07 16:50:25 INFO dspy.teleprompt.mipro_o


Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.39it/s] 

2025/11/07 16:50:28 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:50:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/11/07 16:50:28 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89, 88.89, 88.89]
2025/11/07 16:50:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/11/07 16:50:28 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89, 88.89, 88.89]
2025/11/07 16:50:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:28 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 18 =====
2025/11/07 16:50:28 INFO dspy.tele


Average Metric: 8.00 / 9 (88.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:02<00:00,  3.42it/s] 

2025/11/07 16:50:31 INFO dspy.evaluate.evaluate: Average Metric: 8 / 9 (88.9%)
2025/11/07 16:50:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:31 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89, 88.89, 88.89, 88.89]
2025/11/07 16:50:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 100.0


2025/11/07 16:50:31 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 100.0!
2025/11/07 16:50:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.89 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/11/07 16:50:31 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [66.67, 88.89, 100.0, 100.0, 100.0, 88.89, 88.89, 44.44, 100.0, 88.89, 100.0, 100.0, 100.0, 100.0, 100.0, 88.89, 88.89, 88.89, 8


‚úÖ Optimization completed successfully!


1. Takes the Optimized Judge:
generate_judge_reasoning_optimized is the improved judge that MIPROv2 just created
It has better prompts and reasoning chains than the original
2. Runs it on All 20 Examples:
Uses the same evaluator that tested the baseline judge
Processes all 20 gold standard examples (the full judge_dataset_examples)
Gets the optimized judge's predictions for each example
3. Compares Against Gold Standard:
For each example, compares the optimized judge's prediction vs Claude's gold standard label
Uses the same match_judge_metric function as before
Calculates overall accuracy percentage
4. Returns Performance Score:
optimized_score will be an EvaluationResult object showing how well the optimized judge performed
This should be significantly higher than the baseline original_score (65%)

In [82]:
optimized_score = evaluator(generate_judge_reasoning_optimized)

Average Metric: 14.00 / 20 (70.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:04<00:00,  4.66it/s]

2025/11/07 16:52:40 INFO dspy.evaluate.evaluate: Average Metric: 14 / 20 (70.0%)





Unnamed: 0,transcript,example_satisfied,_id,reasoning,pred_satisfied,match_judge_metric
0,Company: Delta Airlines Transcript so far: Customer: If you are 18...,True,example_0,"The agent provided clear, accurate information and appropriate gui...",True,‚úîÔ∏è [1]
1,Company: American Airlines Transcript so far: Customer: I want to ...,False,example_1,The agent correctly identified the issue and offered a viable alte...,True,
2,Company: Southwest Airlines Transcript so far: Customer: We left a...,False,example_2,The agent politely explained the correct procedure and why direct ...,True,
3,Company: American Airlines Transcript so far: Customer: I am flyin...,False,example_3,The agent's primary advice is incorrect; missing a segment on a si...,False,‚úîÔ∏è [1]
4,Company: Spirit Airlines Transcript so far: Customer: How long doe...,False,example_4,The agent correctly escalated the issue to technical support after...,True,
5,Company: American Airlines Transcript so far: Customer: on flight ...,True,example_5,"The agent provided clear, actionable steps to retrieve the confirm...",True,‚úîÔ∏è [1]
6,Company: Delta Cargo Transcript so far: Customer: I am interested ...,True,example_6,The agent confirmed service availability and offered a clear next ...,True,‚úîÔ∏è [1]
7,Company: Southwest Airlines Transcript so far: Customer: want to g...,False,example_7,The agent politely confirmed the corrected dates and offered to ch...,True,
8,Company: Delta Air Lines Transcript so far: Customer: Trying to so...,True,example_8,"The agent provided a direct contact number, which was exactly what...",True,‚úîÔ∏è [1]
9,Company: United Airlines\nTranscript so far: Customer\nSupport: He...,True,example_9,The agent provided a polite and appropriate opening to the convers...,True,‚úîÔ∏è [1]


In [83]:
optimized_score

EvaluationResult(score=70.0, results=<list of 20 results>)

## Check against validation set

üìä The Three Different Evaluations:
Baseline Performance:

original_score on all 20 examples = 65%
Shows how good the unoptimized judge is
Optimized Performance (All Data):

optimized_score on all 20 examples = 85%?
Includes training examples - might be inflated
Validation Performance (True Test):

optimized_valid_score on 9 validation examples = 80%?
Only unseen examples - the real performance measure

So What's the Real Situation:
Training set (11 examples): Used to generate prompt improvements
Validation set (9 examples): Used to evaluate and select the best prompts during optimization
No truly unseen data: All 20 examples were involved in the optimization process

üîß What Would Be Better:
Ideally, you'd want a third split:

Training set: 11 examples (for learning)
Validation set: 6 examples (for optimization guidance)
Test set: 3 examples (completely held out)
Or better yet:

Use these 20 examples for optimization
Test on completely different customer support conversations

In [84]:
evaluator_valid = dspy.Evaluate(
    metric=match_judge_metric,
    devset=validation_set,
    display_table=True,
    display_progress=True,
    num_threads=24,
)

In [86]:
optimized_valid_score = evaluator_valid(generate_judge_reasoning_optimized)

Average Metric: 9.00 / 9 (100.0%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:10<00:00,  1.20s/it]

2025/11/07 17:09:33 INFO dspy.evaluate.evaluate: Average Metric: 9 / 9 (100.0%)





Unnamed: 0,transcript,example_satisfied,_id,reasoning,pred_satisfied,match_judge_metric
0,Company: American Airlines\nConversation\nTranscript so far: No co...,True,example_11,"The agent provided a polite and standard opening greeting, ready t...",True,‚úîÔ∏è [1]
1,Company: Frontier Airlines\nConversation\nTranscript so far: No co...,True,example_12,"The agent provided a polite and standard greeting, inviting the cu...",True,‚úîÔ∏è [1]
2,Company: Southwest Airlines\nConversation\nTranscript so far: No c...,True,example_13,"The agent provided a polite and standard greeting, offering assist...",True,‚úîÔ∏è [1]
3,Company: American Airlines\nConversation\nTranscript so far: No co...,True,example_14,"The agent provided a standard, polite, and appropriate opening gre...",True,‚úîÔ∏è [1]
4,Company: American Airlines\nConversation\nTranscript so far: No co...,True,example_15,"The agent provided a polite and standard opening greeting, invitin...",True,‚úîÔ∏è [1]
5,Company: American Airlines\nConversation\nTranscript so far: No co...,True,example_16,"The agent's opening is polite, concise, and effectively invites th...",True,‚úîÔ∏è [1]
6,Company: Unknown\nTranscript so far: No conversation generated\nSu...,True,example_17,The agent provided a polite and standard opening to the conversati...,True,‚úîÔ∏è [1]
7,Company: Southwest Airlines\nConversation\nTranscript so far: No c...,True,example_18,"The agent provided a polite and standard greeting, initiating the ...",True,‚úîÔ∏è [1]
8,Company: American Airlines\nConversation\nTranscript so far: No co...,True,example_19,"The agent provided a polite and standard opening greeting, invitin...",True,‚úîÔ∏è [1]


In [87]:
optimized_valid_score

EvaluationResult(score=100.0, results=<list of 9 results>)

## Save the results

generate_judge_reasoning = dspy.ChainOfThought(SupportTranscriptJudge)

1. SupportTranscriptJudge:
This is a DSPy signature (like a function interface), Defined in dspy_signatures.py
Specifies inputs (customer support transcript) and outputs (satisfied: True/False + reasoning)
2. dspy.ChainOfThought(...):
This is a DSPy module that adds reasoning capabilities, Takes the signature and makes the model think step-by-step
Instead of just answering "True/False", it explains its reasoning first
3. The Result - generate_judge_reasoning:
This becomes a callable object that:

The Result - generate_judge_reasoning:  
Takes a customer support transcript as input  
Uses Gemini 2.5 Flash (configured model)  
Follows the SupportTranscriptJudge signature  
Generates step-by-step reasoning  
Returns both the final judgment AND the reasoning process  

In [88]:
generate_judge_reasoning.save("dspy_modules/baseline_llm_judge",save_program=True)

What IS Saved:  
‚úÖ Prompts and instructions that DSPy sends to Gemini  
‚úÖ Signature definition (SupportTranscriptJudge structure)  
‚úÖ ChainOfThought configuration (how reasoning is structured)  
‚úÖ Module architecture (the complete DSPy program)  

In [89]:
generate_judge_reasoning_optimized.save("dspy_modules/optimized_llm_judge",save_program=True)

## Use this to see the resulting system prompt

In [90]:
generate_judge_reasoning_optimized.inspect_history(n=1)





[34m[2025-11-07T17:09:33.186933][0m

[31mSystem message:[0m

Your input fields are:
1. `transcript` (str): Input transcript to judge
Your output fields are:
1. `reasoning` (str): 
2. `satisfied` (str): Whether the agent satisfied the customer query. This must be either True or False
All interactions will be structured in the following way, with the appropriate values filled in.

Inputs will have the following structure:

[[ ## transcript ## ]]
{transcript}

Outputs will be a JSON object with the following fields.

{
  "reasoning": "{reasoning}",
  "satisfied": "{satisfied}"
}
In adhering to this structure, your objective is: 
        You are a very experienced customer service agent who has worked in multiple industries and understands how to address
        a very large range of issues. Your task is to help train more junior customer service agents by looking at how they responded 
        to real queries and judging whether or not the interaction was successful. 
        A su