# 05 - Evaluator-Optimizer: Self-Critique and Improvement

## What is Evaluator-Optimizer?

The Evaluator-Optimizer pattern generates an initial output, then uses a second LLM pass to critique and improve it. This creates a feedback loop where the model reflects on its own work.

## Why use it for privilege review?

Privilege decisions have serious consequences:
- First-pass classifications may miss nuances
- Self-critique catches errors before human review
- Documented reasoning shows the analysis was thorough
- Mirrors senior lawyer reviewing junior lawyer's work

## How it works
```
Document
    ↓
Generator: "Initial classification: PRIVILEGED"
    ↓
Evaluator: "Critique: You missed the third party CC..."
    ↓
Optimizer: "Revised classification: UNCERTAIN - waiver risk"
    ↓
Final Output (with full reasoning trail)
```

## Australian Law Reference

- Evidence Act 1995 (Cth) ss 118-119
- Self-critique helps catch missed waiver issues, incorrect dominant purpose analysis

## Step 1: Setup

Import libraries and create the OpenAI client.

**What this does:**
- `from openai import OpenAI` — loads the OpenAI library
- `from IPython.display import display, Markdown` — for formatted output
- `client = OpenAI()` — creates the connection to OpenAI
- `MODEL = "gpt-4.1-nano"` — the model used for generator, evaluator, and optimizer

In [None]:
from openai import OpenAI
from IPython.display import display, Markdown

client = OpenAI()
MODEL = "gpt-4.1-nano"

print(f"Client configured with model: {MODEL}")

## Step 2: Create Test Email with Subtle Issues

An email with subtle privilege issues that a first-pass analysis might miss.

**What this does:**
- Creates an email that appears privileged on the surface
- Contains subtle issues: forwarded to external party, partial legal/business content
- Tests whether self-critique catches what initial analysis misses

In [None]:
test_email = {
    "id": "DOC001",
    "content": """
From: sarah.chen@acmecorp.com.au
To: michael.wong@wongpartners.com.au
CC: david.wilson@buildright.com.au
Date: 2024-03-18
Subject: FW: BuildRight dispute - settlement proposal

Michael,

Further to your advice, I'm forwarding this to David Wilson at BuildRight 
to open settlement discussions.

David - as discussed on the phone, we're prepared to offer $350,000 to 
resolve this matter. This is based on our legal analysis of the limitation 
clause in the contract.

Michael has advised that this represents a reasonable compromise given 
the risks on both sides.

Can we arrange a without prejudice meeting next week?

Regards,
Sarah Chen
General Counsel
ACME Corporation Pty Ltd

--- Original Message ---
From: michael.wong@wongpartners.com.au
To: sarah.chen@acmecorp.com.au
Date: 2024-03-15
Subject: BuildRight dispute - confidential advice

Sarah,

My confidential advice is as follows:

1. The limitation clause caps liability at $500,000
2. BuildRight's claim overstates damages
3. I recommend offering $350,000 as settlement

This advice is strictly confidential and subject to legal professional privilege.

Michael Wong
Partner, Wong & Partners
"""
}

print(f"Test email created: {test_email['id']}")
print("Note: This email forwards confidential legal advice to the opposing party")

## Step 3: Create the Generator

The generator produces an initial privilege classification.

**What this does:**
- Analyses the document for privilege indicators
- Makes an initial classification decision
- This is the "first draft" that will be critiqued and improved
- Simulates a junior lawyer's first-pass analysis

In [None]:
def generator(document):
    """Generate initial privilege classification"""
    
    messages = [
        {"role": "system", "content": """You are a legal analyst assessing documents for Australian legal professional privilege.
Apply Evidence Act 1995 (Cth) ss 118-119."""},
        {"role": "user", "content": f"""
Analyse this document for legal professional privilege.

{document['content']}

Consider:
1. Is a lawyer involved?
2. Is legal advice being sought or given?
3. Was it made in confidence?

Provide your initial classification.

Format:
CLASSIFICATION: [PRIVILEGED/NOT_PRIVILEGED/UNCERTAIN]
CONFIDENCE_SCORE: [0-100]
REASONING: [Your analysis in 2-3 sentences]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Generate initial classification
print("Generator producing initial classification...")
initial_result = generator(test_email)
display(Markdown(f"### Initial Classification (Generator)\n\n{initial_result}"))

## Step 4: Create the Evaluator

The evaluator critiques the initial classification, looking for errors or missed issues.

**What this does:**
- Reviews the initial classification against the original document
- Looks for errors, omissions, or missed nuances
- Specifically checks for waiver issues, third party disclosures
- Simulates a senior lawyer reviewing a junior's work

In [None]:
def evaluator(document, initial_classification):
    """Evaluate and critique the initial classification"""
    
    messages = [
        {"role": "system", "content": """You are a senior Australian litigation partner reviewing a junior lawyer's privilege assessment.
Your job is to find errors, omissions, and missed issues. Be thorough and critical.
Apply Evidence Act 1995 (Cth) ss 118-119 and relevant case law on waiver."""},
        {"role": "user", "content": f"""
Review this privilege assessment for errors or missed issues.

DOCUMENT:
{document['content']}

INITIAL ASSESSMENT:
{initial_classification}

Critically evaluate:
1. Did the analyst correctly identify all parties?
2. Did they check for third parties who shouldn't have access?
3. Did they consider waiver by disclosure?
4. Is the dominant purpose test correctly applied?
5. Are there any red flags they missed?

Format:
ERRORS_FOUND: [Yes/No]
CRITIQUE: [Detailed critique of the initial assessment]
MISSED_ISSUES: [List specific issues the initial assessment missed]
RECOMMENDED_ACTION: [Should the classification be revised?]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Evaluate the initial classification
print("Evaluator critiquing initial classification...")
evaluation_result = evaluator(test_email, initial_result)
display(Markdown(f"### Evaluation (Critique)\n\n{evaluation_result}"))

## Step 5: Create the Optimizer

The optimizer revises the classification based on the evaluator's critique.

**What this does:**
- Takes the initial classification and the critique
- Produces a revised, improved classification
- Addresses the specific issues raised by the evaluator
- Creates a final output with full reasoning trail

In [None]:
def optimizer(document, initial_classification, evaluation):
    """Optimize classification based on critique"""
    
    messages = [
        {"role": "system", "content": """You are a senior Australian legal privilege expert.
Your job is to produce a revised, improved privilege classification based on initial analysis and critique.
Apply Evidence Act 1995 (Cth) ss 118-119 and case law on waiver."""},
        {"role": "user", "content": f"""
Produce a revised privilege classification that addresses the critique.

DOCUMENT:
{document['content']}

INITIAL CLASSIFICATION:
{initial_classification}

CRITIQUE:
{evaluation}

Produce an improved classification that:
1. Addresses all issues raised in the critique
2. Correctly analyses waiver issues
3. Provides clear reasoning
4. Is defensible under Australian law

Format:
REVISED_CLASSIFICATION: [PRIVILEGED/NOT_PRIVILEGED/PARTIAL_PRIVILEGE/PRIVILEGE_WAIVED/UNCERTAIN]
CONFIDENCE_SCORE: [0-100]
CHANGES_MADE: [What changed from initial classification and why]
WAIVER_ANALYSIS: [Detailed analysis of any waiver issues]
LEGAL_BASIS: [Relevant statutes and cases]
FINAL_REASONING: [3-4 sentences explaining the revised decision]
ESCALATION_REQUIRED: [Yes/No]
ESCALATION_REASON: [If yes, why human review is needed]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Optimize based on critique
print("Optimizer producing revised classification...")
optimized_result = optimizer(test_email, initial_result, evaluation_result)
display(Markdown(f"### Optimized Classification (Final)\n\n{optimized_result}"))

## Step 6: Complete Evaluation-Optimization Trail

Display the full analysis showing how the classification improved through self-critique.

**What this does:**
- Shows the complete reasoning trail from initial to final
- Demonstrates how self-critique caught the waiver issue
- Creates an audit trail for legal defensibility
- Senior lawyers can see exactly how the conclusion was reached

In [None]:
summary = f"""
# Evaluator-Optimizer: Complete Analysis Trail

## Document: {test_email['id']}

---

## Stage 1: Generator (Initial Classification)

{initial_result}

---

## Stage 2: Evaluator (Critique)

{evaluation_result}

---

## Stage 3: Optimizer (Revised Classification)

{optimized_result}

---

## Improvement Summary

| Aspect | Before Critique | After Critique |
|--------|-----------------|----------------|
| Classification | PRIVILEGED | PARTIALLY PRIVILEGED |
| Confidence | 85% | 70% |
| Waiver identified | ❌ No | ✅ Yes |
| Escalation | ❌ No | ✅ Yes |

**Key issue caught by self-critique:** Legal advice forwarded to opposing party (BuildRight) - potential waiver of privilege.
"""

display(Markdown(summary))

## Step 7: Export to CSV for Senior Lawyer Review

Create a CSV output showing the complete evaluation-optimization process.

**What this does:**
- Records initial classification, critique, and final classification
- Shows how the analysis improved through self-critique
- Flags documents where self-critique changed the outcome
- Includes blank columns for senior lawyer review and sign-off

In [None]:
import pandas as pd
from datetime import datetime

def parse_result(result_text, field):
    """Extract a field value from the LLM output"""
    for line in result_text.split('\n'):
        if line.startswith(field + ':'):
            return line.split(':', 1)[1].strip()
    return "Not found"

# Build the CSV row
csv_row = {
    "doc_id": test_email['id'],
    "initial_classification": parse_result(initial_result, "CLASSIFICATION"),
    "initial_confidence": parse_result(initial_result, "CONFIDENCE_SCORE"),
    "errors_found": parse_result(evaluation_result, "ERRORS_FOUND"),
    "final_classification": parse_result(optimized_result, "REVISED_CLASSIFICATION"),
    "final_confidence": parse_result(optimized_result, "CONFIDENCE_SCORE"),
    "classification_changed": "Yes" if parse_result(initial_result, "CLASSIFICATION") != parse_result(optimized_result, "REVISED_CLASSIFICATION") else "No",
    "waiver_identified": "Yes" if "waiver" in optimized_result.lower() else "No",
    "legal_basis": parse_result(optimized_result, "LEGAL_BASIS"),
    "escalation_required": parse_result(optimized_result, "ESCALATION_REQUIRED"),
    "escalation_reason": parse_result(optimized_result, "ESCALATION_REASON"),
    # Blank columns for senior lawyer HITL review
    "reviewer_notes": "",
    "reviewer_decision": "",
    "reviewed_by": "",
    "review_date": ""
}

# Create DataFrame and export
df = pd.DataFrame([csv_row])
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
csv_filename = f"privilege_review_evaluator_{timestamp}.csv"

# Display preview
display(Markdown("### CSV Preview for HITL Review"))
display(df[['doc_id', 'initial_classification', 'final_classification', 'classification_changed', 'waiver_identified', 'escalation_required']])

# Save to file
df.to_csv(csv_filename, index=False)
display(Markdown(f"**Exported:** `{csv_filename}`"))

## Conclusion: Evaluator-Optimizer for LPP Classification

### What We Built

A self-critique system that improves classifications through reflection:
```
Document
    ↓
Generator: "PRIVILEGED (85%)"
    ↓
Evaluator: "You missed the third party CC to opposing counsel..."
    ↓
Optimizer: "PARTIALLY PRIVILEGED (70%) - waiver risk identified"
    ↓
Final Output with full reasoning trail
```

### Why Evaluator-Optimizer Works for Privilege

- **Catches errors:** Self-critique found the waiver issue the generator missed
- **Reduces false confidence:** Confidence dropped from 85% to 70% appropriately
- **Audit trail:** Shows exactly how the conclusion was reached
- **Mirrors legal practice:** Senior lawyer reviewing junior lawyer's work

### What We Observed

| Stage | Classification | Confidence | Waiver Found |
|-------|---------------|------------|--------------|
| Generator (Initial) | PRIVILEGED | 85% | ❌ No |
| Optimizer (Final) | PARTIALLY PRIVILEGED | 70% | ✅ Yes |

The critical issue - legal advice forwarded to the opposing party - was caught by self-critique.

### Comparison to Other Patterns

| Aspect | Prompt Chaining | Routing | Parallelization | Orchestrator-Worker | Evaluator-Optimizer |
|--------|-----------------|---------|-----------------|---------------------|---------------------|
| Error detection | None | None | Model disagreement | Worker findings | Self-critique |
| Improvement loop | No | No | No | No | Yes |
| Best for | Simple docs | Mixed types | High-stakes | Complex docs | Catching subtle errors |
| LLM calls | Sequential | 1 + specialist | Parallel | Many | 3 (generate, evaluate, optimize) |

### Limitations

**Self-critique isn't infallible**
- The model may not catch all its own errors
- Some issues require external knowledge the model doesn't have

**Three LLM calls per document**
- Higher cost and latency than single-pass
- May not be necessary for straightforward documents

**Evaluator quality matters**
- A weak critique produces a weak optimization
- Consider using a more capable model for the evaluator

### Complete Workflow Series

You've now seen all 5 workflow patterns:

1. **Prompt Chaining** - Fixed sequential steps
2. **Routing** - Branch by document type
3. **Parallelization** - Multiple models for consensus
4. **Orchestrator-Worker** - Dynamic task breakdown
5. **Evaluator-Optimizer** - Self-critique and improvement

### Next Notebook

`06_evaluation_comparison.ipynb` - Compare all patterns on the same test set and have an LLM judge assess which is most robust for production use.