# 03 - Parallelization: Multi-Model Consensus

## What is Parallelization?

Parallelization sends the same task to multiple models simultaneously and compares their results. Instead of trusting one model's judgment, you get consensus or flag disagreements.

## Why use it for privilege review?

Privilege decisions are high-stakes:
- A false negative (missing privilege) could waive privilege permanently
- A false positive (over-claiming) wastes lawyer time but is safer

Multiple models provide:
- **Consensus:** If 3 models agree, higher confidence
- **Disagreement detection:** If models disagree, flag for human review
- **Defensibility:** "Three independent AI systems reached the same conclusion"

## How it works
```
Document ──┬──→ Model A ──→ PRIVILEGED
           ├──→ Model B ──→ PRIVILEGED
           └──→ Model C ──→ NOT_PRIVILEGED
                              ↓
                    Disagreement detected
                              ↓
                    Escalate to human review
```

## Australian Law Reference

- Evidence Act 1995 (Cth) ss 118-119
- For high-stakes privilege decisions, consensus reduces risk of error

## Step 1: Setup

Import libraries and create clients for multiple models.

**Prerequisites - Install Ollama:**
1. Download and install Ollama from https://ollama.ai
2. In terminal, run: `ollama pull llama3.2:1b`
3. Verify with: `ollama list`
4. Ollama runs automatically at `http://localhost:11434`

**What this does:**
- `from openai import OpenAI` — loads the OpenAI library
- `openai_client = OpenAI()` — connects to OpenAI cloud API
- `ollama_client = OpenAI(base_url=...)` — connects to local Ollama server
- Ollama provides an OpenAI-compatible API, so we use the same library
- This gives us two independent models: one cloud, one local

In [None]:
from openai import OpenAI
from IPython.display import display, Markdown

# OpenAI client (cloud)
openai_client = OpenAI()

# Ollama client (local) - uses OpenAI-compatible API
ollama_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't need a real key
)

# Models
MODEL_OPENAI = "gpt-4.1-nano"
MODEL_LLAMA = "llama3.2:1b"

print("Clients configured:")
print(f"  Cloud: OpenAI {MODEL_OPENAI}")
print(f"  Local: Ollama {MODEL_LLAMA}")

## Step 2: Create Test Email

A synthetic email for testing parallel classification.

**What this does:**
- Creates a test email with some ambiguity
- This email is CC'd to a non-lawyer, which creates waiver risk
- We want to see if both models agree on the classification

In [None]:
test_email = {
    "id": "DOC001",
    "content": """
From: sarah.chen@acmecorp.com.au
To: michael.wong@wongpartners.com.au
CC: john.smith@acmecorp.com.au
Date: 2024-03-15
Subject: RE: BuildRight dispute - next steps

Michael,

Thanks for your advice on our potential liability exposure.

I've copied in John Smith (our CFO) as he needs to understand 
the financial implications of your recommendations.

Can you please confirm:
1. Whether the $500k limitation clause is enforceable
2. If we should make a settlement offer

John - please treat this as confidential.

Regards,
Sarah Chen
General Counsel
ACME Corporation Pty Ltd
"""
}

print(f"Test email created: {test_email['id']}")
print("Note: This email has a CC to a non-lawyer (CFO) - potential waiver issue")

## Step 3: Create Classification Functions for Each Model

Define separate functions to call each model with the same prompt.

**What this does:**
- `classify_with_openai()` — sends the document to GPT-4.1-nano (cloud)
- `classify_with_llama()` — sends the document to Llama 3.2 (local)
- Both use identical prompts so results are comparable
- Each model analyses independently - no knowledge of the other's answer

In [None]:
PRIVILEGE_PROMPT = """You are an Australian legal privilege expert.
Apply Evidence Act 1995 (Cth) ss 118-119 and the dominant purpose test from Esso v Commissioner of Taxation (1999).

Analyse this document for legal professional privilege:

{content}

Consider:
1. Is a lawyer involved?
2. Is legal advice being sought or given?
3. Was it made in confidence?
4. Has privilege been waived by disclosure to third parties?

Respond in this format:
CLASSIFICATION: [PRIVILEGED/NOT_PRIVILEGED/UNCERTAIN]
CONFIDENCE_SCORE: [0-100]
WAIVER_RISK: [None/Low/Medium/High]
REASONING: [2-3 sentences explaining your decision]
"""

def classify_with_openai(document):
    """Classify using OpenAI GPT (cloud)"""
    response = openai_client.chat.completions.create(
        model=MODEL_OPENAI,
        messages=[{"role": "user", "content": PRIVILEGE_PROMPT.format(content=document['content'])}]
    )
    return response.choices[0].message.content

def classify_with_llama(document):
    """Classify using Llama 3.2 (local via Ollama)"""
    response = ollama_client.chat.completions.create(
        model=MODEL_LLAMA,
        messages=[{"role": "user", "content": PRIVILEGE_PROMPT.format(content=document['content'])}]
    )
    return response.choices[0].message.content

print("Classification functions created:")
print("  - classify_with_openai()")
print("  - classify_with_llama()")

## Step 4: Run Models in Parallel

Execute both models on the same document and compare results.

**What this does:**
- Sends the test email to both OpenAI and Llama simultaneously
- Each model analyses independently with no knowledge of the other
- Displays results side by side for comparison
- This is the core of parallelization - multiple independent opinions

In [None]:
import concurrent.futures

def run_parallel_classification(document):
    """Run both models in parallel and return results"""
    
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Submit both tasks
        future_openai = executor.submit(classify_with_openai, document)
        future_llama = executor.submit(classify_with_llama, document)
        
        # Get results
        result_openai = future_openai.result()
        result_llama = future_llama.result()
    
    return {
        "openai": result_openai,
        "llama": result_llama
    }

# Run parallel classification
print("Running parallel classification...")
results = run_parallel_classification(test_email)

display(Markdown("### OpenAI GPT-4.1-nano Result:\n"))
display(Markdown(results["openai"]))

display(Markdown("\n---\n"))

display(Markdown("### Llama 3.2 (Local) Result:\n"))
display(Markdown(results["llama"]))

## Step 5: Compare Results and Detect Disagreements

Analyse the parallel results to determine consensus or flag disagreements.

**What this does:**
- Extracts the classification from each model's response
- Compares the classifications
- If models agree → high confidence in the result
- If models disagree → flag for human review
- This is the safety net that single-model approaches lack

In [None]:
def parse_classification(result_text):
    """Extract classification from model response"""
    for line in result_text.split('\n'):
        if line.startswith("CLASSIFICATION:"):
            return line.split(':')[1].strip()
    return "UNKNOWN"

def parse_confidence(result_text):
    """Extract confidence score from model response"""
    for line in result_text.split('\n'):
        if line.startswith("CONFIDENCE_SCORE:"):
            return line.split(':')[1].strip()
    return "0"

def compare_results(results):
    """Compare model results and determine consensus"""
    
    openai_class = parse_classification(results["openai"])
    llama_class = parse_classification(results["llama"])
    
    openai_conf = parse_confidence(results["openai"])
    llama_conf = parse_confidence(results["llama"])
    
    # Determine consensus
    if openai_class == llama_class:
        consensus = True
        final_classification = openai_class
        escalation_required = "No"
        escalation_reason = "Models agree"
    else:
        consensus = False
        final_classification = "UNCERTAIN"
        escalation_required = "Yes"
        escalation_reason = f"Model disagreement: OpenAI={openai_class}, Llama={llama_class}"
    
    return {
        "openai_classification": openai_class,
        "openai_confidence": openai_conf,
        "llama_classification": llama_class,
        "llama_confidence": llama_conf,
        "consensus": consensus,
        "final_classification": final_classification,
        "escalation_required": escalation_required,
        "escalation_reason": escalation_reason
    }

# Compare the results
comparison = compare_results(results)

display(Markdown(f"""
### Consensus Analysis

| Model | Classification | Confidence |
|-------|---------------|------------|
| OpenAI GPT-4.1-nano | {comparison['openai_classification']} | {comparison['openai_confidence']}% |
| Llama 3.2 (Local) | {comparison['llama_classification']} | {comparison['llama_confidence']}% |

**Consensus reached:** {comparison['consensus']}

**Final classification:** {comparison['final_classification']}

**Escalation required:** {comparison['escalation_required']}

**Reason:** {comparison['escalation_reason']}
"""))

## Step 6: Export to CSV for Senior Lawyer Review

Create a CSV output showing both model results and consensus analysis.

**What this does:**
- Records both models' classifications and confidence scores
- Shows whether consensus was reached
- Flags disagreements for escalation
- Includes blank columns for senior lawyer to make final decision
- The human reviewer can see exactly why the document was escalated

In [None]:
import pandas as pd
from datetime import datetime

# Build the CSV row
csv_row = {
    "doc_id": test_email['id'],
    "model_a": "OpenAI GPT-4.1-nano",
    "model_a_classification": comparison['openai_classification'],
    "model_a_confidence": comparison['openai_confidence'],
    "model_b": "Llama 3.2 (Local)",
    "model_b_classification": comparison['llama_classification'],
    "model_b_confidence": comparison['llama_confidence'],
    "consensus_reached": comparison['consensus'],
    "final_classification": comparison['final_classification'],
    "escalation_required": comparison['escalation_required'],
    "escalation_reason": comparison['escalation_reason'],
    # Blank columns for senior lawyer HITL review
    "reviewer_notes": "",
    "reviewer_decision": "",
    "reviewed_by": "",
    "review_date": ""
}

# Create DataFrame and export
df = pd.DataFrame([csv_row])
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
csv_filename = f"privilege_review_parallelization_{timestamp}.csv"

# Display preview
display(Markdown("### CSV Preview for HITL Review"))
display(df[['doc_id', 'model_a_classification', 'model_b_classification', 'consensus_reached', 'escalation_required']])

# Save to file
df.to_csv(csv_filename, index=False)
display(Markdown(f"**Exported:** `{csv_filename}`"))

## Conclusion: Parallelization for LPP Classification

### What We Built

A multi-model consensus system that runs two independent models in parallel:
```
Document ──┬──→ OpenAI GPT-4.1-nano ──→ PRIVILEGED (85%)
           │
           └──→ Llama 3.2 (Local) ────→ UNCERTAIN (10%)
                                              ↓
                                   Disagreement detected
                                              ↓
                                   Escalate to human review
```

### Why Parallelization Works for Privilege

- **Safety net:** One model's error is caught by the other
- **Defensibility:** "Two independent AI systems were consulted"
- **Appropriate caution:** Disagreements default to human review
- **Cost balance:** Cloud model (accurate) + local model (free) = affordable redundancy

### What We Observed

| Model | Classification | Confidence | Notes |
|-------|---------------|------------|-------|
| OpenAI GPT-4.1-nano | PRIVILEGED | 85% | Focused on lawyer involvement and dominant purpose |
| Llama 3.2 (Local) | UNCERTAIN | 10% | Questioned the lawyer-client relationship |

The disagreement was legitimate - the email had a third party (CFO) CC'd, which creates genuine ambiguity about waiver.

### Comparison to Other Patterns

| Aspect | Prompt Chaining | Routing | Parallelization |
|--------|-----------------|---------|-----------------|
| Models used | 1 | 1 per type | Multiple |
| Error detection | None | None | Disagreement flagged |
| Cost | Low | Medium | Higher |
| Best for | Clear-cut cases | Mixed document types | High-stakes decisions |

### Limitations

**Cost increases with more models**
- Each additional model multiplies API costs
- Balance accuracy needs against budget

**Consensus isn't always correct**
- Two models can agree on the wrong answer
- Parallelization reduces but doesn't eliminate error

**Different model strengths**
- Llama 3.2:1b is small and less capable than GPT-4
- Consider using similarly-capable models for fairer comparison

### Next Notebook

`04_orchestrator_worker.ipynb` - Dynamic task breakdown for complex documents.