# 02 - Routing: Document Type Classification

## What is Routing?

Routing directs incoming documents to specialised classifiers based on document type. Instead of one generic classifier handling everything, each document type gets expert analysis.

## Why use it for privilege review?

Real discovery datasets contain mixed document types:
- Emails between lawyers and clients
- Internal business emails
- Board minutes and resolutions
- File notes and attendance memos
- Financial spreadsheets
- HR documents

Each type has different privilege indicators. An email needs party analysis; a board minute needs "legal advice" agenda item detection; a file note needs author role analysis.

## How it works
```
Document → Router (what type?) → Specialist Classifier → Result
                ↓
        ┌───────┼───────┐
        ↓       ↓       ↓
      Email   Board   File
     Classifier Classifier Classifier
```

## Australian Law Reference

- Evidence Act 1995 (Cth) ss 118-119
- *AWB Ltd v Cole* (2006) - in-house counsel considerations

## Step 1: Setup

Import libraries and create the OpenAI client.

**What this does:**
- `from openai import OpenAI` — loads the OpenAI library
- `from IPython.display import display, Markdown` — for formatted output
- `client = OpenAI()` — creates the connection to OpenAI (uses your API key from `.env`)
- `MODEL = "gpt-4.1-nano"` — sets which model to use (cheap and fast for testing)

In [None]:
from openai import OpenAI
from IPython.display import display, Markdown

client = OpenAI()
MODEL = "gpt-4.1-nano"

## Step 2: Create Test Documents

Three different document types to demonstrate routing.

**What this does:**
- Creates an email (client to lawyer)
- Creates a board minute (internal corporate record)
- Creates a file note (lawyer's attendance memo)
- Each type requires different privilege analysis

In [None]:
# Document 1: Email - client seeking legal advice
doc_email = {
    "id": "DOC001",
    "type": "unknown",  # Router will determine
    "content": """
From: sarah.chen@acmecorp.com.au
To: michael.wong@wongpartners.com.au
Date: 2024-03-15
Subject: Urgent - Contract dispute advice needed

Hi Michael,

We've received a letter from BuildRight Pty Ltd claiming we breached the construction contract.
They're claiming $2.3 million in damages.

Can you please advise on our potential liability exposure?

Regards,
Sarah Chen
General Counsel
ACME Corporation Pty Ltd
"""
}

# Document 2: Board minute - internal corporate record
doc_board = {
    "id": "DOC002",
    "type": "unknown",
    "content": """
ACME CORPORATION PTY LTD
BOARD OF DIRECTORS MEETING
Minutes - 10 March 2024

PRESENT: J. Smith (Chair), A. Jones, B. Lee, S. Chen (General Counsel)

ITEM 4: LEGAL UPDATE - BUILDRIGHT DISPUTE (CONFIDENTIAL)

The General Counsel provided a confidential legal briefing on the BuildRight 
contract dispute. The Board received legal advice on the company's exposure 
and litigation strategy. 

RESOLVED: That the legal advice be noted and the recommended strategy be adopted.
This item was discussed in confidence for the purpose of obtaining legal advice.

ITEM 5: MARKETING BUDGET 2024

The CFO presented the proposed marketing budget of $1.2 million.
RESOLVED: That the marketing budget be approved.
"""
}

# Document 3: File note - lawyer's record
doc_filenote = {
    "id": "DOC003",
    "type": "unknown",
    "content": """
FILE NOTE
PRIVILEGED AND CONFIDENTIAL

Client: ACME Corporation Pty Ltd
Matter: BuildRight Dispute
Author: Michael Wong, Partner
Date: 16 March 2024

ATTENDANCE ON CLIENT

Attended upon Sarah Chen (General Counsel) by telephone.

Discussed:
1. Receipt of letter of demand from BuildRight
2. Reviewed contract clause 14.3 - limitation of liability
3. Advised that the limitation clause may cap exposure at $500,000
4. Recommended obtaining expert quantity surveyor report

ACTION: Brief counsel on prospects of defending claim.

Time: 0.8 hours
"""
}

documents = [doc_email, doc_board, doc_filenote]
print(f"Created {len(documents)} test documents for routing")

## Step 3: The Router - Identify Document Type

The router examines each document and determines its type.

**What this does:**
- Analyses the document structure and content
- Classifies as: EMAIL, BOARD_MINUTE, FILE_NOTE, or OTHER
- Returns the document type so we can route to the correct specialist classifier

In [None]:
def router(document):
    """Determine document type for routing to specialist classifier"""
    
    messages = [
        {"role": "system", "content": "You are a legal document classifier. Identify document types accurately."},
        {"role": "user", "content": f"""
Analyse this document and determine its type.

Document:
{document['content']}

Classify as ONE of:
- EMAIL: Electronic correspondence with From/To/Subject fields
- BOARD_MINUTE: Corporate board meeting minutes with resolutions
- FILE_NOTE: Lawyer's attendance memo or file note
- OTHER: Does not fit above categories

Respond in this format:
DOCUMENT_TYPE: [type]
CONFIDENCE: [High/Medium/Low]
INDICATORS: [what features identified this type]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Test the router on all documents
for doc in documents:
    result = router(doc)
    display(Markdown(f"### Document {doc['id']}\n\n{result}\n\n---"))

## Step 4: Specialist Classifier - Email

A classifier tuned specifically for email privilege analysis.

**What this does:**
- Analyses email-specific fields: From, To, CC, Subject
- Checks for lawyer involvement via email domains and titles
- Looks for advice-seeking language in the body
- Tailored prompts for email structure

In [None]:
def classifier_email(document):
    """Specialist privilege classifier for emails"""
    
    messages = [
        {"role": "system", "content": """You are an Australian legal privilege expert specialising in email communications.
Apply Evidence Act 1995 (Cth) ss 118-119 and the dominant purpose test from Esso v Commissioner of Taxation (1999)."""},
        {"role": "user", "content": f"""
Analyse this EMAIL for legal professional privilege.

{document['content']}

For emails, consider:
1. Are the sender/recipient lawyers? (Check domains like law firms, titles like Partner, Counsel)
2. Is legal advice being sought or given in the body?
3. Are any non-privileged third parties CC'd?
4. Is there a confidentiality notice?

Respond in this format:
CLASSIFICATION: [PRIVILEGED/NOT_PRIVILEGED/UNCERTAIN]
CONFIDENCE_SCORE: [0-100]
LAWYER_IDENTIFIED: [name and role, or "None"]
DOMINANT_PURPOSE: [Legal advice/Business operational/Mixed]
WAIVER_RISK: [None/Low/Medium/High]
LEGAL_BASIS: [relevant statute or case]
REASONING: [2-3 sentences explaining the decision]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Test on email document
email_result = classifier_email(doc_email)
display(Markdown(f"### Email Classifier Result\n\n{email_result}"))

## Step 5: Specialist Classifier - Board Minutes

A classifier tuned specifically for board minute privilege analysis.

**What this does:**
- Looks for agenda items marked "legal" or "confidential"
- Checks if legal counsel was present and provided advice
- Distinguishes privileged items from general business items
- Board minutes often contain mixed content - some items privileged, others not

In [None]:
def classifier_board_minute(document):
    """Specialist privilege classifier for board minutes"""
    
    messages = [
        {"role": "system", "content": """You are an Australian legal privilege expert specialising in corporate board minutes.
Apply Evidence Act 1995 (Cth) ss 118-119. Note that board minutes may contain PARTIAL privilege - some items privileged, others not."""},
        {"role": "user", "content": f"""
Analyse these BOARD MINUTES for legal professional privilege.

{document['content']}

For board minutes, consider:
1. Was legal counsel present at the meeting?
2. Are there specific agenda items involving legal advice?
3. Are those items marked confidential or for legal advice purposes?
4. Are there non-privileged business items that should be separated?

Respond in this format:
CLASSIFICATION: [PRIVILEGED/NOT_PRIVILEGED/PARTIAL_PRIVILEGE/UNCERTAIN]
CONFIDENCE_SCORE: [0-100]
LEGAL_COUNSEL_PRESENT: [Yes/No - name if identified]
PRIVILEGED_ITEMS: [list item numbers that are privileged]
NON_PRIVILEGED_ITEMS: [list item numbers that are NOT privileged]
LEGAL_BASIS: [relevant statute or case]
REASONING: [2-3 sentences explaining the decision]
REDACTION_RECOMMENDATION: [if partial, what should be redacted vs released]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Test on board minute document
board_result = classifier_board_minute(doc_board)
display(Markdown(f"### Board Minute Classifier Result\n\n{board_result}"))

## Step 6: Specialist Classifier - File Notes

A classifier tuned specifically for lawyer file notes and attendance memos.

**What this does:**
- Checks if the author is a lawyer
- Looks for privilege markers ("Privileged and Confidential")
- Identifies if legal advice was given or recorded
- File notes created by lawyers for the purpose of recording legal advice are typically privileged

In [None]:
def classifier_file_note(document):
    """Specialist privilege classifier for lawyer file notes"""
    
    messages = [
        {"role": "system", "content": """You are an Australian legal privilege expert specialising in lawyer file notes and attendance memos.
Apply Evidence Act 1995 (Cth) ss 118-119. File notes created by lawyers recording legal advice are typically privileged."""},
        {"role": "user", "content": f"""
Analyse this FILE NOTE for legal professional privilege.

{document['content']}

For file notes, consider:
1. Is the author a lawyer? (Check for titles: Partner, Solicitor, Counsel)
2. Is it marked "Privileged" or "Confidential"?
3. Does it record legal advice given to a client?
4. Was it created for the dominant purpose of legal advice?

Respond in this format:
CLASSIFICATION: [PRIVILEGED/NOT_PRIVILEGED/UNCERTAIN]
CONFIDENCE_SCORE: [0-100]
AUTHOR_IS_LAWYER: [Yes/No - name and title if identified]
PRIVILEGE_MARKING: [Yes/No - quote if present]
RECORDS_LEGAL_ADVICE: [Yes/No]
LEGAL_BASIS: [relevant statute or case]
REASONING: [2-3 sentences explaining the decision]
"""}
    ]
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    
    return response.choices[0].message.content

# Test on file note document
filenote_result = classifier_file_note(doc_filenote)
display(Markdown(f"### File Note Classifier Result\n\n{filenote_result}"))

## Step 7: Complete Routing Pipeline

Combine the router and specialist classifiers into a single function.

**What this does:**
- Takes any document
- Routes it to the correct specialist classifier based on type
- Returns the appropriate privilege analysis
- Demonstrates the full routing pattern in action

In [None]:
def route_and_classify(document):
    """Complete routing pipeline: identify type, route to specialist"""
    
    # Step 1: Route - determine document type
    route_result = router(document)
    
    # Extract document type from router result
    doc_type = "OTHER"
    for line in route_result.split('\n'):
        if line.startswith("DOCUMENT_TYPE:"):
            doc_type = line.split(':')[1].strip()
            break
    
    # Step 2: Route to specialist classifier
    if doc_type == "EMAIL":
        classification = classifier_email(document)
        classifier_used = "Email Specialist"
    elif doc_type == "BOARD_MINUTE":
        classification = classifier_board_minute(document)
        classifier_used = "Board Minute Specialist"
    elif doc_type == "FILE_NOTE":
        classification = classifier_file_note(document)
        classifier_used = "File Note Specialist"
    else:
        classification = "Document type not recognised - requires manual review"
        classifier_used = "None - Manual Review Required"
    
    return {
        "doc_id": document['id'],
        "route_result": route_result,
        "doc_type": doc_type,
        "classifier_used": classifier_used,
        "classification": classification
    }

# Run the complete pipeline on all documents
display(Markdown("# Complete Routing Pipeline Results\n"))

for doc in documents:
    result = route_and_classify(doc)
    display(Markdown(f"""
## Document: {result['doc_id']}

**Routed as:** {result['doc_type']}  
**Classifier used:** {result['classifier_used']}

### Classification Result:
{result['classification']}

---
"""))

## Step 8: Export to CSV for Senior Lawyer Review

Create a CSV output for human-in-the-loop (HITL) review.

**What this does:**
- Runs all documents through the routing pipeline
- Extracts key fields from each classification
- Creates a structured CSV with routing information included
- Adds blank columns for senior lawyer review and sign-off

In [None]:
import pandas as pd
from datetime import datetime

def parse_result(result_text, field):
    """Extract a field value from the LLM output"""
    for line in result_text.split('\n'):
        if line.startswith(field + ':'):
            return line.split(':', 1)[1].strip()
    return "Not found"

# Build CSV rows for all documents
csv_rows = []

for doc in documents:
    result = route_and_classify(doc)
    
    row = {
        "doc_id": result['doc_id'],
        "doc_type": result['doc_type'],
        "classifier_used": result['classifier_used'],
        "classification": parse_result(result['classification'], "CLASSIFICATION"),
        "confidence_score": parse_result(result['classification'], "CONFIDENCE_SCORE"),
        "legal_basis": parse_result(result['classification'], "LEGAL_BASIS"),
        "reasoning": parse_result(result['classification'], "REASONING"),
        # Blank columns for senior lawyer HITL review
        "reviewer_notes": "",
        "reviewer_decision": "",
        "reviewed_by": "",
        "review_date": ""
    }
    csv_rows.append(row)

# Create DataFrame and export
df = pd.DataFrame(csv_rows)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
csv_filename = f"privilege_review_routing_{timestamp}.csv"

# Display preview
display(Markdown("### CSV Preview for HITL Review"))
display(df[['doc_id', 'doc_type', 'classifier_used', 'classification', 'confidence_score']])

# Save to file
df.to_csv(csv_filename, index=False)
display(Markdown(f"**Exported:** `{csv_filename}`"))

## Conclusion: Routing for LPP Classification

### What We Built

A document routing system that directs documents to specialist classifiers:
```
Document → Router → What type?
                ↓
    ┌───────────┼───────────┐
    ↓           ↓           ↓
  EMAIL    BOARD_MINUTE  FILE_NOTE
Classifier  Classifier   Classifier
    ↓           ↓           ↓
  Result     Result      Result
```

### Why Routing Works for Privilege

- **Specialist knowledge:** Each classifier is tuned for its document type
- **Better accuracy:** Email analysis differs from board minute analysis
- **Partial privilege:** Board minute classifier detected mixed content that a generic classifier might miss
- **Extensible:** Easy to add new document types (contracts, briefs, letters)

### Comparison to Prompt Chaining

| Aspect | Prompt Chaining | Routing |
|--------|-----------------|---------|
| Structure | Fixed sequential steps | Branch by document type |
| Best for | Single document type | Mixed document sets |
| Flexibility | Same analysis for all | Tailored per type |
| Complexity | Simpler | Requires multiple classifiers |

### Limitations

**Router accuracy is critical**
- If the router misclassifies a document, it goes to the wrong specialist
- A misrouted file note analysed as an email would miss key privilege indicators

**Maintaining multiple classifiers**
- Each classifier needs to be updated if law changes
- More code to maintain than a single classifier

**Unknown document types**
- Documents that don't fit categories need a fallback or manual review
- The "OTHER" category is a catch-all that may need human intervention

### Next Notebook

`03_parallelization.ipynb` - Run multiple models in parallel and flag disagreements for consensus.