# Week 11 ‚Äî Banking & Compliance Evaluation
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand the unique challenges of evaluating LLMs for compliance summarization
2. Use a ComplianceJudge to evaluate regulatory summaries
3. Score summaries on both correctness and completeness
4. Detect and track omitted critical details
5. Analyze patterns in summarization failures

---

## üß† Why Compliance Summarization is Different

### The Stakes

Unlike general summarization, compliance summarization has **high-stakes failure modes**:

| Failure Type | General Summary | Compliance Summary |
|--------------|-----------------|--------------------|
| Missing detail | Reader misses context | Potential violation |
| Inaccurate number | Minor confusion | Wrong threshold applied |
| Softened language | Tone shift | Mandatory becomes optional |

### What We Evaluate

1. **Correctness:** Is the summary accurate?
2. **Completeness:** Are all critical details included?
3. **Omissions:** What specific details were left out?

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import sys
import json
from typing import Dict, List, Any

# Add src to path if running in Colab
sys.path.insert(0, '.')

# For data display
try:
    from IPython.display import display, HTML
except ImportError:
    display = print

print("‚úÖ Setup complete!")

---

## üìã Step 2: Define the Compliance Judge System Prompt

In [None]:
COMPLIANCE_JUDGE_PROMPT = """You are an expert compliance officer evaluating regulatory summaries.

Your task is to evaluate a summary of a regulatory text based on two criteria:

1. **Correctness (0.0-1.0):** Does the summary accurately represent the regulatory requirements?
   - 1.0: Fully accurate, no misrepresentations
   - 0.7-0.9: Minor inaccuracies that don't affect compliance
   - 0.4-0.6: Significant inaccuracies that could mislead compliance decisions
   - 0.0-0.3: Major errors or misrepresentations

2. **Completeness (0.0-1.0):** Are all critical details included?
   - 1.0: All critical details present (amounts, deadlines, penalties, requirements)
   - 0.7-0.9: Most critical details present, minor omissions
   - 0.4-0.6: Some critical details missing
   - 0.0-0.3: Major critical details missing

CRITICAL DETAILS to check for:
- Monetary thresholds and amounts
- Time limits and deadlines
- Penalty amounts and consequences
- Specific requirements (documents, actions, retention periods)
- Key terms ("must", "shall", prohibited actions)

Respond ONLY with a valid JSON object:
{
    "correctness_score": <float 0.0-1.0>,
    "completeness_score": <float 0.0-1.0>,
    "omitted_details": ["<list of critical details missing from summary>"],
    "rationale": "<brief explanation of scores>"
}
"""

print("üìã Compliance Judge System Prompt defined!")
print(f"\nüìù Prompt length: {len(COMPLIANCE_JUDGE_PROMPT)} characters")

---

## üèõÔ∏è Step 3: Implement the ComplianceJudge Class

In [None]:
class ComplianceJudge:
    """
    LLM-as-Judge for evaluating regulatory and compliance summaries.
    
    This specialized judge evaluates:
    - Correctness: Accuracy of the summary
    - Completeness: Whether critical details are included
    - Omissions: Specific details that were left out
    """
    
    def __init__(self, client, model: str = "gpt-4o-mini"):
        """
        Initialize the ComplianceJudge.
        
        Args:
            client: An LLM client with chat.completions.create() method
            model: Model to use for judging (default: gpt-4o-mini)
        """
        self.client = client
        self.model = model
        self.system_prompt = COMPLIANCE_JUDGE_PROMPT
    
    def evaluate_summary(
        self,
        regulatory_text: str,
        summary: str,
    ) -> Dict[str, Any]:
        """
        Evaluate a regulatory summary.
        
        Args:
            regulatory_text: The original regulatory text
            summary: The model-generated summary
            
        Returns:
            Dictionary with correctness_score, completeness_score, 
            omitted_details, and rationale
        """
        user_prompt = f"""Regulatory Text:
{regulatory_text}

Summary to Evaluate:
{summary}

Evaluate the summary for correctness and completeness."""

        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": user_prompt},
                ],
                temperature=0.0,
            )
            
            result_text = response.choices[0].message.content
            result = json.loads(result_text)
            
            return {
                "correctness_score": float(result.get("correctness_score", 0.0)),
                "completeness_score": float(result.get("completeness_score", 0.0)),
                "omitted_details": result.get("omitted_details", []),
                "rationale": result.get("rationale", ""),
            }
            
        except (json.JSONDecodeError, KeyError, TypeError) as e:
            return {
                "correctness_score": 0.0,
                "completeness_score": 0.0,
                "omitted_details": ["Error parsing judge response"],
                "rationale": f"Error: {str(e)}",
            }
    
    def compute_combined_score(
        self,
        correctness_score: float,
        completeness_score: float,
        correctness_weight: float = 0.5,
    ) -> float:
        """
        Compute a weighted combined score.
        
        Args:
            correctness_score: Score for accuracy (0-1)
            completeness_score: Score for completeness (0-1)
            correctness_weight: Weight for correctness (completeness = 1 - this)
            
        Returns:
            Weighted combined score
        """
        completeness_weight = 1.0 - correctness_weight
        return (correctness_score * correctness_weight + 
                completeness_score * completeness_weight)


print("‚úÖ ComplianceJudge class defined!")

---

## ü§ñ Step 4: Create a Mock LLM Client for Demonstration

In [None]:
class MockComplianceJudgeClient:
    """
    Mock client that simulates LLM responses for compliance evaluation.
    
    This allows demonstration without an API key.
    """
    
    class MockChat:
        class MockCompletions:
            def create(self, model: str, messages: List[Dict], temperature: float = 0.0):
                """Simulate a compliance evaluation response."""
                user_msg = messages[-1]["content"].lower()
                
                # Determine response based on summary quality indicators
                if "$500,000" in user_msg and "five years" in user_msg and "penalty" in user_msg.split("summary")[1] if "summary" in user_msg else "":
                    # Good summary with critical details
                    response = {
                        "correctness_score": 0.95,
                        "completeness_score": 0.90,
                        "omitted_details": [],
                        "rationale": "Summary accurately captures all key requirements including penalties and retention period."
                    }
                elif "should" in user_msg.split("summary")[1] if "summary" in user_msg else "" or "around" in user_msg:
                    # Summary with softened language or approximations
                    response = {
                        "correctness_score": 0.50,
                        "completeness_score": 0.40,
                        "omitted_details": [
                            "Specific monetary threshold",
                            "Exact time deadline",
                            "Penalty amount",
                            "Mandatory vs optional distinction"
                        ],
                        "rationale": "Summary softens mandatory requirements and omits critical numerical details."
                    }
                elif "60 days" in user_msg or "$10,000" in user_msg or "inform" in user_msg.split("summary")[1] if "summary" in user_msg else "":
                    # Inaccurate summary
                    response = {
                        "correctness_score": 0.20,
                        "completeness_score": 0.30,
                        "omitted_details": [
                            "Correct deadline (30 days, not 60)",
                            "Correct threshold ($5,000, not $10,000)",
                            "Customer notification prohibition"
                        ],
                        "rationale": "Summary contains multiple factual errors that would lead to compliance violations."
                    }
                elif "documentation" in user_msg.split("summary")[1] if "summary" in user_msg else "" and len(user_msg.split("summary")[1] if "summary" in user_msg else "") < 200:
                    # Very brief summary missing details
                    response = {
                        "correctness_score": 0.70,
                        "completeness_score": 0.25,
                        "omitted_details": [
                            "Specific time deadline",
                            "Monetary threshold",
                            "Penalty for non-compliance",
                            "Retention period",
                            "Required verification documents"
                        ],
                        "rationale": "Summary is technically accurate but critically incomplete, missing most actionable details."
                    }
                else:
                    # Default moderate response
                    response = {
                        "correctness_score": 0.75,
                        "completeness_score": 0.65,
                        "omitted_details": [
                            "Some specific thresholds or deadlines"
                        ],
                        "rationale": "Summary captures main requirements but may be missing some specific details."
                    }
                
                class MockMessage:
                    content = json.dumps(response)
                
                class MockChoice:
                    message = MockMessage()
                
                class MockResponse:
                    choices = [MockChoice()]
                
                return MockResponse()
        
        completions = MockCompletions()
    
    chat = MockChat()


# Create the mock client
mock_client = MockComplianceJudgeClient()
print("‚úÖ Mock compliance evaluation client created!")
print("   (Replace with real OpenAI client for production use)")

---

## üèõÔ∏è Step 5: Initialize the ComplianceJudge

In [None]:
# Create the ComplianceJudge with our mock client
judge = ComplianceJudge(
    client=mock_client,
    model="gpt-4o-mini",
)

print("‚úÖ ComplianceJudge initialized!")
print(f"   Model: {judge.model}")

---

## üìä Step 6: Define Regulatory Test Cases

In [None]:
# Define regulatory texts and summaries of varying quality
test_cases = [
    {
        "name": "KYC - Complete Summary",
        "regulatory_text": """
Financial institutions must verify the identity of each customer opening an account.
Verification requires obtaining the customer's name, date of birth, address, and 
identification number. For individuals, acceptable identification includes a 
government-issued photo ID (passport, driver's license) or two forms of non-photo 
identification. The institution must maintain records of the verification process 
for at least five years after the account is closed. Failure to comply may result 
in penalties up to $500,000 per violation.
        """.strip(),
        "summary": """
Banks must verify customer identity when opening accounts using name, DOB, address, 
and ID number. Acceptable ID: government photo ID or two non-photo IDs. Records 
must be kept five years after account closure. Penalty up to $500,000 per violation.
        """.strip(),
        "expected_quality": "high",
    },
    {
        "name": "KYC - Missing Critical Details",
        "regulatory_text": """
Financial institutions must verify the identity of each customer opening an account.
Verification requires obtaining the customer's name, date of birth, address, and 
identification number. For individuals, acceptable identification includes a 
government-issued photo ID (passport, driver's license) or two forms of non-photo 
identification. The institution must maintain records of the verification process 
for at least five years after the account is closed. Failure to comply may result 
in penalties up to $500,000 per violation.
        """.strip(),
        "summary": """
Banks should verify customer identity when opening accounts. Records should be 
kept for documentation purposes.
        """.strip(),
        "expected_quality": "low",
    },
    {
        "name": "SAR - Inaccurate Summary",
        "regulatory_text": """
Banks must file a Suspicious Activity Report (SAR) within 30 calendar days of 
detecting suspicious activity. Suspicious activity includes transactions over 
$5,000 that appear to involve money laundering. The bank must not notify the 
customer that a SAR has been filed. SAR records must be retained for five years.
        """.strip(),
        "summary": """
Banks should file SARs within 60 days for transactions over $10,000. Customers 
should be informed when a SAR is filed about their account.
        """.strip(),
        "expected_quality": "inaccurate",
    },
    {
        "name": "Data Retention - Moderate Summary",
        "regulatory_text": """
Under the Bank Secrecy Act, financial institutions must retain records of 
transactions over $3,000 for a minimum of five years. Records must include 
the customer's name, account number, transaction amount, and date. Electronic 
records are acceptable if they can be produced within 24 hours upon regulatory 
request. Institutions must designate a compliance officer responsible for 
ensuring retention requirements are met.
        """.strip(),
        "summary": """
Banks must keep transaction records over $3,000 for five years. Records need 
customer name, account number, amount, and date. Electronic records must be 
producible within 24 hours. A compliance officer must be designated.
        """.strip(),
        "expected_quality": "high",
    },
    {
        "name": "AML - Partial Summary",
        "regulatory_text": """
Financial institutions must implement an Anti-Money Laundering (AML) program 
that includes: (1) internal policies and procedures, (2) designation of a 
compliance officer, (3) ongoing employee training, and (4) independent testing. 
The program must be approved by the board of directors. Failure to maintain 
an adequate program may result in penalties up to $1,000,000 per day.
        """.strip(),
        "summary": """
Banks need an AML program with policies, a compliance officer, training, and 
testing. The board must approve it.
        """.strip(),
        "expected_quality": "moderate",
    },
]

print(f"üìä Defined {len(test_cases)} regulatory test cases:")
for i, tc in enumerate(test_cases, 1):
    print(f"   {i}. {tc['name']} (expected: {tc['expected_quality']})")

---

## üß™ Step 7: Evaluate All Test Cases

In [None]:
# Evaluate all test cases
print("üîç Evaluating Regulatory Summaries...")
print("=" * 80)

results = []
for tc in test_cases:
    result = judge.evaluate_summary(
        regulatory_text=tc["regulatory_text"],
        summary=tc["summary"]
    )
    
    combined = judge.compute_combined_score(
        result["correctness_score"],
        result["completeness_score"]
    )
    
    results.append({
        "name": tc["name"],
        "expected_quality": tc["expected_quality"],
        "correctness_score": result["correctness_score"],
        "completeness_score": result["completeness_score"],
        "combined_score": combined,
        "omitted_details": result["omitted_details"],
        "rationale": result["rationale"],
    })
    
    print(f"\nüìã {tc['name']}")
    print("-" * 60)
    print(f"   Correctness:  {result['correctness_score']:.2f}")
    print(f"   Completeness: {result['completeness_score']:.2f}")
    print(f"   Combined:     {combined:.2f}")
    if result["omitted_details"]:
        print(f"   Omissions:    {len(result['omitted_details'])} detail(s)")

print("\n" + "=" * 80)
print("‚úÖ Evaluation complete!")

---

## üìä Step 8: Display Results Summary Table

In [None]:
print("üìä Results Summary Table")
print("=" * 90)
print(f"{'Test Case':<35} {'Expected':<12} {'Correct':<10} {'Complete':<10} {'Combined':<10}")
print("-" * 90)

for r in results:
    # Determine status symbol
    if r["combined_score"] >= 0.8:
        status = "‚úÖ"
    elif r["combined_score"] >= 0.5:
        status = "‚ö†Ô∏è"
    else:
        status = "‚ùå"
    
    print(f"{status} {r['name']:<33} {r['expected_quality']:<12} {r['correctness_score']:<10.2f} {r['completeness_score']:<10.2f} {r['combined_score']:<10.2f}")

print("-" * 90)

# Calculate averages
avg_correctness = sum(r["correctness_score"] for r in results) / len(results)
avg_completeness = sum(r["completeness_score"] for r in results) / len(results)
avg_combined = sum(r["combined_score"] for r in results) / len(results)

print(f"{'AVERAGE':<35} {'':<12} {avg_correctness:<10.2f} {avg_completeness:<10.2f} {avg_combined:<10.2f}")

---

## üîç Step 9: Analyze Omitted Details

In [None]:
print("üîç Omitted Details Analysis")
print("=" * 80)

# Collect all omitted details
all_omissions = []
for r in results:
    if r["omitted_details"]:
        print(f"\nüìã {r['name']}:")
        for detail in r["omitted_details"]:
            print(f"   ‚ùå {detail}")
            all_omissions.append(detail)

# Count omission types (simplified categorization)
print("\n" + "=" * 80)
print("üìà Omission Categories")
print("-" * 40)

categories = {
    "monetary": 0,
    "deadline": 0,
    "penalty": 0,
    "requirement": 0,
    "other": 0,
}

for omission in all_omissions:
    omission_lower = omission.lower()
    if any(word in omission_lower for word in ["threshold", "amount", "$", "monetary"]):
        categories["monetary"] += 1
    elif any(word in omission_lower for word in ["deadline", "time", "days", "period"]):
        categories["deadline"] += 1
    elif any(word in omission_lower for word in ["penalty", "fine", "violation"]):
        categories["penalty"] += 1
    elif any(word in omission_lower for word in ["requirement", "must", "mandatory", "document"]):
        categories["requirement"] += 1
    else:
        categories["other"] += 1

for category, count in sorted(categories.items(), key=lambda x: -x[1]):
    if count > 0:
        print(f"   {category.capitalize()}: {count}")

---

## üìã Step 10: View Detailed Rationales

In [None]:
print("üìã Detailed Evaluation Rationales")
print("=" * 80)

for r in results:
    print(f"\nüìå {r['name']}")
    print(f"   Scores: Correctness={r['correctness_score']:.2f}, Completeness={r['completeness_score']:.2f}")
    print(f"   Rationale: {r['rationale']}")

---

## üéì Step 11: Define Score Thresholds for Compliance Use

In [None]:
# Define thresholds for compliance use
THRESHOLDS = {
    "auto_approve": 0.90,      # Summary can be used without review
    "review_required": 0.70,   # Summary needs human review
    "reject": 0.50,            # Summary should not be used
}

print("üìã Compliance Use Recommendations")
print("=" * 80)
print(f"{'Test Case':<35} {'Combined':<10} {'Recommendation':<20}")
print("-" * 80)

for r in results:
    score = r["combined_score"]
    
    if score >= THRESHOLDS["auto_approve"]:
        recommendation = "‚úÖ Auto-approve"
    elif score >= THRESHOLDS["review_required"]:
        recommendation = "‚ö†Ô∏è Review required"
    elif score >= THRESHOLDS["reject"]:
        recommendation = "‚ùå Major review needed"
    else:
        recommendation = "üö´ Reject"
    
    print(f"{r['name']:<35} {score:<10.2f} {recommendation:<20}")

print("\nüìå Thresholds:")
print(f"   Auto-approve: ‚â• {THRESHOLDS['auto_approve']}")
print(f"   Review required: ‚â• {THRESHOLDS['review_required']}")
print(f"   Major review: ‚â• {THRESHOLDS['reject']}")
print(f"   Reject: < {THRESHOLDS['reject']}")

---

## üéì Mini-Project: Your Compliance Evaluation

### Task

Create your own regulatory text and evaluate different summary qualities.

### Template

In [None]:
# Your regulatory text
my_regulatory_text = """
# Paste a regulatory text here or write your own
# Include: monetary thresholds, deadlines, penalties, requirements
"""

# Your good summary
my_good_summary = """
# Write a summary that includes all critical details
"""

# Your poor summary
my_poor_summary = """
# Write a summary that omits critical details
"""

# Evaluate both
# good_result = judge.evaluate_summary(my_regulatory_text, my_good_summary)
# poor_result = judge.evaluate_summary(my_regulatory_text, my_poor_summary)

# Compare results
# print(f"Good summary: {good_result['combined_score']:.2f}")
# print(f"Poor summary: {poor_result['combined_score']:.2f}")

---

## ü§î Paul-Elder Critical Thinking Questions

Reflect on these questions as you complete the exercises:

### Question 1: RISK ASSESSMENT
**What are the potential consequences if a compliance summary incorrectly states a reporting deadline as 60 days instead of 30 days?**

*Consider: Regulatory violations, audit findings, potential fines, reputational damage, and the chain of decisions that might rely on this summary.*

### Question 2: TRUST CALIBRATION
**Should a compliance officer trust an LLM-generated summary of a regulation they haven't read themselves? Under what conditions?**

*Consider: The role of human oversight, the complexity of the regulation, the stakes involved, available verification methods, and organizational liability.*

### Question 3: OMISSION DETECTION
**How can we systematically detect what an LLM summary has omitted from a regulatory text, and why is this particularly challenging?**

*Consider: The difficulty of proving a negative, the need for domain expertise to identify critical vs. non-critical details, and how omission detection differs from error detection.*

---

## ‚ö†Ô∏è Limitations and Risks

### What This Evaluation DOESN'T Cover

1. **Regulatory Updates:** Regulations change; summaries may become outdated
2. **Jurisdiction:** Rules vary by location; summaries may not be universally applicable
3. **Interconnections:** Regulations interact; isolated summaries miss context
4. **Legal Nuance:** Subtle legal distinctions may be lost in summarization
5. **Liability:** Using LLM summaries doesn't transfer legal responsibility

### Required Safeguards for Production Use

- **Human Review:** All summaries should be reviewed by qualified compliance staff
- **Source Linking:** Always provide links to original regulatory text
- **Version Control:** Track which version of regulations were summarized
- **Audit Trail:** Log all summary generations and reviews
- **Periodic Validation:** Re-evaluate summaries when regulations change

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 12, ensure you can check all boxes:

- [ ] I understand why compliance summarization is different from general summarization
- [ ] I can explain the difference between correctness and completeness scores
- [ ] I can use the ComplianceJudge to evaluate regulatory summaries
- [ ] I understand what types of critical details are most important in compliance
- [ ] I can identify when human review is essential vs. when automation is acceptable
- [ ] I know the risks of omission in regulatory summarization
- [ ] I can set appropriate score thresholds for compliance use cases

---

**Week 11 Complete!** üéâ

**Next:** *Week 12 ‚Äî Healthcare Use Cases*