# Complex Nested Structure Evaluation

This notebook demonstrates how to evaluate complex nested data structures using the stickler library.

## What You'll Learn
- How to define deeply nested structured models
- How to handle optional fields and missing data
- How to evaluate nested objects within lists
- How to interpret detailed confusion matrices for complex structures

## Setup and Imports

In [None]:
from typing import Optional, List
from stickler.structured_object_evaluator.models.structured_model import StructuredModel
from stickler.structured_object_evaluator.models.comparable_field import ComparableField
from stickler.comparators.levenshtein import LevenshteinComparator
from stickler.comparators.numeric import NumericComparator
from stickler.comparators.exact import ExactComparator
from stickler.structured_object_evaluator.evaluator import StructuredModelEvaluator
from pprint import pprint
import json

## 1. Define Complex Nested Structures

Let's create a veterinary record system with multiple levels of nesting:

In [None]:
class Contact(StructuredModel):
    """Contact information with phone and optional email."""

    phone: str = ComparableField(
        comparator=ExactComparator(),
        threshold=1.0,
        weight=2.0,  # Phone is important
    )

    email: Optional[str] = ComparableField(
        default=None, comparator=ExactComparator(), threshold=1.0, weight=1.0
    )


class Owner(StructuredModel):
    """Pet owner with nested contact information."""

    id: int = ComparableField(comparator=ExactComparator(), threshold=1.0, weight=2.0)

    name: str = ComparableField(
        comparator=LevenshteinComparator(), threshold=0.9, weight=2.0
    )

    contact: Contact = ComparableField(
        threshold=0.8,  # Allow some flexibility in contact matching
        weight=1.5,
    )


class Pet(StructuredModel):
    """Pet model with optional fields."""

    # Set a high match threshold for pet matching in lists
    match_threshold = 0.8

    petId: int = ComparableField(
        comparator=ExactComparator(),
        threshold=1.0,
        weight=3.0,  # Pet ID is very important for matching
    )

    name: str = ComparableField(
        comparator=LevenshteinComparator(), threshold=0.8, weight=2.0
    )

    species: Optional[str] = ComparableField(
        default=None, comparator=LevenshteinComparator(), threshold=0.9, weight=2.0
    )

    breed: Optional[str] = ComparableField(
        default=None, comparator=LevenshteinComparator(), threshold=0.8, weight=1.0
    )

    birthdate: Optional[str] = ComparableField(
        default=None, comparator=ExactComparator(), threshold=1.0, weight=1.5
    )

    weight: Optional[float] = ComparableField(
        default=None,
        comparator=NumericComparator(),
        threshold=0.95,  # Very strict for weight
        weight=1.0,
    )


class VeterinaryRecord(StructuredModel):
    """Main record containing owner and pets."""

    recordId: int = ComparableField(
        comparator=ExactComparator(), threshold=1.0, weight=2.0
    )

    owner: Owner = ComparableField(threshold=0.8, weight=2.0)

    pets: List[Pet] = ComparableField(
        weight=3.0  # Pets list is the most important
    )


print("✅ Complex nested models defined")

## 2. Create Sample Data with Various Error Types

Let's create realistic test data that demonstrates different types of extraction errors:

In [None]:
# Ground truth data
ground_truth_data = {
    "recordId": 4721,
    "owner": {
        "id": 1501,
        "name": "Sarah Johnson",
        "contact": {
            "phone": "555-689-1234"
            # No email in ground truth
        },
    },
    "pets": [
        {
            "petId": 3501,
            "name": "Max",
            "species": "Dog",
            "breed": "Golden Retriever",
            "birthdate": "2018-05-12",
            # No weight in ground truth
        },
        {
            "petId": 3512,
            "name": "Buttons",
            "species": "Cat",
            # Minimal info for second pet
        },
    ],
}

# Prediction data with various error types
prediction_data = {
    "recordId": 4721,  # Correct
    "owner": {
        "id": 1501,  # Correct
        "name": "Sarah Johnson",  # Correct
        "contact": {
            "phone": "666-689-1234",  # ERROR: Wrong phone number (False Discovery)
            "email": "sjohnson@example.com",  # ERROR: Extra field (False Alarm)
        },
    },
    "pets": [
        {
            "petId": 3501,  # Correct
            "name": "Max",  # Correct
            "species": "Dog",  # Correct
            "breed": "Golden Retriever",  # Correct
            "birthdate": "2008-05-12",  # ERROR: Wrong year (False Discovery)
            "weight": 68.5,  # ERROR: Extra field (False Alarm)
        },
        {
            "petId": 3512,  # Correct
            "name": "Buttons",  # Correct
            # ERROR: Missing species (False Negative)
        },
    ],
}

print("Ground Truth Data:")
print(json.dumps(ground_truth_data, indent=2))
print("\nPrediction Data (with errors):")
print(json.dumps(prediction_data, indent=2))

## 3. Create Model Instances and Display

In [None]:
def display_veterinary_record(record: VeterinaryRecord, title: str) -> None:
    """Display a veterinary record in a readable format."""
    print(f"\n{title}:")
    print(f"  Record ID: {record.recordId}")
    print(f"  Owner: {record.owner.name} (ID: {record.owner.id})")
    print(f"    Phone: {record.owner.contact.phone}")
    if record.owner.contact.email:
        print(f"    Email: {record.owner.contact.email}")

    print(f"  Pets ({len(record.pets)}):")
    for i, pet in enumerate(record.pets, 1):
        print(f"    {i}. {pet.name} (ID: {pet.petId})")
        if pet.species:
            print(f"       Species: {pet.species}")
        if pet.breed:
            print(f"       Breed: {pet.breed}")
        if pet.birthdate:
            print(f"       Born: {pet.birthdate}")
        if pet.weight:
            print(f"       Weight: {pet.weight} lbs")


# Create model instances
ground_truth = VeterinaryRecord(**ground_truth_data)
prediction = VeterinaryRecord(**prediction_data)

# Display both records
display_veterinary_record(ground_truth, "Ground Truth Record")
display_veterinary_record(prediction, "Predicted Record")

## 4. Perform Detailed Comparison

In [None]:
# Compare using the built-in method
comparison_result = ground_truth.compare_with(prediction, include_confusion_matrix=True)

print("🔍 HIGH-LEVEL COMPARISON RESULTS")
print("=" * 50)
print(f"Overall Score: {comparison_result['overall_score']:.3f}")
print(f"All Fields Matched: {comparison_result['all_fields_matched']}")

print(f"\n📋 Field-by-Field Scores:")
for field, score in comparison_result["field_scores"].items():
    print(f"  {field:12}: {score:.3f}")

## 5. Detailed Evaluation with Confusion Matrix

In [None]:
# Use evaluator for detailed analysis
evaluator = StructuredModelEvaluator()
result = evaluator.evaluate(ground_truth, prediction)

print("📊 DETAILED EVALUATION METRICS")
print("=" * 50)
print(f"Overall Precision: {result['overall']['precision']:.3f}")
print(f"Overall Recall:    {result['overall']['recall']:.3f}")
print(f"Overall F1 Score:  {result['overall']['f1']:.3f}")
print(f"Overall ANLS:      {result['overall']['anls_score']:.3f}")

## 6. Field-Level Analysis

In [None]:
print("\n📋 FIELD-LEVEL ANALYSIS")
print("=" * 50)


def analyze_field_metrics(field_name: str, metrics: dict, indent: int = 0) -> None:
    """Recursively analyze field metrics."""
    prefix = "  " * indent

    if isinstance(metrics, dict):
        # Check for ANLS score (simple fields)
        if "anls_score" in metrics:
            print(f"{prefix}{field_name:20}: ANLS {metrics['anls_score']:.3f}")

        # Check for nested structure (complex fields)
        elif "overall" in metrics:
            if "anls_score" in metrics["overall"]:
                print(
                    f"{prefix}{field_name:20}: ANLS {metrics['overall']['anls_score']:.3f} (nested)"
                )

            # Show confusion matrix for nested fields
            overall = metrics["overall"]
            if any(k in overall for k in ["tp", "fp", "fn"]):
                tp = overall.get("tp", 0)
                fp = overall.get("fp", 0)
                fn = overall.get("fn", 0)
                print(f"{prefix}                     TP:{tp} FP:{fp} FN:{fn}")

            # Recurse into nested fields if they exist
            if "fields" in metrics:
                for nested_field, nested_metrics in metrics["fields"].items():
                    analyze_field_metrics(nested_field, nested_metrics, indent + 1)


# Analyze all fields
for field_name, field_metrics in result["fields"].items():
    analyze_field_metrics(field_name, field_metrics)

## 7. Error Analysis - What Went Wrong?

In [None]:
print("\n🔧 ERROR ANALYSIS")
print("=" * 50)

# Get confusion matrix data
cm = comparison_result.get("confusion_matrix", {})


def analyze_errors_recursively(path: str, metrics: dict, indent: int = 0) -> None:
    """Recursively analyze errors in the structure."""
    prefix = "  " * indent

    if isinstance(metrics, dict):
        # Look for overall metrics
        overall = metrics.get("overall", metrics)

        if isinstance(overall, dict):
            tp = overall.get("tp", 0)
            fp = overall.get("fp", 0)
            fn = overall.get("fn", 0)
            fd = overall.get("fd", 0)
            fa = overall.get("fa", 0)

            if fp > 0 or fn > 0 or fd > 0 or fa > 0:
                print(f"{prefix}{path}:")
                if fd > 0:
                    print(f"{prefix}  ❌ {fd} False Discovery (incorrect values)")
                if fa > 0:
                    print(f"{prefix}  ⚠️  {fa} False Alarm (extra fields)")
                if fn > 0:
                    print(f"{prefix}  📭 {fn} False Negative (missing fields)")

        # Recurse into nested fields
        if "fields" in metrics:
            for field_name, field_metrics in metrics["fields"].items():
                new_path = f"{path}.{field_name}" if path else field_name
                analyze_errors_recursively(new_path, field_metrics, indent + 1)


# Analyze errors starting from root
if cm:
    analyze_errors_recursively("", cm)
else:
    print("No confusion matrix data available")

## 8. Beautiful Results Display

Let's use the beautiful pretty printing function to display our complex evaluation results:

In [None]:
# Import the beautiful print function
from stickler.structured_object_evaluator.utils.pretty_print import (
    print_confusion_matrix,
)

print("🎨 BEAUTIFUL RESULTS DISPLAY")
print("=" * 60)

# Use the pretty printer with our evaluation results
print_confusion_matrix(result, show_details=True, nested_detail="detailed")

print("\n🎯 Pretty printing features demonstrated:")
print("• Colored confusion matrix with visual bars")
print("• Hierarchical field breakdown for nested structures")
print("• Detailed error analysis with TP/FP/FN/FD classification")
print("• Field filtering and sorting capabilities")
print("• Works with ANY evaluation result type!")

## 9. Summary and Insights

In [None]:
print("\n💡 KEY INSIGHTS")
print("=" * 50)

print("📊 What the metrics tell us:")
print(
    f"  • Overall score of {comparison_result['overall_score']:.3f} indicates good but not perfect extraction"
)
print(f"  • The system correctly identified pets and basic structure")
print(f"  • Main issues were in specific field values (phone, birthdate)")
print(f"  • Some fields were hallucinated (extra email, weight)")
print(f"  • Some fields were missed (missing species for second pet)")

print(f"\n🎯 Actionable improvements:")
print(f"  • Review phone number extraction - OCR issue?")
print(f"  • Check date parsing logic - year extraction error")
print(f"  • Validate field extraction to prevent hallucination")
print(f"  • Improve recall for optional fields like species")

print(f"\n🏆 What worked well:")
print(f"  • Correct record and pet ID extraction")
print(f"  • Accurate name extraction")
print(f"  • Proper list structure preservation")
print(f"  • Consistent breed information")

## Key Takeaways

🎯 **Complex Structure Evaluation:**

1. **Hierarchical Analysis**: The library evaluates nested structures recursively
2. **Optional Field Handling**: Missing optional fields are handled gracefully
3. **List Matching**: Pet lists use Hungarian algorithm for optimal matching
4. **Field Weights**: Critical fields (IDs, names) can have higher impact
5. **Error Classification**: Different error types (FD, FA, FN) provide actionable insights
6. **Beautiful Output**: `print_confusion_matrix()` for gorgeous results display

🚀 **Perfect for Complex Documents:**
- Medical records and forms
- Legal documents with nested clauses
- Financial statements with multiple sections
- Technical specifications with hierarchical data
- Any structured document with nested lists and objects

📚 **Advanced Features Demonstrated:**
1. **Match Thresholds**: Pet.match_threshold controls list matching behavior
2. **Optional Fields**: Handle missing vs. extra data appropriately
3. **Multiple Comparators**: Different comparison strategies per field type
4. **Weighted Scoring**: Important fields contribute more to overall score
5. **Detailed Error Analysis**: Pinpoint specific issues for improvement
6. **Beautiful Pretty Printing**: Visual confusion matrices and field analysis