# Quick Start: Structured Object Evaluation

This notebook demonstrates how to use the stickler library to evaluate structured data extraction accuracy.

## What You'll Learn
- How to define structured data models
- How to compare individual objects
- How to evaluate sets of objects (list comparison)
- How to interpret evaluation metrics

## Setup and Imports

In [1]:
from stickler.structured_object_evaluator.models.structured_model import StructuredModel
from stickler.structured_object_evaluator.models.comparable_field import ComparableField
from stickler.comparators.levenshtein import LevenshteinComparator
from stickler.structured_object_evaluator.evaluator import StructuredModelEvaluator
from typing import List
import json

## 1. Define Your Data Structure

First, let's define a simple invoice structure with comparison rules for each field:

In [2]:
class Invoice(StructuredModel):
    """Simple invoice model for demonstration."""

    invoice_number: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.9,  # Strict matching for invoice numbers
        weight=2.0,  # High importance
    )

    vendor: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.7,  # Allow some variation in vendor names
        weight=1.0,
    )

    total: float = ComparableField(
        threshold=0.95,  # Very strict for monetary amounts
        weight=2.0,  # High importance
    )


print("✅ Invoice model defined with comparison rules")

✅ Invoice model defined with comparison rules


## 2. Compare Individual Objects

Let's create two invoice objects and compare them:

In [3]:
# Create ground truth and prediction
ground_truth = Invoice(
    invoice_number="INV-2024-001", vendor="ABC Corporation", total=1500.00
)

prediction = Invoice(
    invoice_number="INV-2024-001",  # Perfect match
    vendor="ABC Corporation Inc",  # Close match - should pass 0.7 threshold
    total=1499.95,  # Close match - should pass 0.95 threshold
)

print("Ground Truth:", ground_truth)
print("Prediction:  ", prediction)

Ground Truth: extra_fields={} invoice_number='INV-2024-001' vendor='ABC Corporation' total=1500.0
Prediction:   extra_fields={} invoice_number='INV-2024-001' vendor='ABC Corporation Inc' total=1499.95


In [4]:
# Compare using the built-in compare_with method
result = ground_truth.compare_with(prediction, include_confusion_matrix=True)

print(f"\n📊 Comparison Results:")
print(f"Overall Score: {result['overall_score']:.3f}")
print(f"All Fields Matched: {result['all_fields_matched']}")

print(f"\n📋 Field-by-Field Scores:")
for field, score in result["field_scores"].items():
    print(f"  {field:15}: {score:.3f}")


📊 Comparison Results:
Overall Score: 0.558
All Fields Matched: False

📋 Field-by-Field Scores:
  invoice_number : 1.000
  vendor         : 0.789
  total          : 0.000


## 3. Using the Evaluator for Detailed Analysis

The evaluator provides additional metrics like precision, recall, and F1 scores:

In [5]:
# Use the evaluator for detailed analysis
evaluator = StructuredModelEvaluator()
eval_result = evaluator.evaluate(ground_truth, prediction)

print("📈 Detailed Evaluation Metrics:")
print(f"Overall Precision: {eval_result['overall']['precision']:.3f}")
print(f"Overall Recall:    {eval_result['overall']['recall']:.3f}")
print(f"Overall F1 Score:  {eval_result['overall']['f1']:.3f}")
print(f"Overall ANLS:      {eval_result['overall']['anls_score']:.3f}")

print(f"\n📋 Field-Level Analysis:")
for field, metrics in eval_result["fields"].items():
    if isinstance(metrics, dict) and "anls_score" in metrics:
        print(f"  {field:15}: ANLS {metrics['anls_score']:.3f}")

📈 Detailed Evaluation Metrics:
Overall Precision: 0.667
Overall Recall:    1.000
Overall F1 Score:  0.800
Overall ANLS:      0.558

📋 Field-Level Analysis:
  invoice_number : ANLS 1.000
  vendor         : ANLS 0.789
  total          : ANLS 0.000


## 4. List Comparison - The Real Power

Now let's see how to compare lists of objects, which is where this library really shines:

In [6]:
# Define a model that contains lists of invoices
class InvoiceBatch(StructuredModel):
    """A batch of invoices for processing."""

    batch_id: str = ComparableField(
        comparator=LevenshteinComparator(), threshold=0.9, weight=1.0
    )

    invoices: List[Invoice] = ComparableField(
        weight=3.0  # This is the most important field
    )


print("✅ InvoiceBatch model defined")

✅ InvoiceBatch model defined


In [7]:
# Create ground truth batch
gt_batch = InvoiceBatch(
    batch_id="BATCH-2024-001",
    invoices=[
        Invoice(invoice_number="INV-001", vendor="Company A", total=1000.00),
        Invoice(invoice_number="INV-002", vendor="Company B", total=2000.00),
        Invoice(invoice_number="INV-003", vendor="Company C", total=1500.00),
    ],
)

# Create prediction batch with some differences
pred_batch = InvoiceBatch(
    batch_id="BATCH-2024-001",
    invoices=[
        Invoice(
            invoice_number="INV-001", vendor="Company A", total=1000.00
        ),  # Perfect match
        Invoice(
            invoice_number="INV-002", vendor="Company B Ltd", total=2000.00
        ),  # Slight vendor difference
        Invoice(
            invoice_number="INV-004", vendor="Company D", total=1800.00
        ),  # Different invoice entirely
    ],
)

print("Ground Truth Batch:")
for i, inv in enumerate(gt_batch.invoices):
    print(f"  {i + 1}. {inv.invoice_number} | {inv.vendor} | ${inv.total}")

print("\nPredicted Batch:")
for i, inv in enumerate(pred_batch.invoices):
    print(f"  {i + 1}. {inv.invoice_number} | {inv.vendor} | ${inv.total}")

Ground Truth Batch:
  1. INV-001 | Company A | $1000.0
  2. INV-002 | Company B | $2000.0
  3. INV-003 | Company C | $1500.0

Predicted Batch:
  1. INV-001 | Company A | $1000.0
  2. INV-002 | Company B Ltd | $2000.0
  3. INV-004 | Company D | $1800.0


In [8]:
# Compare the batches
batch_result = gt_batch.compare_with(pred_batch, include_confusion_matrix=True)

print("🔍 Batch Comparison Results:")
print(f"Overall Score: {batch_result['overall_score']:.3f}")
print(f"All Fields Matched: {batch_result['all_fields_matched']}")

print(f"\n📊 Field Scores:")
for field, score in batch_result["field_scores"].items():
    print(f"  {field:15}: {score:.3f}")

🔍 Batch Comparison Results:
Overall Score: 0.948
All Fields Matched: True

📊 Field Scores:
  batch_id       : 1.000
  invoices       : 0.931


In [9]:
# Use evaluator for detailed batch analysis
batch_eval_result = evaluator.evaluate(gt_batch, pred_batch)

print("📈 Detailed Batch Evaluation:")
print(f"Overall Precision: {batch_eval_result['overall']['precision']:.3f}")
print(f"Overall Recall:    {batch_eval_result['overall']['recall']:.3f}")
print(f"Overall F1 Score:  {batch_eval_result['overall']['f1']:.3f}")

print(f"\n📋 Field-Level Analysis:")
for field, metrics in batch_eval_result["fields"].items():
    if isinstance(metrics, dict):
        if "anls_score" in metrics:
            print(f"  {field:15}: ANLS {metrics['anls_score']:.3f}")
        elif "overall" in metrics and "anls_score" in metrics["overall"]:
            print(
                f"  {field:15}: ANLS {metrics['overall']['anls_score']:.3f} (list field)"
            )

📈 Detailed Batch Evaluation:
Overall Precision: 1.000
Overall Recall:    1.000
Overall F1 Score:  1.000

📋 Field-Level Analysis:
  batch_id       : ANLS 1.000
  invoices       : ANLS 0.931 (list field)


## 5. Understanding List Comparison

The library uses Hungarian algorithm to optimally match objects in lists. Let's see what happened:

In [10]:
# Get confusion matrix for detailed analysis
cm = batch_result.get("confusion_matrix", {})
if cm and "fields" in cm and "invoices" in cm["fields"]:
    invoice_metrics = cm["fields"]["invoices"]

    print("🎯 List Comparison Analysis:")
    print(
        f"True Positives (TP):  {invoice_metrics.get('tp', 0)} (correctly matched invoices)"
    )
    print(
        f"False Positives (FP): {invoice_metrics.get('fp', 0)} (incorrect matches + extra predictions)"
    )
    print(
        f"False Negatives (FN): {invoice_metrics.get('fn', 0)} (missed ground truth invoices)"
    )

    print(f"\n📈 Derived Metrics:")
    derived = invoice_metrics.get("derived", {})
    print(f"Precision: {derived.get('cm_precision', 0):.3f}")
    print(f"Recall:    {derived.get('cm_recall', 0):.3f}")
    print(f"F1 Score:  {derived.get('cm_f1', 0):.3f}")
else:
    print("Confusion matrix data not available")

🎯 List Comparison Analysis:
True Positives (TP):  0 (correctly matched invoices)
False Positives (FP): 0 (incorrect matches + extra predictions)
False Negatives (FN): 0 (missed ground truth invoices)

📈 Derived Metrics:
Precision: 0.000
Recall:    0.000
F1 Score:  0.000


## 6. Beautiful Results Display

The library includes beautiful pretty printing functions for displaying results:

In [11]:
# Import the beautiful print function
from stickler.structured_object_evaluator.utils.pretty_print import (
    print_confusion_matrix,
)

print("🎨 Beautiful Results Display:")
print("=" * 50)

# Use the pretty printer with our batch comparison results
print_confusion_matrix(batch_eval_result, show_details=True)

print("\n🎯 The pretty printer works with ANY evaluation result:")
print("• Individual comparisons: model.compare_with()")
print("• Evaluator results: evaluator.evaluate()")
print("• Bulk evaluator results: bulk_evaluator.compute()")
print("• Includes colors, visual bars, and detailed breakdowns!")

🎨 Beautiful Results Display:
=== CONFUSION MATRIX SUMMARY ===

--- Raw Counts ---
Metric             Count
-------------------------
True Positive          4
False Positive         0
True Negative          0
False Negative         0
False Discovery        0

--- Derived Metrics ---
Metric               Value Visual                
--------------------------------------------------
Precision          100.00% ████████████████████  
Recall             100.00% ████████████████████  
F1 Score           100.00% ████████████████████  
Accuracy           100.00% ████████████████████  


=== FIELD-LEVEL METRICS ===

Field                    TP     FP     TN     FN     FD     Prec   Recall       F1      Acc Visual                
-----------------------------------------------------------------------------------------------------------------------------
batch_id                  1      0      0      0      0  100.00%  100.00%  100.00%  100.00% ████████████████████  

invoices                  3 

## Key Takeaways

🎯 **What We Learned:**

1. **Individual Object Comparison**: Simple field-by-field scoring
2. **List Comparison**: Optimal matching using Hungarian algorithm 
3. **Flexible Thresholds**: Different comparison rules per field type
4. **Weighted Importance**: Critical fields can have higher impact
5. **Multiple Metrics**: ANLS scores, precision/recall, confusion matrices
6. **Beautiful Output**: `print_confusion_matrix()` for gorgeous results display

🚀 **Perfect for:**
- Document extraction evaluation (invoices, receipts, forms)
- Entity extraction assessment
- OCR quality measurement
- ML model evaluation on structured outputs

📚 **Next Steps:**
1. Try with your own data structures
2. Experiment with different comparators and thresholds
3. Use the non-match analysis for debugging
4. Scale up to larger datasets with the bulk evaluator
5. Use `print_confusion_matrix()` for beautiful results display