# Stickler Example: Evaluating Structured Extraction

This notebook demonstrates how to use Stickler to evaluate structured data extraction results, using a shipping invoice example from the blog post.

## Setup

First, let's import required dependencies and define our data schema classes.

In [None]:
from typing import List, Optional

from stickler.structured_object_evaluator import StructuredModel, ComparableField
from stickler.comparators.levenshtein import LevenshteinComparator
from stickler.comparators.exact import ExactComparator
from stickler.comparators.numeric import NumericComparator

## Schema Definition

We'll define our hierarchical data schema using Stickler's StructuredModel base class. Each field is annotated with evaluation parameters through ComparableField.

In [2]:
# Define component classes for the shipping invoice structure
class Sender(StructuredModel):
    """Sender information for shipment."""
    company: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.8,
        weight=1.0,
        description="Sender's company name"
    )
    address: Optional[str] = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.7,
        weight=0.8,
        description="Street address of the recipient as it appears",
        default=None
    )
    city: Optional[str] = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.7,
        weight=0.8,
        description="City of the recipient",
        default=None
    )


class Recipient(StructuredModel):
    """Recipient information for shipment."""
    company: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.8,
        weight=1.0,
        description="Recipient's company name as it appears"
    )
    address: Optional[str] = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.7,
        weight=0.8,
        description="Street address of the recipient as it appears",
        default=None
    )
    zipCode: Optional[str] = ComparableField(
        comparator=NumericComparator(),
        threshold=1.0,
        weight=0.8,
        description="Zipcode of the recipient's address",
        default=None
    )


class AdditionalCharge(StructuredModel):
    """Additional charge applied to shipment."""
    description: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.8,
        weight=1.0,
        description="Description of the additional charge as it appears"
    )
    amount: Optional[float] = ComparableField(
        comparator=NumericComparator(),
        threshold=0.95,
        weight=1.0,
        description="Amount of the additional charge",
        default=None
    )


class MostRecentShipment(StructuredModel):
    """Shipment with the most recent date information."""
    sender: Optional[Sender] = ComparableField(
        threshold=0.8,
        weight=1.0,
        description="Sender information",
        default=None
    )
    recipient: Optional[Recipient] = ComparableField(
        threshold=0.8,
        weight=1.0,
        description="Recipient information",
        default=None
    )
    shipmentNumber: str = ComparableField(
        comparator=ExactComparator(),
        threshold=1.0,
        weight=2.0,
        description="Shipment number, used as ID",
        default=None
    )
    additionalCharges: Optional[List[AdditionalCharge]] = ComparableField(
        weight=0.8,
        description="Details of additional charges applied to the shipment",
        default=None
    )


class ShippingInvoice(StructuredModel):
    """Main shipping invoice structure."""
    mostRecentShipment: Optional[MostRecentShipment] = ComparableField(
        threshold=0.8,
        weight=3.0,
        description="Shipment with the most recent date (closest to today's date), always the last item in the list",
        default=None
    )


## Test Data

Let's create sample ground truth and prediction data with some deliberate errors to demonstrate Stickler's evaluation capabilities.

In [3]:
gt_json = {

    "mostRecentShipment": {
        "sender": {
            "company": "Schumm, Cronin And Grady"
        },
        "recipient": {
            "address": "49418 Renner Key",
            "company": "Stiedemann - Hermann"
        },
        "shipmentNumber": "8186386200",
        "additionalCharges": [{
                "amount": 8.62,
                "description": "Change Of Address Fee"
            },
            {
                "amount": 5.78,
                "description": "Priority Service Charge"
            }
        ]
    }
}

pred_json = {

    "mostRecentShipment": {
        "sender": {
            "company": "Schumm, Cronin And Graddy"  # Typo: Graddy vs Grady
        },
        "recipient": {
            "address": "49418 Renner Key",
            "company": "Stiedemann - Hermann"
        },
        "shipmentNumber": "8186386208", # Last digit typo (it should end with 0 instead of 8) 
        "additionalCharges": [{
                "amount": 8.62,
                "description": "Change Of Address Fees" # Slight workding change
            }       
        # One missing entry in additional Charges                                    
        ]
    }
}

## Basic Evaluation

Now let's perform the basic evaluation between ground truth and prediction.

In [4]:
# Create structured objects from our JSON data
gt_invoice = ShippingInvoice(**gt_json)
pred_invoice = ShippingInvoice(**pred_json)

# Perform comparison with confusion matrix metrics
results = gt_invoice.compare_with(pred_invoice, include_confusion_matrix=True)

## Analyzing Results

Let's examine the evaluation results at different levels of the hierarchy.

In [5]:
# Overall document-level results
print("Overall Document Results:")
print(results['confusion_matrix']['overall'])

# Aggregate results (including all nested fields)
print("\nAggregate Results:")
print(results['confusion_matrix']['aggregate'])

Overall Document Results:
{'tp': 0, 'fa': 0, 'fd': 1, 'fp': 1, 'tn': 0, 'fn': 0, 'similarity_score': 0.0, 'all_fields_matched': False, 'derived': {'cm_precision': 0.0, 'cm_recall': 0.0, 'cm_f1': 0.0, 'cm_accuracy': 0.0}}

Aggregate Results:
{'tp': 5, 'fa': 0, 'fd': 1, 'fp': 1, 'tn': 3, 'fn': 2, 'derived': {'cm_precision': 0.8333333333333334, 'cm_recall': 0.7142857142857143, 'cm_f1': 0.7692307692307692, 'cm_accuracy': 0.7272727272727273}}


## Deep Dive: Additional Charges

Let's examine how Stickler handles the list of additional charges, where we have both missing entries and text variations.

In [6]:
# Extract additional charges for detailed examination
gt_additional_charges = getattr(getattr(gt_invoice, "mostRecentShipment"), "additionalCharges")
pred_additional_charges = getattr(getattr(pred_invoice, "mostRecentShipment"), "additionalCharges")

print("Ground Truth Additional Charges:")
print(gt_additional_charges)
print("\nPredicted Additional Charges:")
print(pred_additional_charges)

Ground Truth Additional Charges:
[AdditionalCharge(extra_fields={}, description='Change Of Address Fee', amount=8.62), AdditionalCharge(extra_fields={}, description='Priority Service Charge', amount=5.78)]

Predicted Additional Charges:
[AdditionalCharge(extra_fields={}, description='Change Of Address Fees', amount=8.62)]


### Hungarian Matching Analysis

Examine how Stickler matches items in the additional charges list using the Hungarian algorithm.

In [7]:
from stickler.structured_object_evaluator.models.hungarian_helper import HungarianHelper

# Analyze how items are matched
hungarian_helper = HungarianHelper()
hungarian_info = hungarian_helper.get_complete_matching_info(gt_additional_charges, pred_additional_charges)
matched_pairs = hungarian_info["matched_pairs"]

print("Matched pairs (ground truth index, prediction index, similarity score):")
print(matched_pairs)

Matched pairs (ground truth index, prediction index, similarity score):
[(0, 0, np.float64(0.9772727272727273))]


### Field level evaluation shows detailed performance of a field

In [8]:
# Overall results for additionalCharges
results['confusion_matrix']['fields']['mostRecentShipment']['fields']['additionalCharges']['overall']

{'tp': 1,
 'fa': 0,
 'fd': 0,
 'fp': 0,
 'tn': 0,
 'fn': 1,
 'derived': {'cm_precision': 1.0,
  'cm_recall': 0.5,
  'cm_f1': 0.6666666666666666,
  'cm_accuracy': 0.5}}

In [9]:
# Aggregate results for additionalCharges
results['confusion_matrix']['fields']['mostRecentShipment']['fields']['additionalCharges']['aggregate']

{'tp': 2,
 'fa': 0,
 'fd': 0,
 'fp': 0,
 'tn': 0,
 'fn': 2,
 'derived': {'cm_precision': 1.0,
  'cm_recall': 0.5,
  'cm_f1': 0.6666666666666666,
  'cm_accuracy': 0.5}}

## Conclusion

This notebook demonstrates Stickler's key features:
- Hierarchical evaluation of nested structures
- Field-specific comparison strategies
- Weighted scoring based on business importance
- Detailed metrics at multiple levels of granularity