# Introduction to Stickler: Single Document Evaluation

## Overview

This notebook introduces the Stickler library for evaluating structured data extraction from any workflow that produces key-value pairs or structured outputs. While we use document processing as our primary example, Stickler's applications extend far beyond this use case. The library can validate structured data extraction from any source - whether you're scraping product information from websites, validating API responses, processing database records, or analyzing log files. Any scenario where you need to verify that extracted information matches expected patterns and requirements can benefit from Stickler's validation framework

## Why Use Stickler?

- **Standardized Metrics**: Industry-standard evaluation methods for structured data extraction
- **Structured Data Focus**: Purpose-built for evaluating key-value extraction accuracy
- **Field-Level Analysis**: Evaluate performance across different field types
- **Clear Reporting**: Generate readable evaluation reports
- **Open Source**: Community-driven library with transparent implementations

## What You'll Learn

1. **Stickler Setup** - Install and configure the library
2. **Data Structure Definition** - Define comparison rules for your fields
3. **Single Document Evaluation** - Run evaluation on one document
4. **Result Interpretation** - Understand Stickler's output reports

## Prerequisites
- Python 3.12+
- Basic understanding of structured data concepts\


## Setup and dependencies
Stickler is an open-source library ([Github](https://github.com/awslabs/stickler), [Piwheels](https://www.piwheels.org/project/stickler-eval/))for structured data evaluation. Install it using pip:

**Import Required Modules**

In [2]:
import json
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator, LevenshteinComparator
from stickler.structured_object_evaluator.evaluator import StructuredModelEvaluator
from stickler.structured_object_evaluator.utils.pretty_print import print_confusion_matrix

## 1. Data sample

Let's consider a single invoice document with three essential fields to extract: the invoice number, vendor name, and total amount. The following example shows both the ground truth (what we expect) and the prediction (what our system extracted):

In [3]:
# Ground truth data (what we expect)
ground_truth_data = {
    'invoice_number': 'INV-2023-001',
    'total': 1500.00,
    'vendor': 'ABC Corporation'
}

# Predicted data (what our system extracted)
prediction_data = {
    'invoice_number': 'INV-2023-001',  # Perfect match
    'total': 1499.95,                  # Very close - should match
    'vendor': 'ABC Corp'               # Abbreviated - should still match
}

This example represents common real-world scenarios in document extraction where we encounter exact matches for standardized fields like invoice numbers, text variations for company names, and small numerical differences in monetary values. We will use this data to demonstrate how the evaluation framework processes these different types of matches and calculates the corresponding accuracy metrics.

## 2. Define your data structure

Before we can evaluate our extracted data, we need to define how each field should be compared and weighted. We create a structured model that specifies the comparison rules for each field in our invoice document:

In [4]:
class Invoice(StructuredModel):
    """Simple invoice model for demonstration."""
    
    invoice_number: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.9,
        weight=2.0  # This field is important
    )
    
    total: float = ComparableField(
        threshold=0.8,  # More forgiving threshold for demo
        weight=2.0  # This field is important
    )

    vendor: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.5,  # Much more forgiving for demo purposes
        weight=1.0
    )


In this structure, we define three fields with different comparison rules. The invoice number and total amount are given higher weights (2.0) to reflect their importance in the evaluation. We use the Levenshtein comparator for text fields, with a strict threshold (0.9) for invoice numbers but a more lenient one (0.5) for vendor names to accommodate common variations in company names. The total amount uses a threshold of 0.8 to allow for minor numerical differences while still ensuring accuracy.

## 3. Initialize Data Structure

Now we convert our raw data into structured objects that follow our defined Invoice model. We use the from_json method to create Invoice instances from both our prediction and ground truth data:

In [5]:
ground_truth = Invoice.from_json(ground_truth_data)
prediction = Invoice.from_json(prediction_data)

print("Ground truth values:")
print(json.dumps(ground_truth_data,indent=4))

print("Predicted values:")
print(json.dumps(prediction_data,indent=4))



Ground truth values:
{
    "invoice_number": "INV-2023-001",
    "total": 1500.0,
    "vendor": "ABC Corporation"
}
Predicted values:
{
    "invoice_number": "INV-2023-001",
    "total": 1499.95,
    "vendor": "ABC Corp"
}


The printed output shows our data has been successfully structured according to our Invoice model. We can see the exact values for each field, and the empty extra_fields dictionary indicates that all our data fits within our defined structure. These structured objects will allow us to perform our evaluation using the comparison rules we defined in the previous step.

## 4. Structured model evaluation

We use the StructuredModelEvaluator to compare our prediction against the ground truth and visualize the results:

In [6]:
evaluator = StructuredModelEvaluator()
result = evaluator.evaluate(ground_truth, prediction)
print(print_confusion_matrix(result, show_details=True))  

=== CONFUSION MATRIX SUMMARY ===

--- Raw Counts ---
Metric             Count
-------------------------
True Positive          2
False Positive         1
True Negative          0
False Negative         0
False Discovery        1

--- Derived Metrics ---
Metric               Value Visual                
--------------------------------------------------
Precision           66.67% █████████████░░░░░░░  
Recall             100.00% ████████████████████  
F1 Score            80.00% ████████████████░░░░  
Accuracy            66.67% █████████████░░░░░░░  


=== FIELD-LEVEL METRICS ===

Field                    TP     FP     TN     FN     FD     Prec   Recall       F1      Acc Visual                
-----------------------------------------------------------------------------------------------------------------------------
invoice_number            1      0      0      0      0  100.00%  100.00%  100.00%  100.00% ████████████████████  

total                     0      1      0      0      1  

The output provides a comprehensive view of the evaluation results in three sections:

1. **Confusion Matrix Summary** shows the overall performance:
   - Raw counts (True Positives, False Positives, False Discovery)
   - Derived metrics (Precision, Recall, F1-score, Accuracy)
   - Visual representation of metrics using progress bars

2. **Field-Level Metrics** breaks down performance by field:
   - Individual metrics for each field (invoice_number, vendor, total)
   - Shows Precision, Recall, F1-score, and Accuracy per field
   - Visual indicators of performance levels

3. **Confusion Matrix Visualization** provides a visual representation of matches and mismatches:
   - Shows distribution of True Positives (T), False Positives (F), and False Discoveries (D)
   - Includes percentage breakdown of each category
   - Helps quickly identify patterns in matching performance

This detailed breakdown helps understand how well our structured data evaluation is performing both overall and at the field level, making it easier to identify areas for potential improvement.

## Conclusion

The structured data evaluation framework provides a systematic way to assess extraction accuracy through multiple levels of analysis. By calculating precision, recall, and F1-scores at both the overall document level and individual field level, we can thoroughly understand the performance of our extraction system.

The detailed evaluation output, including confusion matrices and visual representations, helps identify patterns in matching performance and highlights areas that may need attention. This comprehensive approach to evaluation is essential for real-world applications where understanding the accuracy and reliability of extracted data is crucial.

### Next Steps

Now that you've learned the basics of Stickler for single document evaluation, here are recommended next steps:

#### 1. **Experiment with Different Configurations**
- Adjust threshold values to see how they affect matching performance
- Try different comparators (ExactComparator, NumericComparator) for various field types
- Modify field weights to reflect your specific use case priorities

#### 2. **Scale to Multiple Documents**
- Explore the companion notebook: `04-01-Stickler-Bulk-Evaluation.ipynb`
- Learn batch evaluation techniques for larger datasets
- Understand aggregate performance metrics across document collections

#### 3. **Apply to Your Own Data**
- Define structured models for your specific document types or data sources
- Customize comparison rules based on your field characteristics
- Integrate Stickler evaluation into your existing data processing workflows

#### 5. **Community and Resources**
- Visit the [Stickler GitHub repository](https://github.com/awslabs/stickler) for documentation and examples
- Contribute to the open-source project with feedback or improvements
- Share your use cases and evaluation results with the community