# Understanding Structured Data Evaluation Metrics

## Overview
In this workshop, you'll learn to measure and analyze key evaluation metrics for structured data extraction models. Understanding these metrics is crucial for:
- Assessing model accuracy at field and document levels
- Identifying problematic fields and patterns
- Optimizing extraction performance
- Quality assurance of automated data extraction

## Key Metrics We'll Cover
1. Peformance metrics
- Precision: Accuracy of extracted fields
- Recall: Completeness of extraction
- F1-Score: Balanced measure of precision and recall

2. Evaluation levels
- Overall Document Level: Aggregate metrics across all fields
- Field-Level Analysis: Individual performance for each field type

## Use case
We'll evaluate an invoice processing system that extracts key information from business documents. This demonstrates real-world structured data evaluation in a scenario requiring high accuracy across multiple fields. 

## Prerequisites
- Python 3.10+

## 1. Data Sample

Let's consider a single invoice document with three essential fields to extract: the invoice number, vendor name, and total amount. The following example shows both the ground truth (what we expect) and the prediction (what our system extracted) over five document samples. The data demonstrates five key situations that happen in document processing:

1. **Correct Match (True Positive)**: When our system extracts "INV-001" and that's correct
2. **Correct Empty (True Negative)**: When our system and ground truth both say a field is empty
3. **False Find (False Alarm)**: When our system finds "INV-003" but there shouldn't be anything
4. **Wrong Value (False Discovery)**: When our system finds "Global Svcs" but it should be "Global Services"
5. **Missed Value (False Negative)**: When our system misses "INV-005" that exists in the document

Each document in our example shows different combinations of these scenarios, helping us understand where our system performs well and where it needs improvement.

In [1]:
import json
ground_truth_data = [
    {'DocumentID': 1, 'invoice_number': 'INV-001', 'vendor': 'Acme Corp', 'total': 1500.00},
    {'DocumentID': 2, 'invoice_number': 'INV-002', 'vendor': '', 'total': 2200.50},
    {'DocumentID': 3, 'invoice_number': '', 'vendor': 'SuperTech', 'total': None},
    {'DocumentID': 4, 'invoice_number': 'INV-004', 'vendor': 'Global Services', 'total': 3750.00},
    {'DocumentID': 5, 'invoice_number': 'INV-005', 'vendor': 'Mega Industries', 'total': 1200.75}
]

prediction_data = [
    {'DocumentID': 1, 'invoice_number': 'INV-001', 'vendor': 'Acme Corp', 'total': 1500.00},  # All true positives
    {'DocumentID': 2, 'invoice_number': 'INV-002', 'vendor': '', 'total': 2250.00},  # invoice_number: True positve, vendor: True negative, total: False discovery
    {'DocumentID': 3, 'invoice_number': 'INV-003', 'vendor': '', 'total': None},  # invoice_number: False alarm, vendor: False negative, total: True negative
    {'DocumentID': 4, 'invoice_number': '', 'vendor': 'Global Svcs', 'total': 3750.00},  # invoice_number: False negative, vendor: False discovery, total: True positive
    {'DocumentID': 5, 'invoice_number': '', 'vendor': '', 'total': None}  # invoice_number: False negative, vendor: False negative, total: False negative
]

print(json.dumps(ground_truth_data,indent=4))
print(json.dumps(prediction_data,indent=4))

[
    {
        "DocumentID": 1,
        "invoice_number": "INV-001",
        "vendor": "Acme Corp",
        "total": 1500.0
    },
    {
        "DocumentID": 2,
        "invoice_number": "INV-002",
        "vendor": "",
        "total": 2200.5
    },
    {
        "DocumentID": 3,
        "invoice_number": "",
        "vendor": "SuperTech",
        "total": null
    },
    {
        "DocumentID": 4,
        "invoice_number": "INV-004",
        "vendor": "Global Services",
        "total": 3750.0
    },
    {
        "DocumentID": 5,
        "invoice_number": "INV-005",
        "vendor": "Mega Industries",
        "total": 1200.75
    }
]
[
    {
        "DocumentID": 1,
        "invoice_number": "INV-001",
        "vendor": "Acme Corp",
        "total": 1500.0
    },
    {
        "DocumentID": 2,
        "invoice_number": "INV-002",
        "vendor": "",
        "total": 2250.0
    },
    {
        "DocumentID": 3,
        "invoice_number": "INV-003",
        "vendor": "",
        "

## 2. Compare and classify

The sample data above represents common scenarios in document processing. Before we can calculate our evaluation metrics, we need to classify each prediction into the appropriate category. This classification forms the foundation for calculating higher-level metrics like precision and recall. The function below compares our predictions against ground truth and classifies each field into one of five categories:
* True positive (TP) is when ground truth matches the estimate and both have actual values
* True negative (TN) is when ground truth matches the estimation and both have empty values
* When ground truth does not match estimate, and estimation has actual value.
    * False alarm (FA) is given above condition, and ground truth is empty.
    * False discovery (FD) is given above condition, and ground truth has actual value.


* False negative (FN) is when ground truth does not match estimation, estimation has null value and ground truth has actual value.

In [2]:
def compare_and_classify_field_predictions(ground_truth_data, prediction_data):
    """
    Compare ground truth and prediction data at field level and count metrics.
    
    Args:
        ground_truth_data: List of dictionaries containing ground truth values
        prediction_data: List of dictionaries containing prediction values
    
    Returns:
        Dictionary with counts of TP, TN, FA, FD, FN for each field type
    """
    # Get all unique field names as a union from both datasets, excluding DocumentID
    fields = set().union(*(set(doc.keys()) for doc in ground_truth_data + prediction_data)) - {'DocumentID'}
    
    # Initialize counters for each field
    results = {field: {'TP': 0, 'TN': 0, 'FA': 0, 'FD': 0, 'FN': 0} for field in fields}
    
    # Check if value is empty or None
    def is_empty(value):
        if value is None:
            return True
        if isinstance(value, str) and value.strip() == '':
            return True
        return False
    
    # Check if two values match
    def values_match(val1, val2):
        if isinstance(val1, (int, float)) and isinstance(val2, (int, float)):
            # For numeric values, allow small tolerance
            return abs(val1 - val2) < 0.01
        return val1 == val2
    
    # Process each document
    for gt_doc, pred_doc in zip(ground_truth_data, prediction_data):
        assert gt_doc['DocumentID'] == pred_doc['DocumentID'], "Document IDs don't match"
        
        # Check each field
        for field in fields:
            # Get values, defaulting to None if the field doesn't exist in a document
            gt_value = gt_doc.get(field, None)
            pred_value = pred_doc.get(field, None)
            
            gt_empty = is_empty(gt_value)
            pred_empty = is_empty(pred_value)
            
            if not gt_empty and not pred_empty:
                # Both have values
                if values_match(gt_value, pred_value):
                    # Values match - True Positive
                    results[field]['TP'] += 1
                else:
                    # Values don't match - False Discovery
                    results[field]['FD'] += 1
            elif gt_empty and pred_empty:
                # Both empty - True Negative
                results[field]['TN'] += 1
            elif gt_empty and not pred_empty:
                # Ground truth empty, prediction has value - False Alarm
                results[field]['FA'] += 1
            elif not gt_empty and pred_empty:
                # Ground truth has value, prediction empty - False Negative
                results[field]['FN'] += 1
    
    return results

evaluation_results = compare_and_classify_field_predictions(ground_truth_data, prediction_data)
print("Field-level count:")

print(json.dumps(evaluation_results,indent=4))

Field-level count:
{
    "invoice_number": {
        "TP": 2,
        "TN": 0,
        "FA": 1,
        "FD": 0,
        "FN": 2
    },
    "vendor": {
        "TP": 1,
        "TN": 1,
        "FA": 0,
        "FD": 1,
        "FN": 2
    },
    "total": {
        "TP": 2,
        "TN": 1,
        "FA": 0,
        "FD": 1,
        "FN": 1
    }
}


## 3. Compute evaluation metrics

Understanding these metrics helps us assess different aspects of our extraction system:

* **Precision** measures how accurate our positive predictions are. A high precision means when our system extracts a value, it's usually correct.
  * Calculation: Precision = TP / (TP + FA + FD)
  * Example: If precision is 0.90, 90% of the fields we extracted are correct.

* **Recall** measures how complete our extractions are. A high recall means we're not missing many values that should be extracted.
  * Calculation: Recall = TP / (TP + FN + FD)
  * Example: If recall is 0.85, we successfully found 85% of all fields that should have been extracted.

* **F1-score** balances precision and recall. It's particularly useful when you need a single metric to compare different models.
  * Calculation: F1-score = 2 * (Precision * Recall) / (Precision + Recall)
  * A high F1-score indicates good performance in both precision and recall.
  * A low F1-score suggests the system is struggling with either precision, recall, or both.

* **Accuracy** measures the overall correctness of predictions, including both positive and negative cases.
  * Calculation: Accuracy = (TP + TN) / (TP + TN + FA + FD + FN)
  * Particularly valuable when correct identification of empty fields is as important as correct extractions
  * Example: In legal document processing, correctly identifying when a field is empty can be crucial
  * In sparse datasets, high accuracy indicates the system correctly handles both present and absent values

Key points about these metrics:
- Each metric provides different insights into system performance
- The importance of each metric depends on your specific use case and requirements
- Precision and recall often have a trade-off relationship
- F1-score provides a balanced view of precision and recall
- Accuracy gives insight into overall system reliability, including empty field handling

When evaluating document processing systems:
- Consider your specific use case requirements
- Look at all metrics to get a complete picture of system performance
- Align metric priorities with business requirements and error impact
- Pay attention to both extraction accuracy and empty field handling

The following code implements these metrics calculations:



In [3]:
import collections

def calculate_metrics(evaluation_results):
    """
    Calculates precision, recall, F1-score, and accuracy from evaluation results
    for individual fields and for an overall summary.
    
    Args:
        evaluation_results: Dictionary where keys are field names and values are
                            dictionaries containing counts for 'TP', 'TN', 'FA',
                            'FD', and 'FN'.
    
    Returns:
        Dictionary with metrics (precision, recall, f1, accuracy) for each field
        and an "overall" summary.
    """

    def _safe_divide(numerator, denominator):
        """Helper to perform division, returning 0.0 on ZeroDivisionError."""
        try:
            return numerator / denominator
        except ZeroDivisionError:
            return 0.0

    def _calculate_metrics_from_counts(counts):
        """Helper function to calculate metrics for a single set of counts."""
        TP = counts.get('TP', 0)
        TN = counts.get('TN', 0)
        FA = counts.get('FA', 0)
        FD = counts.get('FD', 0)
        FN = counts.get('FN', 0)
        
        metrics = {}

        # Use the safe_divide helper function for each calculation
        precision = _safe_divide(TP, (TP + FA + FD))
        recall = _safe_divide(TP, (TP + FN + FD))
        
        f1 = _safe_divide((2 * precision * recall), (precision + recall))
        accuracy = _safe_divide((TP + TN), (TP + FA + FD + TN + FN))
        
        metrics['precision'] = precision
        metrics['recall'] = recall
        metrics['f1'] = f1
        metrics['accuracy'] = accuracy
        metrics['counts'] = counts
        
        return metrics

    # Calculate metrics for each field using a dictionary comprehension
    metrics = {
        field: _calculate_metrics_from_counts(counts)
        for field, counts in evaluation_results.items()
    }
    
    # Aggregate counts for overall metrics using collections.Counter
    overall_counts = collections.Counter()
    for counts in evaluation_results.values():
        overall_counts.update(counts)
    
    # Calculate overall metrics using the same helper function
    overall_metrics = _calculate_metrics_from_counts(overall_counts)
    
    # Add overall metrics to the results dictionary
    metrics['overall'] = overall_metrics
    
    return metrics

# Then calculate metrics
metrics_results = calculate_metrics(evaluation_results)

print(json.dumps(metrics_results,indent=4))

{
    "invoice_number": {
        "precision": 0.6666666666666666,
        "recall": 0.5,
        "f1": 0.5714285714285715,
        "accuracy": 0.4,
        "counts": {
            "TP": 2,
            "TN": 0,
            "FA": 1,
            "FD": 0,
            "FN": 2
        }
    },
    "vendor": {
        "precision": 0.5,
        "recall": 0.25,
        "f1": 0.3333333333333333,
        "accuracy": 0.4,
        "counts": {
            "TP": 1,
            "TN": 1,
            "FA": 0,
            "FD": 1,
            "FN": 2
        }
    },
    "total": {
        "precision": 0.6666666666666666,
        "recall": 0.5,
        "f1": 0.5714285714285715,
        "accuracy": 0.6,
        "counts": {
            "TP": 2,
            "TN": 1,
            "FA": 0,
            "FD": 1,
            "FN": 1
        }
    },
    "overall": {
        "precision": 0.625,
        "recall": 0.4166666666666667,
        "f1": 0.5,
        "accuracy": 0.4666666666666667,
        "counts": {
   

## 4. Visualize the results

Effective visualization of evaluation metrics helps us:
1. Quickly identify performance patterns
2. Compare performance across different fields
3. Communicate results to stakeholders
4. Identify areas needing improvement

Our visualization provides two key views:

### Overall Metrics View
Shows system-wide performance with visual progress bars where:
- Each bar is scaled to 100% for easy comparison
- Key metrics (Precision, Recall, F1-score, Accuracy) are displayed with both numeric values and visual bars

### Field-Level Details View
Provides a detailed breakdown per field showing:
- Raw counts (TP, FA, TN, FN, FD)
- Calculated metrics (Precision, Recall, F1-score, Accuracy)
- Visual F1-score representation for quick performance assessment

The following code generates these visualizations:


In [4]:
def visualize_metrics(metrics_results):
    """
    Create a visual representation of the evaluation metrics.
    
    Args:
        metrics_results: Dictionary with metrics and counts from calculate_metrics function
    """
    # Get overall counts for the summary
    overall = metrics_results['overall']
    counts = overall['counts']
    
    # Create a visual bar with filled and empty blocks based on percentage
    def create_visual_bar(percentage, width=20):
        filled = int(percentage * width / 100)
        return '█' * filled + '░' * (width - filled)
    
    # Format percentage for display
    def format_pct(value):
        return f"{value * 100:.2f}%"

    
    print("\n--- Overall Metrics ---")
    print("Metric               Value Visual                ")
    print("--------------------------------------------------")
    
    prec = overall['precision'] * 100
    recall = overall['recall'] * 100
    f1 = overall['f1'] * 100
    accuracy = overall['accuracy'] * 100
    
    print(f"Precision           {prec:6.2f}% {create_visual_bar(prec)}  ")
    print(f"Recall              {recall:6.2f}% {create_visual_bar(recall)}  ")
    print(f"F1 Score            {f1:6.2f}% {create_visual_bar(f1)}  ")
    print(f"Accuracy            {accuracy:6.2f}% {create_visual_bar(accuracy)}  ")
    
    # === FIELD-LEVEL METRICS ===
    print("\n\n=== FIELD-LEVEL METRICS ===\n")
    
    # Header
    print("Field                    TP     FA     TN     FN     FD     Prec   Recall       F1      Acc     Visual                ")
    print("-----------------------------------------------------------------------------------------------------------------------------")
    
    # Print metrics for each field except 'overall'
    for field, metrics in metrics_results.items():
        if field == 'overall':
            continue
        
        field_counts = metrics['counts']
        tp = field_counts['TP']
        fa = field_counts['FA']
        tn = field_counts['TN']
        fn = field_counts['FN']
        fd = field_counts['FD']
        
        prec = metrics['precision'] * 100
        recall = metrics['recall'] * 100
        f1 = metrics['f1'] * 100
        accuracy = metrics['accuracy'] * 100
        
        # Choose which metric to visualize (using F1 score)
        visual = create_visual_bar(f1)
        
        print(f"{field:20s} {tp:6d} {fa:6d} {tn:6d} {fn:6d} {fd:6d} {prec:8.2f}% {recall:6.2f}% {f1:9.2f}% {accuracy:6.2f}%    {visual}  ")

visualize_metrics(metrics_results)


--- Overall Metrics ---
Metric               Value Visual                
--------------------------------------------------
Precision            62.50% ████████████░░░░░░░░  
Recall               41.67% ████████░░░░░░░░░░░░  
F1 Score             50.00% ██████████░░░░░░░░░░  
Accuracy             46.67% █████████░░░░░░░░░░░  


=== FIELD-LEVEL METRICS ===

Field                    TP     FA     TN     FN     FD     Prec   Recall       F1      Acc     Visual                
-----------------------------------------------------------------------------------------------------------------------------
invoice_number            2      1      0      2      0    66.67%  50.00%     57.14%  40.00%    ███████████░░░░░░░░░  
vendor                    1      0      1      2      1    50.00%  25.00%     33.33%  40.00%    ██████░░░░░░░░░░░░░░  
total                     2      0      1      1      1    66.67%  50.00%     57.14%  60.00%    ███████████░░░░░░░░░  


## Concluding Remarks

In this workshop, we built a foundation for evaluating structured data extraction systems through:
1. A classification framework (TP, TN, FA, FD, FN) for comparing extracted versus ground truth values
2. Key performance metrics (precision, recall, F1-score, accuracy) derived from these classifications
3. Visualization tools to analyze performance at system and field levels

Important considerations:
- Our example used simple exact matching - real applications need more sophisticated comparison logic:
  * Case/punctuation sensitivity
  * Numeric tolerances
  * Approximate matching for longer texts
  * Format standardization

- Our example used flat structure - real documents often have complex nested data:
  * Lists (e.g., line items in invoices)
  * Nested objects (e.g., address details)
  * Arrays of objects (e.g., multiple parties in contracts)
  * These require additional logic for matching and metric calculations

This framework provides a starting point - adapt the matching logic, handling of nested structures, and visualization to meet your specific evaluation needs.

### Next Steps

* Customize the evaluation framework for your specific document types and field structures
* Implement appropriate matching techniques for different field types in your data
* Integrate this evaluation process into your model development and monitoring workflows
* Use the insights gained to iteratively improve your extraction models
