# Understanding Structured Data Evaluation Metrics

## Overview
In this workshop, you'll learn to measure and analyze key evaluation metrics for structured data extraction models. Understanding these metrics is crucial for:
- Assessing model accuracy at field and document levels
- Identifying problematic fields and patterns
- Optimizing extraction performance
- Quality assurance of automated data extraction

## Key Metrics We'll Cover
1. Peformance metrics
- Precision: Accuracy of extracted fields
- Recall: Completeness of extraction
- F1-Score: Balanced measure of precision and recall

2. Evaluation levels
- Overall Document Level: Aggregate metrics across all fields
- Field-Level Analysis: Individual performance for each field type

## Use case
We'll evaluate an invoice processing system that extracts key information from business documents. This demonstrates real-world structured data evaluation in a scenario requiring high accuracy across multiple fields. 

## Prerequisites
- AWS account with Bedrock access
- Python 3.10+
- boto3 library



## Setup and dependencies

In [1]:
from genaidp_lib.key_information_evaluation.structured_object_evaluator.models.structured_model import StructuredModel
from genaidp_lib.key_information_evaluation.structured_object_evaluator.models.comparable_field import ComparableField
from genaidp_lib.key_information_evaluation.common.comparators.levenshtein import LevenshteinComparator
from genaidp_lib.key_information_evaluation.structured_object_evaluator.evaluator import StructuredModelEvaluator
from genaidp_lib.key_information_evaluation.structured_object_evaluator.utils.pretty_print import print_confusion_matrix

## 1. Data sample

Let's consider a single invoice document with three essential fields to extract: the invoice number, vendor name, and total amount. The following example shows both the ground truth (what we expect) and the prediction (what our system extracted):

In [2]:
# Ground truth data (what we expect)
ground_truth_data = {
    'invoice_number': 'INV-2023-001',
    'vendor': 'ABC Corporation',
    'total': 1500.00
}

# Predicted data (what our system extracted)
prediction_data = {
    'invoice_number': 'INV-2023-001',  # Perfect match
    'vendor': 'ABC Corp',              # Abbreviated - should still match
    'total': 1499.95                   # Very close - should match
}

This example represents common real-world scenarios in document extraction where we encounter exact matches for standardized fields like invoice numbers, text variations for company names, and small numerical differences in monetary values. We will use this data to demonstrate how the evaluation framework processes these different types of matches and calculates the corresponding accuracy metrics.

## 2. Define your data structure

Before we can evaluate our extracted data, we need to define how each field should be compared and weighted. We create a structured model that specifies the comparison rules for each field in our invoice document:

In [3]:
class Invoice(StructuredModel):
    """Simple invoice model for demonstration."""
    
    invoice_number: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.9,
        weight=2.0  # This field is important
    )
    
    vendor: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.5,  # Much more forgiving for demo purposes
        weight=1.0
    )
    
    total: float = ComparableField(
        threshold=0.8,  # More forgiving threshold for demo
        weight=2.0  # This field is important
    )


In this structure, we define three fields with different comparison rules. The invoice number and total amount are given higher weights (2.0) to reflect their importance in the evaluation. We use the Levenshtein comparator for text fields, with a strict threshold (0.9) for invoice numbers but a more lenient one (0.5) for vendor names to accommodate common variations in company names. The total amount uses a threshold of 0.8 to allow for minor numerical differences while still ensuring accuracy.

## 3. Initialize Data Structure

Now we convert our raw data into structured objects that follow our defined Invoice model. We use the from_json method to create Invoice instances from both our prediction and ground truth data:

In [4]:
prediction = Invoice.from_json(prediction_data)
ground_truth = Invoice.from_json(ground_truth_data)

print("Predicted values:")
for field_name in prediction.__fields__:
    print(f"{field_name}: {getattr(prediction, field_name)}")

print("\nGround truth values:")
for field_name in ground_truth.__fields__:
    print(f"{field_name}: {getattr(ground_truth, field_name)}")

Predicted values:
extra_fields: {}
invoice_number: INV-2023-001
vendor: ABC Corp
total: 1499.95

Ground truth values:
extra_fields: {}
invoice_number: INV-2023-001
vendor: ABC Corporation
total: 1500.0


The printed output shows our data has been successfully structured according to our Invoice model. We can see the exact values for each field, and the empty extra_fields dictionary indicates that all our data fits within our defined structure. These structured objects will allow us to perform our evaluation using the comparison rules we defined in the previous step.

## 4. Structured model evaluation

We use the StructuredModelEvaluator to compare our prediction against the ground truth and visualize the results:

In [5]:
evaluator = StructuredModelEvaluator()
result = evaluator.evaluate(ground_truth, prediction)
print(print_confusion_matrix(result, show_details=True))  

=== CONFUSION MATRIX SUMMARY ===

--- Raw Counts ---
Metric             Count
-------------------------
True Positive          2
False Positive         1
True Negative          0
False Negative         0
False Discovery        1

--- Derived Metrics ---
Metric               Value Visual                
--------------------------------------------------
Precision           66.67% █████████████░░░░░░░  
Recall             100.00% ████████████████████  
F1 Score            80.00% ████████████████░░░░  
Accuracy            66.67% █████████████░░░░░░░  


=== FIELD-LEVEL METRICS ===

Field                    TP     FP     TN     FN     FD     Prec   Recall       F1      Acc Visual                
-----------------------------------------------------------------------------------------------------------------------------
invoice_number            1      0      0      0      0  100.00%  100.00%  100.00%  100.00% ████████████████████  

total                     0      1      0      0      1  

The output provides a comprehensive view of the evaluation results in three sections:

1. **Confusion Matrix Summary** shows the overall performance:
   - Raw counts (True Positives, False Positives, False Discovery)
   - Derived metrics (Precision, Recall, F1-score, Accuracy)
   - Visual representation of metrics using progress bars

2. **Field-Level Metrics** breaks down performance by field:
   - Individual metrics for each field (invoice_number, vendor, total)
   - Shows Precision, Recall, F1-score, and Accuracy per field
   - Visual indicators of performance levels

3. **Confusion Matrix Visualization** provides a visual representation of matches and mismatches:
   - Shows distribution of True Positives (T), False Positives (F), and False Discoveries (D)
   - Includes percentage breakdown of each category
   - Helps quickly identify patterns in matching performance

This detailed breakdown helps understand how well our structured data evaluation is performing both overall and at the field level, making it easier to identify areas for potential improvement.

## Conclusion

The structured data evaluation framework provides a systematic way to assess extraction accuracy through multiple levels of analysis. By calculating precision, recall, and F1-scores at both the overall document level and individual field level, we can thoroughly understand the performance of our extraction system.

The detailed evaluation output, including confusion matrices and visual representations, helps identify patterns in matching performance and highlights areas that may need attention. This comprehensive approach to evaluation is essential for real-world applications where understanding the accuracy and reliability of extracted data is crucial.

### Next steps
For further exploration, the framework can be adapted to evaluate different types of structured data and various matching requirements, making it a versatile tool for assessing data extraction performance. 