# Evaluation Methods Demonstration

This notebook demonstrates the various evaluation methods available in the IDP library for comparing expected values with actual extraction results. It covers:

1. All evaluation methods with both match and no-match scenarios
2. Threshold testing for applicable methods
3. Edge cases:
   - Attribute not found in actual results
   - Attribute not found in expected results
   - Attribute not found in either actual or expected results

In [None]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

ROOTDIR="../.."
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "{ROOTDIR}/lib/idp_common_pkg[dev, all]"

# Note: We can also install specific components like:
# %pip install -q -e "{ROOTDIR}/lib/idp_common_pkg[ocr,classification,extraction,evaluation]"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

In [None]:
# Import necessary libraries
import sys
import os
import json
from typing import Dict, Any, List, Tuple, Optional
import logging

# Add parent directory to path to import the library
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

# Import IDP libraries
from idp_common.evaluation.models import EvaluationMethod
from idp_common.evaluation.comparator import compare_values
from idp_common.evaluation.service import EvaluationService
from idp_common.models import Document, Section, Status

print("Libraries imported successfully")

## Part 1: Comparing Individual Values with Different Methods

We'll test each evaluation method with matching and non-matching examples.

In [None]:
def test_comparison(method: EvaluationMethod, expected: Any, actual: Any, 
                    threshold: float = 0.8, document_class: str = "TestDoc",
                    attr_name: str = "test_attr", attr_description: str = "Test attribute"):
    """Test a comparison method and print results."""
    
    print(f"\n{'-'*60}")
    print(f"Method: {method.name}")
    print(f"Expected: {expected}")
    print(f"Actual: {actual}")
    
    if method in [EvaluationMethod.FUZZY, EvaluationMethod.SEMANTIC]:
        print(f"Threshold: {threshold}")
    
    # Set up LLM config for the LLM method
    llm_config = None
    if method == EvaluationMethod.LLM:
        llm_config = {
            "model": "us.amazon.nova-lite-v1:0",
            "temperature": 0.0,
            "top_k": 250,
            "system_prompt": "You are an evaluator that helps determine if the predicted and expected values match for document attribute extraction.",
            "task_prompt": """I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}.

For the attribute named "{ATTRIBUTE_NAME}" described as "{ATTRIBUTE_DESCRIPTION}":
- Expected value: {EXPECTED_VALUE}
- Actual value: {ACTUAL_VALUE}

Do these values match in meaning, taking into account formatting differences, word order, abbreviations, and semantic equivalence?
Provide your assessment as a JSON with three fields:
- "match": boolean (true if they match, false if not)
- "score": number between 0 and 1 representing the confidence/similarity score
- "reason": brief explanation of your decision

Respond ONLY with the JSON and nothing else.  Here's the exact format:
{
  "match": true or false,
  "score": 0.0 to 1.0,
  "reason": "Your explanation here"
}
"""
        }
    
    # Perform the comparison
    matched, score, reason = compare_values(
        expected=expected,
        actual=actual,
        method=method,
        threshold=threshold,
        document_class=document_class,
        attr_name=attr_name,
        attr_description=attr_description,
        llm_config=llm_config
    )
    
    print(f"Matched: {matched}")
    print(f"Score: {score}")
    if reason:
        print(f"Reason: {reason}")
        
    return matched, score, reason

### Test 1: EXACT Method
Testing exact string matching with both match and non-match cases.

In [None]:
# EXACT method - Match
test_comparison(EvaluationMethod.EXACT, "Account #12345", "Account #12345")

# EXACT method - No match
test_comparison(EvaluationMethod.EXACT, "Account #12345", "Account #12346")

# EXACT method - Match with different casing and punctuation
test_comparison(EvaluationMethod.EXACT, "Account Number: 12345", "account number 12345")

### Test 2: NUMERIC_EXACT Method
Testing numeric comparison with different formats.

In [None]:
# NUMERIC_EXACT method - Match
test_comparison(EvaluationMethod.NUMERIC_EXACT, "$1,250.00", 1250)

# NUMERIC_EXACT method - No match
test_comparison(EvaluationMethod.NUMERIC_EXACT, "$1,250.00", 1251)

# NUMERIC_EXACT method - Match with different formats
test_comparison(EvaluationMethod.NUMERIC_EXACT, "(1,250.00)", "-1250")

### Test 3: FUZZY Method
Testing fuzzy comparison with different thresholds.

In [None]:
# FUZZY method - High match
test_comparison(EvaluationMethod.FUZZY, "John A. Smith", "John Smith", threshold=0.8)

# FUZZY method - Medium match 
matched, score, _ = test_comparison(EvaluationMethod.FUZZY, "John A. Smith", "John Simpson", threshold=0.8)
print(f"With threshold=0.6: {score >= 0.6}")

# FUZZY method - Low match
test_comparison(EvaluationMethod.FUZZY, "John Alexander Smith", "Jane Marie Johnson", threshold=0.8)

<cell_type>markdown</cell_type>### Test 4: HUNGARIAN Method
Testing list comparison using the Hungarian algorithm.

The Hungarian method is used for optimal matching between two lists when order doesn't matter. It automatically pairs items from each list for the maximum possible match score.

**New Feature:** The Hungarian method now supports a `comparator_type` configuration which can be set to:
- `EXACT`: Exact string matching (default)
- `FUZZY`: Fuzzy string matching with similarity scoring
- `NUMERIC`: Numeric comparison for handling different number formats

This allows for greater flexibility when comparing lists of different types of values.

In [None]:
# HUNGARIAN method - Full match with EXACT comparator (default)
expected_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $200"]
actual_list = ["Deposit: $500", "Transfer: $200", "Withdrawal: $150"]
test_comparison(EvaluationMethod.HUNGARIAN, expected_list, actual_list)

# HUNGARIAN method - Test with explicit comparator_type
print("\nDemonstrating the new comparator_type functionality:")
print("-" * 60)

# Test with EXACT comparator
expected_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $200"]
actual_list = ["Deposit: $500", "Withdrawal: $150", "Transfer: $210"]  # Small difference in last value
matched, score, reason = compare_values(
    expected=expected_list,
    actual=actual_list,
    method=EvaluationMethod.HUNGARIAN,
    comparator_type="EXACT"  # Using the new comparator_type parameter
)
print(f"HUNGARIAN with EXACT comparator:")
print(f"  Expected: {expected_list}")
print(f"  Actual: {actual_list}")
print(f"  Result: Matched={matched}, Score={score:.2f}")

# Test with FUZZY comparator
matched, score, reason = compare_values(
    expected=expected_list,
    actual=actual_list,
    method=EvaluationMethod.HUNGARIAN,
    comparator_type="FUZZY",  # Using fuzzy comparator
    threshold=0.7  # Explicit threshold for fuzzy matching
)
print(f"\nHUNGARIAN with FUZZY comparator (threshold 0.7):")
print(f"  Expected: {expected_list}")
print(f"  Actual: {actual_list}")
print(f"  Result: Matched={matched}, Score={score:.2f}")

# Test with NUMERIC comparator (non-matching case)
expected_num_list = ["$500", "$150.00", "$200"]
actual_num_list = ["500", "150", "210"]  # Different number representation for the last value
matched, score, reason = compare_values(
    expected=expected_num_list,
    actual=actual_num_list,
    method=EvaluationMethod.HUNGARIAN,
    comparator_type="NUMERIC"  # Using numeric comparator
)
print(f"\nHUNGARIAN with NUMERIC comparator (non-matching case):")
print(f"  Expected: {expected_num_list}")
print(f"  Actual: {actual_num_list}")
print(f"  Result: Matched={matched}, Score={score:.2f}")
print(f"  Note: The values '$200' and '210' don't match numerically")

# Test with NUMERIC comparator (matching case)
expected_num_list = ["$500", "$150.00", "$200"]
actual_num_list = ["500", "150", "200"]  # Exact numeric matches after normalization
matched, score, reason = compare_values(
    expected=expected_num_list,
    actual=actual_num_list,
    method=EvaluationMethod.HUNGARIAN,
    comparator_type="NUMERIC"  # Using numeric comparator
)
print(f"\nHUNGARIAN with NUMERIC comparator (matching case):")
print(f"  Expected: {expected_num_list}")
print(f"  Actual: {actual_num_list}")
print(f"  Result: Matched={matched}, Score={score:.2f}")
print(f"  Note: All numeric values match after normalization")

# HUNGARIAN method - Non-list values (should convert to list)
test_comparison(EvaluationMethod.HUNGARIAN, "Single item", "Single item")

### Test 5: LLM Method
Testing semantic comparison using a Large Language Model.

In [None]:
# LLM method - High semantic match (different wording, same meaning)
test_comparison(
    EvaluationMethod.LLM,
    "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
    "Statement with deposits totaling $1,250 and withdrawals of $850, leaving a balance of $2,400.",
    document_class="BankStatement",
    attr_name="statement_summary",
    attr_description="Summary of the bank statement"
)

In [None]:
# LLM method - No semantic match (different meaning)
test_comparison(
    EvaluationMethod.LLM,
    "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
    "Statement with deposits of $2,500 and withdrawals of $1,200, leaving a balance of $3,800.",
    document_class="BankStatement",
    attr_name="statement_summary",
    attr_description="Summary of the bank statement"
)

In [None]:
# LLM method - Partial semantic match (some differences)
test_comparison(
    EvaluationMethod.LLM,
    "Policy effective date: January 15, 2023 to January 14, 2024",
    "Policy period begins on Jan 15, 2023 and expires on Jan 15, 2024",
    document_class="InsurancePolicy",
    attr_name="policy_period",
    attr_description="The dates during which the insurance policy is effective"
)

### Test 6: SEMANTIC Method
Testing semantic comparison using embeddings with different thresholds.

In [None]:
# SEMANTIC method - High similarity (different wording but same meaning)
test_comparison(
    EvaluationMethod.SEMANTIC,
    "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
    "Statement with deposits totaling $1,250 and withdrawals of $850, leaving a balance of $2,400.",
    threshold=0.8
)

# SEMANTIC method - Medium similarity (related content)
matched, score, _ = test_comparison(
    EvaluationMethod.SEMANTIC,
    "Policy effective date: January 15, 2023 to January 14, 2024",
    "Coverage period: From Jan 15, 2023 through January 15, 2024",
    threshold=0.8
)
print(f"With threshold=0.7: {score >= 0.7}")

# SEMANTIC method - Low similarity (different content)
test_comparison(
    EvaluationMethod.SEMANTIC,
    "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
    "Insurance policy with a premium of $850 per year and a deductible of $500.",
    threshold=0.8
)

# SEMANTIC method - Different threshold test
print("\nTesting different thresholds with medium similarity content:")
for threshold in [0.5, 0.7, 0.8, 0.9]:
    matched, score, _ = test_comparison(
        EvaluationMethod.SEMANTIC,
        "The patient was diagnosed with hypertension and prescribed lisinopril 10mg daily.",
        "Patient has high blood pressure and was given medication to take once per day.",
        threshold=threshold
    )
    print(f"Threshold {threshold}: Match={matched}, Score={score:.4f}")

### Comparing SEMANTIC vs LLM Methods

Let's compare results from the SEMANTIC method (using embeddings) with the LLM method on various examples to understand their respective strengths.

In [None]:
# Define a function to compare both methods
def compare_semantic_vs_llm(expected: str, actual: str, description: str = "Test comparison"):
    """Compare SEMANTIC and LLM methods on the same input."""
    print(f"\n{'-'*100}")
    print(f"Comparison: {description}")
    print(f"Expected: {expected}")
    print(f"Actual: {actual}")
    print(f"{'-'*100}")
    
    # Run semantic comparison
    semantic_matched, semantic_score, semantic_reason = test_comparison(
        EvaluationMethod.SEMANTIC,
        expected,
        actual,
        threshold=0.8
    )
    
    # Run LLM comparison
    llm_matched, llm_score, llm_reason = test_comparison(
        EvaluationMethod.LLM,
        expected,
        actual,
        document_class="TestDoc",
        attr_name="test_attr",
        attr_description="Test attribute"
    )
    
    # Compare results
    print("\nComparison of results:")
    print(f"{'Method':<10} {'Matched':<10} {'Score':<10} {'Reason'}")
    print(f"{'-'*80}")
    print(f"{'SEMANTIC':<10} {semantic_matched!s:<10} {semantic_score:<10.4f} {semantic_reason or ''}")
    print(f"{'LLM':<10} {llm_matched!s:<10} {llm_score:<10.4f} {llm_reason or ''}")
    
    return {
        "semantic": (semantic_matched, semantic_score, semantic_reason),
        "llm": (llm_matched, llm_score, llm_reason)
    }

# Test cases where both methods should give similar results
compare_semantic_vs_llm(
    "Patient diagnosed with pneumonia and prescribed antibiotics for 10 days.",
    "The patient has pneumonia and was given a 10-day course of antibiotics.",
    "Similar medical information with different wording"
)

# Test cases where SEMANTIC might be better
compare_semantic_vs_llm(
    "Total payment due: $1,543.27",
    "Amount to be paid: $1,543.27",
    "Financial information with different formats but exact amounts"
)

# Test cases where LLM might be better
compare_semantic_vs_llm(
    "Room temperature maintained at 72°F during the experiment.",
    "The experiment was conducted in standard laboratory conditions at room temperature.",
    "Implicit vs explicit information"
)

In [None]:
# Setup test config
test_config = {
    "classes": [
        {
            "name": "TestDocument",
            "attributes": [
                {
                    "name": "exact_match_attr",
                    "description": "Attribute for exact matching",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "numeric_attr",
                    "description": "Attribute for numeric matching",
                    "evaluation_method": "NUMERIC_EXACT"
                },
                {
                    "name": "fuzzy_attr",
                    "description": "Attribute for fuzzy matching",
                    "evaluation_method": "FUZZY",
                    "evaluation_threshold": 0.8
                },
                {
                    "name": "list_attr",
                    "description": "Attribute for list comparison",
                    "evaluation_method": "HUNGARIAN",
                    "hungarian_comparator": "EXACT"
                },
                {
                    "name": "list_attr_fuzzy",
                    "description": "Attribute for list comparison with fuzzy matching",
                    "evaluation_method": "HUNGARIAN",
                    "hungarian_comparator": "FUZZY",
                    "evaluation_threshold": 0.7  # Threshold for fuzzy matching within Hungarian
                },
                {
                    "name": "list_attr_numeric",
                    "description": "Attribute for list comparison with numeric matching",
                    "evaluation_method": "HUNGARIAN",
                    "hungarian_comparator": "NUMERIC"
                },
                {
                    "name": "llm_attr",
                    "description": "Attribute for semantic comparison",
                    "evaluation_method": "LLM"
                },
                {
                    "name": "missing_in_actual",
                    "description": "Attribute missing in actual results",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "missing_in_expected",
                    "description": "Attribute missing in expected results",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "missing_everywhere",
                    "description": "Attribute missing in both expected and actual",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "semantic_attr",
                    "description": "Attribute for semantic embedding comparison",
                    "evaluation_method": "SEMANTIC",
                    "evaluation_threshold": 0.8
                }
            ]
        }
    ],
    "evaluation": {
        "llm_method": {
            "model": "us.amazon.nova-lite-v1:0",
            "temperature": 0.0,
            "top_k": 250,
            "system_prompt": "You are an evaluator for document extraction attributes.",
            "task_prompt": """I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}.

For the attribute named "{ATTRIBUTE_NAME}" described as "{ATTRIBUTE_DESCRIPTION}":
- Expected value: {EXPECTED_VALUE}
- Actual value: {ACTUAL_VALUE}

Do these values match in meaning, taking into account formatting differences, word order, abbreviations, and semantic equivalence?
Provide your assessment as a JSON with three fields:
- "match": boolean (true if they match, false if not)
- "score": number between 0 and 1 representing the confidence/similarity score
- "reason": brief explanation of your decision

Respond ONLY with the JSON and nothing else.  Here's the exact format:
{
  "match": true or false,
  "score": 0.0 to 1.0,
  "reason": "Your explanation here"
}
"""
        }
    }
}

In [None]:
# Update mock S3 retrieval function to include semantic_attr
def mock_s3_get_json(uri: str) -> Dict[str, Any]:
    """Mock S3 file retrieval."""
    if "expected" in uri:
        return {
            "exact_match_attr": "Exact Match Value",
            "numeric_attr": "$1,250.00",
            "fuzzy_attr": "John Alexander Smith",
            "list_attr": ["Item 1", "Item 2", "Item 3"],
            "list_attr_fuzzy": ["Payment method: Credit Card", "Due date: Jan 15, 2023", "Reference: ABC-123"],
            "list_attr_numeric": ["$500.00", "$150.00", "$200.00"],
            "llm_attr": "Monthly statement showing deposits of $1,250, withdrawals of $850, ending balance of $2,400.",
            "semantic_attr": "Patient was diagnosed with hypertension and prescribed lisinopril 10mg daily.",
            "missing_in_actual": "This value exists in expected only",
            # missing_in_expected is intentionally omitted
            # missing_everywhere is intentionally omitted
        }
    else:  # actual results
        return {
            "exact_match_attr": "Exact Match Value",  # Exact match
            "numeric_attr": 1250,  # Numeric match
            "fuzzy_attr": "John A Smith",  # Fuzzy match
            "list_attr": ["Item 1", "Item 3", "Item 2"],  # List with different order
            "list_attr_fuzzy": ["Reference: ABC123", "Payment: CC", "Due: January 15, 2023"],  # Fuzzy matches
            "list_attr_numeric": [500, 150, 200],  # Numeric comparison (now matches)
            "llm_attr": "Statement with deposits totaling $1,250 and withdrawals of $850, leaving a balance of $2,400.",  # Semantic match
            "semantic_attr": "Patient has high blood pressure and was given medication to take once per day.",  # Semantic embedding match
            # missing_in_actual is intentionally omitted
            "missing_in_expected": "This value exists in actual only",
            # missing_everywhere is intentionally omitted
        }

# Set up mock storage - we'll still use this for S3
class MockS3:
    # Store report content for later display
    report_content = ""
    results_content = {}
    
    @staticmethod
    def get_json_content(uri: str) -> Dict[str, Any]:
        return mock_s3_get_json(uri)
    
    @staticmethod
    def write_content(content: Any, bucket: str, key: str, content_type: str = None):
        print(f"Writing content to s3://{bucket}/{key}")
        if key.endswith("results.json"):
            # Store the results for later access
            MockS3.results_content = content
            print(f"Evaluation results summary: {json.dumps(content.get('overall_metrics', {}), indent=2)}")
        elif key.endswith("report.md"):
            # Store the markdown report for later display
            MockS3.report_content = content

In [None]:
# Create mock documents for evaluation
def create_test_document(doc_id: str, is_expected: bool = False) -> Document:
    """Create a test document with a section."""
    section = Section(
        section_id="sec-001",
        classification="TestDocument",
        extraction_result_uri=f"s3://test-bucket/{doc_id}/{'expected' if is_expected else 'actual'}/extraction.json"
    )
    
    doc = Document(
        id=doc_id,
        sections=[section],
        input_key=doc_id,
        input_bucket="test-bucket",
        output_bucket="test-bucket",
    )
    
    return doc

# Create test documents
actual_doc = create_test_document("test-doc-001")
expected_doc = create_test_document("test-doc-001-baseline", is_expected=True)

In [None]:
# Evaluate document
# Only patch S3 module - use real Bedrock
import idp_common.evaluation.service
idp_common.evaluation.service.s3 = MockS3

# Create evaluation service
evaluation_service = EvaluationService(region="us-east-1", config=test_config)

# Evaluate document
result_doc = evaluation_service.evaluate_document(actual_doc, expected_doc, store_results=True)

# Print results
if hasattr(result_doc, 'evaluation_result'):
    eval_result = result_doc.evaluation_result
    print(f"\nOverall metrics: {eval_result.overall_metrics}")
    
    # Check section results
    for section_result in eval_result.section_results:
        print(f"\nSection {section_result.section_id} - Class: {section_result.document_class}")
        print(f"Metrics: {section_result.metrics}")
        
        # Print attribute details
        print("\nAttribute Details:")
        print("-" * 100)
        print(f"{'Name':<20} {'Method':<15} {'Expected':<25} {'Actual':<25} {'Matched':<10} {'Score':<10} {'Reason'}")
        print("-" * 100)
        
        for attr in section_result.attributes:
            expected_val = str(attr.expected)[:25]
            actual_val = str(attr.actual)[:25]
            method = attr.evaluation_method
            reason = attr.reason[:50] + "..." if attr.reason and len(attr.reason) > 50 else (attr.reason or "")
            print(f"{attr.name:<20} {method:<15} {expected_val:<25} {actual_val:<25} {attr.matched!s:<10} {attr.score:<10.2f} {reason}")

### Display the Evaluation Report

Let's display the markdown evaluation report that was generated:

In [None]:
from IPython.display import Markdown

# Display the markdown report
if MockS3.report_content:
    display(Markdown(MockS3.report_content))
else:
    print("No evaluation report was generated.")

## Part 3: Smart Attribute Discovery and Evaluation

This section demonstrates the "Smart attribute discovery and evaluation" feature of the EvaluationService, which:
1. Automatically discovers attributes in the data not defined in configuration
2. Applies default evaluation methods to unconfigured attributes
3. Properly handles attributes found only in expected data, only in actual data, or in both

In [None]:
# Define a minimal configuration with only some attributes defined
minimal_config = {
    "classes": [
        {
            "name": "InvoiceDocument",
            "attributes": [
                {
                    "name": "invoice_number",
                    "description": "The unique identifier for the invoice",
                    "evaluation_method": "EXACT"
                },
                {
                    "name": "amount_due",
                    "description": "The total amount to be paid",
                    "evaluation_method": "NUMERIC_EXACT"
                }
                # Note: Other attributes are intentionally omitted from configuration
            ]
        }
    ],
    "evaluation": {
        "llm_method": {
            "model": "us.amazon.nova-lite-v1:0",
            "temperature": 0.0,
            "top_k": 250,
            "system_prompt": "You are an evaluator for document extraction attributes.",
            "task_prompt": """I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}.

For the attribute named "{ATTRIBUTE_NAME}" described as "{ATTRIBUTE_DESCRIPTION}":
- Expected value: {EXPECTED_VALUE}
- Actual value: {ACTUAL_VALUE}

Do these values match in meaning, taking into account formatting differences, word order, abbreviations, and semantic equivalence?
Provide your assessment as a JSON with three fields:
- "match": boolean (true if they match, false if not)
- "score": number between 0 and 1 representing the confidence/similarity score
- "reason": brief explanation of your decision

Respond ONLY with the JSON and nothing else.  Here's the exact format:
{
  "match": true or false,
  "score": 0.0 to 1.0,
  "reason": "Your explanation here"
}
"""
        }
    }
}

In [None]:
# Override S3 mock data function to simulate invoice data with unconfigured attributes
def mock_invoice_s3_get_json(uri: str) -> Dict[str, Any]:
    """Mock S3 file retrieval with invoice data including unconfigured attributes."""
    if "expected" in uri:
        return {
            # Configured attributes
            "invoice_number": "INV-12345",  # Configured with EXACT method
            "amount_due": "$1,250.00",     # Configured with NUMERIC_EXACT method
            
            # Unconfigured attributes - only in expected
            "reference_number": "REF-98765",
            
            # Unconfigured attributes - in both expected and actual
            "issue_date": "January 15, 2023",
            "due_date": "February 15, 2023",
            "vendor_name": "Acme Corporation Inc.",
            "payment_terms": "Net 30"
        }
    else:  # actual results
        return {
            # Configured attributes
            "invoice_number": "INV-12345",  # Exact match with expected
            "amount_due": 1250,            # Numeric match with expected
            
            # Unconfigured attributes - only in actual
            "purchase_order": "PO-54321",
            
            # Unconfigured attributes - in both expected and actual with various match qualities
            "issue_date": "01/15/2023",                     # Different format but same date
            "due_date": "02/15/2023",                       # Different format but same date
            "vendor_name": "ACME Corp.",                    # Abbreviated but similar
            "payment_terms": "Payment due within 30 days"    # Different wording but same meaning
        }

# Create test invoice documents
def create_invoice_document(doc_id: str, is_expected: bool = False) -> Document:
    """Create a test invoice document with a section."""
    section = Section(
        section_id="inv-001",
        classification="InvoiceDocument",
        extraction_result_uri=f"s3://test-bucket/{doc_id}/{'expected' if is_expected else 'actual'}/extraction.json"
    )
    
    doc = Document(
        id=doc_id,
        sections=[section],
        input_key=doc_id,
        input_bucket="test-bucket",
        output_bucket="test-bucket",
    )
    
    return doc

# Set up a new mock for this example
class MockInvoiceS3:
    # Store report content for later display
    report_content = ""
    results_content = {}
    
    @staticmethod
    def get_json_content(uri: str) -> Dict[str, Any]:
        return mock_invoice_s3_get_json(uri)
    
    @staticmethod
    def write_content(content: Any, bucket: str, key: str, content_type: str = None):
        print(f"Writing content to s3://{bucket}/{key}")
        if key.endswith("results.json"):
            # Store the results for later access
            MockInvoiceS3.results_content = content
        elif key.endswith("report.md"):
            # Store the markdown report for later display
            MockInvoiceS3.report_content = content

In [None]:
# Evaluate document with smart attribute discovery
# Patch S3 module with our invoice mock
idp_common.evaluation.service.s3 = MockInvoiceS3

# Create test documents
actual_invoice = create_invoice_document("invoice-001")
expected_invoice = create_invoice_document("invoice-001-baseline", is_expected=True)

# Create evaluation service with minimal config
evaluation_service = EvaluationService(region="us-east-1", config=minimal_config)

# Evaluate document
print("Evaluating invoice document with smart attribute discovery...")
result_doc = evaluation_service.evaluate_document(actual_invoice, expected_invoice, store_results=True)

# Print results with focus on discovered attributes
if hasattr(result_doc, 'evaluation_result'):
    eval_result = result_doc.evaluation_result
    print(f"\nOverall metrics: {eval_result.overall_metrics}")
    
    # Check section results
    for section_result in eval_result.section_results:
        print(f"\nSection {section_result.section_id} - Class: {section_result.document_class}")
        
        # Find configured vs unconfigured attributes
        configured_attrs = []
        unconfigured_attrs = []
        
        for attr in section_result.attributes:
            # Check if the attribute has a message about being unconfigured
            if attr.reason and "attribute not in the configuration" in attr.reason:
                unconfigured_attrs.append(attr)
            else:
                configured_attrs.append(attr)
        
        # Print summary of attribute counts
        print(f"Total attributes evaluated: {len(section_result.attributes)}")
        print(f"  - Configured attributes: {len(configured_attrs)}")
        print(f"  - Auto-discovered attributes: {len(unconfigured_attrs)}")
        
        # Print attribute details - CONFIGURED
        if configured_attrs:
            print("\nCONFIGURED Attribute Details:")
            print("-" * 100)
            print(f"{'Name':<20} {'Method':<15} {'Expected':<25} {'Actual':<25} {'Matched':<10} {'Score':<10}")
            print("-" * 100)
            
            for attr in configured_attrs:
                expected_val = str(attr.expected)[:25]
                actual_val = str(attr.actual)[:25]
                method = attr.evaluation_method
                print(f"{attr.name:<20} {method:<15} {expected_val:<25} {actual_val:<25} {attr.matched!s:<10} {attr.score:<10.2f}")
        
        # Print attribute details - UNCONFIGURED
        if unconfigured_attrs:
            print("\nAUTO-DISCOVERED Attribute Details:")
            print("-" * 100)
            print(f"{'Name':<20} {'Method':<15} {'Expected':<25} {'Actual':<25} {'Matched':<10} {'Score':<10} {'Reason'}")
            print("-" * 100)
            
            for attr in unconfigured_attrs:
                expected_val = str(attr.expected)[:25]
                actual_val = str(attr.actual)[:25]
                method = attr.evaluation_method
                reason = attr.reason[:50] + "..." if attr.reason and len(attr.reason) > 50 else (attr.reason or "")
                print(f"{attr.name:<20} {method:<15} {expected_val:<25} {actual_val:<25} {attr.matched!s:<10} {attr.score:<10.2f} {reason}")

### Display the Smart Attribute Discovery Evaluation Report

Let's display the markdown evaluation report that was generated for the smart attribute discovery scenario:

In [None]:
# Display the markdown report
if MockInvoiceS3.report_content:
    display(Markdown(MockInvoiceS3.report_content))
else:
    print("No evaluation report was generated.")

### Smart Attribute Discovery Scenario Summary

The smart attribute discovery and evaluation feature provides the following benefits:

1. **Auto-discovery of attributes**
   - Finds attributes not explicitly defined in the configuration
   - Compares all data fields across expected and actual results
   - Works with minimal or even no attribute configuration

2. **Default Evaluation Method**
   - Applies LLM method to unconfigured attributes by default 
   - Provides semantic comparison for discovered attributes
   - Attaches explanations that the attribute was not in configuration

3. **Handles All Possible Cases**
   - Attributes in both expected and actual results
   - Attributes only in expected results (false negatives)
   - Attributes only in actual results (false positives)
   - Attributes that don't exist in either (true negatives)

4. **Benefits**
   - Exploratory evaluation without complete configuration
   - Comprehensive metrics that include all found attributes
   - Flexibility as extraction models evolve or change output formats
   - Identification of potential new attributes to add to configuration

This feature is particularly useful during the early stages of implementation when the complete attribute schema may not be fully defined, or when handling variations in extraction outputs that contain unexpected information.

<cell_type>markdown</cell_type>## Summary of All Demonstrated Features

This notebook has demonstrated:

1. All evaluation methods available in the IDP library:
   - EXACT - Exact string matching
   - NUMERIC_EXACT - Numeric value matching
   - FUZZY - Fuzzy string matching with adjustable thresholds
   - SEMANTIC - Semantic similarity comparison using Titan embeddings
   - HUNGARIAN - List comparison using the Hungarian algorithm with configurable comparator types:
     - EXACT comparator - Exact string matching for list items
     - FUZZY comparator - Fuzzy string matching for list items
     - NUMERIC comparator - Numeric comparison for list items
   - LLM - Semantic comparison using Large Language Models

2. Semantic Comparison Methods:
   - SEMANTIC - Uses Bedrock Titan embeddings and cosine similarity for efficient matching
   - LLM - Uses Bedrock Claude for more nuanced semantic understanding with reasoning

3. Benefits of SEMANTIC vs LLM methods:
   - SEMANTIC is faster and less expensive than LLM-based evaluation
   - LLM provides explanations for matches/mismatches
   - SEMANTIC works well for standard text comparisons
   - LLM better understands implicit information and complex reasoning

4. Handling of edge cases:
   - Attributes missing in actual results
   - Attributes missing in expected results
   - Attributes missing in both actual and expected results
   - Empty string values

5. Full document evaluation with mixed evaluation methods
   - Comprehensive metrics calculation
   - Detailed attribute-level results

6. Threshold sensitivity analysis for fuzzy and semantic matching
   - How different threshold values affect match results
   - Trade-offs between precision and recall

7. Smart attribute discovery and evaluation:
   - Auto-discovery of attributes not in configuration
   - Default semantic evaluation with LLM method
   - Comprehensive handling of all attribute cases
   - Support for exploratory evaluation and evolving schemas

## Summary

This notebook has demonstrated:

1. All evaluation methods available in the IDP library:
   - EXACT - Exact string matching
   - NUMERIC_EXACT - Numeric value matching
   - FUZZY - Fuzzy string matching with adjustable thresholds
   - SEMANTIC - Semantic similarity comparison using Titan embeddings
   - HUNGARIAN - List comparison using the Hungarian algorithm
   - LLM - Semantic comparison using Large Language Models

2. Semantic Comparison Methods:
   - SEMANTIC - Uses Bedrock Titan embeddings and cosine similarity for efficient matching
   - LLM - Uses Bedrock Claude for more nuanced semantic understanding with reasoning

3. Benefits of SEMANTIC vs LLM methods:
   - SEMANTIC is faster and less expensive than LLM-based evaluation
   - LLM provides explanations for matches/mismatches
   - SEMANTIC works well for standard text comparisons
   - LLM better understands implicit information and complex reasoning

4. Handling of edge cases:
   - Attributes missing in actual results
   - Attributes missing in expected results
   - Attributes missing in both actual and expected results
   - Empty string values

5. Full document evaluation with mixed evaluation methods
   - Comprehensive metrics calculation
   - Detailed attribute-level results

6. Threshold sensitivity analysis for fuzzy and semantic matching
   - How different threshold values affect match results
   - Trade-offs between precision and recall