# End-to-End Document Processing with IDP Common Package

This notebook demonstrates how to process a document using the modular Document-based approach with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock using the multi-model page based method.
3. Extraction Service - Extract structured information from sections using Bedrock
4. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [None]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

ROOTDIR="../.."
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "{ROOTDIR}/lib/idp_common_pkg[dev, all]"

# Note: We can also install specific components like:
# %pip install -q -e "{ROOTDIR}/lib/idp_common_pkg[ocr,classification,extraction,evaluation]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

## 2. Import Libraries and Set Up Environment

In [None]:
import os
import json
import time
import boto3
import logging
import datetime

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, evaluation

# Configure logging 
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('idp_common.classification.service').setLevel(logging.DEBUG)  # Enable classification logs
logging.getLogger('idp_common.bedrock.client').setLevel(logging.DEBUG)  # show prompts

logging.getLogger('idp_common.evaluation.service').setLevel(logging.DEBUG)  # Enable evaluation logs

logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.DEBUG)  # Enable evaluation logs

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = f"{ROOTDIR}/samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name =  os.getenv("IDP_INPUT_BUCKET_NAME", f"idp-notebook-input-{account_id}-{region}")
output_bucket_name = os.getenv("IDP_OUTPUT_BUCKET_NAME", f"idp-notebook-output-{account_id}-{region}")

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

# Set ROOT_DIR - used to locate example images from local directory
# OR set CONFIGURATION_BUCKET to S3 Configration bucket name (contains config_library)
os.environ['ROOT_DIR'] = ROOTDIR

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

## 3. Set Up S3 Buckets and Upload Sample File

In [None]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

## 4. Set Up Configuration

In [None]:
# Few shot configuration from config_library
import yaml
with open(f"{ROOTDIR}/config_library/pattern-2/rvl-cdip-package-sample-with-few-shot-examples/config.yaml", 'r') as file:
    CONFIG = yaml.safe_load(file)

print("Test configuration created")

## 5. Process Document with OCR

In [None]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

## 6. Classify the Document

In [None]:
# Verify that Config specifies => "classificationMethod": "textbasedHolisticClassification"
print("*****************************************************************")
print(f'CONFIG classificationMethod: {CONFIG["classification"].get("classificationMethod")}')
print("*****************************************************************")

# Create classification service with Bedrock backend
# The classification method is set in the config
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

In [None]:
# Show classification results
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")

## 7. Extract Information from Document Sections

In [None]:
# Create extraction service with Bedrock
extraction_service = extraction.ExtractionService(config=CONFIG)

print("\nExtracting information from document sections...")
extracted_results = {}

n = 3 # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document = extraction_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    extraction_time = time.time() - start_time
    print(f"Extraction for section {section.section_id} completed in {extraction_time:.2f} seconds")
    
print(f"\nExtraction for first {n} sections complete.")

In [None]:
print("\nShow extraction results...\n")

document_dict = document.to_dict()
sections_json = json.dumps(document_dict["sections"][:n], indent=2)
print(f"{sections_json}...")

## 8. Final Document Status Summary

In [None]:
# Update document status to COMPLETED
document.status = Status.COMPLETED

# Display final document state
print("Final Document State:")
print(f"Document ID: {document.id}")
print(f"Status: {document.status.value}")
print(f"Number of pages: {document.num_pages}")
print(f"Number of sections: {len(document.sections)}")

# Show document serialization capabilities
print("\nDocument can be serialized to JSON:")
document_dict = document.to_dict()
document_json = json.dumps(document_dict, indent=2)  
print(f"{document_json}")

## 9. Evaluate Results

In this section, we'll demonstrate how to evaluate extraction results by comparing them with expected (ground truth) values. The evaluation process involves:

1. Creating a ground truth document with expected values
2. Comparing the actual extraction results against expected values
3. Calculating metrics (precision, recall, F1 score)
4. Generating an evaluation report

#### Evaluation helper function

In [21]:
# Helper function to create a ground truth document from an existing document and expected results
def create_ground_truth_document(source_document, expected_results_dict):
    """Creates a ground truth document for evaluation from an existing document and expected results.
    
    Args:
        source_document: The original document to copy structure from
        expected_results_dict: Dictionary mapping section IDs to expected attribute values
        
    Returns:
        Document: A document with the same structure but with expected results
    """
    # Create a new document with the same core attributes
    ground_truth = Document(
        id=source_document.id,
        input_bucket=source_document.input_bucket,
        input_key=source_document.input_key,
        output_bucket=source_document.output_bucket,
        status=Status.COMPLETED
    )
    
    # Copy sections and add expected result URIs
    for section in source_document.sections:
        # Create section with same structure
        expected_section = Section(
            section_id=section.section_id,
            classification=section.classification,
            confidence=1.0,
            page_ids=section.page_ids.copy(),
            extraction_result_uri=section.extraction_result_uri  # Copy the URI from actual document
        )
        ground_truth.sections.append(expected_section)
    
    # Copy pages
    for page_id, page in source_document.pages.items():
        ground_truth.pages[page_id] = page
    
    # Store expected results to S3 for sections that have extraction results
    for section_id, expected_data in expected_results_dict.items():
        # Find the section in the document
        for section in ground_truth.sections:
            if section.section_id == section_id and section.extraction_result_uri:
                # Load the original extraction result as template
                uri = section.extraction_result_uri
                bucket, key = parse_s3_uri(uri)
                
                try:
                    # Get the original result structure
                    response = s3_client.get_object(Bucket=bucket, Key=key)
                    result_data = json.loads(response['Body'].read().decode('utf-8'))
                    
                    # Replace the inference_result with our expected data
                    if "inference_result" in result_data:
                        result_data["inference_result"] = expected_data
                    else:
                        # Or just replace the entire content if no inference_result key
                        result_data = expected_data
                    
                    # Write back to S3 with a different key for expected values
                    expected_key = key.replace("/result.json", "/expected.json")
                    s3_client.put_object(
                        Bucket=bucket,
                        Key=expected_key,
                        Body=json.dumps(result_data).encode('utf-8')
                    )
                    
                    # Update the section's extraction URI to point to our expected data
                    section.extraction_result_uri = f"s3://{bucket}/{expected_key}"
                    print(f"Stored expected results for section {section_id} at {section.extraction_result_uri}")
                except Exception as e:
                    print(f"Error storing expected results for section {section_id}: {e}")
    
    return ground_truth

#### Set up ground truth

In [None]:
# Define expected results for extraction (ground truth)
# Customize values to showcase different evaluation methods from CONFIG
expected_results = {
    "1": {  # Section 1 (Letter)
        # For sender_name with LLM matching - intentionally create a variant that should match semantically
        "sender_name": "William E. Clarke",  
        # For sender_address with LLM matching - formatting differences should still match
        "sender_address": "206 maple Street\nP.O. Box 1056\nMurray Kentucky 42071-1056"  
    },
    "2": {  # Section 2 (Form)
        # For form_type with FUZZY matching (threshold 0.7) - added qualifier but should still match
        "form_type": "LAB SERVICES CONSISTENCY REPORT - Annual Edition",  
        # For form_id with NUMERIC_EXACT - should match
        "form_id": 2030053328  
    },
    "3": {  # Section 3 (Email)
        # For from_address with default matching (LLM) - match
        "from_address": "Kelahan, Benjamin",  
        # For to_address field with LLM matching
        "to_address": "TI Minnesota, TI New York"  
    }
}

# Create ground truth document using the helper function
expected_document = create_ground_truth_document(document, expected_results)


#### Run evaluation

In [None]:
# Create the evaluation service
evaluation_service = evaluation.EvaluationService(config=CONFIG)

# Run evaluation
print("Running document evaluation...")
start_time = time.time()
document = evaluation_service.evaluate_document(
    actual_document=document,
    expected_document=expected_document
)
evaluation_time = time.time() - start_time

print(f"Evaluation completed in {evaluation_time:.2f} seconds")
print(f"Evaluation report URI: {document.evaluation_report_uri}")

#### Display evaluation results

In [None]:
# Show structured evaluation result
print("Evaluation result object")
if document.evaluation_result:
    print(f"{document.evaluation_result}")
else:
    print("ERROR.. No evaluation_result found")

# Read the evaluation report from S3
print("Reading markdown report from S3...")
if document.evaluation_report_uri:
    bucket, key = parse_s3_uri(document.evaluation_report_uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    s3_markdown = response['Body'].read().decode('utf-8')
    print(f"Successfully read report from {document.evaluation_report_uri}")
else:
    print("No evaluation report URI found")

# Display the report in the notebook with proper formatting
from IPython.display import Markdown, display

# Display the markdown directly from S3 content
display(Markdown(s3_markdown))

# 10. Clean Up (Optional)

In [25]:
# Function to delete objects in a bucket
def delete_bucket_objects(bucket_name):
    try:
        # List all objects in the bucket
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if 'Contents' in response:
            delete_keys = {'Objects': [{'Key': obj['Key']} for obj in response['Contents']]}
            s3_client.delete_objects(Bucket=bucket_name, Delete=delete_keys)
            print(f"Deleted all objects in bucket {bucket_name}")
        else:
            print(f"Bucket {bucket_name} is already empty")
            
        # Delete bucket
        s3_client.delete_bucket(Bucket=bucket_name)
        print(f"Deleted bucket {bucket_name}")
    except Exception as e:
        print(f"Error cleaning up bucket {bucket_name}: {str(e)}")

# Uncomment the following lines to delete the buckets
# print("Cleaning up resources...")
# delete_bucket_objects(input_bucket_name)
# delete_bucket_objects(output_bucket_name)
# print("Cleanup complete")

## Conclusion

This notebook demonstrates the end-to-end processing flow using AWS services and the unified Document model:

1. **Document Creation** - Initialize a Document object with input/output locations
2. **OCR Processing** - Convert PDF to text using AWS Textract via OcrService
3. **Classification** - Identify document types and sections with Claude via ClassificationService
4. **Extraction** - Extract structured information with Claude via ExtractionService
5. **Evaluation** - Compare extraction results against expected values and generate metrics
6. **Document Model** - Document object is consistently used between all services
7. **Result Storage** - Extraction results are stored in S3 with URIs tracked in the Document

Key benefits of this approach:

1. **Modularity** - Each service has a clear responsibility
2. **Consistency** - Same data model flows through the entire pipeline
3. **Performance** - Focused document pattern reduces resource usage
4. **Flexibility** - Support for multiple backends (Bedrock, SageMaker)
5. **Maintainability** - Standardized patterns across services
6. **Measurement** - Built-in evaluation capabilities to measure accuracy

This example uses a  workflow with:
1. S3 buckets (created specifically for this demo)
2. AWS Textract OCR processing
3. LLM inferencing via Bedrock
4. A document sample (rvl_cdip_package.pdf)

The Evaluation Service specifically provides:

1. Multiple evaluation methods (EXACT, NUMERIC_EXACT, FUZZY)
2. Per-attribute and document-level metrics
3. Markdown and JSON format reporting
4. Integration with the Document model
5. Configuration-driven evaluation methods