# Holistic Packet Classification with IDP Common Package

This notebook demonstrates how to use the holistic packet classification capability of the IDP Common Package to classify multi-document packets, where each document might span multiple pages. The holistic approach examines the document as a whole to identify boundaries between different document types within the packet.

**Key Benefits of Holistic Packet Classification:**
1. Properly handles multi-page documents within a packet
2. Detects logical document boundaries
3. Identifies document types in context of the whole document
4. Handles documents where individual pages may not be clearly classifiable on their own

The notebook demonstrates how to process a document with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock using the multi-model page based method.
3. Extraction Service - Extract structured information from sections using Bedrock
4. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [None]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

ROOTDIR="../.."
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "{ROOTDIR}/lib/idp_common_pkg[dev, all]"

# Note: We can also install specific components like:
# %pip install -q -e "{ROOTDIR}/lib/idp_common_pkg[ocr,classification,extraction,evaluation]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

## 2. Import Libraries and Set Up Environment

In [None]:
import os
import json
import time
import boto3
import logging
import datetime
import copy

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, evaluation, summarization
from idp_common import s3
import json
from dotenv import load_dotenv
load_dotenv()

# Configure logging 
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.INFO)  # Enable evaluation logs
logging.getLogger('idp_common.bedrock.client').setLevel(logging.INFO)  # show prompts


# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = f"{ROOTDIR}/samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name =  os.getenv("IDP_INPUT_BUCKET_NAME", f"idp-notebook-input-{account_id}-{region}")
output_bucket_name = os.getenv("IDP_OUTPUT_BUCKET_NAME", f"idp-notebook-output-{account_id}-{region}")

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

## 3. Set Up S3 Buckets and Upload Sample File

In [None]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

## 4. Set Up Configuration

In [None]:
import yaml
with open("{ROOTDIR}/config.yml", 'r') as file:
    CONFIG = yaml.safe_load(file)

## 5. Process Document with OCR

In [None]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

## 6. Classify the Document

In [None]:
# Verify that Config specifies => "classificationMethod": "textbasedHolisticClassification"
print("*****************************************************************")
print(f'CONFIG classificationMethod: {CONFIG["classification"].get("classificationMethod")}')
print("*****************************************************************")

# Create classification service with Bedrock backend
# The classification method is set in the config
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

### Show classification results

In [None]:
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")

# Summarization service

## PART 1: Processing Individual Sections

In [None]:
summarization_service = summarization.SummarizationService(config=CONFIG)

print("=== PART 1: Processing Individual Sections ===")
n = 3  # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document, section_metering = summarization_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    summarization_time = time.time() - start_time
    print(f"Summarization for section {section.section_id} completed in {summarization_time:.2f} seconds")
    
    # Print the summary content if available
    if section.attributes and 'summary_uri' in section.attributes:
        summary_uri = section.attributes['summary_uri']
        summary_md_uri = section.attributes.get('summary_md_uri')
        
        print(f"\nJSON Summary URI: {summary_uri}")
        if summary_md_uri:
            print(f"Markdown Summary URI: {summary_md_uri}")
        
        # Get and display JSON summary
        try:
            # Get the JSON summary content from S3
            summary_content = s3.get_json_content(summary_uri)
            print("\nJSON Summary Content:")
            
            # Check if there's a specific summary field in the content
            if isinstance(summary_content, dict):
                if 'summary' in summary_content:
                    print("Summary field found in JSON:")
                    print(summary_content['summary'][:300] + "..." if len(summary_content['summary']) > 300 else summary_content['summary'])
                elif 'content' in summary_content:
                    print("Content field found in JSON:")
                    print(summary_content['content'][:300] + "..." if len(str(summary_content['content'])) > 300 else summary_content['content'])
                else:
                    # Print the whole content if no specific summary field
                    print("Full JSON content (truncated):")
                    print(json.dumps(summary_content, indent=2)[:300] + "..." if len(json.dumps(summary_content)) > 300 else json.dumps(summary_content, indent=2))
            else:
                print(summary_content)
        except Exception as e:
            print(f"Error retrieving JSON summary: {e}")
            
        # Get and display Markdown summary if available
        if summary_md_uri:
            try:
                # Get the markdown summary content from S3
                markdown_content = s3.get_text_content(summary_md_uri)
                print("\nMarkdown Summary Content (first 300 chars):")
                print(markdown_content[:300] + "..." if len(markdown_content) > 300 else markdown_content)
                
                # Display the rendered markdown
                from IPython.display import Markdown, display
                print("\nRendered Markdown Summary:")
                display(Markdown(markdown_content))
            except Exception as e:
                print(f"Error retrieving markdown summary: {e}")
    else:
        print("No summary available for this section")
    
print(f"\nSummarization for first {n} sections complete.")

## PART 2: Processing Document with Sections

In [None]:
document_with_sections = copy.deepcopy(document)

# Process the entire document using the section-based approach
start_time = time.time()
document_with_sections = summarization_service.process_document(
    document=document_with_sections,
    store_results=True
)
summarization_time = time.time() - start_time
print(f"Document summarization with sections completed in {summarization_time:.2f} seconds")

In [None]:
# Print the combined summary report URI
if document_with_sections.summary_report_uri:
    print(f"\nCombined Summary Report URI: {document_with_sections.summary_report_uri}")
    
    # Try to get and display the markdown summary
    try:
        # Extract bucket and key from the s3 URI
        uri_parts = document_with_sections.summary_report_uri.replace("s3://", "").split("/", 1)
        bucket = uri_parts[0]
        key = uri_parts[1]
        
        # Use boto3 to get the object directly
        s3_client = boto3.client('s3')
        response = s3_client.get_object(Bucket=bucket, Key=key)
        markdown_content = response['Body'].read().decode('utf-8')
        
        # Display a preview of the summary
        print("\nSummary Preview (first 500 chars):")
        print(markdown_content[:500] + "..." if len(markdown_content) > 500 else markdown_content)
        
        # Display the full markdown summary in a rendered cell
        from IPython.display import Markdown, display
        print("\nFull Rendered Summary:")
        display(Markdown(markdown_content))
        
        # Also check if JSON summary exists
        json_key = key.replace("summary.md", "summary.json")
        try:
            json_response = s3_client.get_object(Bucket=bucket, Key=json_key)
            summary_json = json.loads(json_response['Body'].read().decode('utf-8'))
            # print("\nJSON Summary Structure:")
            # print(f"Keys: {list(summary_json.keys())}")
            
            # Check for section summaries
            if 'metadata' in summary_json and 'section_summaries' in summary_json['metadata']:
                print(f"\nSection Summaries: {list(summary_json['metadata']['section_summaries'].keys())}")
        except Exception as e:
            print(f"Note: JSON summary not found or couldn't be parsed: {e}")
            
    except Exception as e:
        print(f"Error retrieving summary: {e}")
else:
    print("No summary available")

# Check individual section summaries if available
# if document_with_sections.sections:
#     print("\nIndividual Section Summaries:")
#     for section in document_with_sections.sections:
#         if section.attributes and 'summary_md_uri' in section.attributes:
#             # print(f"Section {section.section_id} ({section.classification}) Summary: {section.attributes['summary_md_uri']}")
#             print(f"{section.attributes['summary_md_uri']}")


## PART 3: Processing Document without Sections

In [None]:
# Create a copy of the document without sections to demonstrate the fallback approach
# document_without_sections = copy.deepcopy(document)
# document_without_sections.sections = []  # Remove all sections

# # Process the document without sections (should use the fallback approach)
# start_time = time.time()
# document_without_sections = summarization_service.process_document(
#     document=document_without_sections,
#     store_results=True
# )
# summarization_time = time.time() - start_time
# print(f"Document summarization without sections completed in {summarization_time:.2f} seconds")

# # Print the summary report URI
# if document_without_sections.summary_report_uri:
#     print(f"\nWhole Document Summary Report URI: {document_without_sections.summary_report_uri}")
    
#     # Try to get and display the markdown summary
#     try:
#         # Extract bucket and key from the s3 URI
#         uri_parts = document_without_sections.summary_report_uri.replace("s3://", "").split("/", 1)
#         bucket = uri_parts[0]
#         key = uri_parts[1]
        
#         # Use boto3 to get the object directly
#         s3_client = boto3.client('s3')
#         response = s3_client.get_object(Bucket=bucket, Key=key)
#         summary_md = response['Body'].read().decode('utf-8')
        
#         print("\nWhole Document Summary (first 500 chars):")
#         print(summary_md[:500] + "..." if len(summary_md) > 500 else summary_md)
#     except Exception as e:
#         print(f"Error retrieving whole document summary: {e}")
# else:
#     print("No whole document summary available")