# Intelligent Document Processing with Amazon Nova 2 Lite

This notebook demonstrates how to build a comprehensive **Intelligent Document Processing (IDP)** system using **Amazon Nova 2 Lite** on Amazon Bedrock.

We showcase how a single foundation model can handle the full IDP lifecycle ‚Äî from document classification and summarization, through structured data extraction with schema enforcement, to validation with extended thinking and computational analysis with the built-in code interpreter.

### What You'll Learn

| Capability | Description |
|---|---|
| **Document Classification** | Automatically identify document types (bank statements, claims, medical forms) |
| **Summarization** | Generate structured summaries from PDFs and scanned images |
| **Structured Extraction** | Extract schema-compliant JSON using tool configurations (blueprints) |
| **Bounding Box Visualization** | Localize extracted fields spatially on document images |
| **Extended Thinking** | Compare reasoning depth levels (low/medium/high) for complex extraction |
| **Multi-Turn Document Q&A** | Interactive conversation over documents with context retention |
| **Code Interpreter** | Built-in Python sandbox for computation, validation, and analysis |
| **End-to-End Pipeline** | Classify ‚Üí Extract ‚Üí Validate ‚Üí Analyze ‚Üí Report |

### Sample Documents

This notebook processes the following documents from the `samples/` folder:

| File | Type | Format |
|---|---|---|
| `BankStatement.pdf` | Bank statement (native PDF) | Single-page PDF |
| `BankStatement.jpg` | Bank statement (scanned) | JPEG image |
| `claim-form.png` | CMS-1500 medical claim form | PNG image |
| `claims-pack.pdf` | Multi-page claims package | 10-page PDF (scanned images) |
| `sample1_cms-1500-P.pdf` | CMS-1500 medical claim form | Single-page PDF |

### Known PDF Limitations

Some PDFs may not be processed directly by Nova due to unsupported internal formats:
- **CMYK color profiles** ‚Äî PDFs using CMYK instead of RGB color space
- **ICC color profiles** ‚Äî PDFs with embedded ICC color management profiles
- **Transparency masks** ‚Äî PDFs with alpha channels or transparency layers
- **SVG images** ‚Äî PDFs containing embedded SVG graphics

When this happens, our utility functions automatically **fall back to converting PDF pages to JPEG images** using PyMuPDF, then send the images instead. This preserves all visual content while avoiding the format restrictions.

> **Note**: The image fallback uses **PyMuPDF** 
### Prerequisites

- An AWS account with Amazon Bedrock access enabled
- Amazon Nova 2 Lite model access enabled
- Python 3.10+ with `boto3`, `Pillow`, and `PyMuPDF`


---
## 1. Setup and Installation

In [None]:
import subprocess, sys, os

def pip_install(*packages):
    subprocess.check_call(
        [sys.executable, '-m', 'pip', 'install', '-q'] + list(packages),
        env={k: v for k, v in os.environ.items() if 'BASH_FUNC' not in k}
    )

def conda_install(*packages):
    conda_exe = os.environ.get('CONDA_EXE', '')
    if conda_exe and os.path.exists(conda_exe):
        result = subprocess.run(
            [conda_exe, 'install', '-y', '-q', '-c', 'conda-forge'] + list(packages),
            capture_output=True, text=True,
            env={k: v for k, v in os.environ.items() if 'BASH_FUNC' not in k}
        )
        return result.returncode == 0
    return False

# Install PyMuPDF (PDF-to-image conversion)
try:
    import fitz
    print(f'‚úì PyMuPDF already installed (version {fitz.VersionBind})')
except ImportError:
    print('Installing PyMuPDF...')
    if not conda_install('pymupdf'):
        pip_install('PyMuPDF')
    import fitz
    print(f'‚úì PyMuPDF installed (version {fitz.VersionBind})')

# Install remaining dependencies
pip_install('boto3>=1.40.26', 'botocore>=1.40.26', 'pillow')
print('‚úì All dependencies installed')

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import boto3
import json
import base64
import sagemaker
import os
import logging
import time
from io import BytesIO
from pathlib import Path
from pprint import pprint
from datetime import datetime
from botocore.config import Config
from botocore.exceptions import ClientError
from IPython.display import display, Image, IFrame, Markdown, JSON

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

In [None]:
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
session = sagemaker.Session()

REGION = boto3.session.Session().region_name
NOVA_LITE_MODEL_ID = "us.amazon.nova-2-lite-v1:0"
SAMPLES_DIR = "samples"

bedrock_client = boto3.client(
    "bedrock-runtime",
    region_name=REGION,
    config=Config(connect_timeout=300, read_timeout=300, retries={"max_attempts": 3})
)

print(f"Bedrock client initialized in region: {REGION}")
print(f"Nova 2 Lite model: {NOVA_LITE_MODEL_ID}")

print(f"\nSample documents:")
SUPPORTED_EXTENSIONS = {'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.webp', '.csv', '.html', '.txt', '.md', '.doc', '.docx', '.xls', '.xlsx'}
sample_files = sorted(f for f in Path(SAMPLES_DIR).iterdir() if f.is_file() and f.suffix.lower() in SUPPORTED_EXTENSIONS)

for f in sample_files:
    size_kb = f.stat().st_size / 1024
    print(f"  - {f.name} ({size_kb:.1f} KB)")

---
## 2. Import Utility Functions

All reusable helpers are defined in `utils.py` ‚Äî this includes document format detection, Bedrock Converse API wrappers, response parsing, and bounding box visualization functions.

In [None]:
from utils import (
    get_document_format, is_image_format, build_content_block,
    build_content_blocks_from_pdf_images,
    invoke_nova, extract_text, extract_tool_input, show_usage,
    get_document_image, get_color_for_field, draw_bounding_boxes,
    invoke_nova_with_reasoning, extract_reasoning_and_text,
    invoke_nova_with_code_interpreter, extract_code_interpreter_results,
    CODE_INTERPRETER_TOOL
)

print("Utility functions loaded from utils.py")

---
## 3. Document Classification

Document classification is the critical first step in any IDP pipeline ‚Äî routing documents to the appropriate extraction logic.

Here we use Amazon Nova's multimodal understanding to classify documents by type, detecting whether a document is a bank statement, insurance claim form, medical claim form, etc. We use **Nova 2 Lite** for fast, cost-effective classification.

In [None]:
CLASSIFY_SYSTEM = """You are an expert document classifier for an Intelligent Document Processing system.
Analyze documents and classify them into predefined categories with high accuracy.
Always respond with valid JSON only - no markdown formatting, no explanation outside the JSON."""

CLASSIFY_PROMPT = """Analyze the provided document and classify it into one of these categories:

- BANK_STATEMENT: Bank account statements showing transactions, balances, account details
- MEDICAL_CLAIM_FORM: Medical/healthcare claim forms (CMS-1500, UB-04, HCFA)
- INVOICE: Bills or invoices for goods/services
- OTHER: Documents that don't fit the above categories

Return a JSON object with these fields:
{"document_type": "<CATEGORY>", "confidence": "HIGH|MEDIUM|LOW", "reasoning": "<brief explanation>", "detected_fields": ["<key field types found>"]}"""

classification_results = {}

for fp in sample_files:
    fmt = get_document_format(str(fp))
    
    # For multi-page PDFs, classify each page individually
    if fmt == "pdf":
        try:
            import fitz
            doc = fitz.open(str(fp))
            num_pages = len(doc)
            doc.close()
        except Exception:
            num_pages = 1
        
        if num_pages > 1:
            print(f"\n{'='*60}")
            print(f"Classifying: {fp.name} ({num_pages} pages ‚Äî classifying each page)")
            print("=" * 60)
            
            page_images = build_content_blocks_from_pdf_images(str(fp), max_pages=num_pages)
            page_results = []
            
            for page_num, img_block in enumerate(page_images, 1):
                print(f"\n  --- Page {page_num}/{num_pages} ---")
                # Send single page image for classification
                messages = [{"role": "user", "content": [
                    {"text": CLASSIFY_PROMPT},
                    img_block
                ]}]
                resp = bedrock_client.converse(
                    modelId=NOVA_LITE_MODEL_ID,
                    messages=messages,
                    system=[{"text": CLASSIFY_SYSTEM}],
                    inferenceConfig={"maxTokens": 1024, "temperature": 0, "topP": 0.9}
                )
                txt = extract_text(resp)
                print(f"  {txt[:150]}...")
                show_usage(resp)
                
                try:
                    clean = txt.strip()
                    if clean.startswith("```"):
                        clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
                    result = json.loads(clean)
                    result["page"] = page_num
                    page_results.append(result)
                except json.JSONDecodeError:
                    page_results.append({"page": page_num, "raw_response": txt})
            
            classification_results[fp.name] = {"pages": page_results, "total_pages": num_pages}
            continue
    
    # Single-page documents (images and single-page PDFs)
    print(f"\n{'='*60}")
    print(f"Classifying: {fp.name}")
    print("=" * 60)
    resp = invoke_nova(bedrock_client,
        CLASSIFY_PROMPT, file_paths=str(fp), model_id=NOVA_LITE_MODEL_ID,
        system_prompt=CLASSIFY_SYSTEM, max_tokens=1024
    )
    txt = extract_text(resp)
    print(txt)
    show_usage(resp)
    try:
        clean = txt.strip()
        if clean.startswith("```"):
            clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
        classification_results[fp.name] = json.loads(clean)
    except json.JSONDecodeError:
        classification_results[fp.name] = {"raw_response": txt}

In [None]:
print("\n" + "=" * 70)
print("CLASSIFICATION SUMMARY")
print("=" * 70)
dash = "-"
print(f"{'File':<30} {'Page':<6} {'Type':<25} {'Confidence'}")
print("-" * 70)
for fn, r in classification_results.items():
    if "pages" in r:
        for pr in r["pages"]:
            page = pr.get("page", "?")
            print(f"{fn:<30} {str(page):<6} {pr.get('document_type', '?'):<25} {pr.get('confidence', '?')}")
    else:
        print(f"{fn:<30} {dash:<6} {r.get('document_type', '?'):<25} {r.get('confidence', '?')}")

---
## 4. Document Summarization

Amazon Nova can generate comprehensive summaries from documents, extracting key information and presenting it in a structured format.

Here we extract summaries and key insights from each document type ‚Äî processing both native PDF and scanned image documents through Nova 2 Lite's multimodal Converse API.

In [None]:
SUMMARY_SYSTEM = """You are an expert document analyst. Extract and present information in a clear, structured format.
Be thorough and accurate - do not fabricate information not present in the document."""

BANK_PROMPT = """Analyze this bank statement and provide a comprehensive summary:

1. **Account Information**: Account holder, account number, statement period, bank name
2. **Financial Summary**: Opening balance, closing balance, total credits, total debits
3. **Transaction Overview**: Number of transactions, largest transaction, recurring patterns
4. **Key Insights**: Notable observations about spending patterns or account activity

Present your analysis in well-structured markdown."""

print("Processing: BankStatement.pdf")
resp = invoke_nova(bedrock_client, BANK_PROMPT, f"{SAMPLES_DIR}/BankStatement.pdf",
                   model_id=NOVA_LITE_MODEL_ID, system_prompt=SUMMARY_SYSTEM, max_tokens=4096)
show_usage(resp)
Markdown(extract_text(resp))

In [None]:
print("Processing: BankStatement.jpg (scanned image)")
resp = invoke_nova(bedrock_client, BANK_PROMPT, f"{SAMPLES_DIR}/BankStatement.jpg",
                   model_id=NOVA_LITE_MODEL_ID, system_prompt=SUMMARY_SYSTEM, max_tokens=4096)
show_usage(resp)
Markdown(extract_text(resp))

In [None]:
CLAIM_PROMPT = """Analyze this medical claim form and provide a comprehensive summary:

1. **Claimant Information**: Name, policy number, contact details
2. **Claim Details**: Type of claim, date of incident/loss, description
3. **Financial Details**: Claimed amount, deductibles
4. **Form Status**: Whether form appears complete, any missing fields
5. **Key Observations**: Notable details or flags

Present in well-structured markdown."""

print("Processing: claim-form.png")
resp = invoke_nova(bedrock_client, CLAIM_PROMPT, f"{SAMPLES_DIR}/claim-form.png",
                   model_id=NOVA_LITE_MODEL_ID, system_prompt=SUMMARY_SYSTEM, max_tokens=4096)
show_usage(resp)
Markdown(extract_text(resp))

---
## 5. Structured Data Extraction with Blueprints

One of the most powerful IDP capabilities is extracting structured JSON data that conforms to a predefined schema. We achieve this using the Converse API's **tool configuration** ‚Äî the tool schema acts as a "blueprint" that constrains Nova to return data matching the exact structure you need.

This approach gives you:
- **Schema enforcement** ‚Äî output always matches your expected structure
- **Type safety** ‚Äî numbers, booleans, arrays are properly typed
- **Completeness** ‚Äî required fields are always attempted
- **Consistency** ‚Äî same schema produces comparable outputs across documents

### 5.1 Bank Statement Blueprint

In [None]:
bank_statement_tool = {
    "tools": [
        {
            "toolSpec": {
                "name": "extract_bank_statement",
                "description": "Extract structured data from a bank statement document.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "bank_name": {
                                "type": "string"
                            },
                            "account_holder_name": {
                                "type": "string"
                            },
                            "account_number": {
                                "type": "string"
                            },
                            "statement_period": {
                                "type": "object",
                                "properties": {
                                    "start_date": {
                                        "type": "string",
                                        "format": "date"
                                    },
                                    "end_date": {
                                        "type": "string",
                                        "format": "date"
                                    }
                                },
                                "required": [
                                    "start_date",
                                    "end_date"
                                ]
                            },
                            "opening_balance": {
                                "type": "number"
                            },
                            "closing_balance": {
                                "type": "number"
                            },
                            "total_credits": {
                                "type": "number"
                            },
                            "total_debits": {
                                "type": "number"
                            },
                            "currency": {
                                "type": "string"
                            },
                            "transactions": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "date": {
                                            "type": "string",
                                            "format": "date"
                                        },
                                        "description": {
                                            "type": "string"
                                        },
                                        "amount": {
                                            "type": "number"
                                        },
                                        "type": {
                                            "type": "string",
                                            "enum": [
                                                "credit",
                                                "debit"
                                            ]
                                        },
                                        "balance": {
                                            "type": "number"
                                        }
                                    }
                                }
                            }
                        },
                        "required": [
                            "bank_name",
                            "account_holder_name",
                            "account_number",
                            "statement_period",
                            "opening_balance",
                            "closing_balance"
                        ]
                    }
                }
            }
        }
    ],
    "toolChoice": {
        "tool": {
            "name": "extract_bank_statement"
        }
    }
}

print("Bank Statement Blueprint defined.")

In [None]:
print("Extracting structured data: BankStatement.pdf")
print("=" * 60)

resp = invoke_nova(bedrock_client,
    model_id=NOVA_LITE_MODEL_ID,
    prompt="Extract all structured data from this bank statement document.",
    file_paths=f"{SAMPLES_DIR}/BankStatement.pdf",
    system_prompt="You are a document data extraction expert. Extract all data accurately.",
    tool_config=bank_statement_tool, max_tokens=8000
)

bank_data = extract_tool_input(resp)
show_usage(resp)

if bank_data:
    print("\nExtracted Bank Statement Data:")
    print(json.dumps(bank_data, indent=2))
else:
    print("Fallback to text response:")
    print(extract_text(resp))

In [None]:
print("Extracting structured data: BankStatement.jpg (scanned)")
print("=" * 60)

resp = invoke_nova(bedrock_client,
    model_id=NOVA_LITE_MODEL_ID,
    prompt="Extract all structured data from this scanned bank statement image.",
    file_paths=f"{SAMPLES_DIR}/BankStatement.jpg",
    system_prompt="You are a document data extraction expert. Extract data accurately from scanned images.",
    tool_config=bank_statement_tool, max_tokens=8000
)

bank_img_data = extract_tool_input(resp)
show_usage(resp)

if bank_img_data:
    print("\nExtracted Bank Statement Data (from image):")
    print(json.dumps(bank_img_data, indent=2))
else:
    print("Fallback to text response:")
    print(extract_text(resp))

### 5.2 CMS-1500 Medical Claim Form Blueprint

In [None]:
cms1500_tool = {
    "tools": [
        {
            "toolSpec": {
                "name": "extract_cms1500",
                "description": "Extract structured data from a CMS-1500 Health Insurance Claim Form.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "insurance_type": {
                                "type": "string"
                            },
                            "insured_id": {
                                "type": "string"
                            },
                            "patient": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string"
                                    },
                                    "date_of_birth": {
                                        "type": "string",
                                        "format": "date"
                                    },
                                    "gender": {
                                        "type": "string",
                                        "enum": [
                                            "M",
                                            "F"
                                        ]
                                    },
                                    "address": {
                                        "type": "string"
                                    },
                                    "city": {
                                        "type": "string"
                                    },
                                    "state": {
                                        "type": "string"
                                    },
                                    "zip": {
                                        "type": "string"
                                    },
                                    "phone": {
                                        "type": "string"
                                    }
                                },
                                "required": [
                                    "name"
                                ]
                            },
                            "insured": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string"
                                    },
                                    "policy_group_number": {
                                        "type": "string"
                                    },
                                    "plan_name": {
                                        "type": "string"
                                    }
                                }
                            },
                            "diagnosis_codes": {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            },
                            "service_lines": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "date_from": {
                                            "type": "string",
                                            "format": "date"
                                        },
                                        "date_to": {
                                            "type": "string",
                                            "format": "date"
                                        },
                                        "place_of_service": {
                                            "type": "string"
                                        },
                                        "procedure_code": {
                                            "type": "string"
                                        },
                                        "modifier": {
                                            "type": "string"
                                        },
                                        "diagnosis_pointer": {
                                            "type": "string"
                                        },
                                        "charges": {
                                            "type": "number"
                                        },
                                        "units": {
                                            "type": "integer"
                                        }
                                    }
                                }
                            },
                            "total_charges": {
                                "type": "number"
                            },
                            "amount_paid": {
                                "type": "number"
                            },
                            "balance_due": {
                                "type": "number"
                            },
                            "referring_provider": {
                                "type": "string"
                            },
                            "rendering_provider": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string"
                                    },
                                    "npi": {
                                        "type": "string"
                                    }
                                }
                            },
                            "facility": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string"
                                    },
                                    "address": {
                                        "type": "string"
                                    },
                                    "npi": {
                                        "type": "string"
                                    }
                                }
                            },
                            "patient_signature_on_file": {
                                "type": "boolean"
                            },
                            "accept_assignment": {
                                "type": "boolean"
                            }
                        },
                        "required": [
                            "patient",
                            "diagnosis_codes",
                            "service_lines",
                            "total_charges"
                        ]
                    }
                }
            }
        }
    ],
    "toolChoice": {
        "tool": {
            "name": "extract_cms1500"
        }
    }
}

print("CMS-1500 Blueprint defined.")

In [None]:
print("Extracting structured data: sample1_cms-1500-P.pdf")
print("=" * 60)

resp = invoke_nova(bedrock_client,
    model_id=NOVA_LITE_MODEL_ID,
    prompt="Extract all structured data from this CMS-1500 health insurance claim form.",
    file_paths=f"{SAMPLES_DIR}/sample1_cms-1500-P.pdf",
    system_prompt="You are a medical billing data extraction expert. Extract all fields accurately.",
    tool_config=cms1500_tool, max_tokens=8000
)

cms_data = extract_tool_input(resp)
show_usage(resp)

if cms_data:
    print("\nExtracted CMS-1500 Data:")
    print(json.dumps(cms_data, indent=2))
else:
    print("Fallback to text response:")
    print(extract_text(resp))

---
## 5.3 Document Visualization with Bounding Boxes

After extracting structured data, we can ask Amazon Nova to also return the **spatial location** (bounding boxes) of each extracted field within the document. We then overlay these bounding boxes on the rendered document image to visualize what was extracted and where.

This is particularly useful for:
- **Verification**: Confirming the model extracted data from the correct location
- **Debugging**: Identifying extraction errors or misaligned fields
- **UI Integration**: Highlighting extracted fields in a document viewer

### 5.3.1 Extract Data with Bounding Boxes from Bank Statement (Image)

We ask Nova to return both the extracted field values AND their spatial locations as normalized bounding box coordinates (0-1 range relative to document dimensions).

In [None]:
BB_EXTRACT_PROMPT = """Analyze this document image and extract key fields with their spatial locations.

For each field you find, return:
- field_name: the name/type of the field
- value: the extracted text value
- bbox: bounding box as [x1, y1, x2, y2] ‚Äî the top-left corner (x1,y1) and bottom-right corner (x2,y2).
  ALL coordinates are NORMALIZED fractions from 0.0 to 1.0 relative to the full document dimensions.
  (0,0) is the top-left of the document, (1,1) is the bottom-right.
  IMPORTANT: x2 must be GREATER than x1, and y2 must be GREATER than y1.
  These are absolute positions, NOT widths/heights.

Return a JSON object with a "fields" array containing objects with field_name, value, and bbox.
Example format:
{"fields": [{"field_name": "account_number", "value": "1234567890", "bbox": [0.15, 0.12, 0.45, 0.16]}]}

Extract ALL visible fields including headers, names, numbers, dates, amounts, and any other text content.
Return ONLY valid JSON."""

BB_SYSTEM = "You are a document analysis expert that extracts text with precise spatial locations. Return only valid JSON."

# Process bank statement image
print("Extracting fields with bounding boxes: BankStatement.jpg")
print("=" * 60)

resp = invoke_nova(bedrock_client,
    prompt=BB_EXTRACT_PROMPT,
    file_paths=f"{SAMPLES_DIR}/BankStatement.jpg",
    model_id=NOVA_LITE_MODEL_ID,
    system_prompt=BB_SYSTEM,
    max_tokens=8000
)

show_usage(resp)
txt = extract_text(resp)

# Parse the response
try:
    clean = txt.strip()
    if clean.startswith("```"):
        clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
    bb_data = json.loads(clean)
    fields = bb_data.get("fields", [])
    print(f"Extracted {len(fields)} fields with bounding boxes")
    for f in fields[:5]:
        print(f"  {f['field_name']}: {str(f['value'])[:50]} | bbox: {f.get('bbox')}")
    if len(fields) > 5:
        print(f"  ... and {len(fields)-5} more fields")
except json.JSONDecodeError:
    print("Could not parse JSON response, showing raw text:")
    print(txt[:1000])
    fields = []

In [None]:
# Visualize bounding boxes on the bank statement image
if fields:
    doc_img = get_document_image(f"{SAMPLES_DIR}/BankStatement.jpg")
    if doc_img:
        # Print raw bboxes for debugging coordinate issues
        print("Raw bounding boxes from model:")
        for _f in fields[:5]:
            fn = _f.get("field_name", ""); bb = _f.get("bbox"); print(f"  {fn}: bbox={bb}")
        print()
        annotated_img, legend = draw_bounding_boxes(
            doc_img, fields, title="Bank Statement - Extracted Fields"
        )
        
        # Display the annotated image
        print(f"Document size: {doc_img.size[0]}x{doc_img.size[1]} pixels")
        print(f"Fields visualized: {len([f for f in fields if f.get('bbox')])}")
        print()
        
        # Show legend
        print("Legend:")
        for name, value, color in legend[:15]:
            print(f"  [{color}] {name}: {value}")
        
        # Display in notebook
        display(annotated_img)
    else:
        print("Could not load document image")
else:
    print("No fields with bounding boxes to visualize")

### 5.3.2 Extract Data with Bounding Boxes from Insurance Claim Form

In [None]:
# Process claim form image
print("Extracting fields with bounding boxes: claim-form.png")
print("=" * 60)

resp = invoke_nova(bedrock_client,
    prompt=BB_EXTRACT_PROMPT,
    file_paths=f"{SAMPLES_DIR}/claim-form.png",
    model_id=NOVA_LITE_MODEL_ID,
    system_prompt=BB_SYSTEM,
    max_tokens=8000
)

show_usage(resp)
txt = extract_text(resp)

try:
    clean = txt.strip()
    if clean.startswith("```"):
        clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
    bb_claim = json.loads(clean)
    claim_fields = bb_claim.get("fields", [])
    print(f"Extracted {len(claim_fields)} fields with bounding boxes")
except json.JSONDecodeError:
    print("Could not parse JSON response")
    claim_fields = []

if claim_fields:
    doc_img = get_document_image(f"{SAMPLES_DIR}/claim-form.png")
    if doc_img:
        annotated_img, legend = draw_bounding_boxes(
            doc_img, claim_fields, title="Insurance Claim Form - Extracted Fields"
        )
        print(f"\nDocument size: {doc_img.size[0]}x{doc_img.size[1]} pixels")
        print(f"Fields visualized: {len([f for f in claim_fields if f.get('bbox')])}")
        display(annotated_img)

---
## 6. Extraction Mode Comparison: Standard vs Extended Thinking

Nova 2 Lite can operate in two modes:
- **Standard mode** (default): Fast, efficient extraction ideal for high-volume processing
- **Extended thinking mode**: Deeper reasoning for complex documents requiring validation and cross-referencing

Let's compare both modes on the same document to see the quality/speed tradeoff.

In [None]:
COMPARE_PROMPT = """Extract all key-value pairs from this document as a JSON object.
Be thorough ‚Äî include every field, amount, date, name, and identifier."""

doc_path = f"{SAMPLES_DIR}/sample1_cms-1500-P.pdf"
import time

comparison_results = {}

# Mode 1: Standard extraction (no reasoning)
print("=" * 60)
print("MODE 1: Standard Extraction (no extended thinking)")
print("=" * 60)

start = time.time()
resp_standard = invoke_nova(bedrock_client,
    prompt=COMPARE_PROMPT, file_paths=doc_path,
    model_id=NOVA_LITE_MODEL_ID,
    system_prompt="Extract all data as JSON.", max_tokens=4096
)
elapsed_standard = time.time() - start
txt_standard = extract_text(resp_standard)
show_usage(resp_standard)
print(f"Time: {elapsed_standard:.2f}s")
print(f"Output length: {len(txt_standard)} chars")
print(txt_standard[:800])

comparison_results["standard"] = {
    "time": elapsed_standard,
    "length": len(txt_standard),
    "tokens": resp_standard.get("usage", {}) if resp_standard else {}
}

# Mode 2: Extended thinking extraction (medium effort)
print("\n" + "=" * 60)
print("MODE 2: Extended Thinking Extraction (medium effort)")
print("=" * 60)

start = time.time()
resp_thinking = invoke_nova_with_reasoning(bedrock_client,
    prompt=COMPARE_PROMPT,
    file_paths=doc_path,
    reasoning_effort="medium",
    system_prompt="Extract all data as JSON. Reason carefully about each field."
)
elapsed_thinking = time.time() - start
reasoning, txt_thinking = extract_reasoning_and_text(resp_thinking)
show_usage(resp_thinking)
print(f"Time: {elapsed_thinking:.2f}s")
print(f"Reasoning: {reasoning[:100]}...")
print(f"Output length: {len(txt_thinking)} chars")
print(txt_thinking[:800])

comparison_results["extended_thinking"] = {
    "time": elapsed_thinking,
    "length": len(txt_thinking),
    "tokens": resp_thinking.get("usage", {}) if resp_thinking else {}
}

# Summary
print("\n" + "=" * 60)
print("COMPARISON SUMMARY")
print("=" * 60)
print(f"{'Mode':<25} {'Time (s)':<12} {'Output Len':<12} {'Input Tok':<12} {'Output Tok'}")
print("-" * 75)
for mode, data in comparison_results.items():
    t = data["tokens"]
    print(f"{mode:<25} {data['time']:<12.2f} {data['length']:<12} "
          f"{t.get('inputTokens','?'):<12} {t.get('outputTokens','?')}")

print("\nExtended thinking typically produces more thorough extraction,")
print("especially for complex documents with many interrelated fields.")

---
## 7. Multi-Document Processing and Business Insights

In production IDP systems, you often need to process multiple documents and generate aggregate insights.

Here we process all sample documents, extract structured data from each, and then use Nova to generate cross-document business intelligence ‚Äî identifying relationships, patterns, and actionable insights across the document portfolio.

In [None]:
all_extractions = {}

UNIVERSAL_EXTRACT = """Analyze this document and extract ALL information as a structured JSON object.
Include:
- document_type: what kind of document this is
- all_fields: a dictionary of every field/value pair found
- entities: list of people, organizations, addresses found
- financial_data: any monetary amounts, totals, balances
- dates: all dates found with their context
- reference_numbers: any IDs, account numbers, policy numbers, claim numbers

Return valid JSON only."""

EXTRACT_SYS = """You are an expert document processing system. Extract all information accurately.
Return only valid JSON - no markdown, no explanation."""

for fp in sample_files:
    print(f"\nProcessing: {fp.name}")
    print("-" * 40)
    resp = invoke_nova(bedrock_client,
        UNIVERSAL_EXTRACT, file_paths=str(fp), model_id=NOVA_LITE_MODEL_ID,
        system_prompt=EXTRACT_SYS, max_tokens=6000
    )
    txt = extract_text(resp)
    show_usage(resp)
    try:
        clean = txt.strip()
        if clean.startswith("```"):
            clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
        all_extractions[fp.name] = json.loads(clean)
        print(f"  Extracted {len(json.dumps(all_extractions[fp.name]))} chars of structured data")
    except json.JSONDecodeError:
        all_extractions[fp.name] = {"raw_text": txt[:500]}
        print(f"  Stored as raw text ({len(txt)} chars)")

print(f"\nTotal documents processed: {len(all_extractions)}")

### 7.1 Aggregate Business Insights

In [None]:
INSIGHTS_PROMPT = f"""You are a business analyst. Based on the following extracted data from multiple documents,
provide comprehensive business insights:

1. **Document Portfolio Summary**: Types and counts of documents processed
2. **Financial Overview**: All monetary amounts found across documents, totals
3. **Entity Analysis**: Key people, organizations mentioned across documents
4. **Cross-Document Relationships**: Any connections between documents (shared entities, references)
5. **Risk Flags**: Any anomalies, missing data, or concerns identified
6. **Recommendations**: Actionable next steps based on the document analysis

Extracted data from all documents:
{json.dumps(all_extractions, indent=2, default=str)[:15000]}

Provide your analysis in well-structured markdown."""

print("Generating aggregate business insights...")
print("=" * 60)

resp = invoke_nova(bedrock_client,
    prompt=INSIGHTS_PROMPT,
    model_id=NOVA_LITE_MODEL_ID,
    system_prompt="You are an expert business analyst specializing in document intelligence and cross-document analysis.",
    max_tokens=6000
)

show_usage(resp)
Markdown(extract_text(resp))

---
## 8. Interactive Document Q&A (Multi-Turn Conversation)

A powerful capability of Amazon Nova is **multi-turn conversation over documents**. You can send a document once, then ask a series of follow-up questions ‚Äî the model retains the full conversation context across turns.

This is valuable for:
- **Human-in-the-loop workflows**: An operator reviews extracted data and asks clarifying questions
- **Deep document investigation**: Drilling into specific sections, cross-referencing data points
- **Conversational data extraction**: Iteratively pulling different pieces of information

The Converse API supports this natively through its `messages` list ‚Äî each turn is a user/assistant message pair.

### 8.1 Multi-Turn Q&A on a Medical Claim Form

Let's demonstrate a multi-turn conversation where we:
1. Send the CMS-1500 form and ask for a high-level overview
2. Follow up with specific questions about diagnosis codes
3. Ask about financial details
4. Request a risk assessment

In [None]:
# Interactive Document Q&A - Multi-turn conversation over a document
doc_path = f"{SAMPLES_DIR}/sample1_cms-1500-P.pdf"
doc_block = build_content_block(doc_path)

QA_SYSTEM = "You are a medical claims analyst. Answer questions about the attached document accurately and concisely."

# Build conversation turns
conversation_turns = [
    "Please review this CMS-1500 claim form and give me a brief overview: Who is the patient, what is the diagnosis, and what is the total amount claimed?",
    "What specific procedure codes are listed? Are they consistent with the diagnosis codes?",
    "Break down the charges by service line. Which service has the highest charge?",
    "Based on your analysis, are there any red flags or concerns with this claim that should be reviewed?"
]

# Run multi-turn conversation
messages = []

for turn_num, question in enumerate(conversation_turns, 1):
    print(f"\n{'='*60}")
    print(f"Turn {turn_num}: USER")
    print(f"{'='*60}")
    print(question)
    
    # First turn includes the document; subsequent turns are text-only
    if turn_num == 1:
        user_content = [{"text": question}, doc_block]
    else:
        user_content = [{"text": question}]
    
    messages.append({"role": "user", "content": user_content})
    
    # Call Nova
    resp = bedrock_client.converse(
        modelId=NOVA_LITE_MODEL_ID,
        messages=messages,
        system=[{"text": QA_SYSTEM}],
        inferenceConfig={"maxTokens": 2000, "temperature": 0, "topP": 0.9}
    )
    
    # Extract assistant response
    assistant_content = resp["output"]["message"]["content"]
    answer = "\n".join(c["text"] for c in assistant_content if "text" in c)
    
    print(f"\n{'='*60}")
    print(f"Turn {turn_num}: ASSISTANT")
    print(f"{'='*60}")
    print(answer[:1000])
    if len(answer) > 1000:
        print(f"... ({len(answer) - 1000} more chars)")
    show_usage(resp)
    
    # Append assistant response to conversation history
    messages.append({"role": "assistant", "content": assistant_content})

print(f"\n\nConversation complete: {len(messages)} messages ({len(messages)//2} turns)")

### 8.2 Comparing Documents in Conversation

Nova can also compare multiple documents within a conversation ‚Äî useful for detecting differences between document versions, cross-referencing claims with supporting evidence, or validating data across related documents.

In [None]:
# Compare bank statement (PDF) with its scanned image version
pdf_block = build_content_block(f"{SAMPLES_DIR}/BankStatement.pdf")
img_block = build_content_block(f"{SAMPLES_DIR}/BankStatement.jpg")

compare_messages = [{
    "role": "user",
    "content": [
        {"text": """I have two versions of the same bank statement - one is the original PDF 
and the other is a scanned image. Please compare them:

1. Can you read both documents clearly?
2. Are there any differences in the data between the two versions?
3. Which version provides better data quality for automated extraction?
4. Are there any fields that are readable in one version but not the other?

Document 1 (PDF):"""},
        pdf_block,
        {"text": "\nDocument 2 (Scanned Image):"},
        img_block
    ]
}]

resp = bedrock_client.converse(
    modelId=NOVA_LITE_MODEL_ID,
    messages=compare_messages,
    system=[{"text": "You are a document quality analyst. Compare the documents thoroughly."}],
    inferenceConfig={"maxTokens": 2000, "temperature": 0, "topP": 0.9}
)

comparison = extract_text(resp)
print("DOCUMENT COMPARISON RESULTS")
print("=" * 60)
print(comparison)
show_usage(resp)

---
## 9. Extended Thinking (Reasoning Modes)

**Nova 2 Lite** supports **extended thinking** ‚Äî an optional capability that enables deeper reasoning for complex problems. You control the reasoning depth with three effort levels:

| Effort Level | Best For | Token Cost |
|---|---|---|
| **Low** | Tasks with moderate complexity, code review, structured analysis | Low overhead |
| **Medium** | Multi-step tasks, debugging, planning with constraints | Moderate overhead |
| **High** | STEM reasoning, advanced problem-solving, deep validation | Highest (up to 128K tokens) |

For IDP, extended thinking is valuable for:
- **Validating** extracted data for consistency and completeness
- **Cross-referencing** fields within complex documents
- **Detecting anomalies** or potential fraud indicators

The reasoning content is currently `[REDACTED]` in responses but you are charged for reasoning tokens as they improve output quality.

> **API**: `additionalModelRequestFields={"reasoningConfig": {"type": "enabled", "maxReasoningEffort": "low|medium|high"}}`

### 9.1 Comparing Reasoning Levels on Complex Document Extraction

Let's compare all 3 reasoning levels on the CMS-1500 medical claim form ‚Äî a complex, densely structured document where deeper reasoning should yield more accurate and complete extraction.

In [None]:
REASONING_PROMPT = """Analyze this CMS-1500 medical claim form and extract ALL data.
Then perform a thorough validation:

1. Extract every field from the form including patient info, insured info, diagnosis codes, procedure codes, charges
2. Validate: Do the line item charges sum to the total? Are all required fields present?
3. Cross-reference: Are the diagnosis codes consistent with the procedures?
4. Flag any anomalies, missing data, or potential issues

Return your findings as a JSON object with keys: extracted_data, validation_results, anomalies_found"""

REASONING_SYSTEM = "You are a medical billing auditor. Extract data accurately and validate thoroughly. Return valid JSON only."

doc_path = f"{SAMPLES_DIR}/sample1_cms-1500-P.pdf"
results_by_effort = {}

for effort in ["low", "medium", "high"]:
    print(f"\n{'='*60}")
    print(f"Reasoning Effort: {effort.upper()}")
    print("=" * 60)

    start = time.time()
    resp = invoke_nova_with_reasoning(
        bedrock_client,
        prompt=REASONING_PROMPT,
        file_paths=doc_path,
        reasoning_effort=effort,
        system_prompt=REASONING_SYSTEM
    )
    elapsed = time.time() - start

    reasoning, answer = extract_reasoning_and_text(resp)
    show_usage(resp)
    print(f"Response time: {elapsed:.2f}s")
    print(f"Reasoning content: {reasoning[:100]}...")
    print(f"\nAnswer (first 1500 chars):")
    print(answer[:1500])
    if len(answer) > 1500:
        print(f"... ({len(answer) - 1500} more characters)")

    results_by_effort[effort] = {
        "time": elapsed,
        "answer_length": len(answer),
        "tokens": resp.get("usage", {}) if resp else {},
        "answer": answer
    }

In [None]:
# Compare results across reasoning levels
print("\n" + "=" * 60)
print("REASONING LEVEL COMPARISON")
print("=" * 60)
print(f"{'Effort':<10} {'Time (s)':<12} {'Input Tok':<12} {'Output Tok':<12} {'Answer Len'}")
print("-" * 60)
for effort, data in results_by_effort.items():
    tokens = data["tokens"]
    print(f"{effort:<10} {data['time']:<12.2f} {tokens.get('inputTokens','?'):<12} "
          f"{tokens.get('outputTokens','?'):<12} {data['answer_length']}")

print("\nKey insight: Higher reasoning effort typically produces more thorough")
print("validation, catches more anomalies, and provides more detailed cross-referencing.")

---
## 10. Built-in Code Interpreter

Amazon Nova 2 includes a **built-in code interpreter** (`nova_code_interpreter`) that can generate and execute Python code in an isolated sandbox. This is powerful for IDP because after extracting data from documents, you can:

- **Compute statistics** from financial data (totals, averages, trends)
- **Validate calculations** (verify line items sum to totals)
- **Generate visualizations** (charts, graphs from extracted data)
- **Run data transformations** (normalize dates, currencies, formats)

> **API**: `toolConfig={"tools": [{"systemTool": {"name": "nova_code_interpreter"}}]}`

The code interpreter runs Python in a sandbox and returns `{stdOut, stdErr, exitCode, isError}`.

### 10.1 Financial Analysis from Extracted Bank Statement Data

We'll provide the previously extracted bank statement data to Nova with the code interpreter enabled, asking it to compute statistics and generate analysis.

In [None]:
# Use code interpreter to analyze extracted bank statement data
CI_PROMPT = """I have extracted the following bank statement data. 
Please analyze it using Python code:

1. Calculate total credits, total debits, and net change
2. Find the largest and smallest transactions
3. Group transactions by type and compute subtotals
4. Identify any recurring transactions (similar descriptions)
5. Print a clear summary table of the analysis

Here is the extracted data (from a bank statement):
""" + json.dumps(bank_data if bank_data else {"note": "bank_data not available - use sample data",
    "transactions": [
        {"date": "2024-01-05", "description": "SALARY DEPOSIT", "amount": 5000.00, "type": "credit"},
        {"date": "2024-01-07", "description": "RENT PAYMENT", "amount": 1500.00, "type": "debit"},
        {"date": "2024-01-10", "description": "GROCERY STORE", "amount": 125.50, "type": "debit"},
        {"date": "2024-01-12", "description": "UTILITY BILL", "amount": 200.00, "type": "debit"},
        {"date": "2024-01-15", "description": "ONLINE TRANSFER IN", "amount": 750.00, "type": "credit"},
        {"date": "2024-01-18", "description": "GROCERY STORE", "amount": 98.75, "type": "debit"},
        {"date": "2024-01-20", "description": "INSURANCE PREMIUM", "amount": 350.00, "type": "debit"},
        {"date": "2024-01-25", "description": "SALARY DEPOSIT", "amount": 5000.00, "type": "credit"},
        {"date": "2024-01-28", "description": "UTILITY BILL", "amount": 180.00, "type": "debit"},
    ]
}, indent=2)

CI_SYSTEM = "You are a financial analyst. Use the code interpreter to perform calculations and analysis. Show your work with Python code."

print("Requesting code interpreter analysis of bank statement data...")
print("=" * 60)

messages = [{"role": "user", "content": [{"text": CI_PROMPT}]}]
resp = invoke_nova_with_code_interpreter(bedrock_client, messages, system_prompt=CI_SYSTEM)

show_usage(resp)

# Parse the response to find code and results
content_list = resp["output"]["message"]["content"] if resp else []
for item in content_list:
    if "text" in item:
        print("\n--- Model Response ---")
        print(item["text"])
    elif "toolUse" in item:
        tu = item["toolUse"]
        print(f"\n--- Generated Python Code ---")
        print(tu.get("input", {}).get("code", ""))
        print(f"\n(Tool Use ID: {tu.get('toolUseId', 'N/A')})")
    elif "toolResult" in item:
        tr = item["toolResult"]
        print(f"\n--- Code Execution Result ---")
        for c in tr.get("content", []):
            if "text" in c:
                try:
                    result = json.loads(c["text"])
                    print(f"Exit Code: {result.get('exitCode', 'N/A')}")
                    if result.get("stdOut"):
                        print(f"Output:\n{result['stdOut']}")
                    if result.get("stdErr"):
                        print(f"Errors:\n{result['stdErr']}")
                except json.JSONDecodeError:
                    print(c["text"])

### 10.2 Medical Claim Validation with Code Interpreter

Use the code interpreter to validate extracted CMS-1500 data ‚Äî checking that charges sum correctly, required fields are present, and codes are properly formatted.

In [None]:
CI_VALIDATE_PROMPT = """I have extracted the following CMS-1500 medical claim data.
Use Python code to perform thorough validation:

1. Verify that individual line item charges sum to total_charges
2. Check that all required fields are present and non-empty
3. Validate diagnosis code format (should be ICD-10 format like X##.## or X##)
4. Validate procedure code format (should be 5-digit CPT codes)
5. Check date formats are consistent
6. Print a validation report with PASS/FAIL for each check

Extracted CMS-1500 data:
""" + json.dumps(cms_data if cms_data else {"note": "using sample data",
    "patient": {"name": "John Smith", "date_of_birth": "1985-03-15", "gender": "M"},
    "diagnosis_codes": ["J06.9", "R05.9"],
    "service_lines": [
        {"procedure_code": "99213", "charges": 150.00, "units": 1, "date_from": "2024-01-15"},
        {"procedure_code": "87081", "charges": 25.00, "units": 1, "date_from": "2024-01-15"}
    ],
    "total_charges": 175.00
}, indent=2)

print("Requesting code interpreter validation of CMS-1500 data...")
print("=" * 60)

messages = [{"role": "user", "content": [{"text": CI_VALIDATE_PROMPT}]}]
resp = invoke_nova_with_code_interpreter(bedrock_client, messages,
    system_prompt="You are a medical billing auditor. Validate claim data rigorously using Python code.")

show_usage(resp)

content_list = resp["output"]["message"]["content"] if resp else []
for item in content_list:
    if "text" in item:
        print("\n--- Model Response ---")
        print(item["text"])
    elif "toolUse" in item:
        tu = item["toolUse"]
        print(f"\n--- Validation Code ---")
        print(tu.get("input", {}).get("code", ""))
    elif "toolResult" in item:
        tr = item["toolResult"]
        print(f"\n--- Validation Results ---")
        for c in tr.get("content", []):
            if "text" in c:
                try:
                    result = json.loads(c["text"])
                    if result.get("stdOut"):
                        print(result["stdOut"])
                    if result.get("stdErr"):
                        print(f"Errors: {result['stdErr']}")
                except json.JSONDecodeError:
                    print(c["text"])

---
## 11. End-to-End IDP Pipeline

This section combines all Nova 2 Lite capabilities into a realistic **end-to-end Intelligent Document Processing pipeline**:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. CLASSIFY‚îÇ ‚îÄ‚îÄ‚ñ∂ ‚îÇ  2. EXTRACT  ‚îÇ ‚îÄ‚îÄ‚ñ∂ ‚îÇ  3. VALIDATE    ‚îÇ ‚îÄ‚îÄ‚ñ∂ ‚îÇ  4. ANALYZE    ‚îÇ ‚îÄ‚îÄ‚ñ∂ ‚îÇ  5. REPORT  ‚îÇ
‚îÇ  (Standard) ‚îÇ     ‚îÇ  (Blueprint) ‚îÇ     ‚îÇ  (Extended      ‚îÇ     ‚îÇ  (Code         ‚îÇ     ‚îÇ  (Standard) ‚îÇ
‚îÇ             ‚îÇ     ‚îÇ              ‚îÇ     ‚îÇ   Thinking)     ‚îÇ     ‚îÇ   Interpreter) ‚îÇ     ‚îÇ             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

Each stage uses a different Nova 2 Lite capability for its task.

In [None]:
def run_idp_pipeline(file_path):
    """
    Run the complete IDP pipeline on a single document.
    
    Steps:
    1. Classify (standard mode - fast)
    2. Extract (blueprint mode - structured)
    3. Validate (Nova Lite + Extended Thinking - deep reasoning)
    4. Analyze (Nova Lite + Code Interpreter - compute & verify)
    5. Report (standard mode - comprehensive summary)
    """
    pipeline_results = {"file": str(file_path), "stages": {}}
    
    print(f"\n{'#'*70}")
    print(f"# IDP PIPELINE: {Path(file_path).name}")
    print(f"{'#'*70}")
    
    # ‚îÄ‚îÄ STAGE 1: CLASSIFY ‚îÄ‚îÄ
    print(f"\n‚ñ∂ STAGE 1: Document Classification (Nova 2 Lite)")
    print("-" * 50)
    
    resp = invoke_nova(bedrock_client,
        prompt=CLASSIFY_PROMPT, file_paths=str(file_path),
        model_id=NOVA_LITE_MODEL_ID, system_prompt=CLASSIFY_SYSTEM, max_tokens=512
    )
    txt = extract_text(resp)
    show_usage(resp)
    try:
        clean = txt.strip()
        if clean.startswith("```"):
            clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
        classification = json.loads(clean)
    except json.JSONDecodeError:
        classification = {"document_type": "UNKNOWN", "raw": txt[:200]}
    
    doc_type = classification.get("document_type", "UNKNOWN")
    print(f"  Classified as: {doc_type} (confidence: {classification.get('confidence', '?')})")
    pipeline_results["stages"]["classify"] = classification
    
    # ‚îÄ‚îÄ STAGE 2: EXTRACT ‚îÄ‚îÄ
    print(f"\n‚ñ∂ STAGE 2: Structured Extraction (Blueprint)")
    print("-" * 50)
    
    extract_prompt = f"""Extract ALL structured data from this {doc_type} document.
Return a comprehensive JSON object with every field, value, date, amount, 
name, address, and identifier found in the document.
Include a 'metadata' key with document_type, page_count estimate, and extraction_confidence."""

    resp = invoke_nova(bedrock_client,
        prompt=extract_prompt, file_paths=str(file_path),
        model_id=NOVA_LITE_MODEL_ID,
        system_prompt="You are an expert document data extraction system. Return only valid JSON.",
        max_tokens=6000
    )
    txt = extract_text(resp)
    show_usage(resp)
    try:
        clean = txt.strip()
        if clean.startswith("```"):
            clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
        extracted = json.loads(clean)
    except json.JSONDecodeError:
        extracted = {"raw_extraction": txt[:1000]}
    
    print(f"  Extracted {len(json.dumps(extracted))} chars of structured data")
    pipeline_results["stages"]["extract"] = extracted
    
    # ‚îÄ‚îÄ STAGE 3: VALIDATE ‚îÄ‚îÄ
    print(f"\n‚ñ∂ STAGE 3: Deep Validation (Nova 2 Lite + Extended Thinking)")
    print("-" * 50)
    
    validate_prompt = f"""You are a document validation expert. Carefully validate this extracted data:

{json.dumps(extracted, indent=2, default=str)[:8000]}

Perform these validation checks:
1. COMPLETENESS: Are all expected fields present for a {doc_type}?
2. CONSISTENCY: Do numeric values add up? Are dates logical?
3. FORMAT: Are IDs, codes, and references in expected formats?
4. ANOMALIES: Anything unusual or potentially incorrect?

Return a JSON object with: 
{{"validation_status": "PASS|WARN|FAIL", "checks": [...], "issues_found": [...], "confidence_score": 0-100}}"""

    resp = invoke_nova_with_reasoning(bedrock_client,
        prompt=validate_prompt,
        reasoning_effort="medium",
        system_prompt="You are a meticulous document auditor. Validate data thoroughly. Return valid JSON only."
    )
    reasoning, answer = extract_reasoning_and_text(resp)
    show_usage(resp)
    try:
        clean = answer.strip()
        if clean.startswith("```"):
            clean = clean.split("\n", 1)[1].rsplit("```", 1)[0]
        validation = json.loads(clean)
    except json.JSONDecodeError:
        validation = {"validation_status": "UNKNOWN", "raw": answer[:500]}
    
    print(f"  Validation status: {validation.get('validation_status', '?')}")
    print(f"  Confidence score: {validation.get('confidence_score', '?')}")
    issues = validation.get("issues_found", [])
    if issues:
        print(f"  Issues found: {len(issues)}")
        for issue in issues[:3]:
            print(f"    - {issue if isinstance(issue, str) else json.dumps(issue)[:100]}")
    pipeline_results["stages"]["validate"] = validation
    
    # ‚îÄ‚îÄ STAGE 4: ANALYZE ‚îÄ‚îÄ
    print(f"\n‚ñ∂ STAGE 4: Computational Analysis (Code Interpreter)")
    print("-" * 50)
    
    analyze_prompt = f"""Analyze this extracted document data using Python code:

{json.dumps(extracted, indent=2, default=str)[:6000]}

Write Python code to:
1. Count and categorize all fields found
2. Sum any monetary amounts
3. List all dates found and check chronological order
4. Print a structured analysis summary"""

    messages = [{"role": "user", "content": [{"text": analyze_prompt}]}]
    resp = invoke_nova_with_code_interpreter(bedrock_client, messages,
        system_prompt="You are a data analyst. Use Python to analyze document data.")
    show_usage(resp)
    
    analysis_output = ""
    content_list = resp["output"]["message"]["content"] if resp else []
    for item in content_list:
        if "text" in item:
            analysis_output += item["text"]
        elif "toolResult" in item:
            for c in item["toolResult"].get("content", []):
                if "text" in c:
                    try:
                        r = json.loads(c["text"])
                        if r.get("stdOut"):
                            analysis_output += r["stdOut"]
                    except json.JSONDecodeError:
                        analysis_output += c["text"]
    
    print(f"  Analysis output ({len(analysis_output)} chars):")
    print(analysis_output[:800] if analysis_output else "  [No output]")
    pipeline_results["stages"]["analyze"] = {"output": analysis_output[:2000]}
    
    # ‚îÄ‚îÄ STAGE 5: REPORT ‚îÄ‚îÄ
    print(f"\n‚ñ∂ STAGE 5: Final Report Generation")
    print("-" * 50)
    
    report_prompt = f"""Generate a comprehensive processing report for this document.

Classification: {json.dumps(classification, default=str)}
Extraction summary: {len(json.dumps(extracted))} chars extracted
Validation: {json.dumps(validation, default=str)[:2000]}

Create a concise executive summary covering:
1. Document type and key identifiers
2. Critical data points extracted
3. Validation status and any concerns
4. Recommended next steps

Format as clear markdown."""

    resp = invoke_nova(bedrock_client,
        prompt=report_prompt,
        model_id=NOVA_LITE_MODEL_ID,
        system_prompt="You are a document processing report writer. Be concise and actionable.",
        max_tokens=2000
    )
    report = extract_text(resp)
    show_usage(resp)
    print(report[:1000])
    pipeline_results["stages"]["report"] = report
    
    return pipeline_results

print("IDP Pipeline function defined.")

### 11.1 Run Pipeline on Sample Documents

In [None]:
# Run the full pipeline on two different document types
pipeline_outputs = {}

# Pipeline on CMS-1500 (complex medical form)
result = run_idp_pipeline(f"{SAMPLES_DIR}/sample1_cms-1500-P.pdf")
pipeline_outputs["cms1500"] = result

In [None]:
# Pipeline on bank statement
result = run_idp_pipeline(f"{SAMPLES_DIR}/BankStatement.pdf")
pipeline_outputs["bank_statement"] = result

### 11.2 Pipeline Summary

In [None]:
print("=" * 70)
print("IDP PIPELINE - COMPLETE SUMMARY")
print("=" * 70)
print(f"\nDocuments processed: {len(pipeline_outputs)}")
print()

for doc_name, result in pipeline_outputs.items():
    stages = result.get("stages", {})
    classify = stages.get("classify", {})
    validate = stages.get("validate", {})
    print(f"üìÑ {result.get('file', doc_name)}")
    print(f"   Type: {classify.get('document_type', '?')}")
    print(f"   Confidence: {classify.get('confidence', '?')}")
    print(f"   Validation: {validate.get('validation_status', '?')} "
          f"(score: {validate.get('confidence_score', '?')})")
    issues = validate.get("issues_found", [])
    print(f"   Issues: {len(issues)}")
    print()

---
## 13. Conclusion

This notebook demonstrated the art of the possible with **Amazon Nova 2 Lite** for Intelligent Document Processing:

| Capability | Implementation |
|---|---|
| **Document Classification** | Multimodal classification with JSON output |
| **Summarization** | Rich markdown summaries from PDFs and images |
| **Structured Extraction** | Tool configs (blueprints) forcing schema-compliant JSON |
| **Bounding Box Visualization** | Spatial field localization with image overlay |
| **Mode Comparison** | Standard vs Extended Thinking quality/speed analysis |
| **Batch Processing** | Multi-document extraction with aggregate insights |
| **Document Q&A** | Multi-turn conversation and document comparison |
| **Extended Thinking** | 3-level reasoning for deep validation & anomaly detection |
| **Code Interpreter** | Built-in Python sandbox for computation & validation |
| **End-to-End Pipeline** | Classify ‚Üí Extract ‚Üí Validate ‚Üí Analyze ‚Üí Report |

### Key Takeaways

1. **Nova 2 Lite** provides an excellent balance of capability and cost-efficiency for IDP workloads
2. **Tool configurations** (Converse API) act as blueprints for enforcing structured output schemas
3. **Extended thinking** enables quality/cost tradeoff control via low/medium/high reasoning effort levels
4. **Built-in code interpreter** allows in-flight computation and validation without external infrastructure
5. **Multimodal understanding** handles PDFs, scanned images, and multi-page documents seamlessly
6. **Multi-turn conversation** enables interactive document investigation and cross-document comparison
7. **End-to-end pipelines** combining multiple Nova capabilities deliver production-grade IDP workflows

### What's Possible Beyond This Notebook

- **Web Grounding** (`nova_grounding`) ‚Äî Enrich extracted data with real-time web information (e.g., look up medical codes, verify company details)
- **Fine-tuning** ‚Äî Customize Nova 2 Lite via SFT/RFT on Amazon Bedrock for domain-specific extraction
- **Batch Processing at Scale** ‚Äî Integrate with Amazon S3 and AWS Step Functions for production pipelines
- **Content Safety** ‚Äî Add [Amazon Bedrock Guardrails](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html) for PII detection and content filtering
- **Multimodal Embeddings** ‚Äî Use Nova Multimodal Embeddings for semantic document search and retrieval

### Resources

- [Amazon Nova Documentation](https://docs.aws.amazon.com/nova/latest/nova2-userguide/what-is-nova-2.html)
- [Amazon Bedrock Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html)
- [Extended Thinking Guide](https://docs.aws.amazon.com/nova/latest/nova2-userguide/extended-thinking.html)
- [Built-in Tools (Code Interpreter & Web Grounding)](https://docs.aws.amazon.com/nova/latest/userguide/tool-built-in.html)
- [Amazon Nova Samples on GitHub](https://github.com/aws-samples/amazon-nova-samples)