# 🏋️‍♂️ Azure AI Content Understanding - Classifier and Analyzer Demo

###  Background
At Contoso Insurance our customers upload their claim forms to our website from our customer portal or fax them to a 800 number that drops them into the same location.
We find that they often create a single file that contains not only the claim document but also all the supporting douments.
Because this may include statements, bills and receipts from many different doctors, hospitals and labs, there is no way to extract from the varied document types without parsing the raw OCR data or creating custom input templates.

#  A Solution - Content Understanding  
Allows for schema based extraction from pre-classified document types even when they are bundled into one file!

This notebook demonstrates how to use the Azure [AI Content Understanding](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview) service to:
1. Create a classifier to categorize bundled documents inside a single PDF file
2. Create 3 custom analyzers to extract specific fields from specific document types.
3. Combine classifier and the 3 analyzers to create an enhanced classier
4. Classify, logically split, and analyze multiple documents bundled in a single PDF file.
   
If you’d like to learn more before getting started, see the official documentation:  
 - [Content Understanding classifier](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)  
 - [Understanding Analyzers in Azure AI Services](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/analyzer-templates?tabs=document)  
 - [Azure AI Content Understanding document solutions](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/overview)  

## 0. Prerequisites  
To get started, make sure you have the following resources and permissions:

1. An Azure subscription. If you don't have an Azure subscription, create a free account.
2. <mark>Create an AI Foundry Resource in the Azure Portal in one of the following supported regions: westus, swedencentral, or australiaeast.</mark>    
This will be your Content Understanding Resource.  

**FYI only -  no need to do this**  
If you wanted to integrate your Content Understanding Resource with an AI Foundry project, you would  

3. From the Azure Portal create an Azure AI Foundry hub-based project created in one of the following supported regions: westus, swedencentral, or australiaeast.    
4. Create a project from the home page of AI Foundry Studio, or the Content Understanding landing page.  
5. Add the Content Understanding Resource as a connected resource to the project.
  
  Keep track of the [AI Content Understanding release notes](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/whats-new) to know when Content Understanding becomes fully integrated with the new type of AI Foundry projects.

## 1. Import Required Libraries

In [None]:
import json
import logging
import os
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())

print("✅ Libraries imported successfully!")

## 2. Import Azure Content Understanding Client

The `AzureContentUnderstandingClient` class handles all API interactions with the Azure AI service.  
There is not currently an official SDK for Content Understanding

In [None]:
try:
    from content_understanding_client import AzureContentUnderstandingClient
    print("✅ Azure Content Understanding Client imported successfully!")
except ImportError:
    print("❌ Error: Make sure 'AzureContentUnderstandingClient.py' is in the same directory as this notebook.")
    raise

## 3. Load Environment Variables and set credentials

Before executing this cell the first time  
Modify your .env file to add the following entries at the bottom  
  
SERVICE_FOR_CU = The endpoint of the Azure AI Services endpoint in Foundry  
SERVICE_API_FOR_CU = The api version for the service - 2025-05-01-preview will work  
SAMPLE_CLAIMS_BUNDLE = The path to "Data/sample claim submission.pdf"   


<img src="../images/cu-endpoint.png" alt="Env Vars for Content Understanding" width="70%"/>

**Save the .env file**

In [None]:
# always refresh all vars
load_dotenv(override=True)
# For authentication, we will use token-based auth. You can use key based auth, only one of them is required
content_understanding_endpoint = os.getenv("SERVICE_FOR_CU")
# retrieve the API version to use
content_understanding_api_version = os.getenv("SERVICE_API_FOR_CU")
# retrieve the path of the sample bundle file
SAMPLE_CLAIMS_BUNDLE = os.getenv("SAMPLE_CLAIMS_BUNDLE")
print(SAMPLE_CLAIMS_BUNDLE)

# Setup credentials
credential = DefaultAzureCredential(
    exclude_managed_identity_credential=True,
    exclude_client_secret_credential=True,
    exclude_environment_credential=True,
    exclude_workload_identity_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_azure_powershell_credential=True,
    exclude_azure_developer_cli_credential=True,
)
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

file_location = Path(SAMPLE_CLAIMS_BUNDLE)

print("📋 Configuration Summary:")
print(f"   Content Understanding Endpoint: {content_understanding_endpoint}")
print(f"   Content Understanding API Version: {content_understanding_api_version}")
print(f"   Document to Analyze: {file_location.name if file_location.exists() else '❌ File not found'}")

## 4. Example of Defining a Classifier Schema

The classifier schema defines:
  - **Categories**: Document types to classify (e.g., Legal, Medical)
  - **description (Optional)**: An optional field used to provide additional context or hints for categorizing or splitting documents.   
    - This can be helpful when the category name alone isn’t descriptive enough. If the category name is already clear and self-explanatory, this field can be omitted.

- **This classifier should indtify these document types**
  -  Completed_Claim_Form  
  -   HIPAA_Release  
  -   Signed_Physician_Statement  
  -   Pathology_Report  
  -   Doctor_Office_Visit_Report  
  -   Scanner_Report  
  -   Other_Document_Type
  -   Itemized_Bill_for_Lab_Services  
  -   Itemized_Bill_for_Radiology_Services  
  -   Itemized_Bill_from_Other_Service_Providers_Type  
  -   UB04_Bil 

- **splitMode Options**: Defines how multi-page documents should be split before classification or analysis. There are 3 options:  
  - `"auto"`: Automatically split based on content.  
    - For example, if two categories are defined as “invoice” and “application form”:
       - A PDF with only one invoice will be classified as a single document.
       - A PDF containing two invoices and one application form will be automatically split into three classified sections.
  - `"none"`: No splitting.  
    - The entire multi-page document is treated as a single unit for classification and analysis.
  - `"perPage"`: Split by page.  
    - Each page is treated as a separate document. This is useful when you’ve built custom analyzers designed to operate on a per-page basis.

  **Below is my schema definition**

In [None]:
# This classifier schema will automatically splitting by document based on content! That means per document within the bundle
# It defines field descriptions and classifier document categories and their descriptions
classifier_schema = {
			"categories": {
					"Completed_Claim_Form": {"description": "a Completed Claim Form"},
					"HIPAA_Release": {"description": "a HIPAA Release"},
					"Signed_Physician_Statement": {"description": "a Signed Physician Statement"},
                    "Itemized_Bill_for_Lab_Services": {"description": "an Itemized Bill from a laboratory for Lab test and services. This type which includes Statments, Invoices, Account Summaries, any document that has dollar amounts on it sent from a lab"},
					"Itemized_Bill_for_Radiology_Services": {"description": "an Itemized Bill from a radiology department for imaging services. This type which includes Statments, Invoices, Account Summaries, any document that has dollar amounts on it sent from a radiology provider."},
                    "Itemized_Bill_from_Other_Service_Providers_Type": {"description": "an Itemized Bill from a other than a laboratory, a radiology provider or a hospitals provider types listed above This type which includes Statments, Invoices, Account Summaries, any document that has dollar amounts on it."},
					"UB04_Bill": {"description": "A special type of itemized bill. It will have the notation on it UB04 or UB-04 or UB 04."},
					"Pathology_Report": {"description": "a Pathology Report"},
                    "Doctor_Office_Visit_Report": {"description": "a Doctor Office Visit Report contains a narrative of the visit, including symptoms, diagnosis, and treatment plan. It does not include any billing information."},
                    "Scanner_Report": {"description": "a Scanner Report that list issue with the scan of the documents"},
					"Other_Document_Type": {"description": "A document type other the other ones specified"}
					},
    			"splitMode": "auto"  # IMPORTANT: Automatically detect document boundaries. Can change mode for your needs.
			}
			

# List out the doc types here
print("📄 Classifier DocTypes:")
for category, details in classifier_schema["categories"].items():
    print(f"   • {category}: {details['description'][:60]}...")

## 5. Initialize Content Understanding Client

Create the client that will communicate with Azure AI services.

⚠️ Important:
You must update the code below to match your Azure authentication method.


In [None]:
# Initialize the Azure Content Understanding client
try:
    content_understanding_client = AzureContentUnderstandingClient(
        endpoint=content_understanding_endpoint,
        api_version=content_understanding_api_version,
        token_provider=token_provider,
    )
    print("✅ Content Understanding client initialized successfully!")
    print("   Ready to create classifiers and analyzers.")
except Exception as e:
    print(f"❌ Failed to initialize client: {e}")
    raise

## 6A. Create a Custom Analyzer (analyzer_6A) for the itemizated bill doc types

Now let's create a schema for a custom analyzer that can extract specific fields from documents.
This analyzer will:
- Extract the title from each of the documents in the bundle

Note: we are defining the schema here, we will specify what document type to use this schema with down below!

In [None]:
# Define analyzer schema with custom fields
analyzer_schema_analyzer_6A = {
    "description": "Analyzer_with_document_fields - extracts key document information from a bundle of documents in a single pdf submitted for claims",
    "baseAnalyzerId": "prebuilt-documentAnalyzer",  # Built on top of the general document analyzer
    "config": {
        "returnDetails": True,
        "enableLayout": True,          # Extract layout information
        "enableBarcode": False,        # Skip barcode detection
        "enableFormula": False,        # Skip formula detection
        "estimateFieldSourceAndConfidence": True, # Set to True if you want to estimate the field location (aka grounding) and confidence
        "disableContentFiltering": False,
    },
    "fieldSchema": {
        "fields": {
			"title_on_first_page_of_document": {
				"type": "string",
				"method": "generate",
				"description": "This is the title of the document. It will typically be the line of text with the largest sized font near the top of the page The value should be \"None\" if there is no title or it cannot be determined. "
			}
        }
    }
}

# Generate unique analyzer ID
analyzer_id_analyzer_6A = "Analyzer_with_document_fields_analyzer_6A_" + str(uuid.uuid4())

# Create the analyzer
try:
    print(f"🔨 Creating custom analyzer 6A: {analyzer_id_analyzer_6A}")
    print("\n📋 Analyzer will extract:")
    for field_name, field_info in analyzer_schema_analyzer_6A["fieldSchema"]["fields"].items():
        print(f"   • {field_name}: {field_info['description']}")
    
    response = content_understanding_client.begin_create_analyzer(analyzer_id_analyzer_6A, analyzer_schema_analyzer_6A)
    result = content_understanding_client.poll_result(response)
    
    # just printing the fields created
    print("\n✅ Analyzer_with_document_fields created successfully!")
    print(f"   Analyzer ID analyzer_6A: {analyzer_id_analyzer_6A}")
    
except Exception as e:
    print(f"\n❌ Error creating analyzer: {e}")
    analyzer_id_analyzer_6A = None  # Set to None if creation failed

## 6B. Create a Custom Analyzer (analyzer_6B) for the billed expenses

Next let's create a schema for a custom analyzer that can extract specific fields from documents.
This analyzer will:
- Extract the all the billed expenses from each of the documents in the bundle as a table (array)  

Note: we are defining the schema here, we will specify what document type to use this schema with down below!

In [None]:
# Define analyzer schema with custom fields
analyzer_schema_analyzer_6B = {
    "description": "Analyzer_with_document_fields - extracts key document information from a bundle of documents in a single pdf submitted for claims",
    "baseAnalyzerId": "prebuilt-documentAnalyzer",  # Built on top of the general document analyzer
    "config": {
        "returnDetails": True,
        "enableLayout": True,          # Extract layout information
        "enableBarcode": False,        # Skip barcode detection
        "enableFormula": False,        # Skip formula detection
        "estimateFieldSourceAndConfidence": True, # Set to True if you want to estimate the field location (aka grounding) and confidence
        "disableContentFiltering": False,
    },
    "fieldSchema": {
        "fields": {
			"title_on_first_page_of_document": {
				"type": "string",
				"method": "generate",
				"description": "This is the title of the document. It will typically be the line of text with the largest sized font near the top of the page The value should be \"None\" if there is no title or it cannot be determined. "
			},
			"Expenses": {
				"type": "array",
				"items": {
					"type": "object",
					"properties": {
                        "Expense_Amount": {
                                "type": "number",
                                "method": "generate",
                                "description": "A table of the expense items amounts billed to patient or insurance company. These are charges for procedures, professional services, lab tests performed and other medical services. They will be numeric with 2 decimal places. Keep the 2 decimal places even it they are .00. They will typically be on the document pages in a tabular layout with the expensed dollar amounts all in the same column. You will typically find the other columns to extract ICD code CPT code etc for the other columns in this table usually on the same line as the amount. Only capture positive amounts that are actual charges (not totals, subtotals, adjustments, refunds, or negative values or amount that are zero). All dollar amounts for expenses must be captured. the document may contain multiple pages of expenses within a single document."
                            },
                            "ICD_Code": {
                                "type": "string",
                                "method": "generate",
                                "description": "The ICD code associated with the expense if there is one. If there is no ICD code, use \"\".  The ICD code is usually on the same line of the table as the amount"
                            },
                            "Date": {
                                "type": "date",
                                "method": "generate",
                                "description": "The date of the expense. The date is usually on the same line of the table as the amount format the date as mm/dd/yyyy."
                            },
                            "Expense_Description": {
                                "type": "string",
                                "method": "generate",
                                "description": "The description of the expense. This may be the procedure name. It is usually on the same line of the table as the amount."	
                            },
                            "Surgeon_Name_or_Provider": {
                                "type": "string",
                                "method": "generate",
                                "description": "The surgeon or provider if this expense was a sugical procedure."
                            },
                            "CPT_Code": {
                                "type": "string",
                                "method": "generate",
                                "description": "The CPT code associated with the expense.  It is usually on the same line of the table as the amount."	
                            },
                            "Ref_Page": {
                                "type": "number",
                                "method": "generate",
                                "description": "The page number with the document."
                            },
                            "Drug_Name": {
                                "type": "string",
                                "method": "generate",
                                "description": "If the expense charge was for a drug, put the drug name here. If not for a drug put N/A in this field."
                            },
                            "Expense_Type": {
                                "type": "string",
                                "method": "generate",
                                "description": "Categorize each expense into one of four categories based on the description, ICD10, CPT code, or other context. The 4 categories are:  1. Cancer_History_Expenses, 2. Diagnostic_Tests_and_Labs_Expenses,  3. Surgical_Events_Expenses,  4. Cancer_Treatment_Expenses. Put every expense into one of the four. If it was for a exam, a lab test or other diagostic test that diagosed cancer or remission, make it a #1. it was for a lab or dignostic test make it a #2 If it was for a surgical procedure make it a #3. Everything else is a #4. use the full name not just the number when filling in field."
                            }
                        },
                        "method": "generate"
                    },
				"method": "generate",
				"description": "Expenses are charges billed to either the patient or insurance company. They are single charges for a procedure, test or other medical service. They do not include payments, adjustments, refunds, total balances, subtotals. Other than these exceptions all other dollar amounts may be an expense and should be reviewed."
                }
            },
        }
}

# Generate unique analyzer ID
analyzer_id_analyzer_6B = "Analyzer_with_document_fields_analyzer_6B_" + str(uuid.uuid4())

# Create the analyzer
try:
    print(f"🔨 Creating custom analyzer 6B: {analyzer_id_analyzer_6B}")
    print("\n📋 Analyzer will extract:")
    for field_name, field_info in analyzer_schema_analyzer_6B["fieldSchema"]["fields"].items():
        print(f"   • {field_name}: {field_info['description']}")

    response = content_understanding_client.begin_create_analyzer(analyzer_id_analyzer_6B, analyzer_schema_analyzer_6B)
    result = content_understanding_client.poll_result(response)
    
    # just printing the fields created
    print("\n✅ Analyzer_with_document_fields created successfully!")
    print(f"   Analyzer ID analyzer_6B: {analyzer_id_analyzer_6B}")

except Exception as e:
    print(f"\n❌ Error creating analyzer: {e}")
    analyzer_id_analyzer_6B = None  # Set to None if creation failed

## 6C. Create a Custom Analyzer (analyzer_6C) for the patient information

Next let's create a schema for a custom analyzer that can extract specific fields from documents.
This analyzer will:
- Extract the all the patient information  

Note: we are defining the schema here, we will specify what document type to use this schema with down below!

In [None]:
# Define analyzer schema with custom fields
analyzer_schema_analyzer_6C = {
    "description": "Analyzer_with_document_fields - extracts key document information from a bundle of documents in a single pdf submitted for claims",
    "baseAnalyzerId": "prebuilt-documentAnalyzer",  # Built on top of the general document analyzer
    "config": {
        "returnDetails": True,
        "enableLayout": True,          # Extract layout information
        "enableBarcode": False,        # Skip barcode detection
        "enableFormula": False,        # Skip formula detection
        "estimateFieldSourceAndConfidence": True, # Set to True if you want to estimate the field location (aka grounding) and confidence
        "disableContentFiltering": False,
    },
    "fieldSchema": {
        "fields": {
			"title_on_first_page_of_document": {
				"type": "string",
				"method": "generate",
				"description": "This is the title of the document. It will typically be the line of text with the largest sized font near the top of the page The value should be \"None\" if there is no title or it cannot be determined. "
			},
			"Patient_First_Name": {
                "type": "string",
                "method": "generate",
                "description": "The first name of the patient. This is usually on the first page of the document."
            },
            "Patient_Last_Name": {
                "type": "string",
                "method": "generate",
                "description": "The last name of the patient. This is usually on the first page of the document."
            },
            "DOB": {
                "type": "string",
                "method": "generate",
                "description": "The DOB of the patient. This is usually on the first page of the document. Put into YYYY-MM-DD format."
            },
            "Gender": {
                "type": "string",
                "method": "generate",
                "description": "The gender of the patient. This is usually on the first page of the document."
            },
            "Policy_Number": {
                "type": "string",
                "method": "generate",
                "description": "The policy number of the patient. This is usually on the first page of the document. If the field is missing, use \"\"."
            }
        }
    }
}

# Generate unique analyzer ID
analyzer_id_analyzer_6C = "Analyzer_with_document_fields_analyzer_6C_" + str(uuid.uuid4())

# Create the analyzer
try:
    print(f"🔨 Creating custom analyzer: {analyzer_id_analyzer_6C}")
    print("\n📋 Analyzer will extract:")
    for field_name, field_info in analyzer_schema_analyzer_6C["fieldSchema"]["fields"].items():
        print(f"   • {field_name}: {field_info['description']}")

    response = content_understanding_client.begin_create_analyzer(analyzer_id_analyzer_6C, analyzer_schema_analyzer_6C)
    result = content_understanding_client.poll_result(response)
    
    # just printing the fields created
    print("\n✅ Analyzer_with_document_fields created successfully!")
    print(f"   Analyzer ID: {analyzer_id_analyzer_6C}")

except Exception as e:
    print(f"\n❌ Error creating analyzer: {e}")
    analyzer_id_analyzer_6C = None  # Set to None if creation failed

## 7. Create an Enhanced Classifier Schema with 3 Custom Analyzers

Now we'll create a new enhanced classifier Schema that classifies the documents and also specifies which of our 3 custom analyzers to use to extract fields with on each of the document types defined.  
This combines classification with field extraction in one operation.  

We are using the 3 analyzers from previos cells analyzer_6A, analyzer_6B and analyzer_6C


**This is now the schema my Classifier will use**  
It will be used instead of the one created in the earlier step #4 because it defines the analyzer to use for each document type.  

Note: For each document type, one of the 3 analyzers: analyzer_6A or analyzer_6B or analyzer_6C is specified.

In [None]:
# Split based on content! That means per document within the bundle
# Define field descriptions and classifier document categories and their descriptions.
# Notice that the analyzer to use is specified for each document type.
enhanced_classifier_with_document_metadata_and_fields_schema = {
			"categories": {
                	"Completed_Claim_Form": {"description": "a Completed Claim Form", "analyzerId": analyzer_id_analyzer_6C},
					"HIPAA_Release": {"description": "a HIPAA Release", "analyzerId": analyzer_id_analyzer_6A},
					"Signed_Physician_Statement": {"description": "a Signed Physician Statement", "analyzerId": analyzer_id_analyzer_6A},
                    "Pathology_Report": {"description": "a Pathology Report", "analyzerId": analyzer_id_analyzer_6A},
                    "Doctor_Office_Visit_Report": {"description": "a Doctor Office Visit Report contains a narrative of the visit, including symptoms, diagnosis, and treatment plan. It does not include any billing information.", "analyzerId": analyzer_id_analyzer_6A},
                    "Scanner_Report": {"description": "a Scanner Report that list issue with the scan of the documents", "analyzerId": analyzer_id_analyzer_6A},
					"Other_Document_Type": {"description": "A document type other the other ones specified", "analyzerId": analyzer_id_analyzer_6A},
					"Itemized_Bill_for_Lab_Services": {"description": "an Itemized Bill from a laboratory for Lab test and services. This type which includes Statments, Invoices, Account Summaries, any document that has dollar amounts on it sent from a lab", "analyzerId": analyzer_id_analyzer_6B},
					"Itemized_Bill_for_Radiology_Services": {"description": "an Itemized Bill from a radiology department for imaging services. This type which includes Statments, Invoices, Account Summaries, any document that has dollar amounts on it sent from a radiology provider.", "analyzerId": analyzer_id_analyzer_6B},
                    "Itemized_Bill_from_Other_Service_Providers_Type": {"description": "an Itemized Bill from a other than a laboratory, a radiology provider or a hospitals provider types listed above This type which includes Statments, Invoices, Account Summaries, any document that has dollar amounts on it.", "analyzerId": analyzer_id_analyzer_6B},
					"UB04_Bill": {"description": "A special type of itemized bill. It will have the notation on it UB04 or UB-04 or UB 04.", "analyzerId": analyzer_id_analyzer_6B},
					},
    			"splitMode": "auto"  # IMPORTANT: Automatically detect document boundaries. Can change mode for your needs.
			}
# Just printing out my doc types here
print("📄 Classifier DocTypes:")
for category, details in enhanced_classifier_with_document_metadata_and_fields_schema["categories"].items():
    print(f"   • {category}: {details['description'][:60]}...")

## 8. Then, create the Classifier  

It takes the Classifier Schema as an input parameter

In [None]:
# These are the analyzer ids created in prior cells
print(f"Using analyzer from prior cell analyzer_6A:: {analyzer_id_analyzer_6A}")
print(f"Using analyzer from prior cell analyzer_6B:: {analyzer_id_analyzer_6B}")
print(f"Using analyzer from prior cell analyzer_6C:: {analyzer_id_analyzer_6C}")

# This is the classifier we are creating now
# Generate unique enhanced classifier ID
enhanced_classifier_id = "classifier_based_on_doc_type_9" + str(uuid.uuid4())
print(f"🔨 Creating classifier: {enhanced_classifier_id}")

# Create the enhanced classifier
if analyzer_id_analyzer_6A and analyzer_id_analyzer_6B and analyzer_id_analyzer_6C:  # Only create if all of the previous analyzers were successfully created
	try:
		response = content_understanding_client.begin_create_classifier(enhanced_classifier_id, enhanced_classifier_with_document_metadata_and_fields_schema )
		result = content_understanding_client.poll_result(response)
			
		print("\n✅ Enhanced classifier created successfully!")
		print("\n📋 Configuration:")
		print("   • Medical documents in claim bundle → 3 Custom analyzers with field extraction → 1 enhanced Classifier")

		print(f"\n   • These document types below can use the classifier {enhanced_classifier_id} and the custom analyzer - analyzer_id: {analyzer_id_analyzer_6C} and this schema:  enhanced_classifier_with_document_metadata_and_fields_schema")
		print("\n	- Completed_Claim_Form")


		print(f"\n   • These document types below can use the classifier {enhanced_classifier_id} and the custom analyzer - analyzer_id: {analyzer_id_analyzer_6A} and this schema:  enhanced_classifier_with_document_metadata_and_fields_schema")
		print("\n	  - HIPAA_Release")
		print("	  - Signed_Physician_Statement")
		print("	  - Pathology_Report")
		print("	  - Doctor_Office_Visit_Report")
		print("	  - Scanner_Report")
		print("	  - Other_Document_Type")

		print(f"\n   • These document types below can use the classifier {enhanced_classifier_id} and the custom analyzer - analyzer_id: {analyzer_id_analyzer_6B} and this schema: enhanced_classifier_with_document_metadata_and_fields_schema")
		print(f"\n	- Itemized_Bill_for_Lab_Services")
		print(f"	- Itemized_Bill_for_Radiology_Services")
		print(f"	- Itemized_Bill_from_Other_Service_Providers_Type")
		print(f"	- UB04_Bill")
			
	except Exception as e:
		print(f"\n❌ Error creating enhanced classifier: {e}")
else:
	print("⚠️  Skipping enhanced classifier creation - analyzer was not created successfully.")

## 9. Process Document with Enhanced Classifier 

**This step is:**
1.  Reading the PDF documentbundle file  
2.  Classifying the documents within it
3.  Extracting the fields from the document based on document type



In [None]:
# using the classifier that breaks the bundle into documents
#
# Checking that all the analyzers were created
if analyzer_id_analyzer_6A and analyzer_id_analyzer_6B and analyzer_id_analyzer_6C:
    print(f"🔨 Using analyzer: {analyzer_id_analyzer_6A} and {analyzer_id_analyzer_6B}  and {analyzer_id_analyzer_6C}")
else:
    print("⚠️  Skipping analyzer usage - analyzer was not created successfully in previous cell")


if enhanced_classifier_id and analyzer_id_analyzer_6A and analyzer_id_analyzer_6B and analyzer_id_analyzer_6C:
    print(f"🔨 Using classifier: {enhanced_classifier_id}")
    try:
        # Check if document exists
        if not file_location.exists():
            raise FileNotFoundError(f"Document not found at {file_location}")
    
        # Process with enhanced classifier
        print("📄 Processing document with enhanced classifier")
        print(f"   Document: {file_location.name}")
        print("\n⏳ Processing with classification + field extraction...")

        response = content_understanding_client.begin_classify(classifier_id=enhanced_classifier_id, file_location=str(file_location))
        enhanced_result = content_understanding_client.poll_result(response, timeout_seconds=920,polling_interval_seconds=25)
        
        print("\n✅ Enhanced processing completed!")
        
    except Exception as e:
        print(f"\n❌ Error processing document: {e}")
else:
    print("⚠️  Skipping enhanced classification - enhanced classifier was not created.")

## 10. Create a function to do the parsing and execute it

In [None]:
def parse_json_and_display_results(enhanced_result):
   
    data = enhanced_result
    # Extract the main result data
    result_data = data.get("result", {})
    contents = result_data.get("contents", [])
    
    print("\n📊 DOCUMENT ANALYSIS RESULTS")
    print("=" * 70)
    print(f"Total sections(documents) found: {len(contents)}")
    
    # Process each document section
    for i, content in enumerate(contents, 1):
        print(f"\n{'='*70}")
        print(f"DOCUMENT #{i}")
        print(f"{'='*70}")
        
        # Basic document information
        category = content.get('category', 'Unknown')
        start_page = content.get('startPageNumber', '?')
        end_page = content.get('endPageNumber', '?')
        
        # Calculate number of pages
        if start_page != '?' and end_page != '?':
            num_pages = end_page - start_page + 1
        else:
            num_pages = '?'
        
        print(f"📁 Type of Document: {category}")
        print(f"📄 Document Starting Page in Bundle: {start_page}")
        print(f"📄 Document Ending Page in Bundle: {end_page}")
        print(f"📄 Number of Pages in Document: {num_pages}")
        
        # Extract and display fields
        fields = content.get('fields', {})
        field_count = len(fields)
        print(f"Fields extracted from this Document: {field_count}")
        
        if fields:
            # Handle title field
            if 'title_on_first_page_of_document' in fields:
                title_field = fields['title_on_first_page_of_document']
                title_value = title_field.get('valueString', 'N/A')
                print(f"📄 Document Title: {title_value}")
            
            # Handle patient information fields
            patient_fields = ['Patient_First_Name', 'Patient_Last_Name', 'DOB', 'Gender', 'Policy_Number']
            for field_name in patient_fields:
                if field_name in fields:
                    field_data = fields[field_name]
                    field_value = field_data.get('valueString', 'N/A')
                    print(f"📄 {field_name}: {field_value}")
            
            # Handle Expenses array
            if 'Expenses' in fields:
                expenses_field = fields['Expenses']
                expenses_array = expenses_field.get('valueArray', [])
                print(f"📄 Expenses: Found {len(expenses_array)} expense entries")
                
                for idx, expense in enumerate(expenses_array, 1):
                    print(f"    💰 Expense #{idx}:")
                    expense_obj = expense.get('valueObject', {})
                    
                    # Define expense fields to extract in order of importance
                    expense_fields = [
                        'Expense_Amount', 'Expense_Description', 'Date', 'CPT_Code',
                        'ICD_Code', 'Expense_Type', 'Surgeon_Name_or_Provider', 
                        'Ref_Page', 'Drug_Name'
                    ]
                    
                    for exp_field in expense_fields:
                        if exp_field in expense_obj:
                            field_data = expense_obj[exp_field]
                            field_type = field_data.get('type', 'unknown')
                            
                            # Get value based on type
                            if field_type == 'number':
                                field_value = field_data.get('valueNumber', 'N/A')
                                if exp_field == 'Expense_Amount':
                                    field_value = f"${field_value:.2f}"
                                if exp_field == 'Ref_Page':
                                    field_value = f"{field_value:.0f}"
                                    field_value = int(field_value)  # Convert to int for page number
                                    field_value = field_value+start_page-1
                            elif field_type == 'date':
                                field_value = field_data.get('valueDate', 'N/A')
                            else:
                                field_value = field_data.get('valueString', 'N/A')
                            
                            print(f"      📄 {exp_field}: {field_value}")
            
            # Calculate word-level confidence statistics for this document
            pages = content.get('pages', [])
            if pages:
                total_words = 0
                total_confidence = 0
                min_confidence = 1.0
                max_confidence = 0.0
                
                for page in pages:
                    words = page.get('words', [])
                    for word in words:
                        confidence = word.get('confidence', 0)
                        if confidence > 0:  # Only count words with confidence scores
                            total_words += 1
                            total_confidence += confidence
                            min_confidence = min(min_confidence, confidence)
                            max_confidence = max(max_confidence, confidence)
                
                if total_words > 0:
                    avg_confidence = total_confidence / total_words
                    print(f"\n📊 Confidence Information:")
                    print(f"   📄 Word-level confidence - Avg: {avg_confidence:.3f}, Min: {min_confidence:.3f}, Max: {max_confidence:.3f}")
                    print(f"   📄 Total words with confidence scores: {total_words}")
                else:
                    print(f"\n📊 Confidence Information: No word-level confidence scores available")
        
        if field_count == 0:
            print("   📄 No fields were extracted from this document")

# Call the function to parse and display the results
parse_json_and_display_results(enhanced_result)

## 11. Create a function to give a Summary and execute it

In [None]:
def parse_json_summary(enhanced_result):
    """
    Display a concise summary of the document analysis results.
    """
    try:
        data = enhanced_result
        print("📊 DOCUMENT BUNDLE SUMMARY")
        print("=" * 50)
        
        result_data = data.get("result", {})
        contents = result_data.get("contents", [])
        if contents:  
            last_end_page = contents[-1].get('endPageNumber', '?')  
        else:  
            last_end_page = '?'
        
        print(f"Total documents found: {len(contents)}")
        print(f"Total pages in bundle: {last_end_page}")
        
        # Summary table
        print("\n📋 Document Summary:")
        print("-" * 80)
        print(f"{'#':<3} {'Document Type':<50} {'Pages':<8} {'Fields':<8}")
        print("-" * 80)
        
        total_expenses = 0
        for i, content in enumerate(contents, 1):
            category = content.get('category', 'Unknown')
            start_page = content.get('startPageNumber', '?')
            end_page = content.get('endPageNumber', '?')
            
            if start_page != '?' and end_page != '?':
                page_range = f"{start_page}-{end_page}"
                num_pages = end_page - start_page + 1
            else:
                page_range = '?'
                num_pages = '?'
            
            fields = content.get('fields', {})
            field_count = len(fields)
            
            # Count expenses
            if 'Expenses' in fields:
                expenses = fields['Expenses'].get('valueArray', [])
                expense_count = len(expenses)
                total_expenses += expense_count
                field_info = f"{field_count} (+{expense_count} expenses)"
            else:
                field_info = str(field_count)
            
            print(f"{i:<3} {category:<50} {page_range:<8} {field_info:<8}")
        
        print("-" * 80)
        print(f"\n💰 Total expenses found across all documents: {total_expenses}")
        
        # Show which documents have patient info vs expenses
        print(f"\n📝 Field Distribution:")
        print(f"   • Insurance Claim Form: Patient information fields")
        print(f"   • Billing Statements: Expense details + document titles")
        print(f"   • Other Documents: Document titles only")
        
        return contents
        
    except NameError:
        print("❌ Error: enhanced_result variable not found. Run the document processing cell first.")
        return None

# Display the summary
summary_data = parse_json_summary(enhanced_result)

## 12. Lab Summary 

Congratulations! You've successfully:
1. ✅ Created a basic classifier to categorize documents
2. ✅ Created a 3 custom analyzers to extract specific fields from specific types of documents
3. ✅ Combined them into an enhanced classifier for intelligent document processing
4. ✅ Processed a sample file using the enhanced classier

## Next Steps

Content Understanding can do so much more than just documents. It can process audio, video and images too!

[Learn more here: What is Azure AI Content Understanding (preview)?](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview)