# Extract Custom Fields from Your File

This notebook demonstrates how to use analyzers to extract custom fields from your input files.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
2. Install the required packages to run the sample.

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Analyzer Templates

Below is a collection of analyzer templates designed to extract fields from various input file types.

These templates are highly customizable, allowing you to modify them to suit your specific needs. For additional verified templates from Microsoft, please visit [here](../analyzer_templates/README.md).

In [2]:
extraction_templates = {
    "invoice":            ('../analyzer_templates/invoice.json',         '../data/invoice.pdf'            ),
    "chart":              ('../analyzer_templates/image_chart.json',     '../data/pieChart.jpg'           ),
    "call_recording":     ('../analyzer_templates/call_recording_analytics.json', '../data/callCenterRecording.mp3'),
    "conversation_audio": ('../analyzer_templates/conversational_audio_analytics.json', '../data/callCenterRecording.mp3'),
    "marketing_video":    ('../analyzer_templates/marketing_video.json', '../data/video.mp4'              ),
    "PostCallAnalyticsOnText":  ('../CU-Demo-Assets/analyzers/PostCallAnalyticsOnText.json', '../CU-Demo-Assets/en-us-conversation-transcripts/sample0 - Insurance - Customer calls about purchasing life insurance.wav.results.txt'),
    "PostCallAnalyticsCustomized": ('../CU-Demo-Assets/analyzers/PostCallAnalyticsCustomized.json', '../CU-Demo-Assets/en-US-mono-audio/sample4 - Healthcare - Customer calls about new policy quote.wav'),
    "PostCallAnalyticsCustomizedDE": ('../CU-Demo-Assets/analyzers/PostCallAnalyticsWithGermanResult.json', '../CU-Demo-Assets/de-DE-mono-audio/sample4 - Healthcare - Customer calls about new policy quote- German.wav'),
    "ImageTextAnalysis": ('../CU-Demo-Assets/analyzers/imageTextAnalysis.json', '../CU-Demo-Assets/en-us-images-w-text/Designer.png'),
    "MultiLanguageDemo": ('../CU-Demo-Assets/analyzers/MultiLanguageDemo.json', '../CU-Demo-Assets/pt-BR-mono-audio/sample4 - Healthcare - Customer calls about new policy quote - Brazilian.wav'),
    "MultiLanguageDemoOnText":  ('../CU-Demo-Assets/analyzers/MultiLanguageDemoOnText.json', '../CU-Demo-Assets/pt-BR-conversation-transcripts/sample6 - Financial - Customer calls about transferring funds - Brazilian.wav.results.txt'),
    "PostCallAnalyticsCustomizedMultiLanguageMoreFields":  ('../CU-Demo-Assets/analyzers/PostCallAnalyticsCustomizedMultiLanguageMoreFields.json', '../CU-Demo-Assets/de-DE-mono-audio/sample4 - Healthcare - Customer calls about new policy quote- German.wav'),
    "ACSTesting":  ('../CU-Demo-Assets/analyzers/ACSTesting.json', '../CU-Demo-Assets/en-US-mono-audio/sample4 - Healthcare - Customer calls about new policy quote.wav'),
    "ACSTestingIRCLog": ('../CU-Demo-Assets/analyzers/ACSTestingIRCLog.json', '../CU-Demo-Assets/en-us-text-examples/linux.log.txt'),
    "CarInsuranceCallAnalysis":  ('../CU-Demo-Assets/analyzers/CarInsuranceCallAnalysis.json', '../CU-Demo-Assets/en-US-mono-audio/Car Insurance Call.wav'),
    "StoryAnalysis": ('../CU-Demo-Assets/analyzers/HTML-ShortStoryAnalyzer.json', '../CU-Demo-Assets/en-us-text-examples/extended_short_story.html'),
    "StoryAnalysisWObject": ('../CU-Demo-Assets/analyzers/HTML-ShortStoryAnalyzerWObject.json', '../CU-Demo-Assets/en-us-text-examples/extended_short_story.html')
}

Specify the analyzer template you want to use and provide a name for the analyzer to be created based on the template.

In [3]:
import uuid

ANALYZER_TEMPLATE = "PostCallAnalyticsCustomized"
ANALYZER_ID = "cu-demo-" + str(uuid.uuid4())

(analyzer_template_path, analyzer_sample_file_path) = extraction_templates[ANALYZER_TEMPLATE]
print(f"Analyzer template path: {analyzer_template_path}")
print(f"Analyzer sample file path: {analyzer_sample_file_path}")

Analyzer template path: ../CU-Demo-Assets/analyzers/PostCallAnalyticsCustomized.json
Analyzer sample file path: ../CU-Demo-Assets/en-US-mono-audio/sample4 - Healthcare - Customer calls about new policy quote.wav


## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK.


In [4]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

AZURE_AI_ENDPOINT = os.getenv("dev_preview2_endpoint")
AZURE_AI_API_VERSION = os.getenv("dev_preview2_api_version")
AZURE_AI_SUBSCRIPTION_KEY = os.getenv("dev_preview2_subscription_key", None)

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client_dev_preview2 import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    subscription_key=AZURE_AI_SUBSCRIPTION_KEY,
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/field_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.19.0 Python/3.11.11 (Linux-6.8.0-1026-azure-x86_64-with-glibc2.36)'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 400
Response headers:
    'Content-Type': 'application/json; charset=utf-8'
    'Server': 'IMDS/150.870.65.1544'
    'x-ms-request-id': '2b077227-0fc9-4df4-b8de-0a7484936562'
    'Date': 'Fri, 02 May 2025 23:59:48 GMT'
    'Content-Length': '88'
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Re

## List all analyzers created in your resource

After the analyzer is successfully created, we can use it to analyze our input files.

In [5]:
response = client.get_all_analyzers()
print(f"Number of analyzers in your resource: {len(response['value'])}")
print(f"First 3 analyzer details: {json.dumps(response['value'], indent=2)}")

Number of analyzers in your resource: 7
First 3 analyzer details: [
  {
    "analyzerId": "prebuilt-callCenter",
    "description": "Analyze call center conversations to extract transcripts, summaries, sentiment, and more.",
    "createdAt": "2025-05-01T00:00:00Z",
    "config": {
      "returnDetails": false,
      "disableContentFiltering": false,
      "segmentationMode": "noSegmentation",
      "tableFormat": "html"
    },
    "fieldSchema": {
      "name": "PostCallAnalytics",
      "fields": {
        "Summary": {
          "type": "string",
          "description": "A one-paragraph summary"
        },
        "Topics": {
          "type": "array",
          "description": "Top 5 topics mentioned",
          "items": {
            "type": "string"
          }
        },
        "Companies": {
          "type": "array",
          "description": "List of companies mentioned",
          "items": {
            "type": "string"
          }
        },
        "People": {
          "typ

## Create Analyzer from the Template

In [6]:
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=analyzer_template_path)
result = client.poll_result(response)

print(json.dumps(result, indent=2))

INFO:python.content_understanding_client_dev_preview2:Analyzer cu-demo-13b055c7-d293-4bb3-911e-3ea3d8bb38e6 create request accepted.
INFO:python.content_understanding_client_dev_preview2:Request result is ready after 0.00 seconds.


{
  "id": "f16a208e-a1b9-4421-8aba-eb7e60e5610d",
  "status": "Succeeded",
  "result": {
    "analyzerId": "cu-demo-13b055c7-d293-4bb3-911e-3ea3d8bb38e6",
    "description": "Customized Post Call Analytics Analyzer incl. Issue- and Resolution Summaries and extended sentiment descriptions",
    "tags": {
      "projectId": "",
      "templateId": "postCallAnalytics-2024-12-01"
    },
    "createdAt": "2025-05-02T23:59:51Z",
    "lastModifiedAt": "2025-05-02T23:59:51Z",
    "baseAnalyzerId": "prebuilt-callCenter",
    "config": {
      "locales": [
        "en-US",
        "de-DE",
        "pt-BR"
      ],
      "returnDetails": true,
      "disableContentFiltering": false,
      "segmentationMode": "noSegmentation",
      "tableFormat": "html",
      "estimateFieldSourceAndConfidence": false
    },
    "fieldSchema": {
      "fields": {
        "Summary": {
          "type": "string",
          "method": "generate",
          "description": "A one-paragraph summary"
        },
        "

# Extract Conversation (WEBVTT) from either Batch Transcription JSON or Content Understanding JSON

## Extract Fields Using the Analyzer

After the analyzer is successfully created, we can use it to analyze our input files.

In [7]:
# ANALYZER_ID = "prebuilt-callCenter"
response = client.begin_analyze(ANALYZER_ID, file_location=analyzer_sample_file_path)
result = client.poll_result(response)

print(json.dumps(result, indent=2))

INFO:python.content_understanding_client_dev_preview2:Analyzing file ../CU-Demo-Assets/en-US-mono-audio/sample4 - Healthcare - Customer calls about new policy quote.wav with analyzer: cu-demo-13b055c7-d293-4bb3-911e-3ea3d8bb38e6
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-9e8d-4852620ea88f in progress ...
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-9e8d-4852620ea88f in progress ...
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-9e8d-4852620ea88f in progress ...
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-9e8d-4852620ea88f in progress ...
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-9e8d-4852620ea88f in progress ...
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-9e8d-4852620ea88f in progress ...
INFO:python.content_understanding_client_dev_preview2:Request eb426a6e-86f5-4c66-

{
  "id": "eb426a6e-86f5-4c66-9e8d-4852620ea88f",
  "status": "Succeeded",
  "result": {
    "analyzerId": "cu-demo-13b055c7-d293-4bb3-911e-3ea3d8bb38e6",
    "apiVersion": "2025-05-01-preview",
    "createdAt": "2025-05-02T23:59:52Z",
    "stringEncoding": "utf8",
    "contents": [
      {
        "markdown": "# Audio: 00:00.000 => 03:02.112\n\nTranscript\n```\nWEBVTT\n\n00:00.080 --> 00:03.680\n<v Speaker 1>Hi, thank you for calling Contoso, who am I speaking with today?\n\n00:04.560 --> 00:06.240\n<v Speaker 1>Hi, my name is Mary Rondo.\n\n00:07.120 --> 00:09.360\n<v Speaker 1>I was trying to enroll myself with Contoso.\n\n00:10.160 --> 00:13.200\n<v Speaker 1>Hi Mary, are you calling because you need health insurance?\n\n00:14.080 --> 00:16.400\n<v Speaker 1>Yes, I am calling to sign up for insurance.\n\n00:17.280 --> 00:21.280\n<v Speaker 1>Great, if you can answer a few questions, we can get you signed up in a jiffy.\n\n00:22.160 --> 00:22.640\n<v Speaker 1>Okay.\n\n00:23.520 -->

## Clean Up
Optionally, delete the sample analyzer from your resource. In typical usage scenarios, you would analyze multiple files using the same analyzer.

In [8]:
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client_dev_preview2:Analyzer cu-demo-13b055c7-d293-4bb3-911e-3ea3d8bb38e6 deleted.


<Response [204]>