# Extract Content from Your File

This notebook demonstrate you can use Content Understanding API to extract semantic content from multimodal files.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
2. Install the required packages to run the sample.

In [1]:
%pip install -r ../requirements.txt

Collecting azure-identity (from -r ../requirements.txt (line 1))
  Downloading azure_identity-1.23.0-py3-none-any.whl.metadata (81 kB)
Collecting python-dotenv (from -r ../requirements.txt (line 2))
  Using cached python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting requests (from -r ../requirements.txt (line 3))
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting Pillow (from -r ../requirements.txt (line 4))
  Downloading pillow-11.2.1-cp312-cp312-win_amd64.whl.metadata (9.1 kB)
Collecting azure-core>=1.31.0 (from azure-identity->-r ../requirements.txt (line 1))
  Downloading azure_core-1.34.0-py3-none-any.whl.metadata (42 kB)
Collecting cryptography>=2.5 (from azure-identity->-r ../requirements.txt (line 1))
  Downloading cryptography-44.0.3-cp39-abi3-win_amd64.whl.metadata (5.7 kB)
Collecting msal>=1.30.0 (from azure-identity->-r ../requirements.txt (line 1))
  Downloading msal-1.32.3-py3-none-any.whl.metadata (11 kB)
Collecting msal-extensions>=


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK.


In [None]:
import logging
import json
import os
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT","https://wap-dataplatformandpxp-dev-01-aai.cognitiveservices.azure.com/")

AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2024-12-01-preview")
AZURE_AI_SUBSCRIPTION_KEY = os.getenv("AZURE_AI_SUBSCRIPTION_KEY", "2dgA7HWh6mWuMidfoIBxnv02hiZd9b1xoacfmtYymI4rrVABhV0cJQQJ99BBACL93NaXJ3w3AAAAACOGA1Cn")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    # token_provider=token_provider,
    subscription_key=AZURE_AI_SUBSCRIPTION_KEY,
    x_ms_useragent="azure-ai-content-understanding-python/content_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

# Utility function to save images
from PIL import Image
from io import BytesIO
import re

def save_image(image_id: str, response):
    raw_image = client.get_image_from_analyze_operation(analyze_response=response,
        image_id=image_id
    )
    image = Image.open(BytesIO(raw_image))
    # image.show()
    Path(".cache").mkdir(exist_ok=True)
    image.save(f".cache/{image_id}.jpg", "JPEG")


INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.23.0 Python/3.12.10 (Windows-10-10.0.19044-SP0)'
No body was attached to the request
INFO:azure.identity._credentials.chained:DefaultAzureCredential acquired a token from AzureCliCredential


## Document Content

Content Understanding API is designed to extract all textual content from a specified document file. In addition to text extraction, it conducts a comprehensive layout analysis to identify and categorize tables and figures within the document. The output is then presented in a structured markdown format, ensuring clarity and ease of interpretation.



In [6]:
ANALYZER_ID = "content-doc-sample-" + str(uuid.uuid4())
ANALYZER_TEMPLATE_FILE = '../analyzer_templates/content_document.json'
ANALYZER_SAMPLE_FILE = '../data/invoice.pdf'

# Create analyzer
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_FILE)
result = client.poll_result(response)

# Analyzer file
response = client.begin_analyze(ANALYZER_ID, file_location=ANALYZER_SAMPLE_FILE)
result = client.poll_result(response)

print(json.dumps(result, indent=2))
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer content-doc-sample-75d21c4a-6b06-4d50-bd47-a6f643828dc3 create request accepted.
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.
INFO:python.content_understanding_client:Analyzing file ../data/invoice.pdf with analyzer: content-doc-sample-75d21c4a-6b06-4d50-bd47-a6f643828dc3
INFO:python.content_understanding_client:Request bf2eb774-780c-4ffd-aeeb-57a27d1dc5ce in progress ...
INFO:python.content_understanding_client:Request bf2eb774-780c-4ffd-aeeb-57a27d1dc5ce in progress ...
INFO:python.content_understanding_client:Request result is ready after 4.55 seconds.


{
  "id": "bf2eb774-780c-4ffd-aeeb-57a27d1dc5ce",
  "status": "Succeeded",
  "result": {
    "analyzerId": "content-doc-sample-75d21c4a-6b06-4d50-bd47-a6f643828dc3",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2025-05-15T06:33:09Z",
    "contents": [
      {
        "markdown": "CONTOSO LTD.\n\n\n# INVOICE\n\nContoso Headquarters\n123 456th St\nNew York, NY, 10001\n\nINVOICE: INV-100\n\nINVOICE DATE: 11/15/2019\n\nDUE DATE: 12/15/2019\n\nCUSTOMER NAME: MICROSOFT CORPORATION\n\nSERVICE PERIOD: 10/14/2019 - 11/14/2019\n\nCUSTOMER ID: CID-12345\n\nMicrosoft Corp\n123 Other St,\nRedmond WA, 98052\n\nBILL TO:\n\nMicrosoft Finance\n\n123 Bill St,\n\nRedmond WA, 98052\n\nSHIP TO:\n\nMicrosoft Delivery\n\n123 Ship St,\n\nRedmond WA, 98052\n\nSERVICE ADDRESS:\nMicrosoft Services\n123 Service St,\nRedmond WA, 98052\n\n\n<table>\n<tr>\n<th>SALESPERSON</th>\n<th>P.O. NUMBER</th>\n<th>REQUISITIONER</th>\n<th>SHIPPED VIA</th>\n<th>F.O.B. POINT</th>\n<th>TERMS</th>\n</tr>\n<tr>\n<td></t

INFO:python.content_understanding_client:Analyzer content-doc-sample-75d21c4a-6b06-4d50-bd47-a6f643828dc3 deleted.


<Response [204]>

## Audio Content
Our API output facilitates detailed analysis of spoken language, allowing developers to utilize the data for various applications, such as voice recognition, customer service analytics, and conversational AI. The structure of the output makes it easy to extract and analyze different components of the conversation for further processing or insights.

1. Speaker Identification: Each phrase is attributed to a specific speaker (in this case, "Speaker 2"). This allows for clarity in conversations with multiple participants.
1. Timing Information: Each transcription includes precise timing data:
    - startTimeMs: The time (in milliseconds) when the phrase begins.
    - endTimeMs: The time (in milliseconds) when the phrase ends.
    This information is crucial for applications like video subtitles, allowing synchronization between the audio and the text.
1. Text Content: The actual spoken text is provided, which in this instance is "Thank you for calling Woodgrove Travel." This is the main content of the transcription.
1. Confidence Score: Each transcription phrase includes a confidence score (0.933 in this case), indicating the likelihood that the transcription is accurate. A higher score suggests greater reliability.
1. Word-Level Breakdown: The transcription is further broken down into individual words, each with its own timing information. This allows for detailed analysis of speech patterns and can be useful for applications such as language processing or speech recognition improvement.
1. Locale Specification: The locale is specified as "en-US," indicating that the transcription is in American English. This is important for ensuring that the transcription algorithms account for regional dialects and pronunciations.


In [7]:
ANALYZER_ID = "content-audio-sample-" + str(uuid.uuid4())
ANALYZER_TEMPLATE_FILE = '../analyzer_templates/audio_transcription.json'
ANALYZER_SAMPLE_FILE = '../data/audio.wav'

# Create analyzer
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_FILE)
result = client.poll_result(response)

# Analyzer file
response = client.begin_analyze(ANALYZER_ID, file_location=ANALYZER_SAMPLE_FILE)
result = client.poll_result(response)

print(json.dumps(result, indent=2))
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer content-audio-sample-58a98841-cbe4-49ad-b88f-4c85b85d60d1 create request accepted.
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.
INFO:python.content_understanding_client:Analyzing file ../data/audio.wav with analyzer: content-audio-sample-58a98841-cbe4-49ad-b88f-4c85b85d60d1
INFO:python.content_understanding_client:Request 4fc0c801-c2de-4d93-9f22-f5a7887d3697 in progress ...
INFO:python.content_understanding_client:Request 4fc0c801-c2de-4d93-9f22-f5a7887d3697 in progress ...
INFO:python.content_understanding_client:Request 4fc0c801-c2de-4d93-9f22-f5a7887d3697 in progress ...
INFO:python.content_understanding_client:Request result is ready after 7.11 seconds.


{
  "id": "4fc0c801-c2de-4d93-9f22-f5a7887d3697",
  "status": "Succeeded",
  "result": {
    "analyzerId": "content-audio-sample-58a98841-cbe4-49ad-b88f-4c85b85d60d1",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2025-05-15T06:33:31Z",
    "contents": [
      {
        "markdown": "```WEBVTT\n\n00:00.080 --> 00:02.160\n<v Speaker 1>Thank you for calling Woodgrove Travel.\n\n00:02.960 --> 00:04.560\n<v Speaker 1>My name is Isabella Taylor.\n\n00:05.360 --> 00:06.880\n<v Speaker 1>How may I assist you today?\n\n00:07.680 --> 00:10.240\n<v Speaker 2>Hi Isabella, my name is John Smith.\n\n00:11.120 --> 00:17.920\n<v Speaker 2>I recently traveled from New York City to Los Angeles on a business trip, and I had a terrible experience with my flight.\n\n00:18.720 --> 00:20.880\n<v Speaker 1>I'm sorry to hear that, John.\n\n00:21.680 --> 00:27.200\n<v Speaker 1>Could you please provide me with the details of your flight, such as the airline name and flight number?\n\n00:28.000 --> 0

INFO:python.content_understanding_client:Analyzer content-audio-sample-58a98841-cbe4-49ad-b88f-4c85b85d60d1 deleted.


<Response [204]>

## Video Content
Video output provides detailed information about audiovisual content, specifically video shots. Here are the key features it offers:

1. Shot Information: Each shot is defined by a start and end time, along with a unique identifier. For example, Shot 0:0.0 to 0:2.800 includes a transcript and key frames.
1. Transcript: The API includes a transcript of the audio, formatted in WEBVTT, which allows for easy synchronization with the video. It captures spoken content and specifies the timing of the dialogue.
1. Key Frames: It provides a series of key frames (images) that represent important moments in the video shot, allowing users to visualize the content at specific timestamps.
1. Description: Each shot is accompanied by a description, providing context about the visuals presented. This helps in understanding the scene or subject matter without watching the video.
1. Audio Visual Metadata: Details about the video such as dimensions (width and height), type (audiovisual), and the presence of key frame timestamps are included.
1. Transcript Phrases: The output includes specific phrases from the transcript, along with timing and speaker information, enhancing the usability for applications like closed captioning or search functionalities.

In [8]:
ANALYZER_ID = "content-video-sample-" + str(uuid.uuid4())
ANALYZER_TEMPLATE_FILE = '../analyzer_templates/content_video.json'
ANALYZER_SAMPLE_FILE = '../data/FlightSimulator.mp4'

# Create analyzer
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_FILE)
result = client.poll_result(response)

# Analyzer file
response = client.begin_analyze(ANALYZER_ID, file_location=ANALYZER_SAMPLE_FILE)
result = client.poll_result(response)

print(json.dumps(result, indent=2))

# Save keyframes (optional)
keyframe_ids = set()
result_data = result.get("result", {})
contents = result_data.get("contents", [])

# Iterate over contents to find keyframes if available
for content in contents:
    # Extract keyframe IDs from "markdown" if it exists and is a string
    markdown_content = content.get("markdown", "")
    if isinstance(markdown_content, str):
        keyframe_ids.update(re.findall(r"(keyFrame\.\d+)\.jpg", markdown_content))

# Output the results
print("Unique Keyframe IDs:", keyframe_ids)

# Save all keyframe images
for keyframe_id in keyframe_ids:
    save_image(keyframe_id, response)

# Delete analyzer
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer content-video-sample-fc3512b0-ee6a-4baa-a367-b44d4bc49b5e create request accepted.
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.
INFO:python.content_understanding_client:Analyzing file ../data/FlightSimulator.mp4 with analyzer: content-video-sample-fc3512b0-ee6a-4baa-a367-b44d4bc49b5e
INFO:python.content_understanding_client:Request b82ef1eb-0792-4647-bece-1c9c26ff8a43 in progress ...
INFO:python.content_understanding_client:Request b82ef1eb-0792-4647-bece-1c9c26ff8a43 in progress ...
INFO:python.content_understanding_client:Request b82ef1eb-0792-4647-bece-1c9c26ff8a43 in progress ...
INFO:python.content_understanding_client:Request b82ef1eb-0792-4647-bece-1c9c26ff8a43 in progress ...
INFO:python.content_understanding_client:Request b82ef1eb-0792-4647-bece-1c9c26ff8a43 in progress ...
INFO:python.content_understanding_client:Request b82ef1eb-0792-4647-bece-1c9c26ff8a43 in progress ...
INFO:python.c

{
  "id": "b82ef1eb-0792-4647-bece-1c9c26ff8a43",
  "status": "Succeeded",
  "result": {
    "analyzerId": "content-video-sample-fc3512b0-ee6a-4baa-a367-b44d4bc49b5e",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2025-05-15T06:34:27Z",
    "contents": [
      {
        "markdown": "# Shot 00:00.000 => 00:01.467\n## Transcript\n```\nWEBVTT\n\n00:01.400 --> 00:06.560\n<v Speaker>When it comes to the neural TTS, in order to get a good voice, it's better to have good data.\n```\n## Key Frames\n- 00:00.726 ![](keyFrame.726.jpg)",
        "fields": {},
        "kind": "audioVisual",
        "startTimeMs": 0,
        "endTimeMs": 1467,
        "width": 1080,
        "height": 608,
        "KeyFrameTimesMs": [
          726
        ],
        "transcriptPhrases": [
          {
            "speaker": "speaker",
            "startTimeMs": 1400,
            "endTimeMs": 6560,
            "text": "When it comes to the neural TTS, in order to get a good voice, it's better to have good 

INFO:python.content_understanding_client:Analyzer content-video-sample-fc3512b0-ee6a-4baa-a367-b44d4bc49b5e deleted.


<Response [204]>

## Video Content with Face
This is a gated feature, please go through process [Azure AI Resource Face Gating](https://learn.microsoft.com/en-us/legal/cognitive-services/computer-vision/limited-access-identity?context=%2Fazure%2Fai-services%2Fcomputer-vision%2Fcontext%2Fcontext#registration-process) Select `[Video Indexer] Facial Identification (1:N or 1:1 matching) to search for a face in a media or entertainment video archive to find a face within a video and generate metadata for media or entertainment use cases only` in the registration form.

In [9]:
ANALYZER_ID = "content-video-face-sample-" + str(uuid.uuid4())
ANALYZER_TEMPLATE_FILE = '../analyzer_templates/face_aware_in_video.json'
ANALYZER_SAMPLE_FILE = '../data/FlightSimulator.mp4'

# Create analyzer
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_FILE)
result = client.poll_result(response)

# Analyzer file
response = client.begin_analyze(ANALYZER_ID, file_location=ANALYZER_SAMPLE_FILE)
result = client.poll_result(response)

print(json.dumps(result, indent=2))

INFO:python.content_understanding_client:Analyzer content-video-face-sample-3d0d5702-ceb6-498b-9ca3-a7daa31adcbc create request accepted.
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.
INFO:python.content_understanding_client:Analyzing file ../data/FlightSimulator.mp4 with analyzer: content-video-face-sample-3d0d5702-ceb6-498b-9ca3-a7daa31adcbc
INFO:python.content_understanding_client:Request 301830ba-542f-476b-8cba-07a47d91b262 in progress ...
INFO:python.content_understanding_client:Request 301830ba-542f-476b-8cba-07a47d91b262 in progress ...
INFO:python.content_understanding_client:Request 301830ba-542f-476b-8cba-07a47d91b262 in progress ...
INFO:python.content_understanding_client:Request 301830ba-542f-476b-8cba-07a47d91b262 in progress ...
INFO:python.content_understanding_client:Request 301830ba-542f-476b-8cba-07a47d91b262 in progress ...
INFO:python.content_understanding_client:Request 301830ba-542f-476b-8cba-07a47d91b262 in progress ...
INF

{
  "id": "301830ba-542f-476b-8cba-07a47d91b262",
  "status": "Succeeded",
  "result": {
    "analyzerId": "content-video-face-sample-3d0d5702-ceb6-498b-9ca3-a7daa31adcbc",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2025-05-15T06:35:43Z",
    "contents": [
      {
        "markdown": "# Shot 00:00.000 => 00:01.467\n## Transcript\n```\nWEBVTT\n\n00:01.400 --> 00:06.560\n<v Speaker>When it comes to the neural TTS, in order to get a good voice, it's better to have good data.\n```\n## Key Frames\n- 00:00.726 ![](keyFrame.726.jpg)",
        "fields": {
          "description": {
            "type": "string",
            "valueString": "An aerial view of a lush, mountainous island surrounded by vibrant blue ocean waters is shown. The logos of \"Flight Simulator\" and \"Microsoft Azure AI\" are displayed prominently over the image."
          },
          "audioDescription": {
            "type": "string",
            "valueString": "We are greeted with a breathtaking aerial vi

### Get and Save Key Frames and Face Thumbnails

In [10]:
# Initialize sets for unique face IDs and keyframe IDs
face_ids = set()
keyframe_ids = set()

# Extract unique face IDs safely
result_data = result.get("result", {})
contents = result_data.get("contents", [])

# Iterate over contents to find faces and keyframes if available
for content in contents:
    # Safely retrieve face IDs if "faces" exists and is a list
    faces = content.get("faces", [])
    if isinstance(faces, list):
        for face in faces:
            face_id = face.get("faceId")
            if face_id:
                face_ids.add(f"face.{face_id}")

    # Extract keyframe IDs from "markdown" if it exists and is a string
    markdown_content = content.get("markdown", "")
    if isinstance(markdown_content, str):
        keyframe_ids.update(re.findall(r"(keyFrame\.\d+)\.jpg", markdown_content))

# Output the results
print("Unique Face IDs:", face_ids)
print("Unique Keyframe IDs:", keyframe_ids)

# Save all face images
for face_id in face_ids:
    save_image(face_id, response)

# Save all keyframe images
for keyframe_id in keyframe_ids:
    save_image(keyframe_id, response)

Unique Face IDs: {'face.P1000N', 'face.P1003N', 'face.P1023N'}
Unique Keyframe IDs: {'keyFrame.31614', 'keyFrame.4884', 'keyFrame.2640', 'keyFrame.6534', 'keyFrame.7788', 'keyFrame.9768', 'keyFrame.15444', 'keyFrame.29106', 'keyFrame.12078', 'keyFrame.12804', 'keyFrame.4059', 'keyFrame.42636', 'keyFrame.21714', 'keyFrame.10560', 'keyFrame.20196', 'keyFrame.22473', 'keyFrame.34584', 'keyFrame.8976', 'keyFrame.5709', 'keyFrame.23232', 'keyFrame.26532', 'keyFrame.33891', 'keyFrame.40590', 'keyFrame.38181', 'keyFrame.30822', 'keyFrame.32406', 'keyFrame.41283', 'keyFrame.24816', 'keyFrame.39897', 'keyFrame.18579', 'keyFrame.36861', 'keyFrame.20955', 'keyFrame.27390', 'keyFrame.36069', 'keyFrame.43230', 'keyFrame.2046', 'keyFrame.16929', 'keyFrame.28248', 'keyFrame.17754', 'keyFrame.726', 'keyFrame.14190', 'keyFrame.38676', 'keyFrame.14817', 'keyFrame.25674'}


## Clean Up
Optionally, delete the sample analyzer from your resource. In typical usage scenarios, you would analyze multiple files using the same analyzer.

In [None]:
client.delete_analyzer(ANALYZER_ID)