<a href="https://colab.research.google.com/github/genaiconference/Agentic_RAG_Workshop/blob/main/01_Handling_Multi_Modal_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Processing and Analysis using Azure Document Intelligence

This notebook demonstrates how to:
- process a document using Azure Document Intelligence,
- extract information
- process images within the document
- create text chunks, and
- add custom metadata to the chunks.


## Setup and Installations
Install necessary libraries for document processing, data handling, and interacting with Azure Document Intelligence and OpenAI.

In [5]:
!pip install -r requirements.txt



## Load Environment Variables and Initialize Clients
Load environment variables containing API keys and endpoint information, and initialize the Azure Document Intelligence and OpenAI clients.

In [2]:
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
import os

load_dotenv()

doc_intelligence_endpoint=os.getenv("doc_intelligence_endpoint")
doc_intelligence_key=os.getenv("doc_intelligence_key")


llm = ChatOpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    temperature=0
)

## Define Paths and Load Document

In [6]:
import os
import pickle
from IPython.display import Markdown

import di_utils

# --- Configuration ---
INPUT_FILE_PATH = "LEAVE POLICY v9.pdf"
PICKLE_FILENAME = "leave_policy.pkl"
IMG_SAVE_DIR = os.path.join(os.getcwd(), "images")
PICKLE_DIR = os.path.join(os.getcwd(), "DI_output")
PICKLE_FILE_PATH = os.path.join(PICKLE_DIR, PICKLE_FILENAME)


# --- Ensure output directories exist ---
os.makedirs(IMG_SAVE_DIR, exist_ok=True)
os.makedirs(PICKLE_DIR, exist_ok=True)

## Run Document Intelligence

In [7]:
# Step 1: Analyze Document Layout
md_result = di_utils.analyze_document(INPUT_FILE_PATH, doc_intelligence_key, doc_intelligence_endpoint)

# Step 2: Build Span Map
span_map = di_utils.extract_span_map(md_result)

# Step 3: Process Images and Integrate with Text
image_chunks = di_utils.process_images(INPUT_FILE_PATH, IMG_SAVE_DIR, md_result, llm)
result_with_image_descp = di_utils.insert_figures_into_full_text(span_map, md_result, image_chunks)

# Step 4: Save Intermediate Output
di_utils.save_pickle(result_with_image_descp, PICKLE_FILE_PATH)

# Step 5: Chunk Document into Parent Docs and Add Metadata
final_parents = di_utils.generate_parents(md_result, result_with_image_descp, llm)

[INFO] Pickle saved to /content/DI_output/leave_policy.pkl
[INFO] Creating parent documents...














[INFO] Created 14 parent documents.


In [4]:
display(Markdown(final_parents[6].page_content))

print(final_parents[6].metadata)

##1.2 Sick Leave

 Objective & Eligibility  
To give associates time off to recover and recuperate in the event of illness or accidents.
All internal associates are eligible for this policy.  
### Procedures and Conditions  
☐
The company may, at its discretion, grant sick leave to an associate, based on the
nature of sickness.
☐  
[ In case of absence beyond 3 days, the associate is required to submit a medical
certificate.  
[ In case where absence is likely to be prolonged beyond 3 days, the likely duration of
☐  
absence along with the doctor's certificate certifying the same should be submitted.  
[ Any sick leave in excess of 30 continuous days will require the approval of the division
☐  
management team. The management team will review the case and may at its
discretion approve additional sick leave with pay or without pay (Loss of Pay - LOP).  
[ The company may request the associate to undergo a medical examination by a
☐  
nominated medical practitioner if felt necessary.
☐
Sick leaves can neither be accumulated nor encashed nor carried forward to the next
year.  
<!-- PageFooter="Novartis Hyderabad Business use only" -->
<!-- PageFooter="Back to Contents" -->
<!-- PageNumber="Page 3 of 7" -->
<!-- PageBreak -->  
### 1.3 Parental Leave  
Both Birthing mother and Non-birthing parent are eligible for Parental Leave of 26 weeks
following the birth, surrogacy or adoption of a child.  
Associates must avail the Parental Leave within one year of child's birth/surrogacy/adoption.
Any Parental Leave that is not used within one year of a covered life event will be forfeited
and will not be financially compensated, or carried forward. For details, refer to the policy
document - Parental Leave Policy - India

{'Header 1': 'Encashment', 'Header 2': '1.2 Sick Leave', 'page_number': 3, 'custom_metadata': '1.2 Sick Leave'}
