# PDF Text Extraction using Azure Document Intelligence and OpenAI

This notebook demonstrates how to:
1. Extract text from PDF files using Azure Document Intelligence (formerly Form Recognizer)
2. Process the extracted text using Azure OpenAI service
3. Generate insights from the PDF content

**Prerequisites:**
- Azure Document Intelligence subscription
- Azure OpenAI subscription
- PDF file(s) to process

## 1. Install Required Packages

Let's first install the necessary packages for working with Azure services and PDFs.

In [None]:
# Install required packages
%pip install azure-ai-documentintelligence
%pip install azure-identity
%pip install openai
%pip install python-dotenv

: 

## 2. Import Required Libraries

Now let's import the necessary libraries for our project.

In [None]:
import os
import sys
from pathlib import Path
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from dotenv import load_dotenv
import openai
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient

# Load environment variables from .env file
load_dotenv()

## 3. Set Up Azure Credentials

To use Azure services, we need to set up our credentials. For security, we'll load these from environment variables.

First, create a `.env` file in the same directory as this notebook with the following content:
```
# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
AZURE_OPENAI_DEPLOYMENT_NAME=your_azure_openai_deployment_name

# Azure Document Intelligence Configuration
AZURE_DOC_INTELLIGENCE_KEY=your_azure_document_intelligence_key
AZURE_DOC_INTELLIGENCE_ENDPOINT=your_azure_document_intelligence_endpoint
```

In [None]:
# Setup Azure Document Intelligence
doc_intelligence_key = os.environ.get("AZURE_DOC_INTELLIGENCE_KEY")
doc_intelligence_endpoint = os.environ.get("AZURE_DOC_INTELLIGENCE_ENDPOINT")

# Verify credentials are loaded
if not doc_intelligence_key or not doc_intelligence_endpoint:
    print("⚠️ Azure Document Intelligence credentials not found in environment variables.")
else:
    print("✓ Azure Document Intelligence credentials loaded.")

# Setup Azure OpenAI
openai.api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai.api_base = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai.api_type = "azure"
openai.api_version = "2023-12-01-preview"  # Update this if needed
deployment_name = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")

# Verify OpenAI credentials are loaded
if not openai.api_key or not openai.api_base or not deployment_name:
    print("⚠️ Azure OpenAI credentials not found in environment variables.")
else:
    print("✓ Azure OpenAI credentials loaded.")

## 4. PDF Processing Functions

Let's create the functions that will:
1. Extract text from PDFs using Azure Document Intelligence
2. Process the extracted text using Azure OpenAI

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extract text from a PDF using Azure Document Intelligence service.
    
    Args:
        pdf_path (str): Path to the PDF file
        
    Returns:
        str: Extracted text from the PDF
    """
    print(f"Processing file: {pdf_path}")
    
    # Initialize Document Intelligence client
    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=doc_intelligence_endpoint,
        credential=AzureKeyCredential(doc_intelligence_key)
    )
    
    # Read PDF file
    with open(pdf_path, "rb") as f:
        document_bytes = f.read()
    
    # Analyze the document
    print("Sending document to Azure Document Intelligence...")
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",  # Using the layout model
        document=document_bytes,
        content_type="application/pdf"
    )
    
    print("Processing document...")
    # Get results
    result = poller.result()
    
    # Extract text from the result
    extracted_text = ""
    page_count = len(result.pages)
    print(f"Document has {page_count} pages.")
    
    for i, page in enumerate(result.pages, 1):
        print(f"Processing page {i}/{page_count}...")
        page_text = ""
        for line in page.lines:
            page_text += line.content + "\n"
        
        extracted_text += f"\n--- Page {i} ---\n{page_text}\n"
    
    print("Text extraction complete.")
    return extracted_text

In [None]:
def process_with_openai_agent(text, prompt=None):
    """
    Process the extracted text using Azure OpenAI.
    
    Args:
        text (str): Extracted text from the PDF
        prompt (str, optional): Custom prompt for the OpenAI agent
        
    Returns:
        dict: OpenAI API response
    """
    print("Processing extracted text with Azure OpenAI...")
    
    # Prepare prompt
    system_message = "You are an AI assistant that helps analyze and summarize document content."
    
    if not prompt:
        user_message = f"Analyze the following document content and provide a comprehensive summary, key points, and any notable information:\n\n{text[:3000]}..."
    else:
        user_message = f"{prompt}\n\n{text[:3000]}..."
    
    # Call Azure OpenAI API
    try:
        response = openai.ChatCompletion.create(
            engine=deployment_name,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            temperature=0.3,
            max_tokens=1000
        )
        
        return response['choices'][0]['message']['content']
    except Exception as e:
        return f"Error calling Azure OpenAI: {str(e)}"

## 5. Process a PDF File

Now let's use the functions we've created to process a PDF file.

In [None]:
# Set the path to the PDF file (update the path as needed)
pdf_path = "/Users/alvintai/Downloads/PUB/EN13945-4.pdf"

# Verify the file exists
if not os.path.exists(pdf_path):
    print(f"Error: File not found - {pdf_path}")
else:
    print(f"Found PDF file: {pdf_path}")
    
    # Extract text from PDF
    try:
        extracted_text = extract_text_from_pdf(pdf_path)
        
        # Display the first 1000 characters of the extracted text
        print("\nPreview of extracted text:")
        print(extracted_text[:1000] + "...")
        
        # Save the extracted text to a file
        output_path = Path(pdf_path).stem + "_extracted_text.txt"
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(extracted_text)
        
        print(f"\nExtracted text saved to: {output_path}")
        
    except Exception as e:
        print(f"Error: {str(e)}")

## 6. Process with Azure OpenAI

Now let's analyze the extracted text using Azure OpenAI.

In [None]:
# Check if we have extracted text available
output_path = Path(pdf_path).stem + "_extracted_text.txt"
if os.path.exists(output_path):
    # Load the extracted text
    with open(output_path, "r", encoding="utf-8") as f:
        extracted_text = f.read()
    
    # Process with Azure OpenAI
    response = process_with_openai_agent(extracted_text)
    
    # Display the response
    display(Markdown("## Azure OpenAI Analysis"))
    display(Markdown(response))
else:
    print("No extracted text available. Please run the previous cell first.")

## 7. Custom Queries

You can also send custom queries to the OpenAI agent about the PDF content.

In [None]:
# Custom prompt for the OpenAI agent
custom_prompt = "Extract and list all the key technical specifications mentioned in the document."

# Check if we have extracted text available
output_path = Path(pdf_path).stem + "_extracted_text.txt"
if os.path.exists(output_path):
    # Load the extracted text
    with open(output_path, "r", encoding="utf-8") as f:
        extracted_text = f.read()
    
    # Process with Azure OpenAI
    response = process_with_openai_agent(extracted_text, prompt=custom_prompt)
    
    # Display the response
    display(Markdown("## Custom Query Results"))
    display(Markdown(response))
else:
    print("No extracted text available. Please run the previous cells first.")

## Conclusion

In this notebook, we've demonstrated how to:

1. Extract text from PDF documents using Azure Document Intelligence
2. Process and analyze the extracted text using Azure OpenAI
3. Run custom queries against the document content

This workflow can be adapted for various document processing tasks such as:
- Information extraction and summarization
- Document classification
- Data extraction for further processing
- Compliance checking
- And much more!