# Document Processing
Using Azure OpenAI service to extract key entities with OUT using Azure AI Document Intelligence service

### 1. Install required libraries

In [None]:
# install Python PDF library
%pip install PyPDF2

### 2. Import helper libraries and load credentials from .env file

In [1]:
import os
from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

True

### 3. Create AOAI client

In [2]:
# Create AOAI client using end point and key credentials
client = AzureOpenAI(
  azure_endpoint = os.getenv("OPENAI_API_ENDPOINT"), 
  api_key=os.getenv("OPENAI_API_KEY"),    
  api_version='2023-05-15',
)

### 4. Setup PDF information
Using sample PDF document from blob storage

In [3]:
import PyPDF2
import openai
import os
from urllib.request import urlopen
import urllib.request
import shutil
#from azure.storage.blob import ContainerClient, BlobServiceClient, BlockBlobService

# Replace with your OpenAI API key and model
my_ai_model = "gpt-4o"
pdf_file_url = os.getenv("BLOB_SAS_URL")
print(pdf_file_url)

with urlopen (pdf_file_url) as resp:
    print(resp.read())

local_file_name="sample-ukho-doc-process-using-aoai.pdf"
# Download the file from `url` and save it locally under `local_file_name`:
with urllib.request.urlopen(pdf_file_url) as response, open(local_file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

https://ukhosampledocs.blob.core.windows.net/docs/Cleowent%20Agreement%20with%20Tidal%20Licence%20-%2003.01.2012.pdf?sp=r&st=2024-09-14T19:52:31Z&se=2024-09-15T03:52:31Z&spr=https&sv=2022-11-02&sr=b&sig=ylv4hPn8S9ocYICwj0P173Mw2hcJdIs6QW0KTdDGpqk%3D
b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en) /StructTreeRoot 50 0 R/MarkInfo<</Marked true>>/Metadata 720 0 R/ViewerPreferences 721 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 10/Kids[ 4 0 R 12 0 R 21 0 R 23 0 R 25 0 R 27 0 R 34 0 R 43 0 R 46 0 R 48 0 R] >>\r\nendobj\r\n3 0 obj\r\n<</MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId(72f988bf-86f1-41af-91ab-2d7cd011db47) /MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Method(Privileged) /MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled(True) /Title(AGREEMENT) /Author(Keith Packer) /Creator(\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x00f\x00o\x00r\x00 \x00M\x00i\x00c\x00r\x00o\x00s\x00o\x

### 5. Read  PDF document

In [4]:
processed_text_list = []
# Open the PDF file in binary mode
with open(local_file_name, 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    #print(pdf_reader)
    # Iterate through each page and extract text
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        processed_text_list.append(page_text)

# Combine all AI-processed text into a single string
combined_text = "\n".join(processed_text_list)

### 6. Now format the message to send to GPT model

In [5]:
messages = [
        {
            "role": "system",
            "content": """You are a Assistant, a backend processor.
- User input is messy raw text extracted from a PDF page by PyPDF2.
- Answer with polite and positive sense.
"""
        },
        {
            "role": "user",
            "content": "Summarize the content:" + combined_text
        }
    ]

### 7. Invoke GPT Model

In [6]:
response = client.chat.completions.create(
    model=my_ai_model, # model = "deployment_name".
    messages=messages
)

print(response.choices[0].message.content)

The Agreement dated 15 November 2011, is between the United Kingdom Hydrographic Office (UKHO), on behalf of the Secretary of State for Defence, and Cleowent Harbour Commission (CHC). This non-exclusive Agreement ensures the exchange of hydrographic surveys, data, and related information. 

Key Points:
1. **Purpose and Exchange:**
   - The UKHO and CHC will exchange marine data, surveys, navigational products, and related information.
   - Defined terms and exchange details are included in Appendices.

2. **Usage and Intellectual Property:**
   - UKHO can use and reproduce CHC’s material in its products and services, crediting CHC. 
   - License fees from UKHO’s use of CHC's data will be transferred to CHC annually.

3. **Licensing:**
   - CHC grants UKHO the right to license its intellectual property to third parties.
   - Both parties agree to protect each other's intellectual property rights.

4. **Liability and Implementation:**
   - Liabilities rest with the supplying party unless

### 8. Print the token usage

In [7]:
# print total token usage
print(response.usage)

CompletionUsage(completion_tokens=362, prompt_tokens=5066, total_tokens=5428)
