<a href="https://colab.research.google.com/github/aswinaus/ML/blob/main/summarize_and_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install azure-ai-formrecognizer openai

In [None]:
# Step 1: Parse document using Azure Document Intelligence
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from openai import OpenAI

# Replace with your actual access token
from google.colab import userdata
DOCUMENTINTEL_KEY = userdata.get('DOCUMENTINTEL_KEY')

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

import nest_asyncio
nest_asyncio.apply()


from google.colab import drive
drive.mount('/content/drive')

data_dir = '/content/drive/MyDrive' # Input a data dir path from your mounted Google Drive

# Azure Document Intelligence setup
endpoint = "https://documentsclassifier.cognitiveservices.azure.com/"
key = DOCUMENTINTEL_KEY

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(DOCUMENTINTEL_KEY)
)

# Analyze a document
with open(f"{data_dir}/RAG/data/10k/lyft_10k_2023.pdf", "rb") as f:
    # This is the name of the pre-trained model being used for the analysis. Azure Document Intelligence provides various pre-built models for different document types (like invoices, receipts, identity documents, etc.).
    # In this case "prebuilt-document" is a general-purpose model that can extract text, layout
    # and other key information from various types of documents.
    poller = document_analysis_client.begin_analyze_document("prebuilt-document", document=f)
    #document=f: This argument provides the document to be analyzed. f is a file handle representing the opened file.
result = poller.result()

# Extract text from document
#extracted_text = "\n".join([page.content for page in result.pages])
extracted_text=result.content

# Step 2: Send extracted content to GPT for classification
client = OpenAI(api_key=OPENAI_API_KEY)

prompt = f"""
Given the following document text, analyze its content.
Identify the underlying financial or tax-related theme, such as compliance, reporting, audit, accounting, policy, corporate finance, personal taxation, investment
or regulatory matters.
If there is a subcategory then make sure subcatgory is included as comma separted values in the response.
On this analysis classify the document into the most appropriate financial or tax-related category
that best represents its primary subject matter.
Also return the number of pages.
include the precide page number where Tax or related  content occurs in the documents.
If the above exists in more than one page have it displayed as comma separated like 1,2
Return JSON with fields: category, confidence, description, Number of Pages, Page Number.
And make sure subcategory is included in the response.
Document:
{extracted_text}
"""

response = client.chat.completions.create(
    model="gpt-4o",  # or "gpt-4.0"
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

## Chunking the document

### Subtask:
Divide the extracted text into smaller, manageable chunks. This prevents exceeding the LLM's context window and reduces the number of tokens processed per API call.


**Reasoning**:
Divide the extracted text into smaller chunks based on a suitable chunk size, store the chunks in a list, and keep track of the starting page number for each chunk.



In [None]:
# Determine a suitable chunk size (adjust as needed based on LLM token limits)
# A rough estimate of characters per token is 4, so for a 4096 token limit,
# a chunk size around 15000 characters might be a starting point.
# However, splitting by pages or sections is often more effective for maintaining context.
# Let's try splitting by double newlines, which often indicate paragraph breaks or section changes.
# If this doesn't create reasonably sized chunks, we can adjust the strategy.

chunks = []
chunk_page_numbers = []
current_chunk = ""
current_page_number = 1
# Fix: Use result.pages to get the number of pages
characters_per_page = len(extracted_text) / len(result.pages) if len(result.pages) > 0 else 0

# Split by double newlines to get potential sections
sections = extracted_text.split('\n\n')

# A simple approach to associate chunks with page numbers.
# This assumes a relatively even distribution of text across pages.
# A more accurate approach would involve analyzing the layout information from Document Intelligence result.
char_count = 0
for section in sections:
    section_length = len(section) + 2 # Add 2 for the removed double newline
    if char_count + section_length > (current_page_number * characters_per_page) and characters_per_page > 0:
      current_page_number += 1

    if len(current_chunk) + section_length > 15000: # Example chunk size limit
        chunks.append(current_chunk)
        chunk_page_numbers.append(current_page_number)
        current_chunk = section
    else:
        current_chunk += "\n\n" + section

    char_count += section_length

# Add the last chunk
if current_chunk:
    chunks.append(current_chunk)
    chunk_page_numbers.append(current_page_number)

print(f"Number of chunks: {len(chunks)}")
print(f"Starting page numbers for chunks: {chunk_page_numbers}")

Number of chunks: 1
Starting page numbers for chunks: [2]


## Summarization of chunks

### Subtask:
Use the LLM to summarize each chunk. This distills the key information from each section, reducing the overall amount of text that needs to be considered for classification.


**Reasoning**:
Initialize an empty list to store summaries and iterate through the chunks to generate summaries using the LLM.



In [None]:
summaries = []
for i, chunk in enumerate(chunks):
    prompt = f"""
    Summarize the following document chunk, focusing on financial or tax-related information if present.
    If there is no significant financial or tax content, provide a brief general summary of the section.
    Make sure to keep the summary concise.

    Document Chunk:
    {chunk}
    """
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    summary = response.choices[0].message.content
    summaries.append(summary)

print(f"Generated {len(summaries)} summaries.")

In [None]:
relevant_chunk_indices = []

for i, summary in enumerate(summaries):
    prompt = f"""
    Does the following summary contain significant financial or tax-related information relevant to classifying the document as financial or tax-related?
    Respond with only 'yes' or 'no'.

    Summary:
    {summary}
    """
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5 # Keep the response short
    )
    answer = response.choices[0].message.content.strip().lower()

    if "yes" in answer:
        relevant_chunk_indices.append(i)

print(f"Indices of relevant summaries: {relevant_chunk_indices}")

## Final classification with relevant summaries

### Subtask:
Send only the summaries of the relevant chunks, along with the original prompt, to the LLM for the final classification. This significantly reduces the total token usage compared to sending the entire document.


**Reasoning**:
Construct the final prompt using the relevant summaries and call the LLM for classification, then parse the JSON response.



In [None]:
# Create a list of relevant summaries
relevant_summaries = [summaries[i] for i in relevant_chunk_indices]

# Join the relevant summaries into a single string
summaries_string = "---\n".join(relevant_summaries)

# Construct a new prompt for the LLM
final_prompt = f"""
Given the following summaries of relevant sections from a document, analyze their content.
Identify the underlying financial or tax-related theme, such as compliance, reporting, audit, accounting, policy, corporate finance, personal taxation, investment
or regulatory matters.
If there is a subcategory then make sure subcatgory is included as comma separted values in the response.
On this analysis classify the document into the most appropriate financial or tax-related category
that best represents its primary subject matter.
Also return the number of pages, which is {len(result.pages)}.
include the precise page number where Tax or related content occurs in the documents.
If the above exists in more than one page have it displayed as comma separated like 1,2
Return JSON with fields: category, confidence, description, Number of Pages, Page Number, and subcategory.

Relevant Summaries:
{summaries_string}
"""

# Call the OpenAI API with the new prompt
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": final_prompt}]
)

# Parse the JSON response from the LLM
import json
classification_result = json.loads(response.choices[0].message.content)

print(json.dumps(classification_result, indent=2))

In [None]:
import re

# Extract the JSON string from the raw LLM response
raw_response_content = response.choices[0].message.content
json_match = re.search(r'```json\s*([\s\S]*?)\s*```', raw_response_content)

classification_result = {}

if json_match:
    json_string = json_match.group(1)
    try:
        # Parse the extracted JSON string
        classification_result = json.loads(json_string)
        print("\nParsed JSON result:")
        print(json.dumps(classification_result, indent=2))
    except json.JSONDecodeError as e:
        print(f"\nFailed to decode extracted JSON string: {e}")
        # Handle cases where the extracted string is still not valid JSON
else:
    print("\nNo JSON block found in the LLM response.")
    # Handle cases where the LLM did not provide a JSON block in the expected format



## Extracting page numbers

### Subtask:
Since the original document is chunked and summarized, the page numbers of the tax-related content will need to be tracked during the chunking process and potentially re-extracted or referenced based on the identified relevant chunks.


**Reasoning**:
Initialize an empty list to store relevant page numbers, iterate through the relevant chunk indices, get the corresponding starting page number from the chunk_page_numbers list, append it to the relevant_page_numbers list, convert the list to a comma-separated string, and update the classification_result dictionary.



In [None]:
relevant_page_numbers = []

for i in relevant_chunk_indices:
    relevant_page_numbers.append(chunk_page_numbers[i])

page_numbers_str = ",".join(map(str, relevant_page_numbers))

classification_result['Page Number'] = page_numbers_str

print(classification_result)

## Summary:

### Data Analysis Key Findings

*   The document text was successfully divided into smaller chunks based on double newlines and a maximum size limit, with each chunk associated with its starting page number.
*   An LLM (gpt-4o) was used to generate concise summaries for each chunk, prioritizing financial or tax-related content.
*   The LLM was then used to identify which of these summaries contained significant financial or tax-related information relevant to classification.
*   The final classification was performed by sending only the relevant summaries to the LLM, significantly reducing token usage.
*   The LLM's response, which included the classification result in JSON format, was extracted from a markdown code block within the raw output.
*   The starting page numbers corresponding to the identified relevant chunks were successfully extracted and included in the final classification result.

### Insights or Next Steps

*   Consider refining the chunking strategy to be more robust, potentially using Document Intelligence layout information for more accurate page or section breaks, especially for complex document structures.
*   Explore alternative methods for identifying relevant chunks that might be faster or less token-intensive than using the LLM for a simple yes/no decision on each summary.
