# PDF Document Analysis with Azure Document Intelligence

## Overview
This notebook demonstrates how to use Azure's Document Intelligence API to extract information from a diverse set of PDF documents (e.g. reports, articles, slide decks, screen shots, sanned documents with handwritten notes). We analyze document content for test, figures, tables and handwritten content, and organize results in a structured format. The goal is to simplify PDF data extraction and prepare the data to process them later in a RAG pipeline based on different business use-cases.

### Requirements
- **Python 3.11.9**
- The `requirements.txt` file in this repo lists all dependencies.

### Data Sources
The PDF files analyzed in this notebook are located in the `data/sample_docs/raw` directory. Ensure these files are available locally or update the file paths accordingly.

### Outline
[1. Setup and data preparation](#1)

[2. Processing PDF docs with Azure Document Intelligence](#2)

[3. Extract information from Azure Document Intelligence output](#3)

[4. Draw bounding boxes around detected objects into PDFs](#4)

[5. Replacing figures and tables by their description generated by GPT 4.0](#5)

[6. Implementing a multi modal RAG pipeline](#6)


### Usage
- Ensure your Azure API credentials are set up in a `.env` file in the root directory.
- Run each cell sequentially to ensure dependencies and data are loaded correctly.


----

## 1. Setup and data preparation <a class="anchor" id="1"></a>

In [4]:
# Import standard libraries
import os  # For interacting with the operating system (file paths, environment variables)
from typing import List, Dict  # For specifying data types (e.g., lists of strings)
import pickle  # For serializing Python objects to disk
import pandas as pd  # For processing tabular data as dataframes
from dotenv import load_dotenv  # For loading environment variables from a .env fil
import itertools  # For iterating over data structures
from collections import defaultdict  # For creating dictionaries with default values
import numpy as np  # For numerical operations


# Import the base64 library for encoding binary data
import base64

# Libraries for PDF processing
import fitz  # PyMuPDF library, used for chunking PDFs
from io import BytesIO  # For converting between file-like objects and bytes    

#Semantic Search 
from sentence_transformers import SentenceTransformer, util
import torch 

# Import libraries to use MS Azure services
import requests # For making HTTP requests to Azure services
import time # For pausing the program execution

## Azure Document Intelligence libraries
from azure.ai.documentintelligence import DocumentIntelligenceClient  # Main client for Azure Document Intelligence
from azure.core.credentials import AzureKeyCredential  # For Azure API credential handling
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, ContentFormat, AnalyzeResult  # Models and format types for analyzing documents

from openai import AzureOpenAI  # Custom class for interacting with Azure OpenAI services

In [11]:
# File paths for test documents used in PDF analysis
# These are sample PDF files stored locally, representing various document types:
# - Annual financial reports
# - Handwritten mock documents
# - Financial presentations
# - Regulatory documents

file_paths = {
    "deutschebank_annual_report_2023": '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/Annual-Report-2023.pdf',
    "deutschebank_annual_report_2022": '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/Annual-Report-2022.pdf',
    "deutschebank_handwritten_mock": '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/Mock_handwritten.pdf',
    "deutschebank_trade_finance_article": '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/IFC and Deutsche Bank partner to boost trade finance in Africa – Corporates and Institutions.pdf',
    "deutschebank_quarterly_presentation_q2_2024": '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/Deutsche-Bank-Q2-2024-Presentation.pdf',
    "capital_requirements_document": '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/capitalrequirementsregulations.pdf',
    "monzo_annual report_2024": "/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/monzo-annual-report-2024.pdf",
    "Mock-handwritten": "/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/raw/Mock_handwritten.pdf"
}

In [9]:
#Chunking the documents for testing purposes in smaller sub documents to process them with Azure Doc Intelligence without generating costs
def chunk_document(file_path, page_selection):
    """
    Extracts certain pages from a PDF without changing the quality and saves the new document
    in the same folder as the original document with the filename of the original document
    appended with "selected_pages".
    
    :param file_path: The path to the original PDF document.
    :param page_selection: A list of page numbers to extract.
    """
    # Open the document
    input_doc = fitz.open(file_path)
    
    # Create a new PDF document
    output_doc = fitz.open()
    
    # Extract the selected pages
    for num in page_selection:
        output_doc.insert_pdf(input_doc, from_page=num, to_page=num)
    
    # Generate the new filename
    base_name = os.path.basename(file_path)
    name, ext = os.path.splitext(base_name)
    page_suffix = "_".join(map(str, page_selection))
    new_filename = f"{name}_{page_suffix}{ext}"
    new_filefolder=os.path.dirname(file_path).replace('/raw', '/chunked')
    new_file_path = os.path.join(new_filefolder, new_filename)
    
    # Save the new document
    output_doc.save(new_file_path)
    output_doc.close()
    input_doc.close()

    return new_file_path

In [13]:
# Directory for saving the chunked documents
#chunked_dir = '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/'

chunk_pages = {
    "deutschebank_annual_report_2023": [[8, 9, 15, 120], [9, 15, 120], [120],
                                        [420]#factsheet with only graphs
                                        ],
    "deutschebank_annual_report_2022": [[8, 9, 15, 119], [9, 15, 119], [118]],
    "deutschebank_quarterly_presentation_q2_2024": [[8, 9, 17],
                                                    [1,21],#flashy infos
                                                    [10,22],#bar charts
                                                    [38]#line graph
                                                    ],
    "capital_requirements_document": [[266]],
    "monzo_annual report_2024": [
        [55,64], #diagrams
        [16, 26], #flashy infos and pictograms
        [34,61], # donut charts
        [46], #flowchart
        [63,78] #pictures
        ],
    "Mock-handwritten": [[0]]
}

# Initialize a dictionary to hold the chunked file paths
chunked_file_paths = {}

# Process each document and its associated chunks, print paths for verification
for doc_name, page_chunks in chunk_pages.items():
    # Create a list to store paths for the current document
    for pages in page_chunks:
        chunked_file_path = chunk_document(file_paths[doc_name], pages)
        chunked_file_paths[f'{doc_name}_{"_".join(map(str, pages))}'] = chunked_file_path
        print(f"Created chunked document at: {chunked_file_path}")

Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_8_9_15_120.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_9_15_120.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_120.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_420.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_8_9_15_119.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_9_15_119.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_118.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data

## 2. Processing PDF docs with Azure Document Intelligence <a class="anchor" id="2"></a>
Process whole PDFs or only selected pages in PDFs with Azure Document Intelligence

In [14]:
def get_document_intelligence_client(endpoint: str, key: str) -> DocumentIntelligenceClient:
    """
    Initialize and return a DocumentIntelligenceClient.
    
    :param endpoint: The endpoint of the Azure Document Intelligence service.
    :param key: The API key for the Azure Document Intelligence service.
    :return: An instance of DocumentIntelligenceClient.
    """
    return DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

#process entire documents with Azure client
def analyze_document(file_path: str, client: DocumentIntelligenceClient) -> dict:
    """
    Analyze a document using Azure Document Intelligence.
    
    :param file_path: The path to the document file.
    :param client: An instance of DocumentIntelligenceClient.
    :return: The result of the document analysis.
    """
    # Analyze the document using Azure Document Intelligence
    
    with open(file_path, "rb") as f:
        poller = client.begin_analyze_document(
            model_id="prebuilt-layout", analyze_request=f, content_type="application/pdf", output_content_format=ContentFormat.MARKDOWN)
        result = poller.result()
    
    return result

In [15]:
# Load credentials from .env file -> API key for smmaller docs
load_dotenv()

AZURE_API_KEY = os.getenv("AZURE_API_KEY")
AZURE_ENDPOINT = os.getenv("AZURE_ENDPOINT")

# Create a client object
document_intelligence_client = get_document_intelligence_client(AZURE_ENDPOINT, AZURE_API_KEY)

In [2293]:
# Load credentials from .env file -> API key for larger docs
load_dotenv()

AZURE_DOCINTEL_LARGE_API_KEY = os.getenv("AZURE_DOCINTEL_LARGE_API_KEY")
AZURE_DOCINTEL_LARGE_ENDPOINT = os.getenv("AZURE_DOCINTEL_LARGE_ENDPOINT")

# Create a client object
document_intelligence_client_large = get_document_intelligence_client(AZURE_DOCINTEL_LARGE_ENDPOINT, AZURE_DOCINTEL_LARGE_API_KEY)

In [16]:
#Apply the function to the selected pages
chunked_file_paths = {}

# Process each document and its associated chunks, print paths for verification
for doc_name, page_chunks in chunk_pages.items():
    # Create a list to store paths for the current document
    for pages in page_chunks:
        chunked_file_path = chunk_document(file_paths[doc_name], pages)
        chunked_file_paths[f'{doc_name}_{"_".join(map(str, pages))}'] = chunked_file_path
        print(f"Created chunked document at: {chunked_file_path}")

Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_8_9_15_120.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_9_15_120.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_120.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_420.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_8_9_15_119.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_9_15_119.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_118.pdf
Created chunked document at: /Users/ann-kathrin/repositories/PDFDataToolkit/data

In [17]:
chunked_file_paths

{'deutschebank_annual_report_2023_8_9_15_120': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_8_9_15_120.pdf',
 'deutschebank_annual_report_2023_9_15_120': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_9_15_120.pdf',
 'deutschebank_annual_report_2023_120': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_120.pdf',
 'deutschebank_annual_report_2023_420': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_420.pdf',
 'deutschebank_annual_report_2022_8_9_15_119': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_8_9_15_119.pdf',
 'deutschebank_annual_report_2022_9_15_119': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_9_15_119.pdf',
 'deutschebank_annual_report_2022_118': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_

In [18]:
# Process each chunked file and store results in a structured dictionary with analyse_document
processed_results = {
    doc_name: analyze_document(path, document_intelligence_client)
    for doc_name, path in chunked_file_paths.items()
}
processed_results

{'deutschebank_annual_report_2023_8_9_15_120': {'apiVersion': '2024-07-31-preview', 'modelId': 'prebuilt-layout', 'stringIndexType': 'textElements', 'content': '<!-- PageHeader="Letter from the Chairman of the Supervisory Board" -->\n<!-- PageHeader="Deutsche Bank Annual Report 2023" -->\n\n\n# Letter from the Chairman of the Supervisory Board\n\nDear Shareholders,\n\nLast year, Deutsche Bank proved its resilience in a volatile environment, further increasing revenues and delivering its best\npre-tax profit in 16 years. Thanks to its prudent risk management, the bank has a strong and stable balance sheet and has\nfurther strengthened its capital base. As a result, we are this year once again in a position to substantially increase capital\ndistributions to shareholders and we are pleased that the Supervisory Board and Management Board will propose to you a\n50% increase in the dividend to € 0.45 per share at this year\'s Annual General Meeting.\n\nIn this economic environment, the resi

In [2294]:
#Process one whole document and store results in a structured dictionary with analyse_document
processed_monzo = {
    'monzo_annual report_2024': analyze_document(file_paths['monzo_annual report_2024'], document_intelligence_client_large)
}

In [2197]:
# Save the extracted information to a pickle file to load it in a seperate notebook and not have to run the Azure API again
output_dir = '/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/dict_raw'
for doc_name, result in processed_results.items():
    output_filepath = os.path.join(output_dir, doc_name + "_dictpages.pkl")
    with open(output_filepath, 'wb') as f:
        pickle.dump(result, f)

In [2295]:
output_filepath_monzo= os.path.join(output_dir, 'monzo_annual report_2024' + "_dictpages.pkl")
with open(output_filepath_monzo, 'wb') as f:
    pickle.dump(processed_monzo, f)

## 3. Extract information from Azure Doc Intelligence output <a class="anchor" id="3"></a>

#### Helping functions

In [19]:
#Extract headers from tables
def extract_header_info(table):
    headers = {}
    multi_level = False
    
    for cell in table.cells:
        if cell.kind == 'columnHeader':
            row_span = cell.row_span if cell.row_span is not None else 1
            column_span = cell.column_span if cell.column_span is not None else 1

            if row_span > 1 or column_span > 1:
                multi_level = True
            
            # Use (rowIndex, columnIndex) as keys to map header contents
            key = (cell.row_index, cell.column_index)
            if key not in headers:
                headers[key] = cell.content

    # Flatten headers
    flattened_headers = []
    for row_index in range(table.row_count):
        row_headers = []
        for col_index in range(table.column_count):
            if (row_index, col_index) in headers:
                row_headers.append(headers[(row_index, col_index)])
            else:
                row_headers.append('')
        flattened_headers.append(row_headers)
    
    # Combine header rows for flattened headers
    flattened_combined = []
    for col_index in range(table.column_count):
        combined_header = '-'.join(filter(None, [flattened_headers[row_index][col_index] for row_index in range(table.row_count)]))
        flattened_combined.append(combined_header)


    return {
        "multilevel": multi_level,
        "flattened_list": flattened_combined
    }

In [20]:
#Extract tables in raw format
def extract_table_raw(table):
    for cell in table.cells:
        raw_data={
            "kind": cell.kind,
            "rowIndex": cell.row_index,
            "columnIndex": cell.column_index,
            "rowSpan": cell.row_span if hasattr(cell, 'row_span') else 1,
            "columnSpan": cell.column_span if hasattr(cell, 'column_span') else 1,
            "content": cell.content
        }
    return raw_data

In [21]:
#Transform raw table data into flat table format (list of lists), no joined columns and rows, format could be used to store data in database 
def to_flat_table(table):
    # Initialize empty matrix for the table
    data = [['' for _ in range(table.column_count)] for _ in range(table.row_count)]
    
    # Fill the table with content
    for cell in table.cells:
        row_span = cell.row_span if cell.row_span is not None else 1
        column_span = cell.column_span if cell.column_span is not None else 1

        # Fill in the data for each cell, respecting row_span and column_span
        for r in range(cell.row_index, cell.row_index + row_span):
            for c in range(cell.column_index, cell.column_index + column_span):
                data[r][c] = cell.content
    return data

In [22]:
#Transform raw table data into multi-level table format (pandas DataFrame)
def to_multilevel_table(table):
    # Extract header info
    header_info = extract_header_info(table)
    
    # Prepare multi-level column headers
    multi_level_columns = pd.MultiIndex.from_tuples(
        [(header_info['flattened_list'][col_index],) for col_index in range(table.column_count)]
    )
    
    # Prepare content data
    content_data = [['' for _ in range(table.column_count)] for _ in range(table.row_count)]
    
    for cell in table.cells:
        row_span = cell.row_span if cell.row_span is not None else 1
        column_span = cell.column_span if cell.column_span is not None else 1

    # Fill in the data for each cell, respecting row_span and column_span
        for r in range(cell.row_index, cell.row_index + row_span):
            for c in range(cell.column_index, cell.column_index + column_span):
                content_data[r][c] = cell.content
    
    # Create multi-level DataFrame
    df_multilevel = pd.DataFrame(content_data, columns=multi_level_columns)
    
    return df_multilevel

In [23]:
def is_point_in_polygon(point, polygon):
    """Check if a point (x, y) is inside the polygon defined by the list of points."""
    x, y = point
    n = len(polygon)
    inside = False
    
    p1x, p1y = polygon[0]
    for i in range(n + 1):
        p2x, p2y = polygon[i % n]
        if y > min(p1y, p2y):
            if y <= max(p1y, p2y):
                if x <= max(p1x, p2x):
                    if p1y != p2y:
                        xinters = (y - p1y) * (p2x - p1x) / (p2y - p1y) + p1x
                    if p1x == p2x or x <= xinters:
                        inside = not inside
        p1x, p1y = p2x, p2y
    
    return inside

def is_polygon_within(bounding_polygon1, bounding_polygon2):
    """Check if bounding_polygon1 lies within bounding_polygon2."""
    # Convert flat lists to list of points
    polygon1 = [(bounding_polygon1[i], bounding_polygon1[i + 1]) for i in range(0, len(bounding_polygon1), 2)]
    polygon2 = [(bounding_polygon2[i], bounding_polygon2[i + 1]) for i in range(0, len(bounding_polygon2), 2)]

    # Check if all points of polygon1 are within polygon2
    for point in polygon1:
        if not is_point_in_polygon(point, polygon2):
            return False
    return True

In [24]:
#Helping function to determine if a bbox is within another bbox
def bbox_within(bbox1, bbox2):
        """
        Checks if bbox1 is inside bbox2.
        Assumes both are lists of 4 or more points [x1, y1, x2, y2, ..., xn, yn].
        """
        min_x1, min_y1 = min(bbox1[::2]), min(bbox1[1::2])
        max_x1, max_y1 = max(bbox1[::2]), max(bbox1[1::2])
        
        min_x2, min_y2 = min(bbox2[::2]), min(bbox2[1::2])
        max_x2, max_y2 = max(bbox2[::2]), max(bbox2[1::2])
        
        # Check if bbox1 is within bbox2
        return min_x2 <= min_x1 <= max_x2 and min_y2 <= min_y1 <= max_y2 and min_x2 <= max_x1 <= max_x2 and min_y2 <= max_y1 <= max_y2

In [25]:
def extract_integer_from_string(s):
    """Extracts the integer at the end of a string in the format '\\paragraphs\\xxx'."""
    # Find the last backslash in the string and extract the part that follows it
    integer_part = s.split('/')[-1]
    
    # Check if the extracted part is a digit (an integer)
    if integer_part.isdigit():
        return int(integer_part)
    else:
        raise ValueError("The string does not end with an integer.")

#### Dictionary structured by page number and object (paragraph, figure, table, handwritten)
Reorganise output of Azure Doc Intelligence in new dictionary

In [31]:
"""
Structure of dictionary:
"extraction_info": {
    "apiVersion"
    "modelID"
        },
"meta_info_file": {
    "source"
    "stringIndexType"
    "pagesTotal"
    "paragraphsTotal"
    "handwritten"
    "tablesTotal"
    "figuresTotal"
    },
"pages": [
    "page_number"
    "meta_page": {
        "width"
        "height"
        "unit"
    },
    "structure_page": {
        "footnote"
        "page_header"
        "page_footer"
        "page_title"
        "section_headings"
    },
    "content_page": {
        "handwritten": []
        "paragraphs": []
        "tables": []
        "figures": [] 
    }
]
"""

'\nStructure of dictionary:\n"extraction_info": {\n    "apiVersion"\n    "modelID"\n        },\n"meta_info_file": {\n    "source"\n    "stringIndexType"\n    "pagesTotal"\n    "paragraphsTotal"\n    "handwritten"\n    "tablesTotal"\n    "figuresTotal"\n    },\n"pages": [\n    "page_number"\n    "meta_page": {\n        "width"\n        "height"\n        "unit"\n    },\n    "structure_page": {\n        "footnote"\n        "page_header"\n        "page_footer"\n        "page_title"\n        "section_headings"\n    },\n    "content_page": {\n        "handwritten": []\n        "paragraphs": []\n        "tables": []\n        "figures": [] \n    }\n]\n'

In [32]:
# Main function to generate a structured dictionary from the Azure Doc Intelligence output
def extract_information(result: AnalyzeResult) -> dict:
    """
    Extracts information from the analysis result and organizes it into a structured dictionary.
    
    :param result: The analysis result containing document data.
    :param filepath: The path to the document file.
    :return: A dictionary containing the extracted information.
    """
    # Initialize result dictionary with the meta information for the file
    document_info = {
        "extraction_info": {
            "apiVersion": result.api_version,
            "modelID": result.model_id
        },
        "meta_info_file": {
            "stringIndexType": result.string_index_type, # The type of string indexing used, e.g. Unicode code points
            "pagesTotal": len(result.pages), # Total number of pages in the document
            "paragraphsTotal": len(result.paragraphs) if result.paragraphs is not None else 0, # Total number of paragraphs in the document
            "handwritten": any(style.is_handwritten for style in result.styles) if result.styles else False, # Check if any style is handwritten 
            "tablesTotal": len(result.tables) if result.tables is not None else 0, # Total number of tables in the document
            "figuresTotal": len(result.figures) if result.figures is not None else 0 # Total number of figures in the document
        },
        "pages": []
    }

    # Loop through each page and extract the necessary information
    for page in result.pages:
        page_info = {
            "page_number": page.page_number,
            "meta_page": {
                "width": page.width, # Page width
                "height": page.height, # Page height
                "unit": page['unit'] # Page unit in which width and height are measured
            },
            "structure_page": {
                "footnote": [], # Unique footnote content, e.g. citation
                "page_header": [], # Unique page header content, should be the same on every page
                "page_footer": [], # Unique page footer content, should be the same on every page
                "page_title": [], # Page titles extracted from the page
                "section_headings": [] # Page section headings extracted from the page
                },
            "content_page": {
                "handwritten": [], # Handwritten content extracted from the page
                "paragraphs": [], # Paragraph content extracted from the page
                "tables": [], # Table content extracted from the page
                "figures": [] # Figure content extracted from the page
                }}

        # Extract structure information per page 
        if result.paragraphs:
            for paragraph in result.paragraphs:
                if paragraph['boundingRegions'][0]['pageNumber'] == page.page_number:
                    role = paragraph.get('role')
                    if role == 'footnote':
                        page_info["structure_page"]["footnote"].append(paragraph['content'])
                    elif role == 'pageHeader':
                        page_info["structure_page"]["page_header"].append(paragraph['content'])
                    elif role == 'pageFooter':
                        page_info["structure_page"]["page_footer"].append(paragraph['content'])
                    elif role == 'title':
                        page_info["structure_page"]["page_title"].append(paragraph['content'])
                    elif role == 'sectionHeading':
                        page_info["structure_page"]["section_headings"].append(paragraph['content'])

        # Extract handwritten content based on styles
                    for paragraph_span in paragraph['spans']:
                        # Check styles for handwritten content
                        for style in result.get('styles', []):
                            if style.get('isHandwritten'):
                                for style_span in style.get('spans', []):
                                    
                                    # Compare offsets to detect handwritten content
                                    if style_span['offset'] == paragraph_span['offset']:
                                        
                                        # Extract necessary information
                                            handwritten_info={
                                            "reading_order": result.paragraphs.index(paragraph),
                                            "confidence": style.get('confidence'),
                                            "offset": style_span.get('offset'),
                                            "span_length": style_span.get('length'),
                                            "raw_text": paragraph['content'], #raw text extracted from associated paragraph with the same offset and span
                                            "bounding_box": paragraph['boundingRegions'][0]['polygon'] #region extracted from associated paragraph with the same offset and span
                                            }

                                            page_info["content_page"]["handwritten"].append(handwritten_info)
                                        

                # Append page information to the main document dictionary
            
                #document_info["pages"].append(page_info)
       
        # Extract text paragraph information
        
                    if role is None or role == 'title' or role == 'sectionHeading': # Check if the paragraph has no role, i.e. regular text
                        if result.tables:
                            tables_bboxes=[table.bounding_regions[0].polygon for table in result.tables]
                        else: tables_bboxes=[]
                        
                        if result.figures:
                            figures_bboxes=[figure.bounding_regions[0].polygon for figure in result.figures]
                        else: figures_bboxes=[]
                            
                        if not any(bbox_within(paragraph['boundingRegions'][0]['polygon'], table_bbox) for table_bbox in tables_bboxes) and not any(bbox_within(paragraph['boundingRegions'][0]['polygon'], figure_bbox) for figure_bbox in figures_bboxes): 
                            paragraph_info = {
                            "role": role,
                            "reading_order": result.paragraphs.index(paragraph)+1, #independent of the page number
                            "raw_text": paragraph['content'],
                            "bounding_box": paragraph['boundingRegions'][0]['polygon']
                        }
            
                            page_info["content_page"]["paragraphs"].append(paragraph_info)

            # Append page information to the main document dictionary
            #document_info["pages"].append(page_info)

        #Extract tables from document
        if result.tables:
            for table_idx, table in enumerate(result.tables):

                # Ensure the table has bounding regions and page numbers to compare
                if table.bounding_regions and table.bounding_regions[0].page_number == page.page_number:
                    # Extract the bounding polygon (bounding box) of the table
                    bounding_box = table.bounding_regions[0].polygon

                    # Calculate table dimensions (manually if columnCount/rowCount are not available)
                    row_count = max(
                        (cell.row_index + (cell.row_span if cell.row_span is not None else 1)) 
                        for cell in table.cells
                    ) if table.cells else 0
                    
                    column_count = max(
                        (cell.column_index + (cell.column_span if cell.column_span is not None else 1)) 
                        for cell in table.cells
                    ) if table.cells else 0

                    # 1. Table Info
                    table_info = {
                    "reading_order": min([extract_integer_from_string(cell.elements[0]) for cell in table.cells if cell.elements is not None]),
                    #"reading_order": min([result.paragraphs.index(paragraph) for paragraph in result.paragraphs if bbox_within(paragraph['boundingRegions'][0]['polygon'], bounding_box) and paragraph['boundingRegions'][0]['pageNumber'] == page.page_number], default=None)+1,
                    "table_no": table_idx,
                    "bounding_box": bounding_box,
                    "number_of_rows": row_count,
                    "number_of_columns": column_count,
                    "header_info": extract_header_info(table),
                    "table_raw":extract_table_raw(table),
                    "table_flat":to_flat_table(table),
                    "table_multilevel":to_multilevel_table(table)
                    }
                    page_info["content_page"]["tables"].append(table_info)


            
            #document_info["pages"].append(page_info)

        
        if result.figures:
            for figure_idx, figure in enumerate(result.figures):
                if figure.bounding_regions[0].page_number == page.page_number:
                    figure_info = {
                        #"reading_order": [result.paragraphs.index(paragraph)+1 for paragraph in result.paragraphs if bbox_within(paragraph['boundingRegions'][0]['polygon'], figure.bounding_regions[0].polygon) and paragraph['boundingRegions'][0]['pageNumber'] == figure.bounding_regions[0].page_number],
                        #"reading_order": min([(result.paragraphs.index(paragraph)) for paragraph in result.paragraphs if region.page_number == paragraph['boundingRegions'][0]['pageNumber']])+1,
                        "figure_no": figure.get('id'),
                        "caption": figure.get('caption',{}).get('content'),
                        "span": figure.spans if hasattr(figure, 'spans') else None,
                        "bounding_box": figure.bounding_regions[0].polygon,
                        "text_raw":[paragraph['content'] for paragraph in result.paragraphs if bbox_within(paragraph['boundingRegions'][0]['polygon'],figure.bounding_regions[0].polygon) and paragraph['boundingRegions'][0]['pageNumber'] == page.page_number]
                    }
                    if figure.get('elements') or figure.get('caption', {}).get('elements'): 
                        list1=figure.get('elements') if figure.get('elements') else []
                        list2= figure.get('caption', {}).get('elements') if figure.get('caption', {}).get('elements') else []
                        figure_info['reading_order']=min([extract_integer_from_string(element) for element in list1+list2])+1
                    
                        page_info["content_page"]["figures"].append(figure_info)

        document_info["pages"].append(page_info)


    return document_info

In [33]:
# Process each chunked file and store results in a structured dictionary with extract_information
result_dicts = {}
for doc_name in processed_results.keys():
    result_dicts[doc_name] = extract_information(processed_results[doc_name])

In [2304]:
#Process full document 
result_monzo = {}
for doc_name in processed_monzo.keys():
    result_monzo[doc_name] = extract_information(processed_monzo[doc_name])

In [2305]:
result_monzo

{'monzo_annual report_2024': {'extraction_info': {'apiVersion': '2024-07-31-preview',
   'modelID': 'prebuilt-layout'},
  'meta_info_file': {'stringIndexType': <StringIndexType.TEXT_ELEMENTS: 'textElements'>,
   'pagesTotal': 166,
   'paragraphsTotal': 7402,
   'handwritten': True,
   'tablesTotal': 103,
   'figuresTotal': 101},
  'pages': [{'page_number': 1,
    'meta_page': {'width': 11.6806, 'height': 8.2639, 'unit': 'inch'},
    'structure_page': {'footnote': [],
     'page_header': [],
     'page_footer': ['Monzo Bank Holding Group Limited Annual Report 2024'],
     'page_title': [],
     'section_headings': []},
    'content_page': {'handwritten': [],
     'paragraphs': [],
     'tables': [],
     'figures': [{'figure_no': '1.1',
       'caption': None,
       'span': [{'offset': 0, 'length': 26}],
       'bounding_box': [0.3018,
        0.3047,
        2.5447,
        0.3049,
        2.5449,
        0.7733,
        0.3019,
        0.7731],
       'text_raw': [],
       'reading_

In [29]:
# Save the extracted information to a pickle file to load it in a seperate notebook
output_dir = '/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/dict_objects'
for doc_name, result in result_dicts.items():
    output_filepath = os.path.join(output_dir, doc_name + "_dictpages.pkl")
    with open(output_filepath, 'wb') as f:
        pickle.dump(result, f)

In [2306]:
output_filepath_monzo= os.path.join(output_dir, 'monzo_annual report_2024' + "_dictpages.pkl")
with open(output_filepath, 'wb') as f:
        pickle.dump(result_monzo, f)

In [2199]:
result_dicts['monzo_annual report_2024_63_78']

{'extraction_info': {'apiVersion': '2024-07-31-preview',
  'modelID': 'prebuilt-layout'},
 'meta_info_file': {'source': 'monzo-annual-report-2024_63_78.pdf',
  'stringIndexType': <StringIndexType.TEXT_ELEMENTS: 'textElements'>,
  'pagesTotal': 2,
  'paragraphsTotal': 18,
  'handwritten': False,
  'tablesTotal': 0,
  'figuresTotal': 10},
 'pages': [{'page_number': 1,
   'meta_page': {'width': 11.6806, 'height': 8.2639, 'unit': 'inch'},
   'structure_page': {'footnote': [],
    'page_header': ['Annual Report and Group Financial Statements'],
    'page_footer': [],
    'page_title': ['Meet our Executive team'],
    'section_headings': ["The Executive team's in charge of the everyday running of the business"]},
   'content_page': {'handwritten': [],
    'paragraphs': [{'role': 'title',
      'reading_order': 3,
      'raw_text': 'Meet our Executive team',
      'bounding_box': [0.3382,
       1.4373,
       2.094,
       1.4394,
       2.0927,
       2.6033,
       0.3369,
       2.6013]},

#### Dictionary structure by pagenumber and reading order of paragraphs, figures and tables

In [34]:
"""
Structure of dictionary:

'extraction_info'
'meta_info_file'

'pages': [
    'page_number'
    'meta_page'
    'structure_page'
    'bboxes': [
    ]
]
"""

"\nStructure of dictionary:\n\n'extraction_info'\n'meta_info_file'\n\n'pages': [\n    'page_number'\n    'meta_page'\n    'structure_page'\n    'bboxes': [\n    ]\n]\n"

In [35]:
def extract_information_readingorder(result: AnalyzeResult) -> dict:
    
    result_objects=extract_information(result)

    document_info_readingorder={
        'extraction_info': result_objects.get('extraction_info'),
        'meta_info_file': result_objects.get('meta_info_file'),
        'pages': []
    }
    
    for page in result_objects['pages']:
        page_info_readingorder={
            "page_number": page['page_number'],
            "meta_page": page['meta_page'],
            "structure_page": page['structure_page'],
            "bboxes": []
        }
        

        bboxes=[]

        for handwritten in page['content_page']['handwritten']:
            bboxes.append({
                "role": "handwritten",
                "reading_order": handwritten["reading_order"],
                "bounding_box": handwritten["bounding_box"],
                "span": handwritten["confidence"],
                "offset": handwritten["offset"],
                "raw_text": handwritten["raw_text"]
            })
    
        # Process paragraphs and add to sorted_content with type "paragraph"
        for paragraph in page['content_page']['paragraphs']:
            bboxes.append({
                "role": paragraph['role'],
                "reading_order": paragraph["reading_order"],
                "bounding_box": paragraph["bounding_box"],
                "raw_text": paragraph["raw_text"]
            })
    

    
        # Process tables and add to sorted_content with type "table"
        for table in page['content_page']['tables']:
            bboxes.append({
                "role": "table",
                "reading_order": table["reading_order"],
                "bounding_box": table["bounding_box"],
                "table_no": table["table_no"],
                "number_of_rows": table["number_of_rows"],
                "number_of_columns": table["number_of_columns"],
                "header_info": table["header_info"],
                "table_raw": table["table_raw"],
                "table_flat": table["table_flat"],
                "table_multilevel": table["table_multilevel"]
            })
        
        for figure in page['content_page']['figures']:
            bboxes.append({
                "role": "figure",
                "reading_order": figure["reading_order"],
                "bounding_box": figure["bounding_box"],
                "figure_no": figure["figure_no"],
                "caption": figure["caption"],
                "span": figure["span"],
                "raw_text": figure["text_raw"]
            })
        
            
        #Sort combined list by the 'reading_order' field
        bboxes.sort(key=lambda x: x["reading_order"])
        
        page_info_readingorder['bboxes'].extend(bboxes)
    
        document_info_readingorder['pages'].append(page_info_readingorder)


    # Wrap in the final dictionary structure
    return document_info_readingorder



In [36]:
result_dicts

{'deutschebank_annual_report_2023_8_9_15_120': {'extraction_info': {'apiVersion': '2024-07-31-preview',
   'modelID': 'prebuilt-layout'},
  'meta_info_file': {'stringIndexType': <StringIndexType.TEXT_ELEMENTS: 'textElements'>,
   'pagesTotal': 2,
   'paragraphsTotal': 23,
   'handwritten': True,
   'tablesTotal': 0,
   'figuresTotal': 0},
  'pages': [{'page_number': 1,
    'meta_page': {'width': 8.2639, 'height': 11.6806, 'unit': 'inch'},
    'structure_page': {'footnote': [],
     'page_header': ['Letter from the Chairman of the Supervisory Board',
      'Deutsche Bank Annual Report 2023'],
     'page_footer': [],
     'page_title': ['Letter from the Chairman of the Supervisory Board'],
     'section_headings': []},
    'content_page': {'handwritten': [],
     'paragraphs': [{'role': 'title',
       'reading_order': 3,
       'raw_text': 'Letter from the Chairman of the Supervisory Board',
       'bounding_box': [0.8862,
        1.4035,
        6.9986,
        1.4103,
        6.9983,


In [37]:
# Process each chunked file and store results in a structured dictionary with extract_information
result_dicts_reordered = {}
for doc_name in result_dicts.keys():
    result_dicts_reordered[doc_name] = extract_information_readingorder(processed_results[doc_name])

In [2309]:
result_dicts_reordered_monzo = {}
for doc_name in result_monzo.keys():
    result_dicts_reordered_monzo[doc_name] = extract_information_readingorder(processed_monzo[doc_name])

In [53]:
result_dicts_reordered['Mock-handwritten_0']

{'extraction_info': {'apiVersion': '2024-07-31-preview',
  'modelID': 'prebuilt-layout'},
 'meta_info_file': {'stringIndexType': <StringIndexType.TEXT_ELEMENTS: 'textElements'>,
  'pagesTotal': 1,
  'paragraphsTotal': 19,
  'handwritten': True,
  'tablesTotal': 0,
  'figuresTotal': 1},
 'pages': [{'page_number': 1,
   'meta_page': {'width': 8.2778, 'height': 11.6944, 'unit': 'inch'},
   'structure_page': {'footnote': [],
    'page_header': ['Prymasche Bank,'],
    'page_footer': [],
    'page_title': [],
    'section_headings': ['Liquidity and capital resources',
     '11',
     'Credit Ratings',
     'Credit Ratings Development',
     'The rating agencies recognized the continued progress the bank has made over the course of 2023, specifically further improvements in profitability. This was reflected in upgrades by S&P, Fitch and Morningstar DBRS during the year. May 7,2023']},
   'bboxes': [{'role': 'handwritten',
     'reading_order': 1,
     'bounding_box': [7.0053,
      0.4457,
 

In [2310]:
result_dicts_reordered_monzo

{'monzo_annual report_2024': {'extraction_info': {'apiVersion': '2024-07-31-preview',
   'modelID': 'prebuilt-layout'},
  'meta_info_file': {'stringIndexType': <StringIndexType.TEXT_ELEMENTS: 'textElements'>,
   'pagesTotal': 166,
   'paragraphsTotal': 7402,
   'handwritten': True,
   'tablesTotal': 103,
   'figuresTotal': 101},
  'pages': [{'page_number': 1,
    'meta_page': {'width': 11.6806, 'height': 8.2639, 'unit': 'inch'},
    'structure_page': {'footnote': [],
     'page_header': [],
     'page_footer': ['Monzo Bank Holding Group Limited Annual Report 2024'],
     'page_title': [],
     'section_headings': []},
    'bboxes': [{'role': 'figure',
      'reading_order': 1,
      'bounding_box': [0.3018,
       0.3047,
       2.5447,
       0.3049,
       2.5449,
       0.7733,
       0.3019,
       0.7731],
      'figure_no': '1.1',
      'caption': None,
      'span': [{'offset': 0, 'length': 26}],
      'raw_text': []}]},
   {'page_number': 2,
    'meta_page': {'width': 11.6806, 

In [1590]:
result_dicts_reordered['monzo_annual report_2024_63_78'].get('pages')[0].get('bboxes')

[{'role': 'paragraph',
  'reading_order': 5,
  'bounding_box': [0.3388,
   4.044,
   2.9345,
   4.0406,
   2.9373,
   6.1905,
   0.3416,
   6.1939],
  'raw_text': "It's collectively responsible for helping run the day-to-day business of the Monzo Group. The Executive team makes up our Group Executive Committee (Group ExCo) which operates in a dual capacity, like the Boards. This means they consider things on behalf of the Monzo Group and on behalf of individual Monzo entities as appropriate. Many of the Group ExCo also sit on and chair other executive-level committees, with some of them reporting up to our Boards and Group Committees. We talk more about our governance structure from page 56."},
 {'role': 'paragraph',
  'reading_order': 6,
  'bounding_box': [4.0662,
   2.9639,
   5.3691,
   2.9626,
   5.3697,
   3.5941,
   4.0668,
   3.5954],
  'raw_text': 'TS Anil Group Chief Executive Officer and Executive Director'},
 {'role': 'figure',
  'reading_order': 6,
  'bounding_box': [5.9272

In [2201]:
#Save results in a pickle file
output_dir = '/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/dict_readingorder'
for doc_name, result in result_dicts_reordered.items():
    output_filepath = os.path.join(output_dir, doc_name + "_dictbboxes.pkl")
    with open(output_filepath, 'wb') as f:
        pickle.dump(result, f)

## 4. Draw bounding boxes around detected objects into the PDFs <a class="anchor" id="4"></a>

### Bounding boxes for all paragraphs, tables, figures and handwritten content
Draw for each analysed document into 4 seperate pdfs the bounding boxes for paragraphs, tables, figures and handwritten content detected by Azure Doc Intelligence, respectively, and save the resulting pdf

Check if the object detection of Azure Document Intelligence is accurate

In [38]:
# Define bounding box drawing function
def draw_box(page, polygon, color, scale_factor=72, line_width=2):
    if not polygon or len(polygon) != 8:
        return  # Skip drawing if polygon is empty or doesn't have 8 points

    # Convert 8-point polygon (in inches) to points
    x_coords = [polygon[i] * scale_factor for i in range(0, 8, 2)]  # Extract x-coordinates (0, 2, 4, 6)
    y_coords = [polygon[i] * scale_factor for i in range(1, 8, 2)]  # Extract y-coordinates (1, 3, 5, 7)

    # Get the minimum and maximum x and y values to form the bounding box
    x0, x1 = min(x_coords), max(x_coords)
    y0, y1 = min(y_coords), max(y_coords)

    # Create a rectangle (bounding box) and draw it
    rect = fitz.Rect(x0, y0, x1, y1)
    page.draw_rect(rect, color=color, width=line_width)  # Increase line width for visibility


def draw_bounding_boxes(filepath, document_info):
    # Open the original document
    doc = fitz.open(filepath)

    # Define colors for different elements
    colors = {
        "paragraph": (0, 0, 1),  # Blue for paragraphs
        "table": (0, 1, 0),  # Green for tables
        "figure": (1, 0, 0),  # Red for figures
        "handwritten": (1, 1, 0)  # Yellow for handwritten content
    }

    # Define file suffixes for different output files
    suffixes = {
        "paragraph": "_paragraphbox",
        "table": "_tablebox",
        "figure": "_figurebox",
        "handwritten": "_handwrittenbox"
    }


    # Draw boxes for each category and save as a separate file
    for category in ["paragraph", "table", "figure", "handwritten"]:
        # Copy the document for each category
        doc_copy = fitz.open(filepath)

        # Iterate over each page
        for page_info in document_info["pages"]:
            page_number = page_info["page_number"] - 1  # Page numbers are 0-indexed in PyMuPDF
            page = doc_copy.load_page(page_number)

            # Draw bounding boxes for paragraphs, tables, figures, or handwritten content
            if category == "paragraph":
                for paragraph in page_info["content_page"]["paragraphs"]:
                    polygon = paragraph["bounding_box"]
                    if polygon:
                        draw_box(page, polygon, colors[category])

            elif category == "table":
                for table in page_info["content_page"]["tables"]:
                    polygon = table["bounding_box"]
                    if polygon:
                        draw_box(page, polygon, colors[category])

            elif category == "figure":
                for figure in page_info["content_page"]["figures"]:
                    polygon = figure["bounding_box"]
                    if polygon:
                        draw_box(page, polygon, colors[category])

            elif category == "handwritten":
                for handwritten in page_info["content_page"]["handwritten"]:
                    polygon = handwritten["bounding_box"]
                    if polygon:
                        draw_box(page, polygon, colors[category])

        

        # Save the document with bounding boxes for the current category
        filename = os.path.basename(filepath)
        output_filename = filename.replace(".pdf", f"{suffixes[category]}.pdf")
        output_fullpath = os.path.join(f"/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_combined/{category}", output_filename)
        doc_copy.save(output_fullpath)
        doc_copy.close()

    # Close the original document
    doc.close()

In [39]:
chunked_file_paths

{'deutschebank_annual_report_2023_8_9_15_120': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_8_9_15_120.pdf',
 'deutschebank_annual_report_2023_9_15_120': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_9_15_120.pdf',
 'deutschebank_annual_report_2023_120': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_120.pdf',
 'deutschebank_annual_report_2023_420': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2023_420.pdf',
 'deutschebank_annual_report_2022_8_9_15_119': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_8_9_15_119.pdf',
 'deutschebank_annual_report_2022_9_15_119': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_docs/chunked/Annual-Report-2022_9_15_119.pdf',
 'deutschebank_annual_report_2022_118': '/Users/ann-kathrin/repositories/PDFDataToolkit/data/sample_

In [40]:
for doc_name in result_dicts.keys():
    draw_bounding_boxes(chunked_file_paths[doc_name], result_dicts[doc_name])

In [2312]:
draw_bounding_boxes(file_paths['monzo_annual report_2024'], result_monzo['monzo_annual report_2024'])

### Bounding box around single figure and table per pdf
Save for each detected figure and table a pdf with a bounding box around this specific figure/table

Will be used later to translate figures and tables into text

In [41]:
#For figures 
output_dir_figures = '/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox'
for doc_name, item in result_dicts_reordered.items():
    if [bbox for page in item['pages'] for bbox in page['bboxes'] if bbox['role'] == 'figure']:
        try:
            pdf_document = fitz.open(chunked_file_paths[doc_name])
        except Exception as e:
            print(f"Error opening document {doc_name}: {e}")
            continue
        
        for page in item.get('pages'):
            try:
                pdf_page = pdf_document.load_page(page['page_number'] - 1)
            except Exception as e:
                print(f"Error loading page {page['page_number']} in document {doc_name}: {e}")
                continue
            
            for bbox in page.get('bboxes'):
                if bbox['role'] == 'figure':
                    # Create a new PDF document for the current page
                    new_pdf_document = fitz.open()
                    new_pdf_document.insert_pdf(pdf_document, from_page=page['page_number'] - 1, to_page=page['page_number'] - 1)
                    
                    # Draw the bounding box on the new document's page
                    new_pdf_page = new_pdf_document[0]
                    draw_box(new_pdf_page, bbox['bounding_box'], (1, 0, 0))

                    pix = new_pdf_page.get_pixmap(dpi=1400)
                
                    # Create the output filename
                    base_filename = os.path.splitext(os.path.basename(chunked_file_paths[doc_name]))[0]
                    output_filename = f"{base_filename}_pageno{page['page_number']}_figureno{bbox['figure_no']}_bbox.jpg"
                    output_image_path = os.path.join(output_dir_figures, output_filename)
                    bbox['path_figure_redbox']=output_image_path
                    
                    # Save the image as JPEG
                    try:
                        pix.save(output_image_path, 'JPEG')
                        print(f"Saved {output_image_path}")
                    except Exception as e:
                        print(f"Error saving image {output_image_path}: {e}")
                    finally:
                        new_pdf_document.close()
        
        pdf_document.close()

Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox/Annual-Report-2023_420_pageno1_figureno1.1_bbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox/Annual-Report-2023_420_pageno1_figureno1.2_bbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox/Deutsche-Bank-Q2-2024-Presentation_8_9_17_pageno1_figureno1.1_bbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox/Deutsche-Bank-Q2-2024-Presentation_8_9_17_pageno1_figureno1.2_bbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox/Deutsche-Bank-Q2-2024-Presentation_8_9_17_pageno2_figureno2.1_bbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox/Deutsche-Ba

In [2313]:
#For figures - monzo - not in git repo for storage reasons
output_dir_figures = '/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox_monzo'
for doc_name, item in result_dicts_reordered_monzo.items():
    if [bbox for page in item['pages'] for bbox in page['bboxes'] if bbox['role'] == 'figure']:
        try:
            pdf_document = fitz.open(file_paths[doc_name])
        except Exception as e:
            print(f"Error opening document {doc_name}: {e}")
            continue
        
        for page in item.get('pages'):
            try:
                pdf_page = pdf_document.load_page(page['page_number'] - 1)
            except Exception as e:
                print(f"Error loading page {page['page_number']} in document {doc_name}: {e}")
                continue
            
            for bbox in page.get('bboxes'):
                if bbox['role'] == 'figure':
                    # Create a new PDF document for the current page
                    new_pdf_document = fitz.open()
                    new_pdf_document.insert_pdf(pdf_document, from_page=page['page_number'] - 1, to_page=page['page_number'] - 1)
                    
                    # Draw the bounding box on the new document's page
                    new_pdf_page = new_pdf_document[0]
                    draw_box(new_pdf_page, bbox['bounding_box'], (1, 0, 0))

                    pix = new_pdf_page.get_pixmap(dpi=1400)
                
                    # Create the output filename
                    base_filename = os.path.splitext(os.path.basename(file_paths[doc_name]))[0]
                    output_filename = f"{base_filename}_pageno{page['page_number']}_figureno{bbox['figure_no']}_bbox.jpg"
                    output_image_path = os.path.join(output_dir_figures, output_filename)
                    bbox['path_figure_redbox']=output_image_path
                    
                    # Save the image as JPEG
                    try:
                        pix.save(output_image_path, 'JPEG')
                        print(f"Saved {output_image_path}")
                    except Exception as e:
                        print(f"Error saving image {output_image_path}: {e}")
                    finally:
                        new_pdf_document.close()
        
        pdf_document.close()

Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/bbox_monzo/monzo-annual-report-2024_pageno1_figureno1.1_bbox.jpg
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
can

In [42]:
#for tables 
output_dir_tables='/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox'
for doc_name, item in result_dicts_reordered.items():
    if [bbox for page in item['pages'] for bbox in page['bboxes'] if bbox['role'] == 'table']:
        try:
            pdf_document = fitz.open(chunked_file_paths[doc_name])
        except Exception as e:
            print(f"Error opening document {doc_name}: {e}")
            continue
        
        for page in item.get('pages'):
            try:
                pdf_page = pdf_document.load_page(page['page_number'] - 1)
            except Exception as e:
                print(f"Error loading page {page['page_number']} in document {doc_name}: {e}")
                continue
            
            for bbox in page.get('bboxes'):
                if bbox['role'] == 'table':
                    # Create a new PDF document for the current page
                    new_pdf_document = fitz.open()
                    new_pdf_document.insert_pdf(pdf_document, from_page=page['page_number'] - 1, to_page=page['page_number'] - 1)
                    
                    # Draw the bounding box on the new document's page
                    new_pdf_page = new_pdf_document[0]
                    draw_box(new_pdf_page, bbox['bounding_box'], (1, 0, 0))

                    pix = new_pdf_page.get_pixmap(dpi=1400)
                
                    # Create the output filename
                    base_filename = os.path.splitext(os.path.basename(chunked_file_paths[doc_name]))[0]
                    output_filename = f"{base_filename}_pageno{page['page_number']}_tableno{bbox['table_no']}_redbox.jpg"
                    output_table_path = os.path.join(output_dir_tables, output_filename)
                    bbox['path_table_redbox']=output_table_path
                    
                    # Save the image as JPEG
                    try:
                        pix.save(output_table_path, 'JPEG')
                        print(f"Saved {output_table_path}")
                    except Exception as e:
                        print(f"Error saving image {output_table_path}: {e}")
                    finally:
                        new_pdf_document.close()
        
        pdf_document.close()

Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox/Annual-Report-2023_9_15_120_pageno2_tableno0_redbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox/Annual-Report-2023_9_15_120_pageno2_tableno1_redbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox/Annual-Report-2023_120_pageno1_tableno0_redbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox/Annual-Report-2023_120_pageno1_tableno1_redbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox/Annual-Report-2023_120_pageno1_tableno2_redbox.jpg
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox/Annual-Report-2023_120_pageno1_tableno3_redbox.jpg
Saved /Users/ann-k

In [2314]:
#for tables - monzo - not in git repo for storage reasons
output_dir_tables='/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox_monzo'
for doc_name, item in result_dicts_reordered_monzo.items():
    if [bbox for page in item['pages'] for bbox in page['bboxes'] if bbox['role'] == 'table']:
        try:
            pdf_document = fitz.open(file_paths[doc_name])
        except Exception as e:
            print(f"Error opening document {doc_name}: {e}")
            continue
        
        for page in item.get('pages'):
            try:
                pdf_page = pdf_document.load_page(page['page_number'] - 1)
            except Exception as e:
                print(f"Error loading page {page['page_number']} in document {doc_name}: {e}")
                continue
            
            for bbox in page.get('bboxes'):
                if bbox['role'] == 'table':
                    # Create a new PDF document for the current page
                    new_pdf_document = fitz.open()
                    new_pdf_document.insert_pdf(pdf_document, from_page=page['page_number'] - 1, to_page=page['page_number'] - 1)
                    
                    # Draw the bounding box on the new document's page
                    new_pdf_page = new_pdf_document[0]
                    draw_box(new_pdf_page, bbox['bounding_box'], (1, 0, 0))

                    pix = new_pdf_page.get_pixmap(dpi=1400)
                
                    # Create the output filename
                    base_filename = os.path.splitext(os.path.basename(file_paths[doc_name]))[0]
                    output_filename = f"{base_filename}_pageno{page['page_number']}_tableno{bbox['table_no']}_redbox.jpg"
                    output_table_path = os.path.join(output_dir_tables, output_filename)
                    bbox['path_table_redbox']=output_table_path
                    
                    # Save the image as JPEG
                    try:
                        pix.save(output_table_path, 'JPEG')
                        print(f"Saved {output_table_path}")
                    except Exception as e:
                        print(f"Error saving image {output_table_path}: {e}")
                    finally:
                        new_pdf_document.close()
        
        pdf_document.close()

Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox_monzo/monzo-annual-report-2024_pageno19_tableno0_redbox.jpg
cannot create /Annot for kind: 4
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox_monzo/monzo-annual-report-2024_pageno21_tableno1_redbox.jpg
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/bbox_monzo/monzo-annual-report-2024_pageno23_tableno2_redbox.jpg
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
cannot create /Annot for kind: 4
Saved /Users/ann-kathrin/repositories/PDFDataToolkit/data/

##### Save all figures and tables as cropped pictures 
Will be used later to translate figures and tables into text

In [43]:
#cutting tables from pdfs
def save_table_from_pdf(pdf_path, page_number, coordinates, table_number, output_dir):
    """
    Saves a figure from a specific page and coordinates in a PDF as a JPEG image.

    Args:
    - pdf_path (str): Path to the PDF file.
    - page_number (int): The page number to extract the figure from (0-based index).
    - coordinates (tuple): A tuple of (x0, y0, x1, y1, x2, y2, x3, y3) representing the coordinates of the figure.
    - figure_number (int): The figure number on the page.
    - output_dir (str): Directory to save the output JPEG image.

    Returns:
    - str: Path to the saved image.
    """
    # Convert inches to points (1 inch = 72 points)
    coordinates_in_points = [coord * 72 for coord in coordinates]

    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    page_number=page_number-1
    
    # Select the specified page
    page = pdf_document.load_page(page_number)
    
    # Define the quadrilateral area to extract
    if len(coordinates) != 8:
        raise ValueError("Coordinates must be a tuple of 8 values (x0, y0, x1, y1, x2, y2, x3, y3).")
    quad = fitz.Quad([coordinates_in_points[0:2], coordinates_in_points[2:4], coordinates_in_points[4:6], coordinates_in_points[6:8]])
    
    
    # Extract the image from the specified quadrilateral
    #zoom = 2  # Increase the zoom factor to get a higher resolution image
   # mat = fitz.Matrix(zoom, zoom)
    pix = page.get_pixmap(clip=quad.rect, dpi=1200)
    
    # Create the output filename
    base_filename = os.path.splitext(os.path.basename(pdf_path))[0]
    output_filename = f"{base_filename}_pageno{page_number}_tableno{table_number}_cropped.jpg"
    #output_filename_whole = f"{base_filename}_page_{page_number + 1}.jpg"
    output_table_path = os.path.join(output_dir, output_filename)
   # output_image_path_whole = os.path.join(output_dir, output_filename_whole)
    
    # Save the image as JPEG
    pix.save(output_table_path, 'JPEG')
    
    return output_table_path

In [44]:
output_dir='/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/table/cropped'
for doc_name, item in result_dicts_reordered.items():
    for page in item['pages']:
        for bbox in page['bboxes']:
            if bbox['role'] == 'table':
                bbox['table_path_cropped']=save_table_from_pdf(chunked_file_paths[doc_name], page['page_number'], bbox['bounding_box'], bbox['table_no'], output_dir)

In [45]:
#cutting images from pdfs
def save_figure_from_pdf(pdf_path, page_number, coordinates, figure_number, output_dir):
    """
    Saves a figure from a specific page and coordinates in a PDF as a JPEG image.

    Args:
    - pdf_path (str): Path to the PDF file.
    - page_number (int): The page number to extract the figure from (0-based index).
    - coordinates (tuple): A tuple of (x0, y0, x1, y1, x2, y2, x3, y3) representing the coordinates of the figure.
    - figure_number (int): The figure number on the page.
    - output_dir (str): Directory to save the output JPEG image.

    Returns:
    - str: Path to the saved image.
    """
    # Convert inches to points (1 inch = 72 points)
    coordinates_in_points = [coord * 72 for coord in coordinates]

    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    page_number=page_number-1
    
    # Select the specified page
    page = pdf_document.load_page(page_number)
    
    # Define the quadrilateral area to extract
    if len(coordinates) != 8:
        raise ValueError("Coordinates must be a tuple of 8 values (x0, y0, x1, y1, x2, y2, x3, y3).")
    quad = fitz.Quad([coordinates_in_points[0:2], coordinates_in_points[2:4], coordinates_in_points[4:6], coordinates_in_points[6:8]])
    
    
    # Extract the image from the specified quadrilateral
    #zoom = 2  # Increase the zoom factor to get a higher resolution image
   # mat = fitz.Matrix(zoom, zoom)
    pix = page.get_pixmap(clip=quad.rect, dpi=1200)
    
    # Create the output filename
    base_filename = os.path.splitext(os.path.basename(pdf_path))[0]
    output_filename = f"{base_filename}_pageno{page_number}_figureno{figure_number}_cropped.jpg"
    #output_filename_whole = f"{base_filename}_page_{page_number + 1}.jpg"
    output_image_path = os.path.join(output_dir, output_filename)
   # output_image_path_whole = os.path.join(output_dir, output_filename_whole)
    
    # Save the image as JPEG
    pix.save(output_image_path, 'JPEG')
    
    return output_image_path

In [46]:
output_dir='/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped'
for doc_name, item in result_dicts_reordered.items():
    for page in item['pages']:
        for bbox in page['bboxes']:
            if bbox['role'] == 'figure':
                bbox['figure_path_cropped']=save_figure_from_pdf(chunked_file_paths[doc_name], page['page_number'], bbox['bounding_box'], bbox['figure_no'], output_dir)

## 5. Replacing figures and tables by their description generated by GPT 4.0 <a class="anchor" id="5"></a>

In [47]:
# Load credentials from .env file
load_dotenv()

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")

### Get the description of a cropped figure
Save the output as .txt files to check the quality of the results

In [48]:
#Analysing a cropped figure
def process_figure_and_save_text(figure_path, api_key, endpoint, output_dir=None):
    """
    Process an image using OpenAI's GPT-4 Vision model and save the response to a text file.

    Args:
    - image_path (str): Path to the input image file.
    - api_key (str): API key for accessing the GPT-4 Vision model.
    - output_dir (str, optional): Directory where the output text file will be saved. Defaults to the same directory as the input image.

    Returns:
    - str: Path to the saved text file.
    """
    # Read and encode the image file
    encoded_figure = base64.b64encode(open(figure_path, 'rb').read()).decode('ascii')

    # Configuration
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    # Payload for the request
    payload = {
      "messages": [
        {
          "role": "system",
          "content": [
            {
              "type": "text",
              "text": "You are an AI model that extracts detailed information from figures."
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{encoded_figure}"
              }
            },
            {
              "type": "text",
              "text": "Please analyze the figure and provide the following information: "
                      "1. Specify if the figure is a chart, graph, diagram, photograph, pictogram or other type of visual representation. "
                      "2. If the figure is a chart or graph, print the table in markdown which represents the data in the chart."
                      "3. A detailed description of the figure content. "
                      "4. Any text present in the figure. "
                      "5. The context or setting of the figure. "
                      "6. The overall theme or message conveyed by the figure. "
                      "7. Any other relevant details or observations."
            }
          ]
        },
      ],
      "temperature": 0.7,
      "top_p": 0.95,
      
      "max_tokens": 1000
    }

    # GPT-4 Vision endpoint
    GPT4V_ENDPOINT = endpoint

    # Send request
    try:
        response = requests.post(GPT4V_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
    except requests.RequestException as e:
        raise SystemExit(f"Failed to make the request. Error: {e}")

    # Handle the response
    response_data = response.json()
    
    # Determine output directory and filename
    if output_dir is None:
        output_dir = os.path.dirname(figure_path)
    else:
        os.makedirs(output_dir, exist_ok=True)
    
    figure_filename = os.path.basename(figure_path)
    text_filename = os.path.splitext(figure_filename)[0] + "_cropped_text.txt"
    text_filepath = os.path.join(output_dir, text_filename)

    # Save response to a text file
    with open(text_filepath, 'w') as f:
        f.write(response.json()['choices'][0]['message']['content'])

    print(f"Text response saved to: {text_filepath}")
    return text_filepath


In [2247]:
#Process cropped figures
# Directory path containing images
directory = "/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped"
output_dir= "/Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped/output_figuretotext"

# Iterate over each file in the directory
for filename in os.listdir(directory):
    if filename.endswith(".jpg") or filename.endswith(".png"):  # Adjust based on your image file extensions
        image_path = os.path.join(directory, filename)
        process_figure_and_save_text(image_path, api_key=AZURE_OPENAI_API_KEY, endpoint=AZURE_OPENAI_ENDPOINT, output_dir=output_dir)
        time.sleep(8)
    else:
        continue

Text response saved to: /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped/output_figuretotext/Deutsche-Bank-Q2-2024-Presentation_1_21_pageno1_figureno2.1_cropped_cropped_text.txt
Text response saved to: /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped/output_figuretotext/Annual-Report-2023_420_pageno0_figureno1.2_cropped_cropped_text.txt
Text response saved to: /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped/output_figuretotext/Deutsche-Bank-Q2-2024-Presentation_10_22_pageno1_figureno2.3_cropped_cropped_text.txt
Text response saved to: /Users/ann-kathrin/repositories/PDFDataToolkit/data/output_azuredocintelligence/bboxes_single/figure/cropped/output_figuretotext/Deutsche-Bank-Q2-2024-Presentation_38_pageno0_figureno1.1_cropped_cropped_text.txt
Text response saved to: /Users/ann-kathrin/repositories/PDFDataToo

### Add figure descriptions into the dictionary
Figure in red box on the whole page

In [2316]:
def figure_description(image_path, api_key, endpoint):

    """
    Process an image using OpenAI's GPT-4 Vision model and save the response to a text file.

    Args:
    - image_path (str): Path to the input image file.
    - api_key (str): API key for accessing the GPT-4 Vision model.
    - output_dir (str, optional): Directory where the output text file will be saved. Defaults to the same directory as the input image.

    Returns:
    - str: Path to the saved text file.
    """

    # Dictionary of prompts for each figure category
    prompt = """
    
    Analyze the figure thoroughly to extract all essential information. For charts or graphs (e.g., line, bar, pie), describe:
    1. Title, labels, axes, scales, units, and legends, including any multiple axes.
    2. Data trends, patterns, and outliers, highlighting peaks, clusters, or shifts.
    3. Categories, values, and relationships, specifying key figures and comparisons.
    4. Annotations and color schemes used for data differentiation.

    For diagrams (e.g., flowcharts, networks, processes), capture:
    1. Title, labels, sections, and annotations clarifying each component’s purpose.
    2. Structure, including steps, levels, nodes, and directional flows, noting all connections.
    3. Relationships between components and functions.
    4. Colors, shapes, and styles that emphasize meaning or distinguish parts.

    Convert all relevant data into CSV format. Provide a detailed summary of the extracted content for information retrieval.
   
    """

    # Read and encode the image file
    with open(image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read()).decode('ascii')

    # Configuration
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    # Payload for the request
    payload = {
      "messages": [
        {
          "role": "system",
          "content": [
            {
              "type": "text",
              "text": "You are an assistant whose job is to provide the explanation of figures which is going to be used to retrieve the figures. The figure is contained in the red box."
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{encoded_image}"
              }
            },
            {
              "type": "text",
              "text": prompt
            }
          ]
        },
      ],
      "temperature": 0.7,
      "top_p": 0.95,
      "max_tokens": 1500
    }

    # GPT-4 Vision endpoint
    GPT4V_ENDPOINT = endpoint

    # Send request
    try:
        response = requests.post(GPT4V_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
    except requests.RequestException as e:
        raise SystemExit(f"Failed to make the request. Error: {e}")

    # Handle the response
    response_data = response.json()
    
     # Extract the response content
    response_content = response_data['choices'][0]['message']['content']
    
    return response_content

In [1818]:
for doc_name, item in result_dicts_reordered.items():
    for page in item['pages']:
        for bbox in page['bboxes']:
            if bbox['role'] == 'figure':
                if not bbox.get('raw_text') == []:
                    try:
                        bbox['figure_description'] = figure_description(bbox['path_figure_redbox'], api_key=AZURE_OPENAI_API_KEY, endpoint=AZURE_OPENAI_ENDPOINT)
                        time.sleep(8)  # Add a 8-second delay after each request, otherwise HTTPError: 429 Client Error: Too Many Requests
                    except SystemExit as e:
                        print(f"Request failed: {e}")
                        bbox['figure_description'] = ''  # Set to empty if request fails
                else:
                    bbox['figure_description'] = ''

In [2317]:
#for monzo
for doc_name, item in result_dicts_reordered_monzo.items():
    for page in item['pages']:
        for bbox in page['bboxes']:
            if bbox['role'] == 'figure':
                if not bbox.get('raw_text') == []:
                    try:
                        bbox['figure_description'] = figure_description(bbox['path_figure_redbox'], api_key=AZURE_OPENAI_API_KEY, endpoint=AZURE_OPENAI_ENDPOINT)
                        time.sleep(10)  # Add a 8-second delay after each request, otherwise HTTPError: 429 Client Error: Too Many Requests
                    except SystemExit as e:
                        print(f"Request failed: {e}")
                        bbox['figure_description'] = ''  # Set to empty if request fails
                else:
                    bbox['figure_description'] = ''

### Add table descriptions into the dictionary
Table in red box on the whole page

In [2319]:
def table_description(table_path, api_key, endpoint):

    """
    Process a table using OpenAI's GPT-4 Vision model and save the response to a text file.

    Args:
    - table_path (str): Path to the input table file.
    - api_key (str): API key for accessing the GPT-4 Vision model.
    - output_dir (str, optional): Directory where the output text file will be saved. Defaults to the same directory as the input table.

    Returns:
    - str: Path to the saved text file.
    """

    # Dictionary of prompts for each figure category
    prompt = """
    Analyze the table thoroughly to capture all relevant details for retrieval purposes. Include:
    1. **Title and Headings**: Extract the table’s title, headings, and subheadings, including any units or legends associated with columns and rows.
    2. **Structure and Layout**: Describe the table’s structure, including the number of rows and columns, any merged cells, and the organization of data into categories or groups.
    3. **Content and Key Values**: Summarize the data within each cell, highlighting any important or unusual values, trends, or patterns across rows or columns.
    4. **Comparative Insights**: Note any comparisons, relationships, or calculations presented within the table (e.g., percentages, totals, averages).
    5. **Annotations and Footnotes**: Capture any additional annotations, footnotes, or symbols that clarify data or provide context.
    6. **Color and Formatting**: Mention the use of color, bolding, or formatting styles that emphasize particular rows, columns, or cells.
    """

    # Read and encode the image file
    with open(table_path, 'rb') as table_file:
        encoded_table = base64.b64encode(table_file.read()).decode('ascii')

    # Configuration
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    # Payload for the request
    payload = {
      "messages": [
        {
          "role": "system",
          "content": [
            {
              "type": "text",
              "text": "You are an assistant whose job is to provide the explanation of table which is going to be used to retrieve the table. The table is contained in the red box."
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{encoded_table}"
              }
            },
            {
              "type": "text",
              "text": prompt
            }
          ]
        },
      ],
      "temperature": 0.7,
      "top_p": 0.95,
      "max_tokens": 1500
    }

    # GPT-4 Vision endpoint
    GPT4V_ENDPOINT = endpoint
    # Send request
    try:
        response = requests.post(GPT4V_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
    except requests.RequestException as e:
        raise SystemExit(f"Failed to make the request. Error: {e}")

    # Handle the response
    response_data = response.json()
    
     # Extract the response content
    response_content = response_data['choices'][0]['message']['content']
    
    return response_content


In [1820]:
for doc_name, item in result_dicts_reordered.items():
    for page in item['pages']:
        for bbox in page['bboxes']:
            if bbox['role'] == 'table': 
                try:
                    bbox['table_description'] = table_description(bbox['path_table_redbox'], api_key=AZURE_OPENAI_API_KEY, endpoint=AZURE_OPENAI_ENDPOINT)
                    time.sleep(8)  # Add a 8-second delay after each request, otherwise HTTPError: 429 Client Error: Too Many Requests
                except SystemExit as e:
                    print(f"Request failed: {e}")
                    bbox['table_description'] = ''  # Set to empty if request fails

In [3]:
result_dicts_reordered

{}

In [2320]:
#for monzo
for doc_name, item in result_dicts_reordered_monzo.items():
    for page in item['pages']:
        for bbox in page['bboxes']:
            if bbox['role'] == 'table': 
                try:
                    bbox['table_description'] = table_description(bbox['path_table_redbox'], api_key=AZURE_OPENAI_API_KEY, endpoint=AZURE_OPENAI_ENDPOINT)
                    time.sleep(8)  # Add a 8-second delay after each request, otherwise HTTPError: 429 Client Error: Too Many Requests
                except SystemExit as e:
                    print(f"Request failed: {e}")
                    bbox['table_description'] = ''  # Set to empty if request fails

## 6. Implementing a multi modal RAG pipeline <a class="anchor" id="6"></a>
Sources: 
- https://devblogs.microsoft.com/ise/multimodal-rag-with-vision/
- https://medium.com/@uribrown_93284/extending-the-use-of-rag-functionality-to-images-with-azure-ai-search-and-azure-ai-services-cd4a3986155d
- https://github.com/microsoft/sample-app-aoai-chatGPT
- https://gautam75.medium.com/multi-modal-rag-a-practical-guide-99b0178c4fbb

Information to retrieve on business-model:
- Information on ownership structure
- Information on the bank's strategy and corporate structure (universal banking strategy, regional focus, market leadership, market position within the business area)
- Information on the profit and loss statement (P&L) and balance sheet (significant changes compared to the previous year, significant changes planned or being implemented?)
- Information on climate and environmental risks

### Chunking text information

### Hybrid Search

###Chunking of textblocks
- between subheadings if subheadings exist, between captions if captions exists
- if chunk exceeds token limit 8000, split after first 800 chunks, maintain paragraphs but include overlap
- if images/tables: include as many paragraphs before and after the chunk between subheadings and such that token limit <=8000

In [1932]:
# Sort all text information in the dictionary by reading order
def preprocess_chunks(result_lst):
    """Prepare chunk for indexing."""  
    # Include metadata in the text for enrichment (optional)
    
    chunk_dict = {}
    chunks = []
    for doc in result_lst.keys():
        for page in result_lst[doc].get('pages'):
            for bbox in page['bboxes']: 
                if bbox.get('role') is None or bbox.get('role') == 'title' or bbox.get('role') == 'sectionHeading':
                    text_content = bbox["raw_text"]
                    
                    bbox_dict = {
                        "enriched_text": f"{text_content}",
                        # Prepare metadata
                        "metadata": {
                            "doc_name": doc,
                            "role": bbox['role'],
                            "reading_order": bbox["reading_order"],
                            "page": page["page_number"]
                        }
                    }
                    chunks.append(bbox_dict)
                
                if bbox.get('role') == 'figure' and bbox["raw_text"]!=[]:
                    text_content = bbox["figure_description"]
                    
                    bbox_dict = {
                        "enriched_text": f"{text_content}, Figure Caption: {bbox['caption']}",
                        # Prepare metadata
                        "metadata": {
                            "doc_name": doc,
                            "role": bbox['role'],
                            "caption": bbox['caption'],
                            "reading_order": bbox["reading_order"],
                            "page": page["page_number"]
                        }
                    }
                    chunks.append(bbox_dict)
                
                if bbox.get('role') == 'table':
                    text_content = bbox["table_description"]
        
                    bbox_dict = {
                        "enriched_text": f"{text_content}",
                        # Prepare metadata
                        "metadata": {
                            "doc_name": doc,
                            "role": bbox['role'],
                            "reading_order": bbox["reading_order"],
                            "page": page["page_number"]
                        }
                    }
                    chunks.append(bbox_dict)
    
    for i, chunk in enumerate(chunks):
        chunk_dict[i] = {
            "chunk": chunk,
            "chunk_id": i
        }
    
    return chunk_dict

In [2249]:
# Preprocess the chunks result_dicts_reordered for indexing
chunks_ordered=preprocess_chunks(result_dicts_reordered)

In [2321]:
#for monzo
chunks_ordered_monzo=preprocess_chunks(result_dicts_reordered_monzo)

In [2250]:
chunks_ordered

{0: {'chunk': {'enriched_text': 'Letter from the Chairman of the Supervisory Board',
   'metadata': {'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
    'role': 'title',
    'reading_order': 3,
    'page': 1}},
  'chunk_id': 0},
 1: {'chunk': {'enriched_text': 'Dear Shareholders,',
   'metadata': {'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
    'role': None,
    'reading_order': 4,
    'page': 1}},
  'chunk_id': 1},
 2: {'chunk': {'enriched_text': "Last year, Deutsche Bank proved its resilience in a volatile environment, further increasing revenues and delivering its best pre-tax profit in 16 years. Thanks to its prudent risk management, the bank has a strong and stable balance sheet and has further strengthened its capital base. As a result, we are this year once again in a position to substantially increase capital distributions to shareholders and we are pleased that the Supervisory Board and Management Board will propose to you a 50% increase in the dividend

In [1937]:
# Enrich the chunk metadata with section headings and titles
def enrich_metadata(chunks):
    """
    Enrich chunks with metadata keys 'sectionHeading' and 'title' based on the nearest preceding chunks
    with the same doc_name. Includes multiple section headings if they are adjacent in reading order.
    
    Args:
    chunks (dict): Dictionary containing chunk data.
    
    Returns:
    dict: Updated chunks dictionary with enriched metadata.
    """
    enriched_chunks = {}

    # Group chunks by document name
    chunks_by_doc = {}
    for chunk_id, chunk_data in chunks.items():
        doc_name = chunk_data['chunk']['metadata']['doc_name']
        if doc_name not in chunks_by_doc:
            chunks_by_doc[doc_name] = []
        chunks_by_doc[doc_name].append((chunk_id, chunk_data))
    
    for doc_name, doc_chunks in chunks_by_doc.items():
        # Sort chunks by page and reading order for each document
        doc_chunks.sort(key=lambda x: (x[1]['chunk']['metadata']['page'], x[1]['chunk']['metadata']['reading_order']))
        
        # Track nearest section headings and title
        nearest_section_headings = []
        nearest_title = None
        
        for i, (chunk_id, chunk_data) in enumerate(doc_chunks):
            role = chunk_data['chunk']['metadata']['role']
            raw_text = chunk_data['chunk']['enriched_text']
            reading_order = chunk_data['chunk']['metadata']['reading_order']
            
            # Update nearest_section_headings
            if role == 'sectionHeading':
                # If the list is not empty and the current reading_order is adjacent to the last one, append the text
                if nearest_section_headings and \
                   chunk_data['chunk']['metadata']['reading_order'] == nearest_section_headings[-1][1] + 1:
                    nearest_section_headings.append((raw_text, reading_order))
                else:
                    # Otherwise, reset the list with the current heading
                    nearest_section_headings = [(raw_text, reading_order)]
            
            # Update nearest_title
            if role == 'title':
                nearest_title = raw_text
            
            # Prepare section headings as a list of texts
            section_heading_texts = [text for text, _ in nearest_section_headings] # There are cases in which we have two section headings on a page
            
            # Enrich the current chunk
            enriched_metadata = chunk_data['chunk']['metadata'].copy()
            enriched_metadata['sectionHeading'] = section_heading_texts
            enriched_metadata['title'] = nearest_title
            
            enriched_chunks[chunk_id] = {
                'chunk': {
                    'enriched_text': chunk_data['chunk']['enriched_text'],
                    'metadata': enriched_metadata
                },
                'chunk_id': chunk_id
            }
    
    return enriched_chunks


In [2254]:
chunks_ordered_with_meta=enrich_metadata(chunks_ordered)

In [2322]:
#for monzo
chunks_ordered_with_meta_monzo=enrich_metadata(chunks_ordered_monzo)


In [None]:
chunks_ordered_with_meta

{0: {'chunk': {'enriched_text': 'Letter from the Chairman of the Supervisory Board',
   'metadata': {'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
    'role': 'title',
    'reading_order': 3,
    'page': 1,
    'sectionHeading': [],
    'title': 'Letter from the Chairman of the Supervisory Board'}},
  'chunk_id': 0},
 1: {'chunk': {'enriched_text': 'Dear Shareholders,',
   'metadata': {'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
    'role': None,
    'reading_order': 4,
    'page': 1,
    'sectionHeading': [],
    'title': 'Letter from the Chairman of the Supervisory Board'}},
  'chunk_id': 1},
 2: {'chunk': {'enriched_text': "Last year, Deutsche Bank proved its resilience in a volatile environment, further increasing revenues and delivering its best pre-tax profit in 16 years. Thanks to its prudent risk management, the bank has a strong and stable balance sheet and has further strengthened its capital base. As a result, we are this year once again in a positi

In [1939]:
#count tokens to define chunks
import tiktoken
encoding=tiktoken.encoding_for_model("text-embedding-3-large")
def num_tokens_from_string(string: str, encoding_name='cl100k_base') ->int:
    """Returns the number of tokens in a text-string"""
    encoding=tiktoken.get_encoding(encoding_name)
    num_tokens=len(encoding.encode(string))
    return num_tokens

In [2040]:
#Write all chunks into a list
def transform_input(input_data: Dict[int, Dict]) -> List[Dict]:
    """
    Transform the input dictionary into a list of chunk dictionaries with the desired shape.

    Args:
    - input_data (dict): Original input data with a nested structure.

    Returns:
    - List[dict]: Transformed data in the desired flat structure.
    """
    transformed_chunks = []

    for _, chunk_data in input_data.items():
        chunk = chunk_data['chunk']  # Extract the 'chunk' dictionary
        transformed_chunk = {
            'chunk_id': chunk_data['chunk_id'],
            'enriched_text': chunk['enriched_text'],
            'metadata': chunk['metadata']
        }
        transformed_chunks.append(transformed_chunk)

    return transformed_chunks

#give all chunks a global chunk id which should enode which chunks belong together to bigger chunks with a maximum of 1000 tokens and an overlap window of 400 tokens
def process_chunks(chunks: List[Dict]) -> List[Dict]:
    """
    Process the chunks to assign global_chunk_id based on token size constraints, 
    applied separately for each doc_name.
    
    Args:
    - chunks (list of dict): List of chunk dictionaries, each with 'enriched_text' and metadata.
    
    Returns:
    - List of chunks with updated 'global_chunk_id'.
    """
    chunks=transform_input(chunks)
    def process_single_doc(doc_chunks: List[Dict]) -> List[Dict]:
        """Process chunks for a single doc_name."""
        global_chunk_id = 1
        current_group = []  # Temporary list to hold chunks in the current group
        max_group_size = 1000  # Max token size for a group
        result_chunks = []  # Final list of processed chunks
        
        for chunk in doc_chunks:
            current_group_token_sum = sum(c['token_size'] for c in current_group)
            
            if current_group_token_sum + chunk['token_size'] <= max_group_size:
                # Add to current group
                current_group.append(chunk)
            else:
                # Finalize current group
                for group_chunk in current_group:
                    group_chunk['global_chunk_id'] = global_chunk_id
                    result_chunks.append(group_chunk)
                
                # Handle overlap for continuity
                if len(current_group) > 0 and current_group[-1]['token_size'] <= 400:
                    overlap_chunk = current_group[-1].copy()
                    overlap_chunk['global_chunk_id'] = global_chunk_id + 1
                    result_chunks.append(overlap_chunk)
                    current_group = [overlap_chunk]
                else:
                    current_group = []
                
                # Start new group for the current chunk
                current_group.append(chunk)
                global_chunk_id += 1

        # Assign global_chunk_id to the remaining chunks in the last group
        for group_chunk in current_group:
            group_chunk['global_chunk_id'] = global_chunk_id
            result_chunks.append(group_chunk)

        return result_chunks
    
    # Organize chunks by doc_name
    chunks_by_doc = defaultdict(list)
    for chunk in chunks:
        chunk['token_size'] = num_tokens_from_string(chunk['enriched_text'])
        doc_name = chunk['metadata'].get('doc_name', 'default_doc')
        chunks_by_doc[doc_name].append(chunk)

    # Process each document's chunks separately
    processed_chunks = []
    for doc_name, doc_chunks in chunks_by_doc.items():
        processed_chunks.extend(process_single_doc(doc_chunks))
    
    return processed_chunks

In [2255]:
chunks_ordered_with_meta_groups=process_chunks(chunks_ordered_with_meta)

In [2323]:
#for monzo
chunks_ordered_with_meta_groups_monzo=process_chunks(chunks_ordered_with_meta_monzo)

In [2256]:
chunks_ordered_with_meta_groups

[{'chunk_id': 0,
  'enriched_text': 'Letter from the Chairman of the Supervisory Board',
  'metadata': {'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
   'role': 'title',
   'reading_order': 3,
   'page': 1,
   'sectionHeading': [],
   'title': 'Letter from the Chairman of the Supervisory Board'},
  'token_size': 9,
  'global_chunk_id': 1},
 {'chunk_id': 1,
  'enriched_text': 'Dear Shareholders,',
  'metadata': {'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
   'role': None,
   'reading_order': 4,
   'page': 1,
   'sectionHeading': [],
   'title': 'Letter from the Chairman of the Supervisory Board'},
  'token_size': 4,
  'global_chunk_id': 1},
 {'chunk_id': 2,
  'enriched_text': "Last year, Deutsche Bank proved its resilience in a volatile environment, further increasing revenues and delivering its best pre-tax profit in 16 years. Thanks to its prudent risk management, the bank has a strong and stable balance sheet and has further strengthened its capital base. As

In [2257]:
#Merge chunks with same doc_name according to their global chunk ids and with same title and sectionHeading to bigger chunks with token size <= 1000 and overlap <=400
def merge_chunks(chunks: List[Dict]) -> List[Dict]:
    """
    Merge consecutive chunks with the same doc_name, title, sectionHeading, and global_chunk_id.
    
    Args:
    - chunks (list of dict): List of chunk dictionaries.

    Returns:
    - List of merged chunks.
    """
    merged_chunks = []
    i = 0

    while i < len(chunks):
        # Initialize the merged chunk with the current chunk
        current_chunk = chunks[i]
        merged_chunk = {
            'doc_name': current_chunk['metadata']['doc_name'],
            'title': current_chunk['metadata']['title'],
            'sectionHeading': current_chunk['metadata']['sectionHeading'],
            'global_chunk_id': current_chunk['global_chunk_id'],
            'page': [current_chunk['metadata']['page']],
            'token_sizes': [current_chunk['token_size']],
            'reading_order': [current_chunk['metadata']['reading_order']],
            'chunk_ids': [current_chunk['chunk_id']],
            'enriched_text': f"Title: {current_chunk['metadata']['title']}, " # Include title and sectionHeading in the merged text
                             f"Section Heading: {current_chunk['metadata']['sectionHeading']}, "
                             f"{current_chunk['enriched_text']}"
        }
        
        # Check if the next chunk can be merged
        while (
            i + 1 < len(chunks) and
            chunks[i + 1]['metadata']['doc_name'] == merged_chunk['doc_name'] and
            chunks[i + 1]['metadata']['title'] == merged_chunk['title'] and
            chunks[i + 1]['metadata']['sectionHeading'] == merged_chunk['sectionHeading'] and
            chunks[i + 1]['global_chunk_id'] == merged_chunk['global_chunk_id']
        ):
            next_chunk = chunks[i + 1]
            
            # Merge details from the next chunk into the current merged chunk
            merged_chunk['page'].append(next_chunk['metadata']['page'])
            merged_chunk['token_sizes'].append(next_chunk['token_size'])
            merged_chunk['reading_order'].append(next_chunk['metadata']['reading_order'])
            merged_chunk['chunk_ids'].append(next_chunk['chunk_id'])
            merged_chunk['enriched_text'] += " " + next_chunk['enriched_text']
            
            # Move to the next chunk
            i += 1
        
        # Append the merged chunk to the result list
        merged_chunks.append(merged_chunk)
        # Move to the next chunk
        i += 1

    return merged_chunks

In [2258]:
chunks_merged_1=merge_chunks(chunks_ordered_with_meta_groups)

In [2324]:
#for monzo
chunks_merged_1_monzo=merge_chunks(chunks_ordered_with_meta_groups_monzo)

In [2059]:
#Merge chunks with same doc_name according to their global chunk ids to bigger chunks with token size <= 1000 and overlap <=400
def merge_consecutive_chunks(chunks: List[Dict]) -> List[Dict]:
    """
    Merge consecutive chunks that have the same doc_name and global_chunk_id.
    
    Args:
    - chunks (list of dict): List of chunk dictionaries.
    
    Returns:
    - List of merged chunks.
    """
    merged_chunks = []
    i = 0

    while i < len(chunks):
        # Initialize the merged chunk with the current chunk
        current_chunk = chunks[i]
        merged_chunk = {
            'doc_name': current_chunk['doc_name'],
            'global_chunk_id': current_chunk['global_chunk_id'],
            'titles': [current_chunk['title']],
            'sectionHeadings': [current_chunk['sectionHeading']],
            'page': [current_chunk['page']],
            'token_sizes': [current_chunk['token_sizes']],
            'reading_order': [current_chunk['reading_order']],
            'chunk_ids': [current_chunk['chunk_ids']],
            'enriched_text': current_chunk['enriched_text']
        }
        
        # Check if the next chunk can be merged based on doc_name and global_chunk_id
        while (
            i + 1 < len(chunks) and
            chunks[i + 1]['doc_name'] == merged_chunk['doc_name'] and
            chunks[i + 1]['global_chunk_id'] == merged_chunk['global_chunk_id']
        ):
            next_chunk = chunks[i + 1]
            
            # Merge details from the next chunk into the current merged chunk
            merged_chunk['titles'].append(next_chunk['title'])
            merged_chunk['sectionHeadings'].append(next_chunk['sectionHeading'])
            merged_chunk['page'].append(next_chunk['page'])
            merged_chunk['token_sizes'].append(next_chunk['token_sizes'])
            merged_chunk['reading_order'].append(next_chunk['reading_order'])
            merged_chunk['chunk_ids'].append(next_chunk['chunk_ids'])
            merged_chunk['enriched_text'] += " " + next_chunk['enriched_text']
            
            # Move to the next chunk
            i += 1
        
        # Append the merged chunk to the result list
        merged_chunks.append(merged_chunk)
        # Move to the next chunk
        i += 1

    return merged_chunks

In [2261]:
result_chunklist=merge_consecutive_chunks(chunks_merged_1)

In [2325]:
#for monzo
result_chunklist_monzo=merge_consecutive_chunks(chunks_merged_1_monzo)

In [2262]:
result_chunklist

[{'doc_name': 'deutschebank_annual_report_2023_8_9_15_120',
  'global_chunk_id': 1,
  'titles': ['Letter from the Chairman of the Supervisory Board'],
  'sectionHeadings': [[]],
  'page': [[1, 1, 1, 1, 1, 1, 1]],
  'token_sizes': [[9, 4, 112, 236, 137, 150, 197]],
  'reading_order': [[3, 4, 5, 6, 7, 8, 9]],
  'chunk_ids': [[0, 1, 2, 3, 4, 5, 6]],
  'enriched_text': "Title: Letter from the Chairman of the Supervisory Board, Section Heading: [], Letter from the Chairman of the Supervisory Board Dear Shareholders, Last year, Deutsche Bank proved its resilience in a volatile environment, further increasing revenues and delivering its best pre-tax profit in 16 years. Thanks to its prudent risk management, the bank has a strong and stable balance sheet and has further strengthened its capital base. As a result, we are this year once again in a position to substantially increase capital distributions to shareholders and we are pleased that the Supervisory Board and Management Board will propo

In [2263]:
#Create a dataframe from the result_chunklist
df_result_chunklist=pd.DataFrame(result_chunklist)

In [2326]:
#for monzo
df_result_chunklist_monzo=pd.DataFrame(result_chunklist_monzo)

In [2264]:
df_result_chunklist

Unnamed: 0,doc_name,global_chunk_id,titles,sectionHeadings,page,token_sizes,reading_order,chunk_ids,enriched_text
0,deutschebank_annual_report_2023_8_9_15_120,1,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 1, 1, 1, 1]]","[[9, 4, 112, 236, 137, 150, 197]]","[[3, 4, 5, 6, 7, 8, 9]]","[[0, 1, 2, 3, 4, 5, 6]]",Title: Letter from the Chairman of the Supervi...
1,deutschebank_annual_report_2023_8_9_15_120,2,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]]","[[197, 197, 255, 102, 161, 70, 87, 89, 3, 4, 5...","[[9, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22]]","[[6, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]]",Title: Letter from the Chairman of the Supervi...
2,deutschebank_annual_report_2023_9_15_120,1,"[None, None]","[[], [Participation in meetings]]","[[1, 1, 1, 1, 1, 1, 1, 1, 1], [2, 2]]","[[102, 161, 70, 87, 89, 3, 4, 5, 7], [4, 53]]","[[3, 4, 5, 6, 7, 8, 9, 10, 11], [15, 16]]","[[17, 18, 19, 20, 21, 22, 23, 24, 25], [26, 27]]","Title: None, Section Heading: [], As part of p..."
3,deutschebank_annual_report_2023_9_15_120,2,[None],[[Participation in meetings]],"[[2, 2, 2]]","[[53, 53, 528]]","[[16, 16, 17]]","[[27, 27, 28]]","Title: None, Section Heading: ['Participation ..."
4,deutschebank_annual_report_2023_9_15_120,3,[None],[[Participation in meetings]],[[2]],[[602]],[[304]],[[29]],"Title: None, Section Heading: ['Participation ..."
5,deutschebank_annual_report_2023_120,1,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,[[1]],[[26]],[[3]],[[30]],"Title: None, Section Heading: ['IFRS 9 - Sensi..."
6,deutschebank_annual_report_2023_120,2,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1]]","[[26, 26, 1004]]","[[3, 3, 4]]","[[30, 30, 31]]","Title: None, Section Heading: ['IFRS 9 - Sensi..."
7,deutschebank_annual_report_2023_120,3,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1]]","[[62, 36]]","[[41, 42]]","[[32, 33]]","Title: None, Section Heading: ['IFRS 9 - Sensi..."
8,deutschebank_annual_report_2023_120,4,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1]]","[[36, 36, 1139]]","[[42, 42, 43]]","[[33, 33, 34]]","Title: None, Section Heading: ['IFRS 9 - Sensi..."
9,deutschebank_annual_report_2023_120,5,"[None, None]",[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1], [1, 1]]","[[104], [26, 7]]","[[80], [81, 82]]","[[35], [36, 37]]","Title: None, Section Heading: ['IFRS 9 - Sensi..."


### Embedding the text chunks with text-embedding-3-large

In [2099]:
# Load credentials from .env file
load_dotenv()

AZURE_EMBEDDING__API_KEY = os.getenv("AZURE_EMBEDDING_API_KEY")
AZURE_EMBEDDING__ENDPOINT = os.getenv("AZURE_EMBEDDING_ENDPOINT")

client = AzureOpenAI(
  api_key = AZURE_EMBEDDING__API_KEY
  azure_endpoint = AZURE_EMBEDDING__ENDPOINT)

def get_embeddings(text_list, model_name="text-embedding-3-large"):
    """Get embeddings for a list of text strings using the specified model."""
    response = client.embeddings.create(input=text_list, model=model_name)
    return response

In [2265]:
# Generate embeddings for all chunks
df_result_chunklist['embedding_text-embedding-3-large'] = None

# Loop through the DataFrame and assign embeddings
for i in range(len(df_result_chunklist)):
    text_list = [df_result_chunklist.iloc[i]['enriched_text']]  # Wrap the text in a list
    embedding = get_embeddings(text_list).data[0].embedding     # Get the embedding
    # Assign the entire embedding as an object to the column
    df_result_chunklist.at[i, 'embedding_text-embedding-3-large'] = embedding

In [2327]:
#for monzo
# Generate embeddings for all chunks
df_result_chunklist_monzo['embedding_text-embedding-3-large'] = None

# Loop through the DataFrame and assign embeddings
for i in range(len(df_result_chunklist_monzo)):
    text_list = [df_result_chunklist_monzo.iloc[i]['enriched_text']]  # Wrap the text in a list
    embedding = get_embeddings(text_list).data[0].embedding     # Get the embedding
    # Assign the entire embedding as an object to the column
    df_result_chunklist_monzo.at[i, 'embedding_text-embedding-3-large'] = embedding

In [2266]:
df_result_chunklist

Unnamed: 0,doc_name,global_chunk_id,titles,sectionHeadings,page,token_sizes,reading_order,chunk_ids,enriched_text,embedding_text-embedding-3-large
0,deutschebank_annual_report_2023_8_9_15_120,1,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 1, 1, 1, 1]]","[[9, 4, 112, 236, 137, 150, 197]]","[[3, 4, 5, 6, 7, 8, 9]]","[[0, 1, 2, 3, 4, 5, 6]]",Title: Letter from the Chairman of the Supervi...,"[0.04903711378574371, 0.00020914606284350157, ..."
1,deutschebank_annual_report_2023_8_9_15_120,2,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]]","[[197, 197, 255, 102, 161, 70, 87, 89, 3, 4, 5...","[[9, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22]]","[[6, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]]",Title: Letter from the Chairman of the Supervi...,"[0.04739595204591751, -0.018278546631336212, -..."
2,deutschebank_annual_report_2023_9_15_120,1,"[None, None]","[[], [Participation in meetings]]","[[1, 1, 1, 1, 1, 1, 1, 1, 1], [2, 2]]","[[102, 161, 70, 87, 89, 3, 4, 5, 7], [4, 53]]","[[3, 4, 5, 6, 7, 8, 9, 10, 11], [15, 16]]","[[17, 18, 19, 20, 21, 22, 23, 24, 25], [26, 27]]","Title: None, Section Heading: [], As part of p...","[0.05034126341342926, -0.02449539117515087, -0..."
3,deutschebank_annual_report_2023_9_15_120,2,[None],[[Participation in meetings]],"[[2, 2, 2]]","[[53, 53, 528]]","[[16, 16, 17]]","[[27, 27, 28]]","Title: None, Section Heading: ['Participation ...","[0.047187261283397675, -0.025101494044065475, ..."
4,deutschebank_annual_report_2023_9_15_120,3,[None],[[Participation in meetings]],[[2]],[[602]],[[304]],[[29]],"Title: None, Section Heading: ['Participation ...","[0.034215424209833145, -0.02564658597111702, -..."
5,deutschebank_annual_report_2023_120,1,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,[[1]],[[26]],[[3]],[[30]],"Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.008042172528803349, 0.038635168224573135, ..."
6,deutschebank_annual_report_2023_120,2,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1]]","[[26, 26, 1004]]","[[3, 3, 4]]","[[30, 30, 31]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.005911923013627529, 0.00321170873939991, -..."
7,deutschebank_annual_report_2023_120,3,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1]]","[[62, 36]]","[[41, 42]]","[[32, 33]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.009254434145987034, 0.03645535930991173, -..."
8,deutschebank_annual_report_2023_120,4,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1]]","[[36, 36, 1139]]","[[42, 42, 43]]","[[33, 33, 34]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.015078556723892689, 0.011807754635810852, ..."
9,deutschebank_annual_report_2023_120,5,"[None, None]",[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1], [1, 1]]","[[104], [26, 7]]","[[80], [81, 82]]","[[35], [36, 37]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.0009420232963748276, 0.007235208060592413,..."


### Semantic Search

In [2267]:
# Define queries based on four categories which are of importance to describe the business model of a bank
queries = {
    "ownership_structure": [
        "Details about the bank's ownership, including shareholders, stakeholders, and their roles in governance.",
        "Information about the distribution of ownership rights and shares within the organization.",
        "Key details about the bank's ownership structure, including majority and minority stakeholders."
    ],
    "strategy_corporate_structure": [
        "Insights into the bank's universal banking strategy and its focus on diversified financial services.",
        "The bank's approach to achieving market leadership, including its competitive position within its business areas.",
        "Details about the bank's corporate structure, including organizational hierarchy and regional operations."
    ],
    "pnl_balance_sheet": [
        "Summary of significant changes in the bank's profit and loss statement compared to the previous year.",
        "Details about the financial performance, including major changes in revenue, expenses, or profitability.",
        "Insights into planned or implemented changes in the bank's balance sheet, such as asset reallocation or liabilities management."
    ],
    "climate_environmental_risks": [
        "Details about the bank's policies and actions to address climate risks and environmental sustainability.",
        "Information about the bank's exposure to environmental risks, including climate-related financial disclosures.",
        "Insights into how the bank evaluates and manages risks related to sustainability and carbon emissions."
    ]
}

In [2269]:
#Compute the cosine similarity between the embeddings of the chunks and the queries
df_result_chunklist['text-embedding-3-large_ownership_structure'] = None
df_result_chunklist['text-embedding-3-large_strategy_corporate_structure'] = None
df_result_chunklist['text-embedding-3-large_pnl_balance_sheet'] = None
df_result_chunklist['text-embedding-3-large_climate_environmental_risks'] = None

# Loop through each row in the DataFrame
for i in range(len(df_result_chunklist)):
    # Extract the text and get embeddings
    text_list = [df_result_chunklist.iloc[i]['enriched_text']]  # Wrap the text in a list
    embedding = df_result_chunklist.iloc[i]['embedding_text-embedding-3-large']   # Get the embeddings
    embedding = torch.tensor(embedding)  # Convert to 2D tensor
    
    # Generate embeddings for query sentences and convert them to tensors
    embedding_ownership_structure = torch.tensor([
        get_embeddings(queries['ownership_structure']).data[i].embedding for i in range(len(queries['ownership_structure']))])
    
    embedding_strategy_corporate_structure = torch.tensor([
        get_embeddings(queries['strategy_corporate_structure']).data[i].embedding for i in range(len(queries['strategy_corporate_structure']))])
    
    embedding_pnl_balance_sheet = torch.tensor([
        get_embeddings(queries['pnl_balance_sheet']).data[i].embedding for i in range(len(queries['pnl_balance_sheet']))])
    
    embedding_climate_environmental_risks = torch.tensor([
        get_embeddings(queries['climate_environmental_risks']).data[i].embedding for i in range(len(queries['climate_environmental_risks']))])
    

    # Compute cosine similarities and find max similarity for each category
    df_result_chunklist.at[i, 'text-embedding-3-large_ownership_structure'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_ownership_structure]
    )
    df_result_chunklist.at[i, 'text-embedding-3-large_strategy_corporate_structure'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_strategy_corporate_structure]
    )
    df_result_chunklist.at[i, 'text-embedding-3-large_pnl_balance_sheet'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_pnl_balance_sheet]
    )
    df_result_chunklist.at[i, 'text-embedding-3-large_climate_environmental_risks'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_climate_environmental_risks]
    )

# Rank the DataFrame based on similarity scores
df_result_chunklist_ranked = df_result_chunklist.sort_values(
    by=[
        'text-embedding-3-large_ownership_structure',
        'text-embedding-3-large_strategy_corporate_structure',
        'text-embedding-3-large_pnl_balance_sheet',
        'text-embedding-3-large_climate_environmental_risks'
    ],
    ascending=[False, False, False, False]
)

In [2328]:
#for monzo
df_result_chunklist_monzo['text-embedding-3-large_ownership_structure'] = None
df_result_chunklist_monzo['text-embedding-3-large_strategy_corporate_structure'] = None
df_result_chunklist_monzo['text-embedding-3-large_pnl_balance_sheet'] = None
df_result_chunklist_monzo['text-embedding-3-large_climate_environmental_risks'] = None

# Loop through each row in the DataFrame

for i in range(len(df_result_chunklist_monzo)):
    # Extract the text and get embeddings
    text_list = [df_result_chunklist_monzo.iloc[i]['enriched_text']]  # Wrap the text in a list
    embedding = df_result_chunklist_monzo.iloc[i]['embedding_text-embedding-3-large']   # Get the embeddings
    embedding = torch.tensor(embedding)  # Convert to 2D tensor
    
    # Generate embeddings for query sentences and convert them to tensors
    embedding_ownership_structure = torch.tensor([
        get_embeddings(queries['ownership_structure']).data[i].embedding for i in range(len(queries['ownership_structure']))])
    
    embedding_strategy_corporate_structure = torch.tensor([
        get_embeddings(queries['strategy_corporate_structure']).data[i].embedding for i in range(len(queries['strategy_corporate_structure']))])
    
    embedding_pnl_balance_sheet = torch.tensor([
        get_embeddings(queries['pnl_balance_sheet']).data[i].embedding for i in range(len(queries['pnl_balance_sheet']))])
    
    embedding_climate_environmental_risks = torch.tensor([
        get_embeddings(queries['climate_environmental_risks']).data[i].embedding for i in range(len(queries['climate_environmental_risks']))])
    

    # Compute cosine similarities and find max similarity for each category
    df_result_chunklist_monzo.at[i, 'text-embedding-3-large_ownership_structure'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_ownership_structure]
    )
    df_result_chunklist_monzo.at[i, 'text-embedding-3-large_strategy_corporate_structure'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_strategy_corporate_structure]
    )
    df_result_chunklist_monzo.at[i, 'text-embedding-3-large_pnl_balance_sheet'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_pnl_balance_sheet]
    )
    df_result_chunklist_monzo.at[i, 'text-embedding-3-large_climate_environmental_risks'] = max(
        [util.cos_sim(emb, embedding).item() for emb in embedding_climate_environmental_risks]
    )

# Rank the DataFrame based on similarity scores
df_result_chunklist_ranked_monzo = df_result_chunklist_monzo.sort_values(
    by=[
        'text-embedding-3-large_ownership_structure',
        'text-embedding-3-large_strategy_corporate_structure',
        'text-embedding-3-large_pnl_balance_sheet',
        'text-embedding-3-large_climate_environmental_risks'
    ],
    ascending=[False, False, False, False]
)


In [2128]:
df_result_chunklist_ranked

Unnamed: 0,doc_name,global_chunk_id,titles,sectionHeadings,page,token_sizes,reading_order,chunk_ids,enriched_text,text-embedding-3-large_ownership_structure,text-embedding-3-large_strategy_corporate_structure,text-embedding-3-large_pnl_balance_sheet,text-embedding-3-large_climate_environmental_risks
51,monzo_annual report_2024_55_64,1,[None],[[Our governance priorities in FY2024]],"[[1, 1, 1, 1]]","[[7, 93, 62, 115]]","[[3, 4, 5, 9]]","[[193, 194, 195, 196]]","Title: None, Section Heading: ['Our governance...",0.468743,0.428906,0.376329,0.394623
52,monzo_annual report_2024_55_64,2,"[None, How our governance works, How our gover...","[[Our governance priorities in FY2024], [Our g...","[[1, 1, 1], [2], [2, 2, 2]]","[[115, 115, 768], [4], [12, 33, 25]]","[[9, 9, 10], [17], [18, 19, 20]]","[[196, 196, 197], [198], [199, 200, 201]]","Title: None, Section Heading: ['Our governance...",0.448966,0.508932,0.334033,0.33599
53,monzo_annual report_2024_55_64,3,[How our governance works],[[We've developed an approach to governance th...,"[[2, 2, 2]]","[[25, 25, 824]]","[[20, 20, 21]]","[[201, 201, 202]]","Title: How our governance works, Section Headi...",0.444611,0.504684,0.350961,0.417145
21,deutschebank_annual_report_2022_9_15_119,1,"[None, Participation in meetings]","[[], []]","[[1, 1, 1], [2, 2]]","[[134, 84, 39], [4, 48]]","[[3, 4, 5], [9, 10]]","[[72, 73, 74], [75, 76]]","Title: None, Section Heading: [], When organiz...",0.420748,0.360184,0.376873,0.403098
0,deutschebank_annual_report_2023_8_9_15_120,1,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 1, 1, 1, 1]]","[[9, 4, 112, 236, 137, 150, 197]]","[[3, 4, 5, 6, 7, 8, 9]]","[[0, 1, 2, 3, 4, 5, 6]]",Title: Letter from the Chairman of the Supervi...,0.401525,0.406437,0.458029,0.433241
...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,deutschebank_quarterly_presentation_q2_2024_10_22,1,"[Adjusted costs - Q2 2024 (YoY) In € bn, unles...",[[]],"[[1, 1]]","[[19, 637]]","[[1, 2]]","[[139, 140]]","Title: Adjusted costs - Q2 2024 (YoY) In € bn,...",0.260303,0.315389,0.445951,0.328664
25,deutschebank_annual_report_2022_118,3,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1, 1, 1, 1, 1]]","[[4, 3, 3, 2, 2, 4, 62]]","[[22, 27, 32, 37, 42, 47, 52]]","[[91, 92, 93, 94, 95, 96, 97]]","Title: None, Section Heading: ['IFRS 9 - Sensi...",0.251251,0.24612,0.429944,0.384537
14,deutschebank_annual_report_2023_420,3,[None],[[]],"[[1, 1]]","[[6, 6]]","[[16, 61]]","[[50, 51]]","Title: None, Section Heading: [], (LTA) (60%) ...",0.247425,0.214007,0.221449,0.159334
30,deutschebank_quarterly_presentation_q2_2024_8_...,3,"[Q2 2024 highlights In € bn, unless stated oth...",[[Financial results]],"[[1, 1, 1, 1]]","[[2, 5, 8, 8]]","[[3, 4, 5, 6]]","[[103, 104, 105, 106]]","Title: Q2 2024 highlights In € bn, unless stat...",0.22336,0.28475,0.440942,0.262745


### Lexical Search with bm25

In [2131]:
# Keywords around the topics of ownership structure, corporate strategy, P&L statement, and climate/environmental risks
# Will use keywords for lexical search 
keywords = {
    "ownership_structure": [
        "ownership structure",
        "shareholders",
        "stakeholders",
        "equity holders",
        "voting rights",
        "capital ownership",
        "board of directors",
        "governance structure",
        "ownership percentage",
        "major investors",
        "parent company",
        "subsidiaries",
        "affiliated entities",
        "institutional investors",
        "private equity",
    ],
    "strategy_corporate_structure": [
        "corporate strategy",
        "business strategy",
        "universal banking strategy",
        "regional focus",
        "market leadership",
        "competitive position",
        "business area strategy",
        "corporate structure",
        "strategic focus",
        "growth strategy",
        "business model",
        "strategic plan",
        "market dominance",
        "core operations",
        "strategic alignment",
    ],
    "pnl_balance_sheet": [
        "profit and loss statement",
        "P&L statement",
        "income statement",
        "balance sheet",
        "financial performance",
        "financial results",
        "year-over-year changes",
        "fiscal analysis",
        "net income",
        "operating income",
        "revenue changes",
        "assets and liabilities",
        "financial metrics",
        "profitability",
        "cost reduction",
        "financial forecast",
    ],
    "climate_environmental_risks": [
        "climate risks",
        "environmental risks",
        "sustainability",
        "climate change impact",
        "greenhouse gas emissions",
        "environmental sustainability",
        "climate resilience",
        "carbon footprint",
        "sustainable finance",
        "environmental strategy",
        "climate-related financial disclosures",
        "green investments",
        "renewable energy transition",
        "climate-related risks",
        "ecological impact",
    ],
}

In [2270]:
#lexical search with bm25
from rank_bm25 import BM25Okapi

df_result_chunklist['lexical-search_ownership_structure'] = None
df_result_chunklist['lexical-search_strategy_corporate_structure'] = None
df_result_chunklist['lexical-search_pnl_balance_sheet'] = None
df_result_chunklist['lexical-search_climate_environmental_risks'] = None

for i in range(len(df_result_chunklist)):
    # Extract the text and get embeddings
    text_list = [df_result_chunklist.iloc[i]['enriched_text']]  # Wrap the text in a list
    bm25 = BM25Okapi(text_list)

    df_result_chunklist.at[i,'lexical-search_ownership_structure'] = max([bm25.get_scores(term) for term in keywords['ownership_structure']])
    df_result_chunklist.at[i,'lexical-search_strategy_corporate_structure'] = max([bm25.get_scores(term) for term in keywords['strategy_corporate_structure']])
    df_result_chunklist.at[i,'lexical-search_pnl_balance_sheet'] = max([bm25.get_scores(term) for term in keywords['pnl_balance_sheet']])
    df_result_chunklist.at[i,'lexical-search_climate_environmental_risks'] = max([bm25.get_scores(term) for term in keywords['climate_environmental_risks']])


In [2329]:
#for monzo
#lexical search with bm25
from rank_bm25 import BM25Okapi

df_result_chunklist_monzo['lexical-search_ownership_structure'] = None
df_result_chunklist_monzo['lexical-search_strategy_corporate_structure'] = None
df_result_chunklist_monzo['lexical-search_pnl_balance_sheet'] = None
df_result_chunklist_monzo['lexical-search_climate_environmental_risks'] = None

for i in range(len(df_result_chunklist_monzo)):
    # Extract the text and get embeddings
    text_list = [df_result_chunklist_monzo.iloc[i]['enriched_text']]  # Wrap the text in a list
    bm25 = BM25Okapi(text_list)

    df_result_chunklist_monzo.at[i,'lexical-search_ownership_structure'] = max([bm25.get_scores(term) for term in keywords['ownership_structure']])
    df_result_chunklist_monzo.at[i,'lexical-search_strategy_corporate_structure'] = max([bm25.get_scores(term) for term in keywords['strategy_corporate_structure']])
    df_result_chunklist_monzo.at[i,'lexical-search_pnl_balance_sheet'] = max([bm25.get_scores(term) for term in keywords['pnl_balance_sheet']])
    df_result_chunklist_monzo.at[i,'lexical-search_climate_environmental_risks'] = max([bm25.get_scores(term) for term in keywords['climate_environmental_risks']])


In [2271]:
df_result_chunklist

Unnamed: 0,doc_name,global_chunk_id,titles,sectionHeadings,page,token_sizes,reading_order,chunk_ids,enriched_text,embedding_text-embedding-3-large,text-embedding-3-large_ownership_structure,text-embedding-3-large_strategy_corporate_structure,text-embedding-3-large_pnl_balance_sheet,text-embedding-3-large_climate_environmental_risks,lexical-search_ownership_structure,lexical-search_strategy_corporate_structure,lexical-search_pnl_balance_sheet,lexical-search_climate_environmental_risks
0,deutschebank_annual_report_2023_8_9_15_120,1,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 1, 1, 1, 1]]","[[9, 4, 112, 236, 137, 150, 197]]","[[3, 4, 5, 6, 7, 8, 9]]","[[0, 1, 2, 3, 4, 5, 6]]",Title: Letter from the Chairman of the Supervi...,"[0.04903711378574371, 0.00020914606284350157, ...",0.401525,0.406437,0.458029,0.433241,-8.158120538204775,-9.525489197429014,-6.819227166234784,-8.833458119054727
1,deutschebank_annual_report_2023_8_9_15_120,2,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]]","[[197, 197, 255, 102, 161, 70, 87, 89, 3, 4, 5...","[[9, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22]]","[[6, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]]",Title: Letter from the Chairman of the Supervi...,"[0.04739595204591751, -0.018278546631336212, -...",0.38569,0.367267,0.395422,0.378512,-8.174136327216333,-9.547490038123463,-6.833292097883971,-8.854181694697521
2,deutschebank_annual_report_2023_9_15_120,1,"[None, None]","[[], [Participation in meetings]]","[[1, 1, 1, 1, 1, 1, 1, 1, 1], [2, 2]]","[[102, 161, 70, 87, 89, 3, 4, 5, 7], [4, 53]]","[[3, 4, 5, 6, 7, 8, 9, 10, 11], [15, 16]]","[[17, 18, 19, 20, 21, 22, 23, 24, 25], [26, 27]]","Title: None, Section Heading: [], As part of p...","[0.05034126341342926, -0.02449539117515087, -0...",0.3923,0.344329,0.344552,0.341645,-8.103537516785565,-9.462371405330316,-6.797174408970884,-8.774454048118049
3,deutschebank_annual_report_2023_9_15_120,2,[None],[[Participation in meetings]],"[[2, 2, 2]]","[[53, 53, 528]]","[[16, 16, 17]]","[[27, 27, 28]]","Title: None, Section Heading: ['Participation ...","[0.047187261283397675, -0.025101494044065475, ...",0.371209,0.321577,0.331446,0.313279,-7.929430518033133,-9.46735429596282,-6.809668939584883,-8.613416503096381
4,deutschebank_annual_report_2023_9_15_120,3,[None],[[Participation in meetings]],[[2]],[[602]],[[304]],[[29]],"Title: None, Section Heading: ['Participation ...","[0.034215424209833145, -0.02564658597111702, -...",0.393329,0.360115,0.349675,0.343064,-7.856025119475377,-9.46463340630513,-6.804495031723822,-8.540918690446901
5,deutschebank_annual_report_2023_120,1,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,[[1]],[[26]],[[3]],[[30]],"Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.008042172528803349, 0.038635168224573135, ...",0.295698,0.299006,0.399108,0.393919,-5.765859683904272,-7.192968313650932,-5.718062683469386,-6.888253517348961
6,deutschebank_annual_report_2023_120,2,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1]]","[[26, 26, 1004]]","[[3, 3, 4]]","[[30, 30, 31]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.005911923013627529, 0.00321170873939991, -...",0.317738,0.310443,0.444305,0.441974,-7.98938313064192,-9.454834636344913,-6.80421176090632,-8.681348628083244
7,deutschebank_annual_report_2023_120,3,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1]]","[[62, 36]]","[[41, 42]]","[[32, 33]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.009254434145987034, 0.03645535930991173, -...",0.282604,0.260958,0.444551,0.399516,-7.145626115720379,-8.371786673655055,-6.483170408477891,-7.847770577028074
8,deutschebank_annual_report_2023_120,4,[None],[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1, 1, 1]]","[[36, 36, 1139]]","[[42, 42, 43]]","[[33, 33, 34]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.015078556723892689, 0.011807754635810852, ...",0.312839,0.307072,0.4483,0.43765,-8.010980358587469,-9.484401511016772,-6.81018591492599,-8.697640814769006
9,deutschebank_annual_report_2023_120,5,"[None, None]",[[IFRS 9 - Sensitivities of Forward-Looking In...,"[[1], [1, 1]]","[[104], [26, 7]]","[[80], [81, 82]]","[[35], [36, 37]]","Title: None, Section Heading: ['IFRS 9 - Sensi...","[-0.0009420232963748276, 0.007235208060592413,...",0.369569,0.409871,0.465776,0.491273,-7.677958705996391,-8.976204266878653,-6.612860362064481,-8.495263571524763


In [2139]:
# Step 1: Normalize Scores
def normalize_scores(scores):
    """Normalize a list of scores to the range [0, 1]."""
    min_score = min(scores)
    max_score = max(scores)
    if max_score - min_score == 0:  # Avoid division by zero
        return [0.5] * len(scores)  # Neutral normalization if all scores are the same
    return [(score - min_score) / (max_score - min_score) for score in scores]

# Step 2: Normalize All Scores in the DataFrame
categories = [
    "ownership_structure",
    "strategy_corporate_structure",
    "pnl_balance_sheet",
    "climate_environmental_risks",
]

# Create normalized columns for semantic and lexical scores
for category in categories:
    df_result_chunklist[f'normalized_text-embedding_{category}'] = normalize_scores(
        df_result_chunklist[f'text-embedding-3-large_{category}'].fillna(0).tolist()
    )
    df_result_chunklist[f'normalized_lexical_{category}'] = normalize_scores(
        df_result_chunklist[f'lexical-search_{category}'].fillna(0).tolist()
    )

# Step 3: Combine Normalized Scores
for category in categories:
    final_column = f'final_{category}'
    semantic_column = f'normalized_text-embedding_{category}'
    lexical_column = f'normalized_lexical_{category}'

    df_result_chunklist[final_column] = (
        0.5 * df_result_chunklist[semantic_column] +
        0.4 * df_result_chunklist[lexical_column]
    )


In [2330]:
#for monzo
# Create normalized columns for semantic and lexical scores
for category in categories:
    df_result_chunklist_monzo[f'normalized_text-embedding_{category}'] = normalize_scores(
        df_result_chunklist_monzo[f'text-embedding-3-large_{category}'].fillna(0).tolist()
    )
    df_result_chunklist_monzo[f'normalized_lexical_{category}'] = normalize_scores(
        df_result_chunklist_monzo[f'lexical-search_{category}'].fillna(0).tolist()
    )

# Step 3: Combine Normalized Scores
for category in categories:
    final_column = f'final_{category}'
    semantic_column = f'normalized_text-embedding_{category}'
    lexical_column = f'normalized_lexical_{category}'

    df_result_chunklist_monzo[final_column] = (
        0.5 * df_result_chunklist_monzo[semantic_column] +
        0.4 * df_result_chunklist_monzo[lexical_column]
    )

In [2141]:
df_result_chunklist

Unnamed: 0,doc_name,global_chunk_id,titles,sectionHeadings,page,token_sizes,reading_order,chunk_ids,enriched_text,text-embedding-3-large_ownership_structure,...,final_pnl_balance_sheet,final_climate_environmental_risks,normalized_text-embedding_ownership_structure,normalized_lexical_ownership_structure,normalized_text-embedding_strategy_corporate_structure,normalized_lexical_strategy_corporate_structure,normalized_text-embedding_pnl_balance_sheet,normalized_lexical_pnl_balance_sheet,normalized_text-embedding_climate_environmental_risks,normalized_lexical_climate_environmental_risks
0,deutschebank_annual_report_2023_8_9_15_120,1,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 1, 1, 1, 1]]","[[9, 4, 112, 236, 137, 150, 197]]","[[3, 4, 5, 6, 7, 8, 9]]","[[0, 1, 2, 3, 4, 5, 6]]",Title: Letter from the Chairman of the Supervi...,0.401525,...,0.351948,0.416493,0.772382,0.002688,0.652470,0.003599,0.700883,0.003767,0.830049,0.003671
1,deutschebank_annual_report_2023_8_9_15_120,2,[Letter from the Chairman of the Supervisory B...,[[]],"[[1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]]","[[197, 197, 255, 102, 161, 70, 87, 89, 3, 4, 5...","[[9, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22]]","[[6, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]]",Title: Letter from the Chairman of the Supervi...,0.38569,...,0.257703,0.339609,0.718762,0.000000,0.519656,0.000000,0.515406,0.000000,0.679217,0.000000
2,deutschebank_annual_report_2023_9_15_120,1,"[None, None]","[[], [Participation in meetings]]","[[1, 1, 1, 1, 1, 1, 1, 1, 1], [2, 2]]","[[102, 161, 70, 87, 89, 3, 4, 5, 7], [4, 53]]","[[3, 4, 5, 6, 7, 8, 9, 10, 11], [15, 16]]","[[17, 18, 19, 20, 21, 22, 23, 24, 25], [26, 27]]","Title: None, Section Heading: [], As part of p...",0.3923,...,0.186219,0.291397,0.741144,0.011847,0.441882,0.013923,0.364699,0.009674,0.571495,0.014124
3,deutschebank_annual_report_2023_9_15_120,2,[None],[[Participation in meetings]],"[[2, 2, 2]]","[[53, 53, 589]]","[[16, 16, 17]]","[[27, 27, 28]]","Title: None, Section Heading: ['Participation ...",0.353522,...,0.173486,0.266714,0.609833,0.039878,0.379754,0.010970,0.342980,0.004991,0.500267,0.041450
4,deutschebank_annual_report_2023_9_15_120,3,[None],[[Participation in meetings]],[[2]],[[611]],[[304]],[[29]],"Title: None, Section Heading: ['Participation ...",0.386793,...,0.195351,0.316336,0.722496,0.053313,0.502834,0.013267,0.384593,0.007637,0.588324,0.055434
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,monzo_annual report_2024_34_61,2,"[None, None, None]","[[], [Board diversity and engagement], [Board ...","[[1, 1, 1, 1], [2], [2, 2, 2, 2]]","[[4, 4, 410, 46], [4], [12, 86, 49, 51]]","[[29, 29, 40, 50], [53], [54, 55, 56, 58]]","[[228, 228, 229, 230], [231], [232, 233, 234, ...","Title: None, Section Heading: [], 0.94% 0.94% ...",0.347206,...,0.230633,0.421548,0.588446,0.008650,0.388484,0.011523,0.453981,0.009107,0.835717,0.009224
58,monzo_annual report_2024_34_61,3,[None],"[[Board diversity and engagement, We're commit...","[[2, 2, 2, 2]]","[[51, 51, 848, 22]]","[[58, 58, 59, 74]]","[[235, 235, 236, 237]]","Title: None, Section Heading: ['Board diversit...",0.388913,...,0.185421,0.327173,0.729676,0.006240,0.449031,0.010384,0.365151,0.007115,0.648341,0.007505
59,monzo_annual report_2024_46,1,[Our approach to risk management],[[]],"[[1, 1, 1, 1, 1, 1]]","[[5, 37, 61, 62, 51, 43]]","[[3, 4, 5, 6, 7, 8]]","[[238, 239, 240, 241, 242, 243]]","Title: Our approach to risk management, Sectio...",0.310937,...,0.165186,0.497456,0.465629,0.032705,0.613783,0.037731,0.305883,0.030612,0.972520,0.027990
60,monzo_annual report_2024_46,2,[Our approach to risk management],[[]],"[[1, 1, 1]]","[[43, 43, 872]]","[[8, 8, 9]]","[[243, 243, 244]]","Title: Our approach to risk management, Sectio...",0.327871,...,0.180130,0.451695,0.522972,0.009831,0.543621,0.012448,0.356836,0.004282,0.902531,0.001074


### Generate a report based on the information in the retrieved text chunks

In [2192]:
def generate_insights_report(df_result_chunklist, threshold=0.5):
    """
    Generates an insights report for each document based on relevant text chunks.

    Args:
        df_result_chunklist (pd.DataFrame): DataFrame containing text chunks and scores.
        headers (dict): Headers for the GPT-4 API request.
        threshold (float): Relevance threshold for selecting important chunks.

    Returns:
        dict: A dictionary with document names as keys and their respective reports as values.
    """

    # Configuration
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_OPENAI_API_KEY,
    }
    # Filter chunks based on the threshold
    relevant_chunks = df_result_chunklist[
        (df_result_chunklist['final_ownership_structure'] >= threshold) |
        (df_result_chunklist['final_strategy_corporate_structure'] >= threshold) |
        (df_result_chunklist['final_pnl_balance_sheet'] >= threshold) |
        (df_result_chunklist['final_climate_environmental_risks'] >= threshold)
    ]
    
    # Initialize dictionary to store reports
    reports = {}

    
    # Get all relevant text chunks for the current document
    doc_chunks = relevant_chunks['enriched_text'].tolist()

    # Combine all text chunks into a single string
    combined_text = "\n\n".join(doc_chunks)

    # Construct the prompt with relevant chunks
    prompt = f"""
    Based on the following text from the documents:

    {combined_text}

    Please analyze the document and provide insights on:
    - Ownership structure
    - Strategy and corporate structure
    - Profit and loss statement (P&L) and balance sheet (notable changes)
    - Climate and environmental risks.
    """

    # Payload for the request
    payload = {
        "messages": [
            {
                "role": "system",
                "content": "You are an expert in analyzing the business model of banks. Please provide detailed insights on the following aspects."
            },
            {
                "role": "user",
                "content": prompt
            },
        ],
        "temperature": 0.7, #0
        "top_p": 0.95, #0
        "max_tokens": 1500
    }

    # GPT-4 Vision endpoint
    GPT4V_ENDPOINT = AZURE_OPENAI_ENDPOINT

    # Send request to GPT-4 API
    
    response = requests.post(GPT4V_ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Raise an error for unsuccessful status codes


    # Handle the response
    response_data = response.json()
    response_content = response_data['choices'][0]['message']['content']


    return response_content

In [2193]:
generate_insights_report(df_result_chunklist, threshold=0.5)

'### Ownership Structure\n\nThe ownership structure of Monzo Bank is centered around the Monzo Bank Holding Group Limited (MBHG), which serves as the parent company. The key entities within the group are:\n\n1. **Monzo Bank Holding Group Limited (MBHG)**: This is the top-level parent company.\n2. **Monzo Bank Limited (MBL)**: A direct subsidiary of MBHG and the main operational entity.\n3. **Monzo Inc. US**: An indirect subsidiary of MBL.\n4. **Monzo Support US Inc.**: Another indirect subsidiary of MBL.\n\nThe MBHG and MBL Boards operate under a "mirror board" structure, meaning they consist of the same directors, ensuring closely aligned interests and objectives between the holding company and its primary subsidiary.\n\n### Strategy and Corporate Structure\n\nMonzo\'s strategy focuses on maturing its corporate governance to align with evolving statutory and regulatory responsibilities. This involves:\n\n- **Governance Updates**: Establishment of Monzo Bank Holding Group (MBHG) to sup

In [2334]:
#for monzo
generate_insights_report(df_result_chunklist_monzo, threshold=0.6)

"### Ownership Structure\n\nBased on the document, the ownership structure of the company appears to be consolidated under Monzo Bank Holding Group Limited, which is the parent entity. The subsidiaries include Monzo Bank Limited, Monzo Support US Inc, and Monzo Inc, all of which are 100% owned by the parent entity. Here is a breakdown of the ownership:\n\n1. **Monzo Bank Holding Group Limited**: Parent company\n2. **Monzo Bank Limited**: 100% owned by the parent company\n3. **Monzo Support US Inc**: 100% owned by the parent company\n4. **Monzo Inc**: 100% owned by the parent company\n\nThe registered offices of Monzo Bank Holding Group Limited and Monzo Bank Limited are located in London, UK, while Monzo Support US Inc and Monzo Inc are based in Wilmington, Delaware, USA.\n\n### Strategy and Corporate Structure\n\nThe corporate structure is designed to support both UK and international operations, indicating a strategic focus on expanding its services beyond the UK market. The ownershi