
# Introduction to AI-Powered Computer Vision for PDF Extraction
Modern AI models, especially those trained on vast amounts of data, have the capability to understand and extract information from complex documents, such as PDFs. This ability can revolutionize the accounting domain by automating the extraction of financial data from reports, invoices, and other documentation. The key benefits include:

- **Efficiency**: Reduce manual data entry and potential errors.
- **Scalability**: Process large volumes of documents in a fraction of the time.
- **Flexibility**: Extract data from various document layouts and structures.

In this notebook, we'll explore how such models work and how their output can be processed for practical use.


# Extraction with Document Intelligence Models

https://formrecognizer.appliedai.azure.com/studio

<!-- Document Intelligence Demo Here -->


# Exploring the JSON Output
Once a document is processed by the AI model, the extracted information is often represented in a structured format like JSON. Understanding the structure of this JSON output is crucial, as it contains the valuable data we need. Let's dive into its content.


In [None]:
import json

# Load the JSON data from the file
with open("extract.pdf.json", "r") as file:
    data = json.load(file)

# Display the keys in the JSON to get an overview of its structure
data.keys()

In [None]:
# Exploring the structure of "analyzeResult"
analyze_result = data["analyzeResult"]

# Displaying the keys within "analyzeResult"
analyze_result.keys()

In [None]:
# Extracting the tables from the "tables" key
tables = analyze_result["tables"]

# Checking the number of tables and displaying the structure of the first table (if available)
num_tables = len(tables)
first_table = tables[0] if num_tables > 0 else None

num_tables, first_table



# Extracting Tables from the JSON
Extracting tables from the JSON output allows us to access the tabulated data within the document. However, a challenge arises: without context, we might not know the significance or purpose of each table. Let's first extract all the tables and then address this challenge.


In [None]:
import csv
import os

# Directory to store the CSV files
output_dir = "tables_csv"
os.makedirs(output_dir, exist_ok=True)

csv_files = []

for idx, table in enumerate(tables):
    # Initialize an empty matrix for the table
    matrix = [['' for _ in range(table['columnCount'])] for _ in range(table['rowCount'])]
    
    # Populate the matrix with cell content
    for cell in table['cells']:
        matrix[cell['rowIndex']][cell['columnIndex']] = cell['content']
    
    # Save the matrix to a CSV file
    csv_filename = os.path.join(output_dir, f"table_{idx + 1}.csv")
    with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(matrix)
    
    csv_files.append(csv_filename)

csv_files[:5]  # Displaying paths to the first 5 CSV files as an example


<!-- Do it better by getting the closest paragraphs -->


# Associating Tables with Context
To provide meaningful context to each table, we can associate it with the nearest paragraph or section heading in the document. This approach not only gives us insight into the table's content but also aids in naming the extracted CSV files. Let's see how this can be done.


In [None]:
import json
import os
import csv
import math

def vertical_distance(paragraph_region, table_region):
    paragraph_bottom_y = paragraph_region['polygon'][5]
    table_top_y = table_region['polygon'][1]
    return table_top_y - paragraph_bottom_y


In [None]:
def generate_filename(title, idx):
    max_length = 50
    truncated_title = title[:max_length].rstrip() if title else "table"
    sanitized_title = ''.join(c if c.isalnum() else "_" for c in truncated_title)
    filename = f"{idx}_{sanitized_title}.csv"
    return filename

In [None]:
# Load the JSON data
json_path = "extract.pdf.json"
output_dir = "tables_csv"
with open(json_path, "r") as file:
    data = json.load(file)

In [None]:
analyze_result = data["analyzeResult"]
tables = analyze_result["tables"]
paragraphs = analyze_result.get("paragraphs", [])

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

In [None]:
csv_files = []
for idx, table in enumerate(tables, 1):
    # Initialize an empty matrix for the table
    matrix = [['' for _ in range(table['columnCount'])] for _ in range(table['rowCount'])]
    for cell in table['cells']:
        matrix[cell['rowIndex']][cell['columnIndex']] = cell['content']

    # Identify closest title using Euclidean distance
    closest_paragraph = None
    min_distance = float('inf')
    for paragraph in paragraphs:
        for p_region in paragraph['boundingRegions']:
            for t_region in table['boundingRegions']:
                distance = vertical_distance(p_region, t_region)
                if 0 < distance < min_distance:
                    min_distance = distance
                    closest_paragraph = paragraph

    title = closest_paragraph['content'] if closest_paragraph else None
    csv_filename = os.path.join(output_dir, generate_filename(title, idx))
    # Save the matrix to a CSV file
    with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(matrix)
    csv_files.append(csv_filename)
