# PDF Parsing - Table Extraction


## Objective
This Python script extracts text and tables from a PDF file, converts the tables into a human-readable text format using Azure OpenAI, and writes the processed content to a text file. The script uses pdfplumber to extract text and table data from each page of the PDF. For tables, it sends a cleaned version (handling any missing or None values) to Azure OpenAI, which generates a natural language summary of the table. The extracted non-table text and the summarized table text are then saved to a text file for easy search and readability.

In [None]:
!pip install pdfplumber pandas

This code imports necessary libraries for PDF extraction, data processing, and interacting with Azure OpenAI via API calls. It retrieves the Azure OpenAI API key and endpoint from Google Colab's userdata storage, sets up the required headers, and prepares for sending requests to the Azure OpenAI service.

In [None]:
import pdfplumber
import pandas as pd
import requests
import base64
import json
from getpass import getpass
import io  # To create an in-memory file-like object
import os


ENDPOINT = getpass("Azure OpenAI Completions Endpoint: ")

API_KEY = getpass("Azure OpenAI API Key: ")

PARSED_PDF_DIRECTORY = getpass("Output directory for parsed PDF: ")

PARSED_PDF_FILE_NAME = getpass("PARSED PDF File Name: ")


headers = {
    "Content-Type": "application/json",
    "api-key": API_KEY,
}

This code defines two functions: extract_table_text_from_openai and parse_pdf. The extract_table_text_from_openai function sends a table's plain text to Azure OpenAI for conversion into a human-readable description by building a request payload and handling the response. The parse_pdf function processes a PDF file page by page, extracting both text and tables, and sends the extracted tables to Azure OpenAI for summarization, saving all the content (including summarized tables) to a text file.

In [None]:
def extract_table_text_from_openai(table_text):
    # Payload for the Azure OpenAI request
    payload = {
        "messages": [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are an AI assistant that helps convert tables into a human-readable text.",
                    }
                ],
            },
            {
                "role": "user",
                "content": f"Convert this table to a readable text format:\n{table_text}",
            },
        ],
        "temperature": 0.7,
        "top_p": 0.95,
        "max_tokens": 4096,
    }

    # Send the request to Azure OpenAI
    try:
        response = requests.post(ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()  # Raise error if the request fails
    except requests.RequestException as e:
        raise SystemExit(f"Failed to make the request. Error: {e}")

    # Process the response
    return (
        response.json()
        .get("choices", [{}])[0]
        .get("message", {})
        .get("content", "")
        .strip()
    )


def parse_pdf_from_url(file_url):
    # Download the PDF file from the URL
    response = requests.get(file_url)
    response.raise_for_status()  # Ensure the request was successful

    # Open the PDF content with pdfplumber using io.BytesIO
    pdf_content = io.BytesIO(response.content)

    # Ensure the directory exists and has write permissions
    os.makedirs(PARSED_PDF_DIRECTORY, mode=0o755, exist_ok=True)

    with pdfplumber.open(pdf_content) as pdf, open(
        os.path.join(PARSED_PDF_DIRECTORY, PARSED_PDF_FILE_NAME), "w"
    ) as output_file:
        for page_num, page in enumerate(pdf.pages, 1):
            print(f"Processing page {page_num}")

            # Extract text content
            text = page.extract_text()
            if text:
                output_file.write(f"Page {page_num} Text:\n")
                output_file.write(text + "\n\n")
                print("Text extracted:", text)

            # Extract tables
            tables = page.extract_tables()
            for idx, table in enumerate(tables):
                print(f"Table {idx + 1} found on page {page_num}")

                # Convert the table into plain text format (handling None values)
                table_text = "\n".join(
                    [
                        "\t".join(
                            [str(cell) if cell is not None else "" for cell in row]
                        )
                        for row in table[1:]
                    ]
                )

                # Call Azure OpenAI to convert the table into a text representation
                table_description = extract_table_text_from_openai(table_text)

                # Write the text representation to the file
                output_file.write(
                    f"Table {idx + 1} (Page {page_num}) Text Representation:\n"
                )
                output_file.write(table_description + "\n\n")
                print("Text representation of the table:", table_description)

In [None]:
# URL of the PDF file
file_url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/refs/heads/sunman/supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/quarterly_report.pdf"

# Call the function to parse the PDF from the URL
parse_pdf_from_url(file_url)