# Script to Extract Data for Land Titles from PDFs

#### Author: George Felobes  
#### Version: 1.0  

### Overview:
This script is designed to process PDF files containing land title information and extract structured data for analysis or record-keeping. It leverages advanced tools such as PyMuPDF for PDF manipulation and OpenAI's API for intelligent data extraction and formatting.

### Features:
- **PDF to Image Conversion:** Converts multi-page PDFs into high-resolution JPEG images for easier processing.
- **Intelligent Data Extraction:** Utilizes OpenAI's language models to extract structured data from images, including text and metadata with confidence scoring.
- **Data Transformation and Storage:** Extracted data is saved in a CSV format, with clearly defined fields and attributes for further use.
- **Cost Estimation:** Calculates the cost in tokens for using OpenAI's API, enabling budget-conscious decision-making.

### Prerequisites:
1. Python 3.8 or above.
2. Required libraries:
   - `fitz` (PyMuPDF)
   - `openai`
   - `pydantic`
   - `Pillow`
3. OpenAI API key with access to the specified model.
4. Input PDF files with standardized formatting for land title information.

### Workflow:
1. **PDF to JPEG Conversion:**  
   Each page of the input PDF is converted into a JPEG image and stored in the specified output folder.
   
2. **Data Extraction from Images:**  
   Images are processed through OpenAI's API to extract structured data, which is validated using a pydantic model for accuracy.

3. **Output as CSV:**  
   Extracted data is stored in a CSV file, with each field representing a key-value pair from the structured data.

4. **Cost Analysis:**  
   Calculates and reports the token cost of processing each image, providing insight into API usage and expenses.

### Use Cases:
- Efficient data processing for legal, real estate, or administrative purposes.
- Automating land title data digitization for record-keeping.
- Reducing manual labor in extracting and formatting land title information.

### Instructions:
1. Place the PDF files to be processed in the designated input folder.
2. Run the script in the Jupyter Notebook environment.
3. Review the extracted CSV files and cost analysis for accuracy and budget evaluation.

### Disclaimer:
The script assumes a standard formatting structure for the input PDFs. Variations in formatting may require adjustments to the code for optimal performance.


## OUTPUT JSON AND CSV

In [9]:
import base64
import fitz  # PyMuPDF
import os
import re
import csv
from openai import OpenAI
from pydantic import BaseModel
from dotenv import load_dotenv
import json


class PDFImageProcessor:
    """
    A utility class for converting PDF files to JPEG images, processing images using OpenAI's API,
    and extracting and saving structured data from text.
    """

    @staticmethod
    def pdf_to_jpeg_no_poppler(pdf_path, output_folder, dpi=150):
        """
        Convert a PDF file to JPEG images.

        Args:
            pdf_path (str): Path to the PDF file.
            output_folder (str): Directory to save the output JPEG images.
            dpi (int): Dots per inch for rendering the PDF. Default is 150.

        Returns:
            None
        """
        try:
            # Open the PDF
            pdf_document = fitz.open(pdf_path)

            # Create output folder if it doesn't exist
            if not os.path.exists(output_folder):
                os.makedirs(output_folder)

            # Iterate through pages
            for page_number in range(len(pdf_document)):
                page = pdf_document.load_page(page_number)

                # Render page to a pixmap
                pix = page.get_pixmap(dpi=dpi)

                # Save pixmap as a JPEG
                output_path = os.path.join(output_folder, f"page_{page_number + 1}.jpeg")
                pix.save(output_path)
                print(f"Saved: {output_path}")

            pdf_document.close()

        except Exception as e:
            print(f"An error occurred with pdf to jped: {e}")
    
    @staticmethod
    def encode_image_to_base64(image_path):
        """
        Encodes an image to a base64 string.

        Args:
            image_path (str): Path to the image file.

        Returns:
            str: Base64-encoded image string.
        """
        try:
            with open(image_path, "rb") as image_file:
                return base64.b64encode(image_file.read()).decode('utf-8')
        except FileNotFoundError:
            raise FileNotFoundError(f"Image file not found at path: {image_path}")
        except Exception as e:
            raise RuntimeError(f"Error encoding image: {e}")
            
    @staticmethod
    def process_image_to_json(openai_api_key, image_path, model="gpt-4o-mini"):
        """
        Encodes an image to base64, sends it to OpenAI's API, and retrieves a structured response.

        Args:
            openai_api_key (str): API key for OpenAI.
            image_path (str): Path to the image file.
            model (str): Model name to use for OpenAI API. Default is "gpt-4o-mini".

        Returns:
            dict: Parsed JSON response from OpenAI API.
        """
        try:
            client = OpenAI(api_key=openai_api_key)

            # Encode the image
            base64_img = PDFImageProcessor.encode_image_to_base64(image_path)
            base64_img_with_prefix = f"data:image/png;base64,{base64_img}"

            # Create a request to the API
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "system",
                        "content": "You are an expert at extracting structured data from images. Extract the text from the images",
                    },
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": "Return JSON document with data. Only return JSON not other text"},
                            {
                                "type": "image_url",
                                "image_url": {"url": f"{base64_img_with_prefix}"}
                            }
                        ],
                    },
                ],
            )
            # Parse the JSON response
            json_string = response.choices[0].message.content
            json_string = json_string.strip("```json").strip("```")
            return json.loads(json_string)

        except json.JSONDecodeError:
            raise ValueError("Failed to parse JSON response from the API.")
        except Exception as e:
            raise RuntimeError(f"An error occurred during text extraction: {e}")


    @staticmethod
    def extract_and_save_csv(json_data, output_file):
        """
        Extract JSON data and save it to a CSV file with headers FIELD and ATTRIBUTE.

        Args:
            json_data (dict): The input JSON data.
            output_file (str): The name of the output file to save the CSV data.

        Returns:
            None
        """
        try:
            if not isinstance(json_data, dict):
                raise ValueError("Input data is not a valid dictionary.")
            
            # Save the JSON data to a file
            with open(json_file, "w", encoding="utf-8") as jf:
                json.dump(json_data, jf, indent=4)
            print(f"JSON data has been saved to {json_file}")

            with open(output_file, "w", newline="") as csvfile:
                writer = csv.writer(csvfile)

                # Write CSV header
                writer.writerow(["FIELD", "ATTRIBUTE"])

                # Write each key-value pair to the CSV
                for key, value in json_data.items():
                    if isinstance(value, (list, dict)):
                        value = json.dumps(value)  # Serialize complex values
                    writer.writerow([key, value])

            print(f"CSV data has been extracted and saved to {output_file}")

        except PermissionError:
            raise PermissionError(f"Permission denied: Unable to write to file {output_file}.")
        except Exception as e:
            raise RuntimeError(f"An error occurred while saving CSV: {e}")


    @staticmethod
    def process_all_images_in_folder(openai_api_key, output_folder, model="gpt-4o-mini"):
        """
        Process all JPEG images in the specified folder.

        Args:
            output_folder (str): The folder containing JPEG images to process.
            model (str): Model name to use for OpenAI API. Default is "gpt-4o-mini".

        Returns:
            None
        """
        try:
            # List all JPEG files in the folder
            image_files = [f for f in os.listdir(output_folder) if f.endswith('.jpeg')]

            for image_file in image_files:
                image_path = os.path.join(output_folder, image_file)
                print(f"Processing: {image_path}")
                response = PDFImageProcessor.process_image_to_json(openai_api_key,image_path, model)

                if response:
                    # Save CSV data to a file with the same name as the image
                    output_file = os.path.join(output_folder, f"{os.path.splitext(image_file)[0]}.csv")
                    PDFImageProcessor.extract_and_save_csv(response, output_file)
                    
                    tokens = PDFImageProcessor.calculate_openai_cost(image_path, detail="high")
                    print(f"The number of tokens of processing the {image_path} is: {tokens}, cost: ${0.000150*tokens/1000}")

        except Exception as e:
            print(f"An error occurred while processing images: {e}")

    @staticmethod
    def calculate_openai_cost(image_path, detail="low"):
        """
        Calculate the cost in tokens for processing an image using OpenAI's API.

        Args:
            image_path (str): Path to the image file.
            detail (str): Detail level of the image, either "low" or "high". Default is "low".

        Returns:
            int: The cost in tokens.
        """
        try:
            from PIL import Image

            with Image.open(image_path) as img:
                width, height = img.size

            if detail == "low":
                return 85

            # Resize to fit within a 2048 x 2048 square
            if width > 2048 or height > 2048:
                scale = 2048 / max(width, height)
                width = int(width * scale)
                height = int(height * scale)

            # Scale so the shortest side is 768px
            scale = 768 / min(width, height)
            width = int(width * scale)
            height = int(height * scale)

            # Calculate the number of 512px squares
            tiles = (width // 512) * (height // 512)
            return 170 * tiles + 85

        except Exception as e:
            print(f"An error occurred while calculating cost: {e}")
            return 0


In [10]:
# Example usage
if __name__ == "__main__":
    # Convert PDF to JPEG images
    pdf_path = "01_24-00206_Blk Par C Plan No 96R49207 SW 16-006-13 W2M Ext 0.pdf"  # Replace with your PDF file path
    output_folder = "output_images"
    
    # load the env variabl, particularly the openai API key. Save this in your .env file or define it below and comment the load_env function. 
    load_dotenv()
    openai_api_key = os.getenv("OPENAI_API_KEY")
    
    PDFImageProcessor.pdf_to_jpeg_no_poppler(pdf_path, output_folder)

    # # Process all JPEG images in the output folder
    PDFImageProcessor.process_all_images_in_folder(openai_api_key, output_folder)

Saved: output_images\page_1.jpeg
Saved: output_images\page_2.jpeg
Saved: output_images\page_3.jpeg
Saved: output_images\page_4.jpeg
Saved: output_images\page_5.jpeg
Saved: output_images\page_6.jpeg
Saved: output_images\page_7.jpeg
Saved: output_images\page_8.jpeg
Saved: output_images\page_9.jpeg
Saved: output_images\page_10.jpeg
Processing: output_images\page_1.jpeg
CSV data has been extracted and saved to output_images\page_1.csv
The number of tokens of processing the output_images\page_1.jpeg is: 255, cost: $3.825e-05
Processing: output_images\page_10.jpeg
CSV data has been extracted and saved to output_images\page_10.csv
The number of tokens of processing the output_images\page_10.jpeg is: 255, cost: $3.825e-05
Processing: output_images\page_2.jpeg
CSV data has been extracted and saved to output_images\page_2.csv
The number of tokens of processing the output_images\page_2.jpeg is: 255, cost: $3.825e-05
Processing: output_images\page_3.jpeg
CSV data has been extracted and saved to ou