# PDF to Image Converter

This notebook provides functionality to convert PDF files to high-resolution PNG images.


## Single PDF Conversion

The first cell demonstrates converting a single PDF file with a hardcoded path. It:
- Uses PyMuPDF (fitz) to open and process the PDF
- Converts each page to a 300 DPI PNG image
- Sanitizes filenames to remove special characters


In [2]:
import fitz
import os
import re

def sanitize_filename(name):
    # Replace any character that is not alphanumeric or underscore with an underscore
    return re.sub(r'[^A-Za-z0-9_]', '_', name)

dpi = 300
zoom = dpi/72
magnify = fitz.Matrix(zoom, zoom)
pdf_name = "CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement"
pdf_category = "Affiliate_Agreements"
safe_pdf_name = sanitize_filename(pdf_name)
path = f"../CUAD_v1/full_contract_pdf/Part_I/{pdf_category}/{pdf_name}.pdf"
count = 0

doc = fitz.open(path)

for page in doc:
    count+=1
    pix = page.get_pixmap(matrix=magnify)
    pix.save(f"./image/{safe_pdf_name}_page_{count}.png")
    print(f"Converting: {pdf_name}")


Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CreditcardscomInc_20

## Bulk PDF Conversion

The second cell defines two functions:

1. `convert_pdf_to_images(pdf_path, output_dir="./image", dpi=300)`:
   - Converts a single PDF to images
   - Creates output directory if it doesn't exist
   - Handles errors gracefully
   - Returns success/failure status

2. `bulk_convert_pdfs(input_dir, output_dir="./image", dpi=300)`:
   - Recursively finds all PDFs in the input directory
   - Preserves directory structure in the output
   - Calls the single conversion function for each PDF
   - Tracks and reports conversion success/failure counts


In [4]:
def convert_pdf_to_images(pdf_path, output_dir="./image", dpi=300):
    """
    Convert a single PDF to images
    
    Args:
        pdf_path: Path to the PDF file
        output_dir: Directory to save images
        dpi: DPI for the output images
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    zoom = dpi/72
    magnify = fitz.Matrix(zoom, zoom)
    
    pdf_name = os.path.basename(pdf_path)
    pdf_name = os.path.splitext(pdf_name)[0]  # Remove extension
    safe_pdf_name = sanitize_filename(pdf_name)
    
    try:
        doc = fitz.open(pdf_path)
        print(f"Converting: {pdf_name}")
        
        for page_num, page in enumerate(doc, 1):
            pix = page.get_pixmap(matrix=magnify)
            output_path = os.path.join(output_dir, f"{safe_pdf_name}_page_{page_num}.png")
            pix.save(output_path)
        
        return True
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")
        return False

def bulk_convert_pdfs(input_dir, output_dir="./image", dpi=300):
    """
    Convert all PDFs in a directory to images
    
    Args:
        input_dir: Directory containing PDF files
        output_dir: Directory to save images
        dpi: DPI for the output images
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    success_count = 0
    failure_count = 0
    
    # Walk through all files in input_dir and its subdirectories
    for root, _, files in os.walk(input_dir):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_path = os.path.join(root, file)
                
                # Create subdirectory structure mirroring the input directory
                rel_path = os.path.relpath(root, input_dir)
                if rel_path != '.':
                    target_dir = os.path.join(output_dir, rel_path)
                    if not os.path.exists(target_dir):
                        os.makedirs(target_dir)
                else:
                    target_dir = output_dir
                
                # Convert the PDF
                result = convert_pdf_to_images(pdf_path, target_dir, dpi)
                
                if result:
                    success_count += 1
                else:
                    failure_count += 1
    
    print(f"\nConversion complete!")
    print(f"Successfully converted: {success_count} PDFs")
    print(f"Failed conversions: {failure_count} PDFs")
    
# Example usage
# bulk_convert_pdfs("../CUAD_v1/full_contract_pdf/")

## Usage Example

The third cell shows how to use the bulk conversion function:
- Specify input directory containing PDFs
- Specify output directory for the images
- Set the desired DPI (default is 300)

In [5]:
# Run this cell to convert all PDFs in a specific directory
input_directory = "../CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements"
output_directory = "./image/bulk_output"

bulk_convert_pdfs(input_directory, output_directory, dpi=300)

Converting: CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement
Converting: CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605784_EX-10.27_Affiliate Agreement
Converting: DigitalCinemaDestinationsCorp_20111220_S-1_EX-10.10_7346719_EX-10.10_Affiliate Agreement
Converting: LinkPlusCorp_20050802_8-K_EX-10_3240252_EX-10_Affiliate Agreement
Converting: SouthernStarEnergyInc_20051202_SB-2A_EX-9_801890_EX-9_Affiliate Agreement
Converting: SteelVaultCorp_20081224_10-K_EX-10.16_3074935_EX-10.16_Affiliate Agreement
Converting: TubeMediaCorp_20060310_8-K_EX-10.1_513921_EX-10.1_Affiliate Agreement
Converting: UnionDentalHoldingsInc_20050204_8-KA_EX-10_3345577_EX-10_Affiliate Agreement
Converting: UsioInc_20040428_SB-2_EX-10.11_1723988_EX-10.11_Affiliate Agreement 2

Conversion complete!
Successfully converted: 9 PDFs
Failed conversions: 0 PDFs
