
# PDF Text Extraction Program Documentation

## Overview

This Jupyter Notebook contains a Python program designed to perform text extraction from a series of PDF documents. The program reads a list of PDF file paths from a text file, processes each PDF to extract text while considering specific formatting rules, and saves the extracted text into a Markdown file in the same directory as the original PDF. The process aims to detect headings and subheadings based on font sizes and styles, adjust text flow, and maintain a logical reading order.

## Requirements

- **PyMuPDF Library**: The program utilizes PyMuPDF (fitz) for handling PDFs. Ensure PyMuPDF is installed using `pip install PyMuPDF`.
- **Python 3**: Written for Python 3, compatibility with previous versions is not guaranteed.

## Program Flow

1. **Reading PDF Paths**: Initially, the program reads a file named `pdfs_to_extract.txt`, which contains the local paths to all PDFs to be processed. Each line in this file represents one path.

2. **PDF Processing**:
    - For each PDF, the program:
        - Opens the PDF and iterates through each page.
        - Calculates and applies a half-inch margin to ignore texts close to the edges.
        - Extracts text blocks within the defined margins.
        - Sorts the extracted text blocks into a logical reading order based on their positions.
        - Adjusts text flow by removing hyphenated line breaks and unnecessary whitespace.
        - Applies basic Markdown formatting for detected headings and subheadings (additional implementation required for full functionality).

3. **Markdown File Generation**:
    - After processing, the extracted text for each PDF is saved into a Markdown (.md) file.
    - The Markdown file is named after the original PDF and stored in the same directory.

## Text Processing Details

- **Margin Adjustment**: A half-inch margin from each page's edge is considered non-essential and ignored during text extraction to focus on the main content.

- **Text Flow Adjustment**: Hyphenated words at line breaks are joined to ensure smooth text flow. Paragraph breaks are denoted with double newlines for Markdown compatibility.

- **Heading Detection**: While the basic framework for detecting headings based on font sizes and styles is outlined, detailed implementation requires further development. Headings are intended to be marked with appropriate Markdown tags (e.g., `#` for main headings, `##` for subheadings).

## Usage Instructions

- Prepare a text file named `pdfs_to_extract.txt` with paths to the PDF files you wish to process.
- Ensure all dependencies are installed and run the cells in this notebook.
- Processed text will be available in Markdown format in the same directory as the source PDFs.


In [11]:
import fitz  # PyMuPDF
import os

In [12]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF, ignoring a half-inch margin on all sides,
    adjusting text flow, and applying Markdown for headings.
    """
    # Open the PDF
    doc = fitz.open(pdf_path)
    
    extracted_text = ""  # To hold all text extracted from this PDF
    
    for page in doc:
        # Get the page's dimensions to calculate margin
        rect = page.rect
        margin = 0.5 * 72  # Half an inch in points
        clip_rect = fitz.Rect(rect.x0 + margin, rect.y0 + margin, rect.x1 - margin, rect.y1 - margin)
        
        # Extract text blocks, ignoring the specified margin
        text_blocks = page.get_text("blocks", clip=clip_rect)
        
        # Sort the blocks into their visual reading order
        text_blocks.sort(key=lambda block: (block[1], block[0]))  # Sort by y1, then x0
        
        # Process and concatenate the text, adjusting for headings and wrapped words
        for block in text_blocks:
            text = block[4]
            text = text.replace('-\n', '')  # Remove hyphenation
            text = text.replace('\n', ' ')  # Replace newline characters with spaces for continuous text
            # Additional logic for detecting and formatting headings can be implemented here
            extracted_text += text + "\n\n"  # Add a double newline to denote paragraph breaks
    
    # Close the document
    doc.close()
    
    return extracted_text

In [13]:
def save_markdown(text, path):
    """
    Saves the extracted text to a markdown file in the same directory as the original PDF.
    """
    with open(path, 'w', encoding='utf-8') as md_file:
        md_file.write(text)

In [14]:
def process_pdfs_from_list(file_path):
    """
    Reads the list of PDF paths from a text file and processes each PDF.
    """
    with open(file_path, 'r') as f:
        pdf_paths = f.read().splitlines()
    
    for pdf_path in pdf_paths:
        print(f"Processing {pdf_path}...")
        text = extract_text_from_pdf(pdf_path)
        md_path = os.path.splitext(pdf_path)[0] + ".md"
        save_markdown(text, md_path)
        print(f"Saved extracted text to {md_path}")

In [15]:
# Path to the file containing the list of PDFs
list_file_path = 'pdfs_to_extract.txt'
process_pdfs_from_list(list_file_path)

Processing /Users/willit/Documents/WorldBank/samplefiles/713741468337198922/536490BRI0SPAN10Box345621B01PUBLIC1.pdf...
Saved extracted text to /Users/willit/Documents/WorldBank/samplefiles/713741468337198922/536490BRI0SPAN10Box345621B01PUBLIC1.md
Processing /Users/willit/Documents/WorldBank/samplefiles/615181468141301901/394600turkey0p1io0economic01public1.pdf...
Saved extracted text to /Users/willit/Documents/WorldBank/samplefiles/615181468141301901/394600turkey0p1io0economic01public1.md
Processing /Users/willit/Documents/WorldBank/samplefiles/561931468184777746/96423-BRI-CHILD-FECES-Box391444B-PUBLIC-WSP-Chad-CFD-Profile.pdf...
Saved extracted text to /Users/willit/Documents/WorldBank/samplefiles/561931468184777746/96423-BRI-CHILD-FECES-Box391444B-PUBLIC-WSP-Chad-CFD-Profile.md
Processing /Users/willit/Documents/WorldBank/samplefiles/115871467986280887/96433-BRI-CHILD-FECES-Box391444B-PUBLIC-WSP-Malawi-CFD-Profile.pdf...
Saved extracted text to /Users/willit/Documents/WorldBank/sampl