<a href="https://colab.research.google.com/github/deanopatoni/deanopatoni/blob/main/PDF_PRocess_Pro2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PDF Text Extraction and Cleaning Script

In [None]:
"""
PDF Text Extraction and Cleaning Script

This script extracts text from uploaded PDF files, cleans and formats the extracted text,
and saves it to `.txt` files. It addresses common issues like split words,
improving readability and compatibility with AI services.

Key Features:
- Uploads PDF files via Google Colab's `files.upload()`.
- Extracts text using PyPDF2.
- Fixes split words and unnecessary newlines.
- Formats headers, lists, and tables for better readability.
- Adds metadata (filename, date) to each processed file.
- Saves and downloads processed `.txt` files.

Usage:
1. Run the script in Google Colab.
2. Upload PDF files when prompted.
3. Processed `.txt` files will be saved and downloaded automatically.

Author: Patoni-Deano
Date: 2025-03-11
"""
# Install necessary libraries
!pip install PyPDF2 pdfminer.six pytesseract Pillow

import io
import re
from datetime import datetime
from PyPDF2 import PdfReader
from google.colab import files

def upload_pdfs():
    """
    Upload PDF files and return their filenames and contents.
    Returns:
        dict: A dictionary with filenames as keys and file contents as values.
    """
    print("Please upload PDF files...")
    uploaded_files = files.upload()  # Returns a dictionary {filename: binary content}
    return uploaded_files

def extract_text_from_pdf(pdf_content):
    """
    Extract text from a PDF using PyPDF2.
    Args:
        pdf_content (bytes): Binary content of the PDF file.
    Returns:
        str: Extracted text from the PDF.
    """
    try:
        pdf_reader = PdfReader(io.BytesIO(pdf_content))
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text() + "\n\n"  # Add extra newline for page breaks
        return text.strip()
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return ""

def fix_split_words(text):
    """
    Fixes improperly split words in the extracted text.
    Args:
        text (str): The raw extracted text.
    Returns:
        str: Text with split words fixed.
    """
    # Remove hyphenation at line breaks (e.g., "infor-\nmation" -> "information")
    text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', text)

    # Remove unnecessary newlines within sentences (e.g., "infor-\nmation" -> "information")
    text = re.sub(r'(\w+)\s*\n\s*(\w+)', r'\1 \2', text)

    # Fix cases where spaces are added between letters (e.g., "I nformation" -> "Information")
    text = re.sub(r'(\b[A-Za-z])\s(?=[A-Za-z]+\b)', r'\1', text)

    return text

def process_text(text):
    """
    Clean and format extracted text for better readability.
    Args:
        text (str): Raw or fixed text extracted from the PDF.
    Returns:
        str: Processed and formatted text.
    """
    # Fix split words first
    text = fix_split_words(text)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Format headers (assumes headers are in uppercase)
    text = re.sub(r'([A-Z][A-Z\s]+):', r'\n## \1\n', text)

    # Add bullet points to numbered lists
    text = re.sub(r'(\d+\.\s)', r'\n- ', text)

    # Add table structure (assuming tab-separated data)
    table_pattern = r'(\w+(\t\w+)+)'
    text = re.sub(table_pattern, lambda m: '\n| ' + ' | '.join(m.group(0).split('\t')) + ' |', text)

    return text

def save_processed_text(filename, content):
    """
    Save processed text to a file and return the filename.
    Args:
        filename (str): Original filename of the PDF.
        content (str): Processed text content to save.
    Returns:
        str: Name of the saved file.
    """
    output_filename = f"processed_{filename.replace('.pdf', '.txt')}"
    with open(output_filename, 'w', encoding='utf-8') as f:
        f.write(content)
    return output_filename

def main():
    # Upload PDF files
    pdf_files = upload_pdfs()

    for pdf_file_name, pdf_file_content in pdf_files.items():
        print(f"Processing {pdf_file_name}...")

        # Extract raw text from the PDF
        raw_text = extract_text_from_pdf(pdf_file_content)

        if not raw_text:
            print(f"No text extracted from {pdf_file_name}. Skipping.")
            continue

        # Process the extracted and fixed text
        processed_text = process_text(raw_text)

        # Add metadata to the processed text
        metadata = f"""# {pdf_file_name.replace('.pdf', '')}
Author: Extracted from PDF
Date: {datetime.now().strftime('%Y-%m-%d')}

---

"""
        final_content = metadata + processed_text

        # Save and download the processed file
        output_filename = save_processed_text(pdf_file_name, final_content)

        try:
            files.download(output_filename)
            print(f"Processed file saved as {output_filename}")
        except Exception as e:
            print(f"Error downloading file: {e}")

if __name__ == "__main__":
    main()


Please upload PDF files...


Saving BS EN ISO 19650‑2_2018_Inc. Corrigendum Feb2021.pdf to BS EN ISO 19650‑2_2018_Inc. Corrigendum Feb2021 (5).pdf
Processing BS EN ISO 19650‑2_2018_Inc. Corrigendum Feb2021 (5).pdf...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Processed file saved as processed_BS EN ISO 19650‑2_2018_Inc. Corrigendum Feb2021 (5).txt
