# PDF-Word Data Scrapping

Modify the System to Support MS Office Format

    Goal: Extract and save data in a format that MS Office (e.g., Word) can open, edit, and save back to the database.
    Tools/Libraries:
        For Word Document generation and manipulation: python-docx to generate, read, and modify Word documents.

Workflow Overview

    Upload PDF: User uploads a PDF to be processed.
    Scrape PDF: Extract content (text, tables, images, formatting metadata).
    Store in DB: Store the extracted content in a structured format.
    Generate Word Document: Create a Word document (.docx) from the extracted data.
    Edit in MS Word: The user edits the .docx file in MS Word.
    Save Changes Back to DB: Parse the updated .docx file to extract modified content and save it back to the database.

Steps in Detail
1. PDF Scraping

    Extract the content from the PDF file as outlined previously, ensuring to maintain formatting details for text, tables, and images.

2. Store Data in a Database

    Store the extracted content in a database with a structure that supports different content types (text, tables, images) and formatting.

3. Generate MS Word Document

    Goal: Convert the scraped content into a .docx file that retains as much of the original formatting as possible.
    Tools/Libraries:
        Use python-docx to create a Word document with structured text, images, and tables.
    Tasks:
        Create a .docx file using the data retrieved from the database.
        Apply formatting to match the original PDF content (e.g., headings, bold/italic text, table styles).
        Save the Word document and make it available for editing.

4. Edit in MS Word

    The user can open the .docx file in MS Word, make necessary edits, and then save the document.

5. Save Changes Back to DB

    Goal: Extract content from the modified .docx file and update the database accordingly.
    Tools/Libraries:
        Use python-docx to read the modified .docx file.
    Tasks:
        Parse the updated Word document to extract the modified text, tables, and images.
        Compare the modified content with the original content in the database to identify changes.
        Update the database with the modified content, ensuring that the changes are reflected accurately.

6. Regenerate PDF (Optional)

    If required, the modified content from the database can be used to regenerate a PDF file that reflects the changes made by the user in MS Word.
    Use tools like ReportLab or FPDF to create the updated PDF, preserving the formatting.

Challenges and Considerations

    Maintaining Formatting Consistency: Converting between PDF and Word may lead to differences in formatting. You will need careful handling of tables, fonts, and images.
    Version Control: Implement a version control system in the database to track changes made to the document and revert to a previous version if necessary.
    Change Detection: You may need to develop a change detection mechanism to determine what has been modified in the Word document, so only those changes are updated in the database.
    User Interface: Create a user interface that allows users to upload, download, and re-upload Word documents, making the workflow seamless.

Tools and Technologies Overview

    PDF to Word Document Conversion: python-docx for generating .docx.
    Database Interaction: SQLAlchemy, pymongo, etc.
    PDF Generation: ReportLab, FPDF.
    Microsoft Word Editing: .docx files for user edits.

Detailed Workflow Example

    Upload and Scrape PDF: Extract the content and save it in a structured format.
    Generate .docx File: Create a .docx file for user edits, applying appropriate formatting.
    Edit and Re-upload:
        User downloads and edits the .docx file.
        User re-uploads the modified .docx file.
    Parse Modified .docx:
        Extract the updated content and identify the changes.
        Update the database accordingly.
    Regenerate PDF (if needed): Generate an updated PDF based on the new content.

This workflow allows you to combine the capabilities of PDF scraping and data extraction with easy editing via MS Office and synchronization with a database, creating a robust system for document management and editing.

In [1]:
pdf_file_path = "pdf_word_app/uploads/sample pdf for project.pdf"  # Path to the PDF file

In [7]:
import pdfplumber
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTTextLineHorizontal, LTChar
from docx import Document
from docx.shared import Pt
from docx.shared import Inches

# Function to extract content and formatting details from a PDF
def extract_pdf_content_and_formatting(pdf_file_path):
    content = {
        "text": "",
        "tables": [],
        "images": [],
        "formatting": []
    }
    
    with pdfplumber.open(pdf_file_path) as pdf:
        for page in pdf.pages:
            # Extract tables and save them in content
            tables = page.extract_tables()
            if tables:
                content["tables"].extend(tables)
            
            # Collect words and table areas to avoid including table text in the main text
            table_words = []
            for table in tables:
                for row in table:
                    for cell in row:
                        if isinstance(cell, str):
                            table_words.append(cell.strip())

            # Extract and store text, excluding table content
            page_text = ""
            words = page.extract_words()
            for word in words:
                # Exclude words that are part of tables
                if word["text"] not in table_words:
                    page_text += word["text"] + " "

            content["text"] += page_text + "\n"
            
            # Extract images
            for img in page.images:
                x0, top, x1, bottom = img['x0'], img['top'], img['x1'], img['bottom']
                image_data = page.to_image()
                pil_image = image_data.original
                cropped_img = pil_image.crop((x0, top, x1, bottom))
                content["images"].append(cropped_img)

    # Extract formatting using pdfminer
    for page_layout in extract_pages(pdf_file_path):
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    if isinstance(text_line, LTTextLineHorizontal):
                        line_text = "".join([char.get_text() for char in text_line if isinstance(char, LTChar)])
                        font_sizes = [char.size for char in text_line if isinstance(char, LTChar)]
                        font_styles = [char.fontname for char in text_line if isinstance(char, LTChar)]

                        # Store the line with its formatting details
                        content["formatting"].append({
                            "text": line_text,
                            "font_sizes": font_sizes,
                            "font_styles": font_styles
                        })

    return content

# Function to generate Word document with extracted content and formatting
def generate_word_document(pdf_content, output_file):
    # Create a new Word document
    doc = Document()

    # Add text with formatting
    for formatted_line in pdf_content["formatting"]:
        text = formatted_line["text"]
        font_sizes = formatted_line["font_sizes"]
        font_styles = formatted_line["font_styles"]
        
        # Create a new paragraph for each line of text
        p = doc.add_paragraph()
        
        for idx, char in enumerate(text):
            run = p.add_run(char)
            # Apply font size
            if len(font_sizes) > idx and font_sizes[idx]:
                run.font.size = Pt(font_sizes[idx])

            # Apply bold/italic based on font style (simplified for this example)
            if len(font_styles) > idx and "Bold" in font_styles[idx]:
                run.bold = True
            if len(font_styles) > idx and "Italic" in font_styles[idx]:
                run.italic = True

    # Add tables
    for table in pdf_content["tables"]:
        # Create a new table in the Word document
        word_table = doc.add_table(rows=len(table), cols=len(table[0]))
        for row_idx, row in enumerate(table):
            for col_idx, cell_text in enumerate(row):
                word_table.cell(row_idx, col_idx).text = cell_text if cell_text else ""

    # Add images
    for img in pdf_content["images"]:
        # Save the image temporarily and add it to the Word document
        img_path = "temp_image.png"
        img.save(img_path)
        doc.add_picture(img_path, width=Inches(3))  # Adjust size as needed

    # Save the generated Word document
    doc.save(output_file)
    print(f"Word document saved to {output_file}")

# Example usage
pdf_content = extract_pdf_content_and_formatting(pdf_file_path)
generate_word_document(pdf_content, "output.docx")


Word document saved to output.docx


In [23]:
import pdfplumber
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTTextLineHorizontal, LTChar
from docx import Document
from docx.shared import Pt
from docx.shared import Inches
import sqlite3  # Example using SQLite (You can switch to any other DB like MySQL or MongoDB)
import io
from PIL import Image

# Step 2: Store Data in a Database

def create_db():
    # Connect to SQLite database (or any other database system)
    conn = sqlite3.connect('document_data.db')
    cursor = conn.cursor()

    # Create tables for text, tables, images, and formatting
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS TextContent (
        id INTEGER PRIMARY KEY,
        content TEXT
    )''')

    cursor.execute('''
    CREATE TABLE IF NOT EXISTS TableContent (
        id INTEGER PRIMARY KEY,
        table_data TEXT
    )''')

    cursor.execute('''
    CREATE TABLE IF NOT EXISTS ImageContent (
        id INTEGER PRIMARY KEY,
        image BLOB
    )''')

    cursor.execute('''
    CREATE TABLE IF NOT EXISTS FormattingContent (
        id INTEGER PRIMARY KEY,
        text TEXT,
        font_size TEXT,
        font_style TEXT
    )''')

    conn.commit()
    return conn

def store_pdf_content_in_db(conn, content):
    cursor = conn.cursor()

    # Store text content
    cursor.execute("INSERT INTO TextContent (content) VALUES (?)", (content["text"],))

    # Store tables (as a string representation, you may choose a better format)
    for table in content["tables"]:
        cursor.execute("INSERT INTO TableContent (table_data) VALUES (?)", (str(table),))

    # Store images as binary data
    for img in content["images"]:
        img_byte_arr = io.BytesIO()
        img.save(img_byte_arr, format='PNG')
        img_data = img_byte_arr.getvalue()
        cursor.execute("INSERT INTO ImageContent (image) VALUES (?)", (img_data,))

    # Store formatting details
    for formatting in content["formatting"]:
        text = formatting["text"]
        font_size = str(formatting["font_sizes"])  # Store font sizes as a string
        font_style = str(formatting["font_styles"])  # Store font styles as a string
        cursor.execute("INSERT INTO FormattingContent (text, font_size, font_style) VALUES (?, ?, ?)", 
                       (text, font_size, font_style))

    conn.commit()

# Function to extract content and formatting details from a PDF
def extract_pdf_content_and_formatting(pdf_file_path):
    content = {
        "text": "",
        "tables": [],
        "images": [],
        "formatting": []
    }
    
    with pdfplumber.open(pdf_file_path) as pdf:
        for page in pdf.pages:
            # Extract tables and save them in content
            tables = page.extract_tables()
            if tables:
                content["tables"].extend(tables)
            
            # Collect words and table areas to avoid including table text in the main text
            table_words = []
            for table in tables:
                for row in table:
                    for cell in row:
                        if isinstance(cell, str):
                            table_words.append(cell.strip())

            # Extract and store text, excluding table content
            page_text = ""
            words = page.extract_words()
            for word in words:
                # Exclude words that are part of tables
                if word["text"] not in table_words:
                    page_text += word["text"] + " "

            content["text"] += page_text + "\n"
            
            # Extract images
            for img in page.images:
                x0, top, x1, bottom = img['x0'], img['top'], img['x1'], img['bottom']
                image_data = page.to_image()
                pil_image = image_data.original
                cropped_img = pil_image.crop((x0, top, x1, bottom))
                content["images"].append(cropped_img)

    # Extract formatting using pdfminer
    for page_layout in extract_pages(pdf_file_path):
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    if isinstance(text_line, LTTextLineHorizontal):
                        line_text = "".join([char.get_text() for char in text_line if isinstance(char, LTChar)])
                        font_sizes = [char.size for char in text_line if isinstance(char, LTChar)]
                        font_styles = [char.fontname for char in text_line if isinstance(char, LTChar)]

                        # Store the line with its formatting details
                        content["formatting"].append({
                            "text": line_text,
                            "font_sizes": font_sizes,
                            "font_styles": font_styles
                        })

    return content

# Function to generate Word document with extracted content and formatting
def generate_word_document(pdf_content, output_file):
    # Create a new Word document
    doc = Document()

    # Add text with formatting
    for formatted_line in pdf_content["formatting"]:
        text = formatted_line["text"]
        font_sizes = formatted_line["font_sizes"]
        font_styles = formatted_line["font_styles"]
        
        # Create a new paragraph for each line of text
        p = doc.add_paragraph()
        
        for idx, char in enumerate(text):
            run = p.add_run(char)
            # Apply font size
            if len(font_sizes) > idx and font_sizes[idx]:
                run.font.size = Pt(font_sizes[idx])

            # Apply bold/italic based on font style (simplified for this example)
            if len(font_styles) > idx and "Bold" in font_styles[idx]:
                run.bold = True
            if len(font_styles) > idx and "Italic" in font_styles[idx]:
                run.italic = True

    # Add tables
    for table in pdf_content["tables"]:
        # Create a new table in the Word document
        word_table = doc.add_table(rows=len(table), cols=len(table[0]))
        for row_idx, row in enumerate(table):
            for col_idx, cell_text in enumerate(row):
                word_table.cell(row_idx, col_idx).text = cell_text if cell_text else ""

    # Add images
    for img in pdf_content["images"]:
        # Save the image temporarily and add it to the Word document
        img_path = "temp_image.png"
        img.save(img_path)
        doc.add_picture(img_path, width=Inches(3))  # Adjust size as needed

    # Save the generated Word document
    doc.save(output_file)
    print(f"Word document saved to {output_file}")

# Example usage

conn = create_db()  # Step 2: Initialize the database
pdf_content = extract_pdf_content_and_formatting(pdf_file_path)
store_pdf_content_in_db(conn, pdf_content)  # Step 2: Store the content in the database
generate_word_document(pdf_content, "output.docx")


Word document saved to output.docx
