PDF Splitter Tool

A Python tool that splits a multi-section PDF into separate PDFs based on predefined components.

Features

Automatic Section Detection: Finds and extracts sections like "Project Summary", "Project Description", etc.
Clean Output: Creates separate PDF files for each component
Flexible Output: Specify custom output directory
Error Handling: Gracefully handles edge cases

Supported Components

The tool recognizes and separates the following sections:

Project Summary
Project Description
References Cited
Data Management and Sharing Plan
Mentoring Plan
Project Personnel and Partner Organizations
Facilities, Equipment and Other Resources
Synergistic Activities

Splitting Rules

Fixed Components (Order Guaranteed)

Project Summary: Page 1 (automatic)
Project Description: Pages 2-16 (automatic, always 15 pages)
References Cited: Pages 17+ until next section (automatic detection)

Variable Components (Order Not Determined)

Data Management and Sharing Plan
Mentoring Plan
Project Personnel and Partner Organizations
Facilities, Equipment and Other Resources
Synergistic Activities

Detection Algorithm

Extract pages 1, 2-16 automatically for Summary and Description
Extract pages 17+ as References Cited until next section is detected
Detect remaining variable sections using fuzzy matching (approx. 70% similarity)
Each section ends where the next one begins
Last section extends to the end of the document

Key Requirements

Each component must start on a new page (guaranteed in input)
Component names are approximate (fuzzy matching handles variations like "Project Summary" vs "Summary of the Project")
No explicit section headers required (except for the 5 variable components)

Installation

Install dependencies:

pip install -r requirements.txt

Usage

Command Line

python pdf_splitter.py input.pdf output_directory

Example:

python pdf_splitter.py proposal.pdf ./split_pdfs

As a Python Module

from pdf_splitter import split_pdf

results = split_pdf("proposal.pdf", "output_dir")
for component_name, filepath in results:
    print(f"{component_name}: {filepath}")

How It Works

Reads the input PDF file
Scans each page for component section headers
Identifies page boundaries where each section begins
Extracts pages for each section into a separate PDF file
Saves output PDFs with sanitized component names

Output

Output files are named based on the component, with spaces replaced by underscores:

project_summary.pdf
project_description.pdf
references_cited.pdf
data_management_and_sharing_plan.pdf
mentoring_plan.pdf
project_personnel_and_partner_organizations.pdf
synergistic_activities.pdf

Requirements

Python 3.6+
PyPDF2

Customization

To modify the component names or add new sections, edit the COMPONENTS list in pdf_splitter.py:

COMPONENTS = [
    "Project Summary",
    "Project Description",
    # ... add or modify components here
]

Limitations

Each component must start on a new page (as required)
Component names are matched case-insensitively
OCR is not performed (text must be selectable in the PDF)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
example.py		example.py
pdf_splitter.py		pdf_splitter.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Splitter Tool

Features

Supported Components

Splitting Rules

Fixed Components (Order Guaranteed)

Variable Components (Order Not Determined)

Detection Algorithm

Key Requirements

Installation

Usage

Command Line

As a Python Module

How It Works

Output

Requirements

Customization

Limitations

License

About

Uh oh!

Releases

Packages

Languages

bigwater/split-pdf

Folders and files

Latest commit

History

Repository files navigation

PDF Splitter Tool

Features

Supported Components

Splitting Rules

Fixed Components (Order Guaranteed)

Variable Components (Order Not Determined)

Detection Algorithm

Key Requirements

Installation

Usage

Command Line

As a Python Module

How It Works

Output

Requirements

Customization

Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages