A Python tool that splits a multi-section PDF into separate PDFs based on predefined components.
- Automatic Section Detection: Finds and extracts sections like "Project Summary", "Project Description", etc.
- Clean Output: Creates separate PDF files for each component
- Flexible Output: Specify custom output directory
- Error Handling: Gracefully handles edge cases
The tool recognizes and separates the following sections:
- Project Summary
- Project Description
- References Cited
- Data Management and Sharing Plan
- Mentoring Plan
- Project Personnel and Partner Organizations
- Facilities, Equipment and Other Resources
- Synergistic Activities
- Project Summary: Page 1 (automatic)
- Project Description: Pages 2-16 (automatic, always 15 pages)
- References Cited: Pages 17+ until next section (automatic detection)
- Data Management and Sharing Plan
- Mentoring Plan
- Project Personnel and Partner Organizations
- Facilities, Equipment and Other Resources
- Synergistic Activities
- Extract pages 1, 2-16 automatically for Summary and Description
- Extract pages 17+ as References Cited until next section is detected
- Detect remaining variable sections using fuzzy matching (approx. 70% similarity)
- Each section ends where the next one begins
- Last section extends to the end of the document
- Each component must start on a new page (guaranteed in input)
- Component names are approximate (fuzzy matching handles variations like "Project Summary" vs "Summary of the Project")
- No explicit section headers required (except for the 5 variable components)
- Install dependencies:
pip install -r requirements.txtpython pdf_splitter.py input.pdf output_directoryExample:
python pdf_splitter.py proposal.pdf ./split_pdfsfrom pdf_splitter import split_pdf
results = split_pdf("proposal.pdf", "output_dir")
for component_name, filepath in results:
print(f"{component_name}: {filepath}")- Reads the input PDF file
- Scans each page for component section headers
- Identifies page boundaries where each section begins
- Extracts pages for each section into a separate PDF file
- Saves output PDFs with sanitized component names
Output files are named based on the component, with spaces replaced by underscores:
project_summary.pdfproject_description.pdfreferences_cited.pdfdata_management_and_sharing_plan.pdfmentoring_plan.pdfproject_personnel_and_partner_organizations.pdfsynergistic_activities.pdf
- Python 3.6+
- PyPDF2
To modify the component names or add new sections, edit the COMPONENTS list in pdf_splitter.py:
COMPONENTS = [
"Project Summary",
"Project Description",
# ... add or modify components here
]- Each component must start on a new page (as required)
- Component names are matched case-insensitively
- OCR is not performed (text must be selectable in the PDF)
MIT