# Writing your own PDF handler to merge, split and extract PDF files

### Motivations of using Python:
- free and safe
- fast for handling massive files
- flexible, powerful and independent

### Here I share two small but very useful functions I wrote based on the [PyPDF2](https://pypi.org/project/PyPDF2/) package:
- merge_pdf
- split_pdf

This is because merging and splitting PDFs are the most freqently needed based on my experience. You can wrap these functions in a small package, for example called ```pdfhandler```. Then you can use functions from this script directly via: ```from pdfhandler import xxx```. Or you can just copy and paste the codes when you need them. But make sure the ```PyPDF2``` package is installed in advance in both cases. There are many other Python packages to work with PDFs, you can develop your own codes to solve your problems with handling PDF files.

In [1]:
# create a function to merge a list of pdfs in order (merge process = "append" + "save out")
def merge_pdf(pdfs_list,output_filename):
    '''Input the filenames of a list of pdfs, combine them into a single pdf with the specified name, 
       keeping the default input order.'''  
    # initialize the tool for merging pdf ("merger" here)
    from PyPDF2 import PdfFileMerger
    merger = PdfFileMerger()
    # append each input pdf together
    for pdf in pdfs_list:
        merger.append(pdf)
    # save out the combined file
    merger.write(str(output_filename)+'.pdf')
    merger.close()

# create a function to split a multi-pages pdf and save out each selected page as a separate pdf
def split_pdf(input_pdf,output_filename,target_pages=None):
    '''Input a single pdf filename, save out each page as a separate pdf by default.
       Or you can provide selected page numbers in a list (just use the natural page number in the file).
       Outpout filenames are "Your filename + page number"'''
    # import the tools needed
    from PyPDF2 import PdfFileWriter, PdfFileReader
    # read the input pdf
    inputpdf = PdfFileReader(open(input_pdf, "rb"))
    # save out each page if no arguments are provided
    # or save out the requested target pages
    # when providing the target pages, just use the natural page number in the file
    # they are converted to Python indice in the codes
    if target_pages is None:
        output_pages = range(inputpdf.numPages)
    else: 
        output_pages = [x - 1 for x in target_pages]
    # loop over all pages or requested ones only, then save out each page
    for i in output_pages:
        output = PdfFileWriter()
        output.addPage(inputpdf.getPage(i))
        with open(str(output_filename)+'_page_%s.pdf' % int(i+1), "wb") as outputStream:
             output.write(outputStream)

### Here I show some examples of how to use these two functions:

In [None]:
##############################################################################################
# set up working directory
import os 
os.chdir('your working directory')

# get the list of pdf files

# when there are few files to merge, you can provide their names in list mannually
pdfs_files = ['test1.pdf','test2.pdf','test3.pdf']

# then you just need one line to merge them together
merge_pdf(pdf_files,"combined_pdf_test1.pdf")

In [None]:
##############################################################################################
# if you want to merge massive files in a directory
# you can use Python tricks to find out PDF files, the key is to play around the "string"
# for example, you can first get the string for each file name, then use "str.startswith", "str.endswith"

# list all files within this directory
all_files = os.listdir()

# here we write some codes to identify all PDF files in a directory
pdf_files = []

# loop over all files
for file in all_files:
    # get the strings for each file
    filename = os.fsdecode(file) 
    #  only add the file to your list if it is a pdf
    if filename.endswith('.pdf') == True: 
        pdf_files.append(filename)

# sort the files to decide their order 
pdf_files.sort()

# check the files and their order
for file in pdf_files:
    print(file,sep='\n')
        
# now you just need one line to merge them together
merge_pdf(pdf_files,"combined_pdf_test2.pdf")

In [3]:
##############################################################################################
# split a PDF file with multiple pages
split_pdf('test1.pdf',"test1")

# then check outputs in your working directory
import glob
import re

test = glob.glob("test1_page*")
test.sort(key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
display(test)

['test1_page_1.pdf',
 'test1_page_2.pdf',
 'test1_page_3.pdf',
 'test1_page_4.pdf',
 'test1_page_5.pdf',
 'test1_page_6.pdf',
 'test1_page_7.pdf',
 'test1_page_8.pdf',
 'test1_page_9.pdf',
 'test1_page_10.pdf']

### You can further improve these two functions by adding new arguments. I haven't  done so, because these two basic functions already meet my needs. Here I show two examples of what you can achieve:

In [None]:
# only combine the first 3 pages from each file
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf,pages=(0,3)) 
    
merger.write("combined_pdf_test2.pdf")
merger.close()
  
# only combine pages 1,3,5 from each input pdf file
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf, pages=(0,6,2)) # (start,end,sep)
    
merger.write("combined_pdf_test3.pdf")
merger.close()

### There are more than one Python packages which can manipulate PDF. You can build functions according to your research needs. For example, you can [merge two pages into one](https://stackoverflow.com/questions/22795091/how-to-append-pdf-pages-using-pypdf2), [convert PDFs to images](https://stackoverflow.com/questions/46184239/extract-a-page-from-a-pdf-as-a-jpeg) and [scrape data from PDFs](https://medium.com/codestorm/how-to-read-and-scrape-data-from-pdf-file-using-python-2f2a2fe73ae7). 