**Python Work with PDF Files**


Python PyPDF2 package allows you to do a lot of useful operations on existing PDF or create new pdf files.

In this tutorial we will do various operations on pdf files.


  

*   Splitting PDF file (read  page/pages from a file)
*   Extracting text from PDF files
*   Merging PDFs


**To install PyPDF2, run following command from command line:**



In [0]:
pip install PyPDF2

In [0]:
import PyPDF2
pdfFileObj = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# get total number of pages
pdfReader.numPages
 
# read content of first page  
pageObj = pdfReader.getPage(0)
pageObj.extractText()

Splitting PDF file

---



In [0]:
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
 
def pdf_splitter(path):
    fname = os.path.splitext(os.path.basename(path))[0]
 
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
 
        output_filename = '{}_page_{}.pdf'.format(
            fname, page+1)
 
        with open(output_filename, 'wb') as out:
            pdf_writer.write(out)
 
        print('Created: {}'.format(output_filename))
 
if __name__ == '__main__':
    path = 'sample.pdf'           # your pdf file to split
    pdf_splitter(path)

In above code we created a fun little function called pdf_splitter. It accepts the path of the input PDF file. The first line of this function will grab the name of the input file, minus the extension. 
Next, we open the PDF up and create a reader object. Then we loop over all the pages using the reader object's **getNumPages** method.

Inside the for loop we open the new file name in write-binary mode and use the PDF writer object's write method to write the object's contents to disk.


Extracting text from PDF files


---


In [0]:
# importing required modules
import PyPDF2
 
# creating a pdf file object
pdfFileObj = open('my.pdf', 'rb')
 
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
# prin number of pages in pdf file
print(pdfReader.numPages)
 
# creating a page object
pageObj = pdfReader.getPage(0)
 
# extracting text from page
print(pageObj.extractText())
 
# closing the pdf file object
pdfFileObj.close()

Merging PDFs


---


PyPDF2 made this a bit simpler by creating a PdfFileMerger object:


In [0]:
import PyPDF2
 
def PDFmerge(pdfs, output): 
    # creating pdf file merger object
    pdfMerger = PyPDF2.PdfFileMerger()
     
    # appending pdfs one by one
    for pdf in pdfs:
        with open(pdf, 'rb') as f:
            pdfMerger.append(f)
         
    # writing pdf to output pdf file
    with open(output, 'wb') as f:
        pdfMerger.write(f)
 
def main():
    # pdf files to merge change your file names here
    pdfs = ['fir.pdf', 'sec.pdf']
    
    # your output pdf file name
    output  = 'newpdf.pdf'
     
    # calling pdf merge function
    PDFmerge(pdfs = pdfs, output = output)
 
if __name__ == "__main__":
    # calling the main function
    main()