### Lesson 44:

### Reading and Editing PDFs

PDF files are binary files, which are more complex than text files, since they contain formatting information, images, and other assets.

PDF is great for printing, but not great for software, which work by typically parsing plain text. 

The `PyPDF2` module allows us to more easily work with PDF files in Python.

In [1]:
import PyPDF2

Despite its functionality, there are some PDFs that even this module can't perfectly handle. This is simply the nature of PDF files.

We must first change the directory to the folder holding PDF files, to avoid having to reference full paths.

In [4]:
import os

os.chdir('files')

In [8]:
os.listdir()

['.DS_Store',
 '112065.pdf',
 '26645.pdf',
 'alarm.wav',
 'allMyCats1.py',
 'allMyCats1.py.backup',
 'allMyCats2.py',
 'allMyCats2.py.backup',
 'AutomateSearch.png',
 'backupToZip.py',
 'backupToZip.py.backup',
 'bacon.txt',
 'birthdays.py',
 'birthdays.py.backup',
 'boxPrint.py',
 'boxPrint.py.backup',
 'buggyAddingProgram.py',
 'buggyAddingProgram.py.backup',
 'bulletPointAdder.py',
 'bulletPointAdder.py.backup',
 'calcProd.py',
 'calcProd.py.backup',
 'catlogo.png',
 'catnapping.py',
 'catnapping.py.backup',
 'census2010.py',
 'census2010.py.backup',
 'censuspopdata.xlsx',
 'characterCount.py',
 'characterCount.py.backup',
 'coinFlip.py',
 'coinFlip.py.backup',
 'combinedminutes.pdf',
 'combinePdfs.py',
 'combinePdfs.py.backup',
 'countdown.py',
 'countdown.py.backup',
 'CSSSelector.png',
 'demo.docx',
 'dictionary.txt',
 'dimensions.xlsx',
 'downloadXkcd.py',
 'downloadXkcd.py.backup',
 'duesRecords.xlsx',
 'encrypted.pdf',
 'encryptedminutes.pdf',
 'error_log.txt',
 'errorExample.

Open allows the script to interact with this file, and opens files in 'read mode'. However since PDFS are binary, we need 'read binary' mode, which we activate with the parameter 'rb'.

In [11]:
pdfFile = open('meetingminutes.pdf', 'rb')

Once loaded, we can now pass the file to the PDF reader, and create a 'reader' object.

In [12]:
reader = PyPDF2.PdfFileReader(pdfFile)

In [13]:
reader

<PyPDF2.pdf.PdfFileReader at 0x105b93c18>

A 'reader' object has a variety of methods, one of which is `.numPages()` which returns the number of pages in the PDF.

In [15]:
reader.numPages

19

The `.getPage()` method loads a specific page, and turns it into a 'page' object for even more interaction.

In [17]:
page = reader.getPage(0)

Once created, we can use methods like `.extractText` to retrieve the text of that page object.

In [19]:
page.extractText()

'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of \nMarch 7\n, 2014\n        \n     The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n  '

Note that while this translation is not perfect, it is sufficient for understanding the content of the [PDF File]('files/meetingminutes.pdf').

![image](files/meetingminutes.png)

To extract all the text from this PDF document, we can loop this method over all pages:

In [20]:
for pageNum in range(reader.numPages):
    print(reader.getPage(pageNum).extractText())

OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of 
March 7
, 2014
        
     The Board of Elementary and Secondary Education shall provide leadership and 
create policies for education that expand opportunities for children, empower 
families and communities, and advance Louisiana in an increasingly 
competitive glob
al market.
 BOARD 
 of ELEMENTARY
 and 
 SECONDARY
 EDUCATION
  
 LOUISIANA STATE BOARD OF ELEMENTARY AND SECONDARY EDUCATION
   MARCH 7, 2014
  
 The Louisiana Purchase Room
  Baton Rouge, LA
   
 
 
The Louisiana State Board of Elementary and Secondary Education met in 
regular
 session on
 March 7, 2014
, in the Louisiana Purcha
se Room, located in the Claiborne 
Building in Baton Rouge, Louisiana.  The meeting was called to order at 
9:17 a.m.
 by 
Board President 
Chas Roemer
 and opened with a prayer by
 Ms. Terry Johnson, Bossier 
Parish School System
.  
Board members present were 
Dr. Lottie Beebe, Ms. Holly Boffy, Mr. Jim Garvey, Mr.
 Jay 
Guillot, Ms.

The `PyPDF2` module also includes a 'writer' as well as a 'reader', which can create new PDF files. However, it cannot write arbitrary text to PDFs, due to the binary nature. It can therefore only add, remove, or reorder pages, not the text or layout itself.

We can use this function to manage PDF pages across multiple documents in a batch approach.

In [25]:
import PyPDF2

# Open up both example PDFs in read-binary mode
pdf1File = open('meetingminutes.pdf', 'rb')
pdf2File = open('meetingminutes2.pdf', 'rb')

# Now generate readers for both of these functions
reader1 = PyPDF2.PdfFileReader(pdf1File)
reader2 = PyPDF2.PdfFileReader(pdf2File)

# Now create a writer object that will merge these PDFs into one pdf. 
# this object only exists in memory at the moment, and will need to be saved later.
writer = PyPDF2.PdfFileWriter()

We can now loop through all the pages in the 'reader' objects and add them to the 'writer object'.

In [26]:
# For every page in reader 1
for pageNum in range(reader1.numPages):
    # Create a page object at that page
    page = reader1.getPage(pageNum)
    # Add that page object to the writer defined earlier
    writer.addPage(page)
    
# For every page in reader 2
for pageNum in range(reader2.numPages):
    # Create a page object at that page
    page = reader2.getPage(pageNum)
    # Add that page object to the writer defined earlier
    writer.addPage(page)    

Now the writer object contains all the pages, and we must save  this object (at the location defined earlier, or the absolute path provided). 

We need to  first'open' a file we can save these 'reader' object to.

In [30]:
outputFile = open('combinedminutes2.pdf', 'wb')

In [31]:
writer.write(outputFile)

The newly created file is now available as [combinedminutes2.pdf](files/combinedminutes2.pdf).

We can now close all the open files, to make sure no further edits are done.

In [32]:
outputFile.close()
pdf1File.close()
pdf2File.close()

### Recap
* The `PyPDF2` module can read and write PDFs.
* OPening a PDF is done by calling `open()` and passing the file in read-binary mode to the `PdfFileReader()` function.
* A Page object can be obtained from a PDF reader object with the `.getPage()` method.
* The text from a Page object is obtained with the `.extractText()` method on a Page object, which can be imperfect.
* New PDFs can be made from `PdfFileWriter()`, but new pages can only involve page manipulation and merging.
* New pages can be appended to a writer object with the `.addPage()` method on page objects.
* Call the `write()` method on a writer object to save it to a defined & open output file.