# PDFs

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. There are many libraries in Python for working with PDFs, each with their pros and cons, the most common one being `PyPDF2`.

Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.

In [1]:
pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
     |████████████████████████████████| 77 kB 2.6 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25l- done
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l- \ done
[?25h  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61101 sha256=6786c6bbeab732d9e9f408c998a69e705707de18e60d2001f5e650988bd808e9
  Stored in directory: /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
import PyPDF2

# Reading PDFs

Similar to the csv library, we open a pdf, then create a reader object for it.

In [3]:
# First we open it
# Notice we read it as a binary with 'rb'

f = open('../input/pdfs-and-csvs/Working_Business_Proposal.pdf','rb')

In [4]:
# After that, we read it

pdf_reader = PyPDF2.PdfFileReader(f)

In [5]:
# Get number of pages

pdf_reader.numPages

5

In [6]:
# Get the first page
page_one = pdf_reader.getPage(0)

# Extracting the text
page_one_text = page_one.extractText()

page_one_text

'Business Proposal\n The Revolution is Coming\n Leverage agile frameworks to provide a robust synopsis for high level \noverviews. Iterative approaches to corporate strategy foster collaborative \nthinking to further the overall value proposition. Organically grow the \nholistic world view of disruptive innovation via workplace diversity and \nempowerment. \nBring to the table win-win survival strategies to ensure proactive \ndomination. At the end of the day, going forward, a new normal that has \nevolved from generation X is on the runway heading towards a streamlined \ncloud solution. User generated content in real-time will have multiple \ntouchpoints for offshoring. \nCapitalize on low hanging fruit to identify a ballpark value added activity to \nbeta test. Override the digital divide with additional clickthroughs from \nDevOps. Nanotechnology immersion along the information highway will \nclose the loop on focusing solely on the bottom line. Podcasting operational change managem

In [7]:
f.close()

Remember that if the PyPDF2 library is not working, you might have to try another library.

# Adding to PDFs

Because of the nature of most PDF files, we cannot edit the text. What we can mostly do is copy pages and append pages to the end.

In [8]:
# Getting the first page

f = open('../input/pdfs-and-csvs/Working_Business_Proposal.pdf','rb')

pdf_reader = PyPDF2.PdfFileReader(f)

first_page = pdf_reader.getPage(0)

In [9]:
# Creating a writer object

pdf_writer = PyPDF2.PdfFileWriter()

# Adding a page to the writer object

pdf_writer.addPage(first_page)

In [10]:
# In order to append a page to a PDF file, it should be a pdf.PageObject

type(first_page)

PyPDF2.pdf.PageObject

In [11]:
# Opening a new file in order to append the first page to it

pdf_output = open("Some_New_Doc.pdf","wb")

# Appending the first_page to Some_New_Doc.pdf

pdf_writer.write(pdf_output)

In [12]:
# Don't forget to close

f.close()
pdf_output.close()

# Grabbing all the Text

Let's try to grab all the text from this PDF file.

In [13]:
# Opening the file

f = open('../input/pdfs-and-csvs/Working_Business_Proposal.pdf','rb')

In [14]:
# Reading the file

pdf_reader = PyPDF2.PdfFileReader(f)

In [15]:
# Get number of pages

pdf_reader.numPages

5

In [16]:
# Create an empty list

pdf_text = []

In [17]:
# n will be used to count the number of pages
n = 0

# Loop
# n in range of number of pages
for n in range(pdf_reader.numPages):
    
    # Get the page number n
    page = pdf_reader.getPage(n)
    
    # append it to the list
    pdf_text.append(page.extractText())

In [18]:
pdf_text

['Business Proposal\n The Revolution is Coming\n Leverage agile frameworks to provide a robust synopsis for high level \noverviews. Iterative approaches to corporate strategy foster collaborative \nthinking to further the overall value proposition. Organically grow the \nholistic world view of disruptive innovation via workplace diversity and \nempowerment. \nBring to the table win-win survival strategies to ensure proactive \ndomination. At the end of the day, going forward, a new normal that has \nevolved from generation X is on the runway heading towards a streamlined \ncloud solution. User generated content in real-time will have multiple \ntouchpoints for offshoring. \nCapitalize on low hanging fruit to identify a ballpark value added activity to \nbeta test. Override the digital divide with additional clickthroughs from \nDevOps. Nanotechnology immersion along the information highway will \nclose the loop on focusing solely on the bottom line. Podcasting operational change manage

In [19]:
# Reading the last page

print(pdf_text[4])

Quickly communicate enabled technology and turnkey leadership skills. 
Uniquely enable accurate supply chains rather than frictionless 
technology. Globally network focused materials vis-a-vis cost effective 
manufactured products. 
BUSINESS PROPOSAL
!5
