<h1>Create and Modify PDF Files in Python</h1>

<h2>Table of Contents</h2>

<h3>Extracting Text from a PDF File</h3>

In [1]:
from PyPDF2 import PdfFileReader

In [2]:
from pathlib import Path

<h4>1. Opening a PDF File</h4>

In [3]:
Path.home()

WindowsPath('C:/Users/Felipe')

In [4]:
# Establish the path to the PDF file
pdf_path = (
    Path.home()
    /"python_work"
    /"notebooks"
    /"py_PDF"
    /"data"
    /"practice_files"
    /"Pride_and_Prejudice.pdf"
)

In [6]:
#  Create the PdfFileReader instance
# Convert to string because PdfFileReader cannot read a pathlib.Path object
pdf = PdfFileReader(str(pdf_path))

In [9]:
#  Get number of pages
pdf.getNumPages()

234

In [10]:
# Access some document information
pdf.documentInfo

{'/Title': 'Pride and Prejudice, by Jane Austen',
 '/Author': 'Chuck',
 '/Creator': 'Microsoft® Office Word 2007',
 '/CreationDate': 'D:20110812174208',
 '/ModDate': 'D:20110812174208',
 '/Producer': 'Microsoft® Office Word 2007'}

In [11]:
# Retrieving only the title
pdf.documentInfo.title

'Pride and Prejudice, by Jane Austen'

<h4>2. Extracting Text From a Page</h4>
<h5>2.1 Get a <code>GetPageObject</code> with <code>PdfFileReader.getPage()</code>.</h5>

In [12]:
# Pass the page's index to retrieve the page
first_page = pdf.getPage(0)

<p><code>.getPage()</code> returns a <code>PageObject</code>:

In [13]:
type(first_page)

PyPDF2._page.PageObject

<h5>2.2 Extract the page's text with <code>PageObject.extractText()</code>:</h5>

In [14]:
first_page.extract_text()

' \n \nThe Project Gutenberg EBook of Pride and Prejudice, by Jane Austen  \n \nThis eBook is for the use of anyone anywhere at no cost and with  \nalmost no restrictions whatsoever.  You may copy it, give it away or  \nre-use it under the terms of the Project Gutenberg License included  \nwith this eBook or online at www.gutenberg.org  \n \n \nTitle: Pride and Prejudice  \n \nAuthor: Jane Austen  \n \nRelease Date: August 26, 2008 [EBook #1342]  \n[Last updated: August 11, 2011]  \n \nLanguage: Eng lish \n \nCharacter set encoding: ASCII  \n \n*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***  \n \n \n \n \nProduced by Anonymous Volunteers, and David Widger  \n \n \n \n \n \n \nPRIDE AND PREJUDICE  \n \nBy Jane Austen  \n \n \n \nContents  '

<p>Because every <code>PdfFileReader</code> object has a <code>pages</code> attribute, you can use it to iterate over all the pages of the PDF in order.</p>

In [15]:
for page in pdf.pages:
    print(page.extractText())

 
 
The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen  
 
This eBook is for the use of anyone anywhere at no cost and with  
almost no restrictions whatsoever.  You may copy it, give it away or  
re-use it under the terms of the Project Gutenberg License included  
with this eBook or online at www.gutenberg.org  
 
 
Title: Pride and Prejudice  
 
Author: Jane Austen  
 
Release Date: August 26, 2008 [EBook #1342]  
[Last updated: August 11, 2011]  
 
Language: Eng lish 
 
Character set encoding: ASCII  
 
*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***  
 
 
 
 
Produced by Anonymous Volunteers, and David Widger  
 
 
 
 
 
 
PRIDE AND PREJUDICE  
 
By Jane Austen  
 
 
 
Contents  
 
Chapter  1  
Chapter  2  
Chapter  3  
Chapter  4  
Chapter  5  
Chapter  6  
Chapter  7  
Chapter  8  
Chapter  9  
Chapter  10  
Chapter  11  
Chapter  12  
Chapter  13  
Chapter  14  
Chapter  15  
Chapter  16  
Chapter  17  
Chapter  18  
Chapter  19  
Chapter  20  

<h4>3. Putting It All Together</h3>
<p>Let's save our extracted text in a new <code>.txt</code> file.</p>

In [16]:
# 1 Create pdf reader object and set the output file path
pdf_reader = PdfFileReader(str(pdf_path))
output_file_path = (
    Path.home()
    /"python_work"
    /"notebooks"
    /"py_PDF"
    /"data"
    /"practice_files"
    /"Pride_and_Prejudice.txt"
)

# 2 Open the output file in 'write' mode
with output_file_path.open(mode='w') as output_file:
    # 3 Print the title and the number of pages
    title = pdf_reader.documentInfo.title
    num_pages = pdf_reader.getNumPages()
    output_file.write(f'{title}\\nNumber of pages: {num_pages}\\n\\n')

    #  Iterate over all the pages and extract text from each page, which will be written to the output_file
    for page in pdf_reader.pages:
        text = page.extractText()
        output_file.write(text)

<p>And that's it, you should have a <code>.txt</code> file in your path folder!</p>
<hr>