# <center>Working with PDFs in Python </center>
<p><p>
<center>Mike Driscoll (@driscollis)</center>
<p><p>

# Slides / Code
## https://github.com/driscollis/PrairieCode2019

# About Mike

- Blogs at www.MouseVsPython.com
- Contributes to https://realpython.com/
- Writes technical books

![title](screenshots/all_python.png)

# What is PDF?

"Portable Document Format (PDF) is a file format used to present and exchange documents reliably, independent of software, hardware, or operating system. Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic. They can also be signed electronically and are easily viewed using free Acrobat Reader DC software." - Adobe


# Creating a PDF with Python

# Tools for Creating PDFs

- ReportLab
- PyFPDF

# ReportLab

- Has the most features
- Decent documentation

# Two ReportLabs?

* ReportLab (open source)
* ReportLab Plus (commercial)

# Installation

`pip install reportlab`

# Creating a PDF with ReportLab

- Using coordinates (canvas)
- Using Flowables

# Using the canvas

In [None]:
from reportlab.pdfgen import canvas
 
c = canvas.Canvas("hello.pdf")
# coords in point (72 points per inch)
c.drawString(100, 750, "Welcome to Reportlab!")
c.save()

![title](screenshots/hello.png)

# Using Flowables

In [None]:
# hello_platypus.py

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate(
    "hello_platypus.pdf", pagesize=letter)
styles = getSampleStyleSheet()

flowables = []

text = "Hello, I'm a Paragraph"
para = Paragraph(text, style=styles["Normal"])
flowables.append(para)

doc.build(flowables)

![title](screenshots/hello_platypus.png)

# Why Flowables?

* Better multi-page support
* Add headers / footers
* Good table support
* Can add drawings / graphs / barcodes easier

# Additional Resources

![title](screenshots/reportlab_book_cover3.jpg)

![title](screenshots/mvp_pdf.png)

![title](screenshots/real_pdf_py.png)

# Manipulating PDFs with Python



* How to Extract Document Information 
* How to Rotate Pages
* How to Merge PDFs
* How to Split PDFs
* How to Add Watermarks
* How to Encrypt a PDF

# PDF Libraries for Python

There are too many!

* **PyPDF2**: rotating, merging, splitting, watermarking, etc
* **pdfrw**: rotating, merging, splitting, watermarking, etc
* **PDFMiner / Slate**: For extracting text
* **ReportLab**: For creating PDFs
* and several others

# We Will Be Using PyPDF2

PyPDF2 may be old, but it's reliable

**pdfrw** is a viable alternative

# History of PyPDF

* PyPDF (dead)
* PyPDF2
* PyPDF3 (dead)
* PyPDF4 (beta)

# Installation

`pip install pypdf2`

# Let's Get Started!

# How to Extract Document Information

- author
- creator
- producer
- subject
- title
- number of pages

In [None]:
from PyPDF2 import PdfFileReader

path = 'reportlab-sample.pdf'
with open(path, 'rb') as f:
    pdf = PdfFileReader(f)
    information = pdf.getDocumentInfo()
    number_of_pages = pdf.getNumPages()  


In [None]:
txt = f"""
Information about {path}: 

Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""

print(txt)

# How to Rotate Pages

# Why Rotate?

- Page scanned incorrectly

In [5]:
# rotate_pages.py

from PyPDF2 import PdfFileReader, PdfFileWriter

path = 'Jupyter_Notebook_An_Introduction.pdf'
pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(path)
# Rotate page 90 degrees to the right
page_1 = pdf_reader.getPage(0).rotateClockwise(90)
pdf_writer.addPage(page_1)

In [6]:
# Rotate page 90 degrees to the left
page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90)
pdf_writer.addPage(page_2)
# Add a page in normal orientation
pdf_writer.addPage(pdf_reader.getPage(2))

with open('rotate_pages.pdf', 'wb') as fh:
    pdf_writer.write(fh)

# How to Merge PDFs

# Reasons for Merging

* Need to add a cover page
* Your copier scans the pages to individual PDFs
* Building a document from different sources

In [None]:
# pdf_merging.py

from PyPDF2 import PdfFileReader, PdfFileWriter

paths = ['document1.pdf', 'document2.pdf']
pdf_writer = PdfFileWriter()

for path in paths:
    pdf_reader = PdfFileReader(path)
    for page in range(pdf_reader.getNumPages()):
        # Add each page to the writer object
        pdf_writer.addPage(pdf_reader.getPage(page))

# Write out the merged PDF
with open(output, 'wb') as out:
    pdf_writer.write(out)

# Using PdfFileMerger

In [7]:
from PyPDF2 import PdfFileMerger

paths = ['rotate_pages.pdf', 'watermark.pdf']
pdf_merger = PdfFileMerger()

for path in paths:
    pdf_merger.append(open(path, 'rb'))

# Write out the merged PDF
with open('merged.pdf', 'wb') as out:
    pdf_merger.write(out)

# How to Split PDFs

# Reasons for Splitting

- You want to take a cover page off
- You only need a subset of pages

In [9]:
# pdf_splitting.py

from PyPDF2 import PdfFileReader, PdfFileWriter

path = 'Jupyter_Notebook_An_Introduction.pdf'
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(page))
    name_of_split = 'jupyter_page'

    output = f'{name_of_split}{page}.pdf'
    with open(output, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)

## Note: Splitting PDFs does not reduce file size significantly with PyPDF2 or pdfrw

### *In many PDFs, all pages will reference the same pool of fonts and other resources, so removing a set of pages, without intrusively scrubbing the resources, won't reduce the size dramatically.*

-- Patrick Maupin (author of pdfrw)

# How to Add Watermarks

Watermarks are identifying images or patterns on printed and digital documents.

They are useful for protecting intellectual property

# Creating a Watermark PDF

* Programmatically or
* Word Processor

# Programmatically

You can use ReportLab for that

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Image
from reportlab.lib import utils

img = utils.ImageReader('real-python-logo-wide.png')
orig_width, height = img.getSize()
aspect = height / orig_width

doc = SimpleDocTemplate('watermark.pdf', pagesize=letter)

logo = Image('real-python-logo-wide.png', width=400, height=400*aspect)

doc.build([logo])

# Using a Word Processor

* Add image in new blank document
* Verify document size matches PDF size (letter, A4, etc)
* Position image
* Save / Export as PDF

In [None]:
from PyPDF2 import PdfFileReader, PdfFileWriter
input_pdf = 'Jupyter_Notebook_An_Introduction.pdf'
output = 'watermarked_notebook.pdf'
watermark = 'watermark.pdf'

watermark_obj = PdfFileReader(watermark)
watermark_page = watermark_obj.getPage(0)

pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()

# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
    page = pdf_reader.getPage(page)
    page.mergePage(watermark_page)
    pdf_writer.addPage(page)

with open(output, 'wb') as out:
    pdf_writer.write(out)

# How to Encrypt a PDF

PyPDF2 currently only supports adding a user password and an owner password to a preexisting PDF

# Owner vs User Passwords

* Owner allows setting permissions
* User only allows opening the document

PyPDF2 does not appear to allow the setting of any permissions, so owner and user passwords are basically equivelent to each other

# Without further ado, let's look at some code!

In [None]:
# pdf_encrypt.py

from PyPDF2 import PdfFileWriter, PdfFileReader

def add_encryption(input_pdf, output_pdf, password):
    pdf_writer = PdfFileWriter()
    pdf_reader = PdfFileReader(input_pdf)

    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))

    pdf_writer.encrypt(user_pwd=password, owner_pwd=None, 
                       use_128bit=True)
    with open(output_pdf, 'wb') as fh:
        pdf_writer.write(fh)

if __name__ == '__main__':
    add_encryption(input_pdf='reportlab-sample.pdf',
                   output_pdf='reportlab-encrypted.pdf',
                   password='twofish')

- Added a password
- Encrypted the contents with 128-bit encryption
- Note: If **use_128bit** is set to False, 40-bit encryption is used

# PDF Utility GUI

![title](screenshots/wx_cover.png)

# <center> Merge PDFs</center>

![title](screenshots/merge.png)

# <center> Split PDFs</center>

![title](screenshots/split.png)

# Slides / Code
## https://github.com/driscollis/PrairieCode2019

# Questions?