# <center>Working with PDFs in Python </center>
<p><p>
<center>Mike Driscoll (@driscollis)</center>
<p><p>

# Slides / Code
## https://github.com/driscollis/PrairieCode2019

# About Mike

- Blogs at www.MouseVsPython.com
- Contributes to https://realpython.com/
- Writes technical books

![title](screenshots/all_python.png)

# www.Python101.org

# Let's Get Started!

# What is PDF?

"Portable Document Format (PDF) is a file format used to present and exchange documents reliably, independent of software, hardware, or operating system. Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic. They can also be signed electronically and are easily viewed using free Acrobat Reader DC software." - Adobe


# Creating a PDF with Python

# Tools for Creating PDFs

- ReportLab
- PyFPDF
- Weazyprint
- rst2pdf

# ReportLab

- Has the most features
- Decent documentation
- Lots of built-in objects

# FPDF
- Simpler
- Not as full-featured 
- Missing good table support

# Let's Use ReportLab!

# Two Versions of ReportLab
- ReportLab (Open Source)
- ReportLab Plus

# Installation

`pip install reportlab`

# Installing in a Virtual Environment

```
python -m venv pdf_test
cd pdf_test
source bin/activate
pip install reportlab
```

# Creating a PDF with ReportLab

- Using canvas
- Using Page Layout and Typography Using Scripts

# Using the canvas

In [None]:
from reportlab.pdfgen import canvas
 
c = canvas.Canvas("output/hello.pdf")
# Coords in points (72 points in an inch)
c.drawString(100, 750, "Welcome to Reportlab!")
c.showPage()
c.save()

![title](screenshots/hello.png)

# Cons of Canvas

- You keep track of placement
- You must add page breaks
- Adding a new element means you need to recalculate placement of others elements

# Let's See What Else You Can Do!

# Drawing Polygons

In [None]:
# drawing_polygons.py

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def draw_shapes():
    c = canvas.Canvas("output/draw_other.pdf", pagesize=letter)
    c.setStrokeColorRGB(0.2, 0.5, 0.3)
    c.rect(10, 700, width=100, height=80, stroke=1, fill=0)
    c.ellipse(10, 680, 100, 630, stroke=1, fill=1)
    c.wedge(10, 600, 100, 550, startAng=45, extent=90, stroke=1, fill=0)
    c.circle(300, 600, r=50)
    c.save()

if __name__ == '__main__':
    draw_shapes()

# Adding an Image

In [None]:
# image_on_canvas.py

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas


def add_image(image_path):
    my_canvas = canvas.Canvas("output/canvas_image.pdf",
                              pagesize=letter)
    my_canvas.drawImage(image_path, 30, 600,
                        width=100, height=100)
    my_canvas.save()

if __name__ == '__main__':
    image_path = 'snakehead.jpg'
    add_image(image_path)

# Playing with Color

In [None]:
# colors_demo.py

from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

my_canvas = canvas.Canvas("output/colors.pdf",
                          pagesize=letter)
my_canvas.setFont('Helvetica', 10)
x = 30

sample_colors = [colors.aliceblue,
                 colors.aquamarine,
                 colors.lavender,
                 colors.beige,
                 colors.chocolate]

In [None]:
for color in sample_colors:
    my_canvas.setFillColor(color)
    my_canvas.circle(x, 730, 20, fill=1)
    color_str = f"{color._lookupName()}"
    my_canvas.setFillColor(colors.black)
    my_canvas.drawString(x-10, 700, color_str)
    x += 75

my_canvas.save()

# Applying Fonts

In [None]:
# font_demo.py

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

my_canvas = canvas.Canvas("output/font_demo.pdf",
                              pagesize=letter)
fonts = my_canvas.getAvailableFonts()
pos_y = 750
for font in fonts:
    my_canvas.setFont(font, 12)
    my_canvas.drawString(30, pos_y, font)
    pos_y -= 10
my_canvas.save()

# Now Let's Try ReportLab's PLATYPUS

# Page Layout and Typography Using Scripts (PLATYPUS)

# Flowables come from reportlab.platypus

In [None]:
# hello_platypus.py

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output/hello_platypus.pdf",
                        pagesize=letter)
styles = getSampleStyleSheet()

flowables = []

text = "Hello, I'm a Paragraph"
para = Paragraph(text, style=styles["Normal"])
flowables.append(para)

doc.build(flowables)

![title](screenshots/hello_platypus.png)

# Why Flowables?

* Better multi-page support
* Add headers / footers
* Can add drawings / graphs / barcodes easier

# Let's Look at Other Examples!

# Paragraphs with Fonts / Colors

In [None]:
# paragraph_font_colors.py

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
import reportlab.lib.colors


doc = SimpleDocTemplate("output/paragraph_font_colors.pdf",
                        pagesize=letter
                        )
styles = getSampleStyleSheet()

flowables = []

In [None]:
ptext = """<font name=helvetica size=12 color=red>
Welcome to ReportLab! (helvetica)</font>"""
para = Paragraph(ptext, style=styles["Normal"])
flowables.append(para)

ptext = """<font name=courier fg=blue size=14>
Welcome to Reportlab! (courier)</font>"""
para = Paragraph(ptext, style=styles["Normal"])
flowables.append(para)

ptext = """<font name=times-roman size=16 color=#777215>
Welcome to Reportlab! (times-roman)</font>"""
para = Paragraph(ptext, style=styles["Normal"])
flowables.append(para)

doc.build(flowables)

# Creating a Table

In [None]:
# simple_table.py

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table

doc = SimpleDocTemplate("output/simple_table.pdf", pagesize=letter)
flowables = []

data = [['col_{}'.format(x) for x in range(1, 6)],
        [str(x) for x in range(1, 6)],
        ['a', 'b', 'c', 'd', 'e']
        ]

tbl = Table(data)
flowables.append(tbl)

doc.build(flowables)

# Paragraphs with In-line Images

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output/paragraph_inline_images.pdf",
                        pagesize=letter
                        )
styles = getSampleStyleSheet()

flowables = []

ptext = '''Here is a picture:
<img src="snakehead.jpg" width="50" height="50"/> in the
middle of our text'''
p = Paragraph(ptext, styles['Normal'])
flowables.append(p)

doc.build(flowables)

# Using the Image Flowable

In [None]:
# scaled_image.py

from reportlab.lib import utils
from reportlab.lib.pagesizes import letter
from reportlab.platypus import Image, SimpleDocTemplate

doc = SimpleDocTemplate("output/image_with_scaling.pdf", pagesize=letter)
flowables = []

img = utils.ImageReader('snakehead.jpg')
img_width, img_height = img.getSize()
aspect = img_height / float(img_width)

img = Image("snakehead.jpg",
            width=50,
            height=(50 * aspect))
img.hAlign = 'CENTER'
flowables.append(img)
doc.build(flowables)

# Want to Learn More?

![title](screenshots/reportlab_book_cover3.jpg)

# Manipulating PDFs with Python



* How to Extract Document Information
* How to Rotate Pages
* How to Merge PDFs
* How to Split PDFs
* How to Add Watermarks
* How to Encrypt a PDF

# PDF Libraries for Python

There are too many!

* **PyPDF2**: rotating, merging, splitting, watermarking, etc
* **pdfrw**: rotating, merging, splitting, watermarking, etc
* **PDFMiner**: For extracting text
* **PDFPlumber**: Extracting text / tables

# We Will Be Using PyPDF2

PyPDF2 may be old, but it's reliable

**pdfrw** is a viable alternative

# History of PyPDF

* pyPdf (dead)
* PyPDF2
* PyPDF3 (dead)
* PyPDF4 (beta)

# Installation

`pip install pypdf2`

# Let's Get Started!

# How to Extract Document Information

- author
- creator
- producer
- subject
- title
- number of pages

In [None]:
from PyPDF2 import PdfFileReader

path = 'reportlab-sample.pdf'
with open(path, 'rb') as f:
    pdf = PdfFileReader(f)
    information = pdf.getDocumentInfo()
    number_of_pages = pdf.getNumPages()  


In [None]:
txt = f"""
Information about {path}: 

Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""

print(txt)

# How to Rotate Pages

# Why Rotate?

- Page scanned incorrectly
- Client sent a poorly formatted PDF

In [None]:
# rotate_pages.py

from PyPDF2 import PdfFileReader, PdfFileWriter

path = 'Jupyter_Notebook_An_Introduction.pdf'
pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(path)
# Rotate page 90 degrees to the right
page_1 = pdf_reader.getPage(0).rotateClockwise(90)
pdf_writer.addPage(page_1)

In [None]:
# Rotate page 90 degrees to the left
page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90)
pdf_writer.addPage(page_2)
# Add a page in normal orientation
pdf_writer.addPage(pdf_reader.getPage(2))

with open('output/rotate_pages.pdf', 'wb') as fh:
    pdf_writer.write(fh)

# How to Merge PDFs

# Reasons for Merging

* Need to add a cover page
* Your copier scans the pages to individual PDFs
* Building a document from different sources

In [None]:
# pdf_merging.py

from PyPDF2 import PdfFileReader, PdfFileWriter

paths = ['document1.pdf', 'document2.pdf']
pdf_writer = PdfFileWriter()

for path in paths:
    pdf_reader = PdfFileReader(path)
    for page in range(pdf_reader.getNumPages()):
        # Add each page to the writer object
        pdf_writer.addPage(pdf_reader.getPage(page))

# Write out the merged PDF
with open(output, 'wb') as out:
    pdf_writer.write(out)

# Using PdfFileMerger

In [None]:
import PyPDF2
 
merger = PyPDF2.PdfFileMerger()

paths = ['rotate_pages.pdf', 'watermark.pdf']
for path in paths:
    merger.append(open(path, 'rb'))

merger.write(open("output/merged.pdf", 'wb'))

# How to Split PDFs

# Reasons for Splitting

- You want to take a cover page off
- You only need a subset of pages

In [None]:
# pdf_splitting.py

from PyPDF2 import PdfFileReader, PdfFileWriter

path = 'Jupyter_Notebook_An_Introduction.pdf'
name_of_split = 'jupyter_page_'
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(page))

    output = f'output/{name_of_split}{page}.pdf'
    with open(output, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)

## Note: Splitting PDFs does not reduce file size significantly with PyPDF2 or pdfrw

### *In many PDFs, all pages will reference the same pool of fonts and other resources, so removing a set of pages, without intrusively scrubbing the resources, won't reduce the size dramatically.*

-- Patrick Maupin (author of pdfrw)

# How to Add Watermarks

Watermarks are identifying images or patterns on printed and digital documents.

They are useful for protecting intellectual property

# Creating a Watermark

* Programmatically or
* Word Processor

# Programmatically

You can use ReportLab for that

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Image
from reportlab.lib import utils

img = utils.ImageReader('real-python-logo-wide.png')
orig_width, height = img.getSize()
aspect = height / orig_width

doc = SimpleDocTemplate('watermark.pdf', pagesize=letter)

logo = Image('real-python-logo-wide.png', width=400, height=400*aspect)

doc.build([logo])

# Rotating the Watermark Image

In [None]:

from reportlab.lib.pagesizes import letter
from reportlab.platypus import Image, SimpleDocTemplate

class RotatedImage(Image):

    def wrap(self, availWidth, availHeight):
        height, width = Image.wrap(self, availHeight, availWidth)
        return width, height

    def draw(self):
        self.canv.rotate(45)
        Image.draw(self)

In [None]:
doc = SimpleDocTemplate("output/image_with_rotation.pdf", pagesize=letter)
flowables = []

img = RotatedImage('real-python-logo-wide.png',
                   width=550, height=50,
                   kind='proportional'
                   )
img.hAlign = 'CENTER'
flowables.append(img)
doc.build(flowables)

# Why programmatically?

- Multiple logos
- Automation

# Applying the Watermark Image

In [None]:
from PyPDF2 import PdfFileReader, PdfFileWriter
input_pdf = 'Jupyter_Notebook_An_Introduction.pdf'
logo = 'output/image_with_rotation.pdf'
output = 'output/watermarked_notebook.pdf'

logo_obj = PdfFileReader(logo)
watermark = logo_obj.getPage(0)

pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()

# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
    page = pdf_reader.getPage(page)
    page.mergePage(watermark)
    pdf_writer.addPage(page)

with open(output, 'wb') as out:
    pdf_writer.write(out)

# How to Encrypt a PDF

PyPDF2 currently only supports adding a user password and an owner password to a preexisting PDF

# Owner vs User Passwords

* Owner allows setting permissions
* User only allows opening the document

PyPDF2 does not appear to allow the setting of any permissions, so owner and user passwords are basically equivelent to each other

# Without further ado, let's look at some code!

In [None]:
# pdf_encrypt.py

from PyPDF2 import PdfFileWriter, PdfFileReader

def add_encryption(input_pdf, output_pdf, password):
    pdf_writer = PdfFileWriter()
    pdf_reader = PdfFileReader(input_pdf)

    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))

    pdf_writer.encrypt(user_pwd=password, owner_pwd=None, 
                       use_128bit=True)

    with open(output_pdf, 'wb') as fh:
        pdf_writer.write(fh)

if __name__ == '__main__':
    add_encryption(input_pdf='reportlab-sample.pdf',
                   output_pdf='output/reportlab-encrypted.pdf',
                   password='twofish')

# PDF Utility GUI

![title](screenshots/fenix_kick.png)

# <center> Merge PDFs</center>

![title](screenshots/merge.png)

# <center> Split PDFs</center>

![title](screenshots/split.png)

# Want More Information?

![title](screenshots/mvp_pdf.png)

![title](screenshots/real_pdf_py.png)

# Let's Review!



# Creating a PDF with ReportLab
  * Using Canvas Methods
  * Using Flowables


# Manipulating PDFs with PyPDF2

* How to Extract Document Information
* How to Rotate Pages
* How to Merge PDFs
* How to Split PDFs
* How to Add Watermarks
* How to Encrypt a PDF

# Questions?

## https://github.com/driscollis/PrairieCode2019