# PDF Extraction

PDFs are another common data format that can be obtained -- reports, books, company filings, etc. are all typically distributed in this format. There are a variety of reasons for this, but one contributing reason is that it created to share documents that included formatting and inline images (ever touched an image in a word document?). 

You can create and edit PDFs yourself using a variety of software now, although the easiest way to do that is to save a document as a PDF yourself. 

PDFs, unlike plaintext, present some annoying problems because of what the format is typically used for. With a Word document if we see text then we can (generally) safely assume that the text was digitally authored with the word processor itself. This means that it was digitally created so--as long as we can open the word document format--we can be rest assured that we are able to process and analyze the text within.

PDFs really don't have that guarantee, maybe the text was authored digitally or maybe the text is the result of a scan of printed text *making it an image*. An image doesn't have any text available for extraction, it's simply a matrix of numbers telling the computer how to render the image. So there is no text available to process and analyze when we open a PDF that holds an image. Even if the PDF was digitally authored the extraction is not entirely perfect for analysis because it will read it as it was typeset *and* there can still be some engine issues in reading the PDF.

If we want to work with a PDF that has images of text, we actually have to work with programs that will execute [Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) algorithms. It is important to recognize that, in the modern day, the output of these algorithms is based on data that they are trained with--machine learning. This means that we are not guaranteed (i) that the algorithm will be able to translate the image into text and (ii) tha the extracted text is correct. 

Which is all to say the difficulty in extracting data from a PDF varies **wildly** depending on what the data is, ranging from *dead simple* to *you're now doing computer science research*

# Starting with what we have

So let's examine documents that we downloaded from the SEC to start our introduction to parsing PDFs.

There are a number of libraries that exist to open and [extract data from PDFs with Python](https://letmegooglethat.com/?q=python+pdf+parsing)

We will start with something dead simple to install across any OS because we're using an Anaconda installation

In [None]:
!pip install PyPDF2

In [None]:
import PyPDF2

pdf_obj = open('../../data/pdfs/amazon_2020.pdf', 'rb')
pdf_obj

In [None]:
pdf_obj.read()

Success! We can at least open the PDF and see that we're getting a standard file object representation to work with 

But we have to create the reader for the object too

In [None]:
pdf_reader = PyPDF2.PdfReader(pdf_obj)

pdf_reader.pages

In [None]:
len(pdf_reader.pages)

In [None]:
dir(pdf_reader)

And now we can see that it's reading/seeing that there are 80 pages in the document. We can work with individual pages-- let's do that for the third page.

In [None]:
pdf_reader.pages[2]

And try to extract some text

In [None]:
page = pdf_reader.pages[2]
page.extract_text().split('\n')

Fun times, we can all totally read that right?

Nope. totally mangled on the bytecode conversion to a string because the PDF isnt meant to be read as a PDF document.

In comparison, a document created from latex.

In [None]:
pdf_obj = open('../../data/pdfs/plos_template.pdf', 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_obj)
page = pdf_reader.pages[2]
print(page.extract_text())

So not a perfect extraction in comparison to text stored in plain-text, but accessible. There isn't much that can be done to aid in parsing other than hoping that there are textual markers that signify structure in the manuscript. 

We could also try to get data from elements like tables:

In [None]:
pdf_reader = PyPDF2.PdfReader(pdf_obj)
page = pdf_reader.pages[16]
page.extract_text()

But again, we lost the structure of the table itself so we have to try to figure it out on our own. 

There are other packages to extract text and tables from PDFs, *but* installation across any Operating System is difficult because of their dependencies on other software and libraries that are not controlled or a part of the Python installation.