In [3]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# PDF Extraction

PDFs are another common data format that can be obtained -- reports, books, company filings, etc. are all typically distributed in this format. There are a variety of reasons for this, but one contributing reason is that it created to share documents that included formatting and inline images (ever touched an image in a word document?). 

You can create and edit PDFs yourself using a variety of software now, although the easiest way to do that is to save a document as a PDF yourself. 

PDFs, unlike plaintext, present some annoying problems because of what the format is typically used for. With a Word document if we see text then we can (generally) safely assume that the text was digitally authored with the word processor itself. This means that it was digitally created so--as long as we can open the word document format--we can be rest assured that we are able to process and analyze the text within.

PDFs really don't have that guarantee, maybe the text was authored digitally or maybe the text is the result of a scan of printed text *making it an image*. An image doesn't have any text available for extraction, it's simply a matrix of numbers telling the computer how to render the image. So there is no text available to process and analyze when we open a PDF that holds an image. Even if the PDF was digitally authored the extraction is not entirely perfect for analysis because it will read it as it was typeset *and* there can still be some engine issues in reading the PDF.

If we want to work with a PDF that has images of text, we actually have to work with programs that will execute [Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) algorithms. It is important to recognize that, in the modern day, the output of these algorithms is based on data that they are trained with--machine learning. This means that we are not guaranteed (i) that the algorithm will be able to translate the image into text and (ii) tha the extracted text is correct. 

Which is all to say the difficulty in extracting data from a PDF varies **wildly** depending on what the data is, ranging from *dead simple* to *you're now doing computer science research*

# Starting with what we have

So let's examine documents that we downloaded from the SEC to start our introduction to parsing PDFs.

There are a number of libraries that exist to open and [extract data from PDFs with Python](https://letmegooglethat.com/?q=python+pdf+parsing)

We will start with something dead simple to install across any OS because we're using an Anaconda installation

In [2]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [3]:
import PyPDF2

pdf_obj = open('../../data/pdfs/amazon_2020.pdf', 'rb')
pdf_obj

<_io.BufferedReader name='../../data/pdfs/amazon_2020.pdf'>

In [4]:
pdf_obj.read()

b'%PDF-1.4\r\n%\xd3\xf4\xcc\xe1\r\n1 0 obj\r\n<< /Author <E669B80F415383FFFBC1CCD8D72E85779906A48C7110DE446DCD9214BC44F87D269B4449609895A8404B55F4F6113C9A47BD83B65C817FCF87>\r\n   /Creator <E669B80F41539CD5D188E5D8956B96368900A0>\r\n   /Keywords <931DCF7F2342F4A6A59C8F8FCA23D467CD5FE2D139598A0A7C92D97F>\r\n   /Producer <E669B80F41539CD5D188E5D8956B96368900A0>\r\n   /Subject <921DD205>\r\n   /Title <931DCF7F2342F4A6A59C8F8FCA23D467CD5FE2D1>\r\n   /CreationDate <E717CD7E2142FCA3A79B928BCB3BD164D05FE7C2324996>\r\n   /ModDate <E717CD7E2142FCA3A79B928BCB38D463D05FE7C2324996>\r\n>>\r\nendobj\r\n2 0 obj\r\n<< /Type /Catalog /Pages 3 0 R >>\r\nendobj\r\n3 0 obj\r\n<< /Type /Pages\r\n   /Kids [29 0 R 32 0 R 36 0 R 190 0 R 41 0 R 193 0 R 196 0 R 199 0 R 202 0 R 205 0 R 208 0 R 211 0 R 214 0 R 217 0 R 46 0 R 53 0 R 60 0 R 67 0 R 220 0 R 223 0 R 226 0 R 229 0 R 232 0 R 235 0 R 238 0 R 241 0 R 244 0 R 247 0 R 250 0 R 253 0 R 256 0 R 72 0 R 259 0 R 77 0 R 262 0 R 297 0 R 268 0 R 273 0 R 278 0 R 283 

Success! We can at least open the PDF and see that we're getting a standard file object representation to work with 

But we have to create the reader for the object too

In [10]:
pdf_reader = PyPDF2.PdfReader(pdf_obj)

pdf_reader.pages

<PyPDF2._page._VirtualList at 0x11106be00>

In [14]:
len(pdf_reader.pages)

80

In [11]:
dir(pdf_reader)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_basic_validation',
 '_build_destination',
 '_build_field',
 '_build_outline_item',
 '_check_kids',
 '_encryption',
 '_find_eof_marker',
 '_find_startxref_pos',
 '_flatten',
 '_get_indirect_object',
 '_get_named_destinations',
 '_get_num_pages',
 '_get_object_from_stream',
 '_get_outline',
 '_get_page',
 '_get_page_number_by_indirect',
 '_get_xref_issues',
 '_override_encryption',
 '_page_id2num',
 '_pairs',
 '_read_pdf15_xref_stream',
 '_read_standard_xref_table',
 '_read_xref',
 '_read_xref_other_error',
 '_read_xref_subsections',
 '_read_xref_tables_and_trailers',
 '_rebuild_xref_table',
 '_write_fiel

And now we can see that it's reading/seeing that there are 80 pages in the document. We can work with individual pages-- let's do that for the third page.

In [16]:
pdf_reader.pages[2]

{'/Type': '/Page',
 '/Contents': {'/Filter': '/FlateDecode'},
 '/MediaBox': [0, 0, 612, 792],
 '/CropBox': [0, 0, 612, 792],
 '/Parent': {'/Type': '/Pages',
  '/Kids': [IndirectObject(29, 0, 4580611408),
   IndirectObject(32, 0, 4580611408),
   IndirectObject(36, 0, 4580611408),
   IndirectObject(190, 0, 4580611408),
   IndirectObject(41, 0, 4580611408),
   IndirectObject(193, 0, 4580611408),
   IndirectObject(196, 0, 4580611408),
   IndirectObject(199, 0, 4580611408),
   IndirectObject(202, 0, 4580611408),
   IndirectObject(205, 0, 4580611408),
   IndirectObject(208, 0, 4580611408),
   IndirectObject(211, 0, 4580611408),
   IndirectObject(214, 0, 4580611408),
   IndirectObject(217, 0, 4580611408),
   IndirectObject(46, 0, 4580611408),
   IndirectObject(53, 0, 4580611408),
   IndirectObject(60, 0, 4580611408),
   IndirectObject(67, 0, 4580611408),
   IndirectObject(220, 0, 4580611408),
   IndirectObject(223, 0, 4580611408),
   IndirectObject(226, 0, 4580611408),
   IndirectObject(229, 

And try to extract some text

In [19]:
page = pdf_reader.pages[2]
page.extract_text().split('\n')

['Table of ContentsAMAZON.COM, INC.',
 'PART I',
 'Item 1.',
 'Business This Annual Report on Form 10-K and the documents incorporated here',
 'in by reference contain forward-looking statements based on expectations,estimates, and projections as of the da',
 'te of this filing. Actual results may differ materially from those expressed in forward-looking statements. See Item 1A of PartI — “Risk Factors.”',
 'Amazon.com, Inc.’s principa',
 'l corporate offices are located in Seattle, Washington. We completed our initial public offering in May 1997 and our commonstock is listed on the Nasdaq Global Se',
 'lect Market under the symbol “AMZN.”As used herein, “Amazon.com,” “we,” ',
 '“our,” and similar terms include Amazon.com, Inc. and its subsidiaries, unless the context indicates otherwise.General',
 'We seek to be Earth’s most ',
 'customer-centric company. We are guided by four principles: customer obsession rather than competitor focus, passion forinvention, commitment ',
 'to operati

Fun times, we can all totally read that right?

Nope. totally mangled on the bytecode conversion to a string because the PDF isnt meant to be read as a PDF document.

In comparison, a document created from latex.

In [21]:
pdf_obj = open('../../data/pdfs/plos_template.pdf', 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_obj)
page = pdf_reader.pages[2]
print(page.extract_text())

3
tive of a global network. This framework has been applied to g ood measure in phylometabolism to assess
the emergence biological carbon-ﬁxation [11] and to unders tand the regulation of metabolism [12]. A
global network has also been recently used in conjunction wi th probabilistic methods to predict metabolic
networks on a small scale with experimental veriﬁcation [13 ]. While the motivation for the global net-
work approach has been mostly pragmatic, it reminds us of the “Res Potentia ” framework proposed by
Whitehead [14]. Wherein he proposes that which does exist, t ermed the Res Extenta , springs forth as a
set, speciﬁc realization from the realm of possibilities in theRes Potentia .
Furthermore, We contend that using a global network approac h to the study of metabolism is com-
parable to what epidemiologists do when studying worldwide propagation of infection. In building the
worldwide air transportation network [15] all carrier ﬂigh ts are aggregated into a single network and

So not a perfect extraction in comparison to text stored in plain-text, but accessible. There isn't much that can be done to aid in parsing other than hoping that there are textual markers that signify structure in the manuscript. 

We could also try to get data from elements like tables:

In [6]:
pdf_reader = PyPDF2.PdfFileReader(pdf_obj)
page = pdf_reader.getPage(16)
page.extract_text()

'17\nTables\nDomain Clade Number of Organisms\nArchaea\n(54)Crenarchaeota 19\nEuryarchaeota 34\nNanoarchaeota 1\nBacteria\n(750)Acidobacteria 2\nActinobacteria 59\nAlpha Proteobacteria 96\nBacillales 59\nBacteroides 12\nBeta Proteobacteria 60\nChlamydia 12\nClostridia 35\nCyanobacteria 36\nDeinococcus Thermus 5\nDelta Proteobacteria 21\nEpsilon Proteobacteria 23\nFusobacteria 1\nGamma Proteobacteria 196\nGreen Nonsulfur Bacteria 9\nGreen Sulfur Bacteria 9\nHyperthermophilic Bacteria 11\nLactobacillales 61\nMagnetococcus 1\nMollicutes 21\nPlanctomyces 1\nSpirochete 17\nTermite Group 1\nVerrucomicrobia 2\nEukaryotes\n(70)Animals 19\nFungi 27\nPlants 3\nProtists 21\nTable 1. Number of organisms considered by taxonomic clade'

But again, we lost the structure of the table itself so we have to try to figure it out on our own. 

There are other packages to extract text and tables from PDFs, *but* installation across any Operating System is difficult because of their dependencies on other software and libraries that are not controlled or a part of the Python installation.