# PDF Text Mining using PDFMiner

## Installation

`pip install pdfminer.six`

## How to Use
Below is an edited code example from [Tim Arnold's blog on *Manipulating PDFs with Python*]( https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167). It has been modified to be compatible with Python 3.X. Most of it is boilerplate stuff that does not need to change. The only change that needs to be done is the filename and the  page(s) of interest. 

In [1]:
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage  

Identify file and page of interest

In [2]:
filename = 'MDOT_fastfacts02-2011_345554_7.pdf'
pagenums = [3] # empty list does all pages

Create instances of classes necessary to read pdf

In [3]:
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)

Open the pdf and read & process page(s) of interest

In [4]:
with open(filename, 'rb') as fin:
    for page in PDFPage.get_pages(fin, pagenums):
        interpreter.process_page(page)

Get output string

In [5]:
text = output.getvalue()
converter.close()
output.close()

Let's look at the output text string

In [6]:
text

'Fast Facts\n\n201 7\n\nCARPOOL LOTS\n\n2015 MICHIGAN \nSTATE REVENUE PACKAGE\n\nn There are 261 carpool parking lots located across \n\nthe state, 23 of which are public-private partnerships. \nIncluded in the public-private partnerships are 17 \nlocations that MDOT has partnered with Meijer Corp. \nto provide carpool parking spaces in Meijer parking lots \nlocated near the highway.\n\nn MDOT continues its efforts to provide bike racks at \n\ncarpool lots, and to attract transit service to lots  \nwhere appropriate.\n\nCOST OF ROAD CONSTRUCTION \n\nRoadway construction costs are typically based on standard  \ndesign characteristics, materials, and the type of work performed. \nGeneral estimates are provided for the average cost per lane mile \nof major work by roadway type, and material costs. \n\nAverage Cost Per Lane Mile by  \nMajor Work Type for Various Networks  \n(2016 figures; in millions) \n\nWork Type \n\nReconstruction Rehabilitation Average R&R\n\n \n \n\nCombined \nStatewi

Pretty Print Text

In [7]:
from pprint import pprint as prettyprint
prettyprint(text)

('Fast Facts\n'
 '\n'
 '201 7\n'
 '\n'
 'CARPOOL LOTS\n'
 '\n'
 '2015 MICHIGAN \n'
 'STATE REVENUE PACKAGE\n'
 '\n'
 'n There are 261 carpool parking lots located across \n'
 '\n'
 'the state, 23 of which are public-private partnerships. \n'
 'Included in the public-private partnerships are 17 \n'
 'locations that MDOT has partnered with Meijer Corp. \n'
 'to provide carpool parking spaces in Meijer parking lots \n'
 'located near the highway.\n'
 '\n'
 'n MDOT continues its efforts to provide bike racks at \n'
 '\n'
 'carpool lots, and to attract transit service to lots  \n'
 'where appropriate.\n'
 '\n'
 'COST OF ROAD CONSTRUCTION \n'
 '\n'
 'Roadway construction costs are typically based on standard  \n'
 'design characteristics, materials, and the type of work performed. \n'
 'General estimates are provided for the average cost per lane mile \n'
 'of major work by roadway type, and material costs. \n'
 '\n'
 'Average Cost Per Lane Mile by  \n'
 'Major Work Type for Various Networks

Write out text to file

In [8]:
savefile = filename.replace('pdf','txt')
with open(savefile,'w') as fout:
    fout.write(text)

# Conclusion

Trying to reconstruct tables from pdf text mining tools looks like a formatting nightmare in the same realm as copy and paste.