## read_EGHS_pdf

E. Quinn 1/1/2020

This notebook uses pdfminer to extract the information from the EGHS budget pdf

The documentation for pdfminer is at:

https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

## Import standard python datascience packages

In [1]:
import math
import re
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
from datetime import *
from datascience import *

## Import pdfminer packages

In [3]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

In [4]:
pd.set_option("display.max_rows",1000)
pd.get_option("display.max_rows")
pd.set_option('display.max_columns', 50)

### Show the directory we are running in

In [5]:
!pwd

/home/gquinn/EG/school_committee/finance_subcommittee/notebooks


## Read the pdf and create a dictionary with the contents of each text box

### Function read_earnings() reads the earnings report and returns a dictionary containing the contents

Strategy for this document:  

Save information from each element in the LTTextBox objects in a dictionary including:

- x0 horizontal coordinate of the upper left corner of the text box
- x1 horizontal coordinate of the lower right corner of the text box
- y0 vertical coordinate of the upper left corner of the text box
- y1 vertical coordinate of the lower right corner of the text box
- page number 
- sequence number of text box within this page
- text contained in the text box, converted to ascii

Parsing the text is complicated by the fact that that a text box may span multiple columns and/or rows, and the text box groupings vary quite a bit depending on the page contents and layout.

However, with a bit of luck the structure of the document will allow the contents to be deciphered with the following heuristics:

- Text boxes containing left justified columns will tend to have nearly the same x0 coordinates
- Text boxes containing right justified columns will tend to have nearly the same x1 coordinates
- The codes for fund, account code, and object code are numeric and have fixed lengths
- Extraneous information is often preceded or followed by a series of underscore and newline characters
- Last name can be distinguished because is the only field that is all characters followed by a comma
- Last name may be preceded by between one and three numerical fields:  fund, account, object.  If it is, the x0 value is shifted to the left.
    - Three numerical fields precede the name:  assume they are fund, account, object
    - Two numerical fields precede the name: assume they are account, object
    - One numerical field precedes the name: assume it is object
    

In [6]:
def read_pdf(path):
    document = open(path, 'rb')                                     #read a pdf and create a document object
    rsrcmgr = PDFResourceManager()                                  #create a resource manager
    laparams = LAParams()                                           #set the parameters for analysis
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)          #create a PDF page aggregator object
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    pdf={}                                                          #dictionary to hold the results

    pageno = -1                                                     #initialize page coounter to zero

    for page in PDFPage.get_pages(document):                        #loop through the pdf page by page
        pageno = pageno + 1                                         #increment the page number
        pdf[pageno] = {}                                            #dictionary for this page
        interpreter.process_page(page)                              # receive the LTPage object for the page.
        layout = device.get_result()                                # create layout object
        tbox_no=0                                                   # index for element number
        for element in layout:
            if (type(element).__name__=='LTTextBoxHorizontal'):     #loop through text boxes
                tbox_no += 1                                        #increment text box number
                pdf[pageno][tbox_no] = {}                           #dictionary for text boxes within page
                x0 = round(element.x0,2)                            #x0 coordinate of textbox corner
                x1 = round(element.x1,2)                            #x1 coordinate of textbox corner
                y0 = round(element.y0,2)                            #y0 coordinate of textbox corner
                y1 = round(element.y1,2)                            #y1 coordinate of textbox corner
                txt = element.get_text().encode('ascii', 'ignore')  #text converted to ascii
                pdf[pageno][tbox_no]['x0'] = x0                     #create x0 coordinate entry
                pdf[pageno][tbox_no]['x1'] = x1                     #create x1 coordinate entry
                pdf[pageno][tbox_no]['y0'] = y0                     #create y0 coordinate entry
                pdf[pageno][tbox_no]['y1'] = y1                     #create y1 coordinate entry

                pdf[pageno][tbox_no]['text'] = ''.join(chr(c) for c in txt) #convert bytes to string
    return(pdf)

In [7]:
def get_lineitems(pdf):

    lineitems = {}

    for page in pdf.keys():
        for tb in pdf[page].keys():
            if (pdf[page][tb]['x0']==31.0):
                text = pdf[page][tb]['text']
                lines = text.split('\n')
                if (text.count('-')>=4):
                    if (page not in lineitems.keys()):
                        lineitems[page] = {}
                    if (tb not in lineitems[page].keys()):
                        lineitems[page][tb] = {}
                    lineitems[page][tb]['x0'] = pdf[page][tb]['x0'] 
                    lineitems[page][tb]['x1'] = pdf[page][tb]['x1']
                    lineitems[page][tb]['y0'] = pdf[page][tb]['y0']
                    lineitems[page][tb]['y1'] = pdf[page][tb]['y1']
                    lineitems[page][tb]['text'] = pdf[page][tb]['text']
    return(lineitems)

In [8]:
def find_matches(pg,minx0,maxx0,y0in,y1in,tolerance):
    dct = {}
    for tbx in pg.keys():
        tx0 = pg[tbx]['x0']
        tx1 = pg[tbx]['x1']
        if ((minx0 < tx0) & (tx0 < maxx0)):
            ty0 = pg[tbx]['y0']
            ty1 = pg[tbx]['y1']
            txt = pg[tbx]['text']
            if ((y0in-tolerance < ty0) & (ty1 < y1in+tolerance)):
                dct[ty1] = txt
    dct2 = {}
    seq = 0
    if (len(dct) > 0):
        for y in sorted(dct.keys()):
                    lines = dct[y].split('\n')
                    for line in reversed(lines):
                        if (len(line) > 0):
                            if (len(line)==1):
                                dct2[seq] = '0.0'
                            else:
                                dct2[seq] = line
                            seq = seq+1
    dct3 = {}
    l2 = len(dct2)
    for seq in dct2.keys():
        dct3[l2-seq] = dct2[seq]
    return(dct3)      

In [9]:
def budget_dct(lid,bd_old):
    bd = {}
    bd_index = len(bd_old.keys()) + 1
    for page in lid.keys():
        for tb in lid[page].keys():
            if ('text' in lid[page][tb].keys()):
                txt = lid[page][tb]['text']
                lines = txt.split('\n')
                tb_index = 1
                for line in lines:
                    if (line.count('-') > 1):
                        bd[bd_index] = {}
                        bd[bd_index]['ucoa'] = line
                        words = line.split('-')
                        bd[bd_index]['Fund'] = float(words[0].strip() + '0000')                                                 
                        bd[bd_index]['Loc'] = float(words[1])   
                        bd[bd_index]['Func'] = float(words[2]) 
                        bd[bd_index]['Prog'] = float(words[3]) 
                        bd[bd_index]['Sub'] = float(words[4])
                        bd[bd_index]['JC'] = float(words[5])
                        acct_string = lid[page][tb]['acct-code'][tb_index]
                        acct_desc = acct_string[acct_string.index(' ',acct_string.index(' ')+1):].strip()
                        bd[bd_index]['acct-string'] = lid[page][tb]['acct-code'][tb_index]
                        bd[bd_index]['acct-desc'] = acct_desc
                        words = acct_string.split()
                        bd[bd_index]['acct-code'] = words[0]
                        bd[bd_index]['Obj'] = float(words[1])
                        bd[bd_index]['budget2020'] = float(lid[page][tb]['budget2020'][tb_index].replace(',',''))
                        bd[bd_index]['budget2019'] = float(lid[page][tb]['budget2019'][tb_index].replace(',',''))
                        bd[bd_index]['actual2018'] = float(lid[page][tb]['actual2018'][tb_index].replace(',',''))
                        bd[bd_index]['actual2017'] = float(lid[page][tb]['actual2017'][tb_index].replace(',',''))
                        bd_index += 1
                        tb_index += 1
    return(bd)

### Read the PDF

In [10]:
pdfd = read_pdf('../EGHS FY20Budget.pdf')

In [11]:
lineitems = get_lineitems(pdfd)

In [12]:
for page in lineitems.keys():
    pg = lineitems[page]
    for tb in pg.keys():
        x0 = pg[tb]['x0']
        x1 = pg[tb]['x1']
        y0 = pg[tb]['y0']
        y1 = pg[tb]['y1']
        text = pg[tb]['text']
        pg[tb]['budget2019'] = find_matches(pdfd[page],525.0,570.0,y0,y1,5)
        pg[tb]['budget2020'] = find_matches(pdfd[page],650.0,705.0,y0,y1,5)
        pg[tb]['actual2018'] = find_matches(pdfd[page],450.0,505.0,y0,y1,5)
        pg[tb]['actual2017'] = find_matches(pdfd[page],395.0,445.0,y0,y1,5)
        pg[tb]['acct-code']  = find_matches(pdfd[page],195.0,385.0,y0,y1,5)

In [13]:
bd_old = {}
bd = budget_dct(lineitems,bd_old)

In [14]:
bd

{1: {'ucoa': '1000 - 05106 - 00222 - 10 - 0000 - 0000',
  'Fund': 10000000.0,
  'Loc': 5106.0,
  'Func': 222.0,
  'Prog': 10.0,
  'Sub': 0.0,
  'JC': 0.0,
  'acct-string': '71289006 53301 Prof Dev Training',
  'acct-desc': 'Prof Dev Training',
  'acct-code': '71289006',
  'Obj': 53301.0,
  'budget2020': 0.0,
  'budget2019': 3150.0,
  'actual2018': 853.12,
  'actual2017': 7925.49},
 2: {'ucoa': '1000 - 05106 - 00222 - 10 - 0000 - 0000',
  'Fund': 10000000.0,
  'Loc': 5106.0,
  'Func': 222.0,
  'Prog': 10.0,
  'Sub': 0.0,
  'JC': 0.0,
  'acct-string': '71289006 53303 Conf/Workshops',
  'acct-desc': 'Conf/Workshops',
  'acct-code': '71289006',
  'Obj': 53303.0,
  'budget2020': 0.0,
  'budget2019': 0.0,
  'actual2018': 379.98,
  'actual2017': 0.0},
 3: {'ucoa': '1000 - 05106 - 00511 - 10 - 0000 - 0000',
  'Fund': 10000000.0,
  'Loc': 5106.0,
  'Func': 511.0,
  'Prog': 10.0,
  'Sub': 0.0,
  'JC': 0.0,
  'acct-string': '71400006 53303 Conferences',
  'acct-desc': 'Conferences',
  'acct-code'