## Read the FY2019 Tax Roll (assessed 12/31/2018)

E. Quinn 12/19/2019

This notebook uses pdfminer to extract the information from the FY2019 EG tax roll

The documentation for pdfminer is at:

https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

Change log:

12/23/2019  Modify code to read only the main property listing (pages 1-484) section and skip the pages with totals and exempt properties.  This solved the duplicate account number issue.  epq

## Import standard python datascience packages

In [1]:
import math
import re
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Import pdfminer packages

In [2]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

### Show the directory we are running in

In [3]:
!pwd

/home/gquinn/EG/notebooks


## Perform layout analysis - see section 2.3 of the pdfminer documentation


### Read the pdf and create a document object

In [4]:
document = open('../re_tax_rolls/RETaxRollFINAL_201908081248490737.pdf', 'rb')

### Create a resource manager object

In [5]:
rsrcmgr = PDFResourceManager()

### Set the parameters for analysis

In [6]:
laparams = LAParams()

### Create a PDF page aggregator object

In [7]:
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

### Store the information in a dictionary

Strategy for this document:  

Save information from each element in the LTTextBox objects in a dictionary including:

- x0 horizontal coordinate of the upper left corner of the text box
- x1 horizontal coordinate of the lower right corner of the text box
- page number 
- sequence number of text box within this page
- text contained in the text box, converted to ascii

Parsing the text is complicated by the fact that that a text box may span multiple columns and/or rows, and the text box groupings vary quite a bit depending on the page contents and layout.

However, with a bit of luck the structure of the document will allow the contents to be deciphered with the following heuristics:

- Text boxes containing left justified columns will tend to have nearly the same x0 coordinates
- Text boxes containing right justified columns will tend to have nearly the same x1 coordinates
- The codes for fund, account code, and object code are numeric and have fixed lengths
- Extraneous information is often preceded or followed by a series of underscore and newline characters
- Last name can be distinguished because is the only field that is all characters followed by a comma
- Last name may be preceded by between one and three numerical fields:  fund, account, object.  If it is, the x0 value is shifted to the left.
    - Three numerical fields precede the name:  assume they are fund, account, object
    - Two numerical fields precede the name: assume they are account, object
    - One numerical field precedes the name: assume it is object

In [8]:
pdf={}                                     #dictionary to hold the results

pageno = 0                                 #initialize page coounter to zero

for page in PDFPage.get_pages(document):   #loop through the pdf page by page
    pageno = pageno + 1                    #increment the page number
    pdf[pageno] = {}                       #dictionary for this page
    interpreter.process_page(page)         # receive the LTPage object for the page.
    layout = device.get_result()           # create layout object
    tbox_no=0                              # index for element number
    for element in layout:
        if (type(element).__name__=='LTTextBoxHorizontal'):             #loop through text boxes
            tbox_no += 1                                                #increment text box number
            pdf[pageno][tbox_no] = {}                                   #dictionary for text boxes within page
            x0 = round(element.x0,2)                                    #x0 coordinate of textbox corner
            x1 = round(element.x1,2)                                    #x1 coordinate of textbox corner
            txt = element.get_text().encode('ascii', 'ignore')          #text converted to ascii
            pdf[pageno][tbox_no]['x0'] = x0                             #create x0 coordinate entry
            pdf[pageno][tbox_no]['x1'] = x1                             #create x1 coordinate entry
            pdf[pageno][tbox_no]['text'] = txt                          #create text entry

### Various functions for parsing tax roll

In [9]:
def check_acct(string):
    if (string[0:9].isdigit()):
        return(True)
    else:
        return(False)

In [10]:
def check_RE_00(string):
    if (' RE 00 ' in string):
        return(True)
    else:
        return(False)

In [11]:
def get_bill_no(line):
    words = line.split()
    lastword = ''
    for word in words:
        if (word=='RE'):
            if (line.index('  RE 00 ') > 40):
                return(lastword)
            else:
                return('XXXX')
        lastword = word

In [12]:
def get_re_class(line):
    ix = line.index(' RE 00 ')           #find the 'RE 00' substring
    line2 = line[ix:len(line)-1]         #take the rest of the line
    words = line2.split()                #split it into words
    re_class = words[2]                  #take the next word to be the first part of the RE class
    if (len(words[3]) < 15):             #if the word after that is less than 15 bytes, it's part of RE class
        re_class = re_class + ' ' + words[3]
    return(re_class)

In [13]:
def get_plot(line):
    ix = line.index(' RE 00 ')           #find the 'RE 00' substring
    line2 = line[ix:len(line)-1]         #take the rest of the line
    words = line2.split()                #split it into words
    for word in words:                   #find a word that fits the pattern 999-999-999-9999
        if (len(word)==16):              #length of the plot string has to be 16
            if (word[3]=='-'):           #fourth character has to be '-'
                return(word)
    return('')

In [14]:
def get_tax(line):
    ix = line.index(' RE 00 ')           #find the 'RE 00' substring
    line2 = line[ix:len(line)-1]         #take the rest of the line
    if (line2[len(line2)-9:].isspace()): #check for missing tax amount
        tax = '0.0'                      #    set these equal to 0.0
    else:
        words = line2.split()                #split it into words
        tax = words[len(words)-1]            #the final word is the tax amount
    tax = tax.replace(',','')                  #remove the commas so we can perform arithmetic
    return(tax)                         

In [15]:
def get_assessment(line):
    ix = line.index(' RE 00 ')                 #find the 'RE 00' substring
    line2 = line[ix:len(line)-1]               #take the rest of the line
    words = line2.split()
    if (line2[len(line2)-9:].isspace()):       #check for missing tax amount
        assessment = words[len(words)-1]       #if so the last word is the assessed value
    else:                                      #otherwise,
        assessment = words[len(words)-2]       #the next to last word is the assessed value
    assessment = assessment.replace(',','')    #remove the commas so we can perform arithmetic
    return(assessment)                         

In [16]:
def get_address(line):
    ix = line.index(' RE 00 ')                 #find the 'RE 00' substring
    line2 = line[ix:len(line)-1]               #take the rest of the line
    words = line2.split()                      #split it into words
    for word in words:                         #find a word that fits the pattern 999-999-999-9999
        if (len(word)==16):                    #length of the plot string has to be 16
            if (word[3]=='-'):                 #fourth character has to be '-'
                plot = word                    #find plot string
    assessment = words[len(words)-2]           #find assessed value
    address_start = line2.index(plot) + 16     #address starts 16 characters after plot
    address_end   = line2.index(assessment)    #address ends just before assessment
    address = line2[address_start:address_end] #pick out the address string
    address = address.strip()                  #trim leading and trailing spaces
    return(address)                         

In [17]:
def get_total_assessment(line):
    ix = line.index('TOTALS ')                    #find the 'TOTALS ' substring
    line2 = line[ix:len(line)-1]                  #take the line from there on
    words = line2.split()                         #split it into words
    total_asmt = words[len(words)-2]              #the next to last word is the total assessed value
    if (total_asmt != 'TOTALS'):                  #check if no assessed value
        total_asmt = total_asmt.replace(',','')   #remove the commas so we can perform arithmetic
    else:
        total_asmt = '0.0'
    return(total_asmt)                                                  

In [18]:
def get_total_tax(line):
    ix = line.index('TOTALS ')                    #find the 'TOTALS ' substring
    line2 = line[ix:len(line)-1]                  #take the line from there on
    words = line2.split()                         #split it into words
    total_tax = words[len(words)-1]               #the final word is the total tax amount
    total_tax = total_tax.replace(',','')         #remove the commas so we can perform arithmetic
    return(total_tax)      


In [19]:
def get_exemption_type(line):
    ix = line.index(' EX ') +1                                  #find start of EX information
    line2 = line[ix:len(line)-1]                                #take the line from there out
    words = line2.split()                                       #split into words
    exemption_type=''                                           #initialize exemption_type string
    word_num = 1
    num_words = len(words)
    for word in words:
        if ((word_num != 1) & (word_num != num_words)):                  
            exemption_type = exemption_type + ' ' + word
        word_num += 1
    exemption_type = exemption_type.strip()
    return(exemption_type)

In [20]:
def get_exemption_amt(line):
    ix = line.index(' EX ') +1                  #find start of EX information
    line2 = line[ix:len(line)-1]                #take the line from there out
    words = line2.split()
    num_words = len(words)
    exemption_amt = words[num_words-1]
    exemption_amt = exemption_amt.replace(',','')
    return(exemption_amt)

### Parse the tax roll document

In [21]:
acctd = {}

#for k in pdf.keys():                                   #loop through pages
for k in np.arange(1,485):                              #loop through the main property listing only 
    for key in pdf[k].keys():                           #loop through items on pages
        if ('text' in pdf[k][key].keys()):              #extract the text portion
            string = str(pdf[k][key]['text'])           #convert to string
            lines = string.split('\\n')                 #split into lines
            for line in lines:                          #parse the content of each line
                if (' RE 00 ' in line):                 #lines with taxes contain ' RE 00 '
                    if (check_acct(line)):              #line with account start with account number
                        acct = line[0:9]                #start new account entry
                        if (acct in acctd.keys()):      #check if it's already there
                            acct = 'XXXX'               #if so route it to junk acct
                        acctd[acct] = {}
                    bill = get_bill_no(line)            #get bill number
                    if (bill in acctd[acct].keys()):
                        print('duplicate bill number')
                        print(acct,bill)
                        bill = bill + '_2'
                    acctd[acct][bill] = {}                               #entry for this bill
                    acctd[acct][bill]['RE_CLASS'] = get_re_class(line)   #fill in RE class
                    acctd[acct][bill]['PLOT'] = get_plot(line)           #fill in plot
                    acctd[acct][bill]['TAX']  = get_tax(line)            #fill in the tax amount
                    acctd[acct][bill]['ASMT'] = get_assessment(line)     #fill in the assessed value
                    acctd[acct][bill]['ADDRESS'] = get_address(line)     #fill in the address
                elif (' EX ' in line):                                   #Exemptions line
                        exemption = get_exemption_type(line)             #get the extmption type
                        exemption_amt = get_exemption_amt(line)          #get the exemption amount
                        if ('EXEMPTIONS' not in acctd[acct][bill].keys()):  #see if we need to add EXEMPTIONS
                            acctd[acct][bill]['EXEMPTIONS'] = {}         #   if so create empty dictionary
                        acctd[acct][bill]['EXEMPTIONS'][exemption] = exemption_amt

                elif (' TOTALS ' in line):                               #TOTALS line
                    if (acct != ''):
                        acctd[acct]['TOTAL_TAX'] = get_total_tax(line)         #get the total tax for acct
                        acctd[acct]['TOTAL_ASMT'] = get_total_assessment(line) #get the total assessment
                        acct = ''                                                    #no more for this acct

if ('XXXX' in acctd.keys()):
    print(acctd['XXXX'])
else: 
    print("No duplicate account numbers")

duplicate bill number
100000157 39
duplicate bill number
100404042 4313
duplicate bill number
100402109 2967
duplicate bill number
100002465 616
No duplicate account numbers


In [22]:
### ad-hoc checking

for acct in acctd.keys():
    if ('100000157' in acct):
        print(acct,acctd[acct])
    if ('100404042' in acct):
        print('\n',acct,acctd[acct])
    if ('100402109' in acct):
        print('\n',acct,acctd[acct])
    if ('100002465' in acct):
        print('\n',acct,acctd[acct])

100000157 {'39': {'RE_CLASS': 'FARM', 'PLOT': '043-011-002-0000', 'TAX': '0.0', 'ASMT': '13600', 'ADDRESS': '2068 SOUTH COUNTY'}, '39_2': {'RE_CLASS': 'FFOP', 'PLOT': '043-011-002-0000', 'TAX': '8484.3', 'ASMT': '351947', 'ADDRESS': '2068 SOUTH COUNTY TRA'}, 'TOTAL_TAX': '8484.3', 'TOTAL_ASMT': '365547'}

 100404042 {'4313': {'RE_CLASS': 'FFOP', 'PLOT': '020-019-020-0000', 'TAX': '8436.2', 'ASMT': '346273', 'ADDRESS': '1786 FRENCHTOWN ROAD'}, '4313_2': {'RE_CLASS': 'FFOP', 'PLOT': '020-019-020-0000', 'TAX': '0.0', 'ASMT': '17200', 'ADDRESS': '1786 FRENCHTOWN'}, 'TOTAL_TAX': '8436.2', 'TOTAL_ASMT': '363473'}

 100402109 {'2967': {'RE_CLASS': 'FARM', 'PLOT': '058-014-019-0000', 'TAX': '0.0', 'ASMT': '16900', 'ADDRESS': '465 SHIPPEETOWN'}, '2967_2': {'RE_CLASS': 'FFOP', 'PLOT': '058-014-019-0000', 'TAX': '8146.1', 'ASMT': '334077', 'ADDRESS': '465 SHIPPEETOWN ROAD'}, 'TOTAL_TAX': '8146.1', 'TOTAL_ASMT': '350977'}

 100002465 {'616': {'RE_CLASS': 'COMM II', 'PLOT': '071-012-097-0000', 'TAX

In [23]:
### balance totals
###################################################################################
total_tax = 0.0
total_asmt = 0.0
item_ct = 0.0

tax_roll_total_tax = 53101398.13
tax_roll_total_assessed = 2646550206
tax_roll_items = 5456

for acct in acctd.keys():
    total_tax += float(acctd[acct]['TOTAL_TAX'])
    total_asmt += float(acctd[acct]['TOTAL_ASMT'])
    item_ct += len(acctd[acct])

nbills = 0
for acct in acctd.keys():
    for bill in acctd[acct].keys():
        if (bill.isdigit()):
            nbills+=1

tax_diff = total_tax - tax_roll_total_tax
asmt_diff = total_asmt - tax_roll_total_assessed
item_diff = nbills - tax_roll_items

print("Total tax: ",total_tax,"  tax roll: ",tax_roll_total_tax," difference: ",tax_diff,\
    '\nTotal assessed value: ',total_asmt,"  tax roll: ",tax_roll_total_assessed," difference: ",asmt_diff,\
    '\nNumber of bills: ',nbills," tax roll: ",tax_roll_items," difference: ",item_diff )

Total tax:  53117438.5999999   tax roll:  53101398.13  difference:  16040.4699998945 
Total assessed value:  2646549562.0   tax roll:  2646550206  difference:  -644.0 
Number of bills:  5456  tax roll:  5456  difference:  0


In [None]:
pdf[1]