## personal property tax roll

E. Quinn 6/8/2020

This notebook uses pdfminer to extract the information from the personal property tax roll

The documentation for pdfminer is at:

https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

## Import standard python datascience packages

In [None]:
import math
import re
import copy
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cloudpickle
%matplotlib inline

## Import pdfminer packages

In [None]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

### Show the directory we are running in

In [None]:
!pwd

### Read town and fire district tax rates

In [None]:
has_fd={1994:'no',1995:'no',1996:'no',1997:'no',1998:'no',1999:'no',2000:'no',\
      2001:'no',2002:'no',2003:'no',2004:'no',2005:'no',2006:'no',2007:'no',\
      2008:'yes',2009:'yes',2010:'yes',2011:'no',2012:'yes',2013:'yes',2014:'yes',\
      2015:'yes',2016:'yes',2017:'yes',2018:'yes',2019:'yes',2020:'yes'}

tr = pd.read_csv("../../egsc/Town_and_Fire_District_tax_rates_2020.csv")
print(tr.shape)
print(tr.columns)
trdd = tr.to_dict(orient='list')

trs={}

for i in np.arange(len(trdd['Fiscal Year'])):
    fy = trdd['Fiscal Year'][i]
    if fy not in trs.keys():
        trs[fy] = {}
        trs[fy]['total comm rate'] = trdd['Total Comm Rate'][i]
        trs[fy]['total res rate'] = trdd['Total Residential Rate'][i]
        trs[fy]['comm rate'] = trdd['Comm Tax Rate'][i]
        trs[fy]['res rate'] = trdd['Residential Tax Rate'][i]
        trs[fy]['fire rate'] = trdd[' Fire Tax Rate '][i]
        if fy in has_fd.keys():
            trs[fy]['has_fd'] = has_fd[fy]
        else:
            trs[fy]['has_fd'] = 'unk'

        
trs

## Read the pdf and create a dictionary with the contents of each text box

### Function read_pdf() reads a PDF and returns a dictionary containing the contents

Strategy for this document:  

Save information from each element in the LTTextBox objects in a dictionary including:

- x0 horizontal coordinate of the upper left corner of the text box
- x1 horizontal coordinate of the lower right corner of the text box
- y0 vertical coordinate of the upper left corner of the text box
- y1 vertical coordinate of the lower right corner of the text box
- page number 
- sequence number of text box within this page
- text contained in the text box, converted to ascii

Parsing the text is complicated by the fact that that a text box may span multiple columns and/or rows, and the text box groupings vary quite a bit depending on the page contents and layout.

However, with a bit of luck the structure of the document will allow the contents to be deciphered with the following heuristics:

- Text boxes containing left justified columns will tend to have nearly the same x0 coordinates
- Text boxes containing right justified columns will tend to have nearly the same x1 coordinates
- The codes for fund, account code, and object code are numeric and have fixed lengths
- Extraneous information is often preceded or followed by a series of underscore and newline characters
- Last name can be distinguished because is the only field that is all characters followed by a comma
- Last name may be preceded by between one and three numerical fields:  fund, account, object.  If it is, the x0 value is shifted to the left.
    - Three numerical fields precede the name:  assume they are fund, account, object
    - Two numerical fields precede the name: assume they are account, object
    - One numerical field precedes the name: assume it is object
    

In [None]:
def read_pdf(path):
    document = open(path, 'rb')                                     #read a pdf and create a document object
    rsrcmgr = PDFResourceManager()                                  #create a resource manager
    laparams = LAParams()                                           #set the parameters for analysis
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)          #create a PDF page aggregator object
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    pdf={}                                                          #dictionary to hold the results

    pageno = -1                                                     #initialize page coounter to zero

    for page in PDFPage.get_pages(document):                        #loop through the pdf page by page
        pageno = pageno + 1                                         #increment the page number
        pdf[pageno] = {}                                            #dictionary for this page
        interpreter.process_page(page)                              # receive the LTPage object for the page.
        layout = device.get_result()                                # create layout object
        tbox_no=0                                                   # index for element number
        for element in layout:
            if (type(element).__name__=='LTTextBoxHorizontal'):     #loop through text boxes
                tbox_no += 1                                        #increment text box number
                pdf[pageno][tbox_no] = {}                           #dictionary for text boxes within page
                x0 = round(element.x0,2)                            #x0 coordinate of textbox corner
                x1 = round(element.x1,2)                            #x1 coordinate of textbox corner
                y0 = round(element.y0,2)                            #y0 coordinate of textbox corner
                y1 = round(element.y1,2)                            #y1 coordinate of textbox corner
                txt = element.get_text().encode('ascii', 'ignore')  #text converted to ascii
                pdf[pageno][tbox_no]['x0'] = x0                     #create x0 coordinate entry
                pdf[pageno][tbox_no]['x1'] = x1                     #create x1 coordinate entry
                pdf[pageno][tbox_no]['y0'] = y0                     #create y0 coordinate entry
                pdf[pageno][tbox_no]['y1'] = y1                     #create y1 coordinate entry

                pdf[pageno][tbox_no]['text'] = ''.join(chr(c) for c in txt) #convert bytes to string
    return(pdf)

### Parse the tax roll document

In [None]:
def find_plat(tx):
    plat_index=-1
    words = tx.split()
    for word in words:
        if (len(word)==16):
            if ((word[0:2].isdigit()) & (word[3]=='-') & \
                (word[4:6].isdigit()) & (word[7]=='-') & \
                (word[8:10].isdigit()) & (word[11]=='-') & (word[12:15].isdigit())):
                    plat_index = tx.find(word)
    return(plat_index)

def get_float(st):
    st2 = st.replace(',','')
    try:
        nv = float(st2)
    except ValueError:
        nv = np.NaN
    return(nv)

def get_last3_numeric(txt):                         #function returns the last 3 words on a line as numbers
    last3 = []                                      #initialize list for last 3 numbers
    words = txt.split()                             #split the text ine into words
    for i in np.arange(1,4):                        #for i=1,2,3
        last3.append(np.NaN)                        #append np.NaN as default
        if (len(words) >= i):                       #replace it with floating point number word is numeric
            word = words[len(words)-i]                          
            last3[i-1] = abs(get_float(word.replace(',','')))
    return(last3)
        
class account():                                        #account class
    def __init__(self,acct_no):                         #account class constructor
        self.acct_no = acct_no                          #  acct_no
        self.bills = {}                                 #  bills dictionary
        self.totals_text = None                         #  totals line
        self.totals_valuation = np.NaN
        self.totals_tax = np.NaN
        self.exemptions = {}                            #  exemptions dictionary
        self.effective_tax_rate = np.NaN
        return
    
    def get_acct_no(self):                              #return acct_no
        return(self.acct_no)
    
    def get_bills(self):                                #return bills dictionary
        return(self.bills)
    
    def get_exemptions(self):                           #return exemptions dictionary
        return(self.exemptions)
    
    def get_bill_count(self):                           #return bill ount
        return(len(self.bills))
    
    def set_totals(self,text):                          #set totals text and values
        substr = text[text.find('TOTALS') + 6:]         #get text after 'TOTALS'.''
        words = substr.split()                          #
        first = last = next(iter(words), '')            #get first and last words 
        for last in iter(words):                        #
            pass                                        #
        if (first.replace(',','').isdigit()):           #
            self.totals_valuation = get_float(first.replace(',',''))
        if ('.' in last):
            self.totals_tax = get_float(last.replace(',',''))
            
    def set_effective_tax_rate(self,etr):               #set effective tax rate for acct
        self.effective_tax_rate = etr
        
    def get_effective_tax_rate(self):                   #get effective tax rate for acct
        return(self.effective_tax_rate)
        
    def get_totals_text(self):                          #return totals text
        return(self.totals_text)
    
    def get_totals_valuation(self):                     #return valuation from totals
        return(self.totals_valuation)
    
    def get_totals_tax(self):                           #return tax from totals
        return(self.totals_tax)
    
    def get_bill(self,i):                               #return a specific bill
        try:
            return(self.bills[i])
        except KeyError:
            return(None)
    
    def add_exemption(self,exemp):                      #add exemption to dictionary
        xmp = exemp[3:]                                 #strip of 'EX ' from the beginning
        words = xmp.split()                             #
        first = last = next(iter(words), '')            #get first and last words 
        for last in iter(words):                        #
            pass                                        #
        xm = xmp[:xmp.find(last)].strip()               #remove exemption amount
        examt = abs(get_float(last.replace(',','')))    #get amt from last word
        if (xm not in self.exemptions.keys()):
            self.exemptions[xm] = examt                 #set type as key for amt
        else:
            self.exemptions[xm] += examt                #if duplicate, just add in exemption
        return
    
    def add_bill(self,bil,trs,fy):                      #add bill to dictionary
        ix = 1+len(self.bills)                          #increment bill index
        self.bills[ix] = bill(bil,trs,fy)               #add bill object to dictionary
        return
    
class bill():                                           #bill class
    def __init__(self,text,trs,fy):                     #constructor     
        self.text = text                                #set text
        self.platt = ''                                 #set platt
        self.tax = np.NaN                               #set tax missing
        self.valuation = np.NaN                         #set valuation missing
        re_class = ''                                   #set real estate class missing
        self.bill_no = ''                               #set bill number missing
        self.address = ''                               #set address missing
        self.state_code = ''                            #set state code missing
        self.sc_desc = ''

                                                        #        
        words = text.split()                            #extract platt from text string                                                        #extract real estate class from text string
        reix = text.find(' PP 00')+6                    #look for the ' PP 00' substring 
        substr = text[reix:]                            #words after 'PP 00'
        words = substr.split()                          #

        if ((words[0] == 'TANG') & (words[1]=='>')):
            self.sc_desc = 'TANG >'
            self.bill_no = words[2]
            start_word = 3
        elif ((words[0]=='TANGIBLE') | (words[0]=='LREM') | (words[0]=='UTY/RR')):
            self.sc_desc=words[0]
            self.bill_no=words[1]
            start_word = 2
        else: 
            print('word[0] not found: ',words[0])
            start_word = 0
            
        self.tax = float(words[len(words)-1].replace(',',''))
        
        self.valuation = float(words[len(words)-2].replace(',',''))
        
        self.address = ''
        for i in np.arange(start_word,len(words)-2):
            self.address += words[i] + ' '
        self.address = self.address.strip()
            
        return
    
    def get_text(self):                                 #return text
        return(self.text)
    
    def get_class(self):
        return(self.re_class)
    
    def get_comm_rate(self):
        '''Returns commercial rate excluding fire district from history'''
        return(self.comm_rate)
    
    def get_fire_rate(self):
        '''Returns fire district rate from history'''
        return(self.fire_rate)

    def get_res_rate(self):
        '''Returns residential rate excluding fire district from history'''
        return(self.res_rate)

    def get_has_fd(self):
        '''Returns has fd indicator'''
        return(self.has_fd)
    
    def get_bill_no(self):
        return(self.bill_no)

    def get_state_code(self):
        return(self.state_code)
    
    def get_state_code_description(self):
        return(self.sc_desc)

    def get_address(self):
        return(self.address)
    
    def get_tax(self):
        return(self.tax)
    
    def get_valuation(self):
        return(self.valuation)

### Function reads the tax roll pdf

In [None]:
def read_taxroll(tr,trs,fy):
    
    accts = {}

    for k in tr.keys():                                             #loop through pages
        for key in tr[k].keys():                                    #loop through text boxes in page
            if ('text' in tr[k][key].keys()):                       #look at 'text' elements
                text = tr[k][key]['text']                           #extract text
                lines = text.split('\n')                            #split into lines
                for line in lines:                                  #loop through lines
                    if (len(line) > 10):                            #only look at lines longer than 10 chars
                        if (line[0:9].isdigit()):                   #check for account number
                            acct_no = line[0:9]                     #if first 10 chars are digits
                            if (acct_no not in accts.keys()):       #check if it's already in keys
                                accts[acct_no] = account(acct_no)   #if not, add account object
                        if ('PP 00' in line):                       #check for PP 00 string
                            accts[acct_no].add_bill(line,trs,fy)    #if present, add to bills
                        elif ('TOTALS' in line):                    #check for TOTALS line
                            line2 = line[line.find('TOTALS'):]      #if so, add TOTALS to account object
                            accts[acct_no].set_totals(line2)        #using set_totals() function
                        elif (' EX ' in line):                      #check for exemption
                            line2 = line[line.find(' EX '):]        #if present
                            accts[acct_no].add_exemption(line2)     #add exemption
    return(accts)

### Read the tax roll documents and save the decoded contents

In [None]:
taxrolls = {}

pdfs = {}

fyears = np.arange(2020,2021)
for fy in fyears:
    
    fn = '../../../pp_tax_rolls/PPTaxRollFINAL_201908081249441367.pdf'
    print(fn)
    pdfs[fy] = read_pdf(fn)
    taxrolls[fy] = read_taxroll(pdfs[fy])

In [None]:
taxrolls[2020] = read_taxroll(pdfs[2020],trs,fy)

### Check that total is correct:  Tax roll total tax is \$1,886,267.96

In [None]:
ttax = 0.0

for acct in taxrolls[2020].keys():
    a = taxrolls[2020][acct]
    ttax += a.get_totals_tax()
round(ttax,2)

In [None]:
file = open('../personal_property_2020.csv','w') 
 
    
file.write('Account,Bill_no,Address,Valuation,Tax,Exemption\n') 
for acct in taxrolls[2020].keys():
    a = taxrolls[2020][acct]
    e = a.get_exemptions()
    exemp = ''
    for x in e.keys():
        exemp = str(e[x])
    for bn in a.get_bills():
        b = a.get_bills()
        bill_no = b[1].get_bill_no()
        address = (b[1].get_address())
    
    
    file.write(acct + ',' + \
        bill_no + ',' + \
        '"' + address + '",' + \
        str(round(a.get_totals_valuation(),0)) + ',' +\
        str(round(a.get_totals_tax(),2)) + ',' +\
        exemp + '\n') 
 
file.close() 