# Board of Estimates Tabulator  
The purpose of this software tool is to use the pdf files that store the minutes of Baltimore's Board of Estimates to create a small database with linked tables for entities that could possibly include:

- meetings 
- agreements
- contracts
- contractors 
- personnel
- reclassifications

## Setup
### Import packages

In [1]:
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from pathlib import Path
import time 

from utils import *

Improvements needed for function `get_boe_pdfs`:

- Report errors more accurately 
- Get current year dynamically

### Try out a key bit of syntax  
Since we will need to transform months into numbers, this bit of syntax is good to bear in mind.

In [2]:
time.strptime("November", "%B").tm_mon

11

### Store PDFs to local directory  
The following code downloads all the .pdf files with minutes from the Board of Estimates and saves them in your local version of this repository. 

The tricky part of this is getting a correct date for every file. Some files have a typo somewhere in html, so the functions `store_boe_pdfs()` and  `parse_long_dates()` are built to handle these errors. We may see new types of errors appear in the future, however.

In [3]:
base_url = "https://comptroller.baltimorecity.gov/"
minutes_url = base_url + "boe/meetings/minutes"

store_boe_pdfs(base_url, minutes_url)

Saving files from url: https://comptroller.baltimorecity.gov//minutes-2009
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2010
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2011
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2012
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2013
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2014
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2015
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2016-0
Saving files from url: https://comptroller.baltimorecity.gov//boe/meetings/minutes
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2018
Saving files from url: https://comptroller.baltimorecity.gov//2019
Saving files from url: https://comptroller.baltimorecity.gov//minutes-2020
Wrote 512 .pdf files to local repo.


## Process and prepare data
### Create table with full texts  
Note that for testing purposes we're processing the data from the year 2009 only.

In [4]:
root = Path.cwd()
pdf_dir = root / "pdf_files" / "2009"

text_df = store_pdf_text_to_df(pdf_dir)



Wrote 44 rows to the table of minutes.


In [5]:
text_df.sample(6, random_state=444)

Unnamed: 0,date,page_number,minutes
4,2009-05-27,1796,1796 BOARD OF ESTIMATES ...
32,2009-09-02,3299,3299 BOARD OF ESTIMATES ...
21,2009-06-17,2159,2159 BOARD OF ESTIMATES ...
28,2009-12-16,4661,4661 BOARD OF ESTIMATES ...
5,2009-07-22,2683,2683 BOARD OF ESTIMATES ...
42,2009-02-25,593,"593 BOARD OF ESTIMATES February 25, 2009 MIN..."


### Replace erroneous characters and consolidate white spaces
Not sure if decision to transform all multiple white spaces to a single white space will work for the long term, but currently that's what we're doing.

In [6]:
def replace_chars(val):
    val = ' '.join(val.split())
    val = val.replace('™', "'")
    val = val.replace('Œ', "-")
    return val

text_df['minutes_fixed'] = text_df['minutes'].apply(replace_chars)

In [12]:
# view a sample of the transformed text
print(text_df['minutes_fixed'][0][0:500])

1721 BOARD OF ESTIMATES May 20, 2009 MINUTES REGULAR MEETING Stephanie Rawlings-Blake, President Sheila Dixon, Mayor - ABSENT Joan M. Pratt, Comptroller and Secretary George A. Nilson, City Solicitor Donald Huskey, Deputy City Solicitor David E. Scott, Director of Public Works Ben Meli, Deputy Director of Public Works Bernice H. Taylor, Deputy Comptroller, and Clerk The meeting was called to order by the President. Pursuant to Article VI, Section 1(c) of the revised City Charter effective July 1


## Tabulate data
### Create empty dataframes

In [8]:
def create_dateframes():
    meetings_df = pd.DataFrame(
        columns=["date", "president", "mayor", "no_of_protests", "no_of_settlements"]
    )
    agreements_df = pd.DataFrame(
        columns=["date", "department", "contractor", "account_number", "agreement"]
    )
    
    return meetings_df, agreements_df

agreements_df, meetings_df = create_dateframes()

In [13]:
account_lookup = r"\d{4}-\d{6}-\d{4}-\d{6}-\d{6}"
#department_lookup = r"^.+?(?=–|-.*Agreements)"
#department_lookup = r"(?<=MINUTES).+?(?=(\s–|-\s).*)"
department_lookup = r"(?<=MINUTES).+(?=(\s-\s))"

#^.+?(?=(\s–|-\s).*Agreements)


sample_minutes = text_df['minutes_fixed'][1]
print(re.fullmatch(re.compile(department_lookup), sample_minutes))

None


In [10]:
account_matches

NameError: name 'account_matches' is not defined