# Board of Estimates Tabulator  
The purpose of this software tool is to use the pdf files that store the minutes of Baltimore's Board of Estimates to create a small database with linked tables for entities that could possibly include:

- __meetings__
    - one entity per BoE meeting
    - primary key is the date
- __agreements__
    - primary key is BAN plus the partner organization's name
    - features include: date, dollar amount, BAN, description
- __prequalifications__
- __contractors__ 
- __personnel__
- __reclassifications__

## Setup
### Import packages  
**We may want to break visualization code into a separate notebook

In [1]:
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from pathlib import Path
import time 
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt

from utils import *
from bike_rack.parse_utils import parse_pdf

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() 

### Define directories and urls
We'll also create the `pdf_files` directory if it doesn't exist already.

In [2]:
base_url = "https://comptroller.baltimorecity.gov/"
minutes_url = base_url + "boe/meetings/minutes"

root = Path.cwd()
pdf_dir = root / "pdf_files"

try:
    pdf_dir.mkdir(parents=True, exist_ok=False)
except FileExistsError:
    print("The pdf directory already exists.")
else:
    print("The pdf directory has been created.")

The pdf directory already exists.


### Store PDFs to local directory  
The following code downloads all the .pdf files with minutes from the Board of Estimates and saves them in your local version of this repository. 

The code will skip this time-consuming step if it detects files within the `pdf_files` directory.

The tricky part of this is getting a correct date for every file. Some files have a typo somewhere in html, so the functions `store_boe_pdfs()` and  `parse_long_dates()` are built to handle these errors. We may see new types of errors appear in the future, however.

In [3]:
# set to true if you'll be repeatedly running store_boe_pdfs()
testing_mode = False

if testing_mode:
    if pdf_dir:
        del_dir_contents(pdf_dir)
if is_empty(pdf_dir):
    store_boe_pdfs(base_url, minutes_url)
else: 
    print("Files already exist in the pdf directory.")

Files already exist in the pdf directory.


## Process and prepare sample data
### Parse only a few sample pdfs
While we're still in development mode, we can save some time by only parsing a few pdfs instead of all 500+ 

In [4]:
# specifies the paths for a couple sample pdfs
meeting1_path = Path("pdf_files/2013/2013_11_20.pdf")
meeting2_path = Path("pdf_files/2010/2010_03_17.pdf")

# uses parse_pdf() to instantiate the Minutes class for each meeting
meeting1 = parse_pdf(meeting1_path)
meeting2 = parse_pdf(meeting2_path)

print(meeting1.clean_text[:500])

4680 BOARD OF ESITMATES November 20, 2013 MINUTES REGULAR MEETING Honorable Bernard C. "Jack" Young, President Honorable Stephanie Rawlings-Blake, Mayor - ABSENT Harry Black, Director of Finance Honorable Joan M. Pratt, Comptroller and Secretary George A. Nilson, City Solicitor Alfred H. Foxx, Director of Public Works David E. Ralph, Deputy City Solicitor Rudolph S. Chow, Deputy Director of Public Works Bernice H. Taylor, Deputy Comptroller and Clerk Pursuant to Article VI, Section 1(c) of the r


## Process and prepare data
### Create a Pandas dataframe with full texts  
Now that we have the .pdf files we need, we're ready to read them and store their text in a Pandas dataframe. 

This should take about one minute for each year of data.

In [5]:
text_df_raw = store_pdf_text_to_df(pdf_dir)

An error of type EOF marker not found occurred reading file /Users/james/Documents/Bmore/repos/BOE_tabulator/pdf_files/2013/2013_06_19.pdf
An error of type EOF marker not found occurred reading file /Users/james/Documents/Bmore/repos/BOE_tabulator/pdf_files/2013/2013_06_26.pdf
An error of type EOF marker not found occurred reading file /Users/james/Documents/Bmore/repos/BOE_tabulator/pdf_files/2015/2015_10_28.pdf
An error of type EOF marker not found occurred reading file /Users/james/Documents/Bmore/repos/BOE_tabulator/pdf_files/2011/2011_04_27.pdf
Wrote 527 rows to the table of minutes.


### View a sample of the stored text

In [182]:
text_df_raw.sample(6, random_state=444)

Unnamed: 0,date,page_number,minutes
376,2020-10-14,3960,"3960\n \nBOARD OF ESTIMATES\n \n \nOCTOBER 14,..."
289,2010-07-14,2341,2341 BOARD OF ESTIMATES ...
315,2019-02-06,584,"584\n \nBOARD OF \nESTIMATES\n \nFEBRUARY 06, ..."
0,2013-11-20,4680,"4680 BOARD OF ESITMATES November 20, 2013 MI..."
285,2010-09-29,3357,3357 BOARD OF ESTIMATES ...
187,2009-08-19,3094,3094 BOARD OF ESTIMATES ...


### Replace erroneous characters and consolidate white spaces
Not sure if decision to transform all multiple white spaces to a single white space will work for the long term because we may need to use multiple spaces to detect certain fields.

I hope we won't need to do that, though. Currently we're consolidating all multiple white spaces into just one white space.

In [183]:
def replace_chars(text):
    replacements = [
        # ("\n", ""),
        ("Œ", "-"),
        ("ﬁ", '"'),
        ("ﬂ", '"'),
        ("™", "'"),
        ("Ł", "•"),
        ("Š", "-"),
        ("€", " "),
        ("¬", "-"),
        ("–", "…"),
        ("Ž", "™"),
        ("˚", "fl"),
        ("˜", "fi"),
        ("˛", "ff"),
        ("˝", "ffi"),
        ("š", "—"),
        ("ü", "ti"),
        ("î", "í"),
        ("è", "c"),
        ("ë", "e"),
        ("Ð", "–"),
        ("Ò", '"'),
        ("Ó", '"'),
        ("Õ", "'"),
    ]
    for i in replacements:
        text = text.replace(i[0], i[1])
    return text


text_df = text_df_raw.copy()

text_df["text"] = text_df["minutes"].apply(replace_chars)

In [184]:
# view a sample of the transformed text
print(text_df['text'][0][0:500])

4680  BOARD OF ESITMATES  November 20, 2013 MINUTES  REGULAR MEETING  Honorable Bernard C. "Jack" Young, President Honorable Stephanie Rawlings-Blake, Mayor - ABSENT Harry Black, Director of Finance Honorable Joan M. Pratt, Comptroller and Secretary George A. Nilson, City Solicitor 
Alfred H. Foxx, Director of Public Works David E. Ralph, Deputy City Solicitor Rudolph S. Chow, Deputy Director of Public Works 
Bernice H. Taylor, Deputy Comptroller and Clerk 
  Pursuant to Article VI, Section 1(c)


In [185]:
def add_fiscal_year(df):
    df = df.copy()
    df["calendar_year"] = df["date"].dt.year
    df["month"] = df["date"].dt.month
    c = pd.to_numeric(df["calendar_year"])
    df["fiscal_year"] = np.where(df["month"] >= 7, c + 1, c)
    df["fiscal_year"] = (pd.to_datetime(df["fiscal_year"], format="%Y")).dt.year
    return df

text_df['date'] = pd.to_datetime(text_df['date'])
text_df = add_fiscal_year(text_df)

text_df.head(3)

Unnamed: 0,date,page_number,minutes,text,calendar_year,month,fiscal_year
0,2013-11-20,4680,"4680 BOARD OF ESITMATES November 20, 2013 MI...","4680 BOARD OF ESITMATES November 20, 2013 MI...",2013,11,2014
1,2013-03-20,837,"837 BOARD OF ESTIMATES March 20, 2013 MINUTES...","837 BOARD OF ESTIMATES March 20, 2013 MINUTES...",2013,3,2013
2,2013-01-30,280,"280 BOARD OF ESTIMATES JANUARY 30, 2013 MINUT...","280 BOARD OF ESTIMATES JANUARY 30, 2013 MINUT...",2013,1,2013


### Store the date of meeting as the index

In [187]:
text_df = text_df.set_index('date', drop=False)
text_df['word_count'] = text_df['text'].apply(lambda x: len(x.split()))

text_df.sample(3)

Unnamed: 0_level_0,date,page_number,minutes,text,calendar_year,month,fiscal_year,word_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2011-03-16,2011-03-16,741,741 BOARD OF ESTIMATES ...,741 BOARD OF ESTIMATES ...,2011,3,2011,13379
2017-02-08,2017-02-08,383,"383\n \nBOARD OF ESTIMATES\n \nFEBRUARY 08, 20...","383\n \nBOARD OF ESTIMATES\n \nFEBRUARY 08, 20...",2017,2,2017,13468
2010-06-09,2010-06-09,1767,1767 BOARD OF ESTIMATES ...,1767 BOARD OF ESTIMATES ...,2010,6,2010,23276


## Test consistency of the pdfs
Let's see how consistent our pdf documents are. Our first test will be to see if the document contains the string "REGULAR MEETING", which in most cases comprises the first words of the document below the header. 

In [113]:
test_df = text_df.copy()

def test_a(row):
    reg_meeting_regex = r"\bREGULAR\sMEETING\b"
    reg_match = re.search(reg_meeting_regex, row["text"])
    special_meeting_regex = r"\bSPECIAL\sMEETING\b"
    special_match = re.search(special_meeting_regex, row["text"])
    row["has_regular_meeting"] = reg_match != None
    row["has_special_meeting"] = special_match != None
    return row


match_df = test_df.apply(test_a, axis=1)
print(f"Exactly {sum(match_df['has_regular_meeting'])} pdf files contain the string 'REGULAR MEETING', \nand exactly {sum(match_df['has_special_meeting'])} contain the string 'SPECIAL MEETING'")

Exactly 508 pdf files contain the string 'REGULAR MEETING', 
and exactly 14 contain the string 'SPECIAL MEETING'


We see that most but not all of the pdfs have the string "REGULAR MEETING". The 20 files that fail this test include 11 files that were special meetings. The remainder include some malformed pdfs as well as a few pdfs that appear to simply be missing 

### Separate out the regular meetings

In [110]:
cond_reg = match_df["has_regular_meeting"] == 1
cond_spec = match_df["has_special_meeting"] == 1
reg_meetings = match_df[cond_reg & ~cond_spec]
print(f"{len(reg_meetings)}")

507


In [112]:
reg_meetings[['date', 'text']].sample(3)

Unnamed: 0_level_0,date,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2009-09-02,2009-09-02,3299 BOARD OF ESTIMATES ...
2015-11-18,2015-11-18,"4161 BOARD OF ESTIMATES NOVEMBER 18, 2015 MINU..."
2015-04-22,2015-04-22,"1216 BOARD OF ESTIMATES APRIL 22, 2015 MINUTE..."


### Test the sequencing of the sections "BOARDS AND COMMISSIONS" and "TRANSFERS OF FUNDS"
Based on looking at the .pdfs, I expect to see the sections in this order:

1. REGULAR MEETING
2. BOARDS AND COMMISSIONS
3. TRANSFERS OF FUNDS

Let's go ahead and test how confident we can be in the assumption that the sections exist and that they come in that order.

In [None]:
def test_sections_exist(row):
    regular_meeting = r"^REGULAR\w+MEETING$"
    
reg_meetings.apply(test_sequence, axis=1)

In [161]:
def test_sequence(row):
    """ 
    Takes a row of the dataframe as an input and returns an augented row with information 
    about the order in which we see the section headings.
    """
    boards_first = r"(\bBOARD\w?\sAND\sCOMMISSION\w?\b).*(\bTRANSFER\w?\sOF\sFUND\w?\b)"
    transfers_first = r"(\bTRANSFER\w?\sOF\sFUND\w?\b).*(\bBOARD\w?\sAND\sCOMMISSIONS\b)"
    boards_first_match = re.search(boards_first, row["text"])  
    transfers_first_match = re.search(transfers_first, row["text"])  
    row["boards_before_transfers"] = boards_first_match != None
    row["transfers_before_boards"] = transfers_first_match != None
    return row

seq_df = match_df = reg_meetings.apply(test_sequence, axis=1)

It turns out that in the majority of cases we see the sequencing we expect. But in about 10% of the cases, the two sections are out of order. And then there are 11 documents that don't meet either condition — manual checks suggest that a typist simply forgot to name the section. 

In [165]:
cond_boards_first = seq_df["boards_before_transfers"] == 1
cond_transfers_first = seq_df["transfers_before_boards"] == 1
irregulars = seq_df[~cond_boards_first & ~cond_transfers_first]

print(f"We find {sum(seq_df['boards_before_transfers'])} .pdfs where BOARDS comes before TRANSFERS.")
print(f"We find {sum(seq_df['transfers_before_boards'])} .pdfs where TRANSFERS comes before BOARDS.")
print(f"We find {len(irregulars)} sets of minutes that meet neither condition.")

We find 490 .pdfs where BOARDS comes before TRANSFERS.
We find 46 .pdfs where TRANSFERS comes before BOARDS.
We find 11 sets of minutes that meet neither condition.


In [176]:
def test_sequence_3(row):
    """ 
    Takes a row of the dataframe as an input and returns an augented row with information 
    about the order in which we see the section headings.
    """
    seq_3 = r"(\bOPTIONS\/CONDEMNATION\/QUICK-TAKES:)"
    #seq_3 = r"(\bBOARD\w?\sAND\sCOMMISSION\w?\b).*(\bTRANSFER\w?\sOF\sFUND\w?\b).*(\bOPTIONS/CONDEMNATION/QUICK-TAKES:\b)"
    seq_3_match = re.search(seq_3, row["text"])  
    row["seq_3"] = seq_3_match != None
    return row

seq_df = seq_df.apply(test_sequence_3, axis=1)

In [177]:
sum(seq_df['seq_3'])

387