# Board of Estimates Tabulator  
The purpose of this software tool is to use the pdf files that store the minutes of Baltimore's Board of Estimates to create a small database with linked tables for entities that could possibly include:

- __meetings__
    - one entity per BoE meeting
    - primary key is the date
- __agreements__
    - primary key is BAN plus the partner organization's name
    - features include: date, dollar amount, BAN, description
- __prequalifications__
- __contractors__ 
- __personnel__
- __reclassifications__

## Setup
### Import packages  
**We may want to break visualization code into a separate notebook

In [1]:
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from pathlib import Path
import time 
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt

from utils import *
from bike_rack.parse_utils import parse_pdf

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() 

### Define directories and urls
We'll also create the `pdf_files` directory if it doesn't exist already.

In [2]:
base_url = "https://comptroller.baltimorecity.gov/"
minutes_url = base_url + "boe/meetings/minutes"

root = Path.cwd()
pdf_dir = root / "pdf_files"

try:
    pdf_dir.mkdir(parents=True, exist_ok=False)
except FileExistsError:
    print("The pdf directory already exists.")
else:
    print("The pdf directory has been created.")

The pdf directory already exists.


### Store PDFs to local directory  
The following code downloads all the .pdf files with minutes from the Board of Estimates and saves them in your local version of this repository. 

The code will skip this time-consuming step if it detects files within the `pdf_files` directory.

The tricky part of this is getting a correct date for every file. Some files have a typo somewhere in html, so the functions `store_boe_pdfs()` and  `parse_long_dates()` are built to handle these errors. We may see new types of errors appear in the future, however.

In [3]:
# set to true if you'll be repeatedly running store_boe_pdfs()
testing_mode = False

if testing_mode:
    if pdf_dir:
        del_dir_contents(pdf_dir)
if is_empty(pdf_dir):
    store_boe_pdfs(base_url, minutes_url)
else: 
    print("Files already exist in the pdf directory.")

Files already exist in the pdf directory.


## Process and prepare sample data
### Parse only a few sample pdfs
While we're still in development mode, we can save some time by only parsing a few pdfs instead of all 500+ 

In [4]:
# specifies the paths for a couple sample pdfs
meeting1_path = Path("pdf_files/2013/2013_11_20.pdf")
meeting2_path = Path("pdf_files/2010/2010_03_17.pdf")

# uses parse_pdf() to instantiate the Minutes class for each meeting
meeting1 = parse_pdf(meeting1_path)
meeting2 = parse_pdf(meeting2_path)

print(meeting1.clean_text[:500])

4680 BOARD OF ESITMATES November 20, 2013 MINUTES REGULAR MEETING Honorable Bernard C. "Jack" Young, President Honorable Stephanie Rawlings-Blake, Mayor - ABSENT Harry Black, Director of Finance Honorable Joan M. Pratt, Comptroller and Secretary George A. Nilson, City Solicitor Alfred H. Foxx, Director of Public Works David E. Ralph, Deputy City Solicitor Rudolph S. Chow, Deputy Director of Public Works Bernice H. Taylor, Deputy Comptroller and Clerk Pursuant to Article VI, Section 1(c) of the r


## Process and prepare data
### Create a Pandas dataframe with full texts  
Now that we have the .pdf files we need, we're ready to read them and store their text in a Pandas dataframe. 

This should take about one minute for each year of data.

In [None]:
text_df_raw = store_pdf_text_to_df(pdf_dir)

An error occurred reading file /Users/james/Documents/Bmore/repos/BOE_tabulator/pdf_files/2013/2013_06_19.pdf
An error occurred reading file /Users/james/Documents/Bmore/repos/BOE_tabulator/pdf_files/2013/2013_06_26.pdf


### View a sample of the stored text

In [None]:
text_df_raw.sample(6, random_state=444)

### Replace erroneous characters and consolidate white spaces
Not sure if decision to transform all multiple white spaces to a single white space will work for the long term because we may need to use multiple spaces to detect certain fields.

I hope we won't need to do that, though. Currently we're consolidating all multiple white spaces into just one white space.

In [None]:
text_df = text_df_raw.copy()

text_df['text'] = text_df['minutes'].apply(replace_chars)

In [None]:
# view a sample of the transformed text
print(text_df['text'][0][0:500])

In [None]:
def add_fiscal_year(df):
    df = df.copy()
    df["calendar_year"] = df["date"].dt.year
    df["month"] = df["date"].dt.month
    c = pd.to_numeric(df["calendar_year"])
    df["fiscal_year"] = np.where(df["month"] >= 7, c + 1, c)
    df["fiscal_year"] = (pd.to_datetime(df["fiscal_year"], format="%Y")).dt.year
    return df

text_df['date'] = pd.to_datetime(text_df['date'])
text_df = add_fiscal_year(text_df)

text_df.head(3)

### Store the date of meeting as the index

In [None]:
text_df = text_df.set_index('date', drop=False)
text_df['word_count'] = text_df['text'].apply(lambda x: len(x.split()))

text_df.head(3)

## Test consistency of the pdfs