This notebook tests out the helper functions (in `utils.py`) that (1) parse the APR spreadsheets for 2018-2019 data, and (2) combines the ABAG permits dataset from 2013-2017 with the APR spreadsheets from 2018 to 2019 to create a dataset of all permits over the entire time period.

In [1]:
import geopandas as gpd
import pandas as pd
from IPython.display import Markdown
from housing_elements import utils

Set up logging to print to the screen:

In [2]:
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

In [64]:
from importlib import reload
reload(utils)

<module 'housing_elements.utils' from '/Users/Salim/Desktop/housing-elements/housing_elements/utils.py'>

First, let's test out the APR spreadsheet helper function (`utils.load_apr_permits`) on the cities we have to ensure they look reasonable.

In [None]:
datasets = {}
for city in ['Berkeley', 'Mountain View', 'Oakland',  'Palo Alto', 'SanJose']:
    for year in ['2018', '2019']:
        filtered_df = utils.load_apr_permits(city, year)

        display(Markdown('# ' + city + ' ' + year))
        display(filtered_df[['Current APN', 'Street Address', '# of Units Issued Building Permits', 
                             'Unit Category (SFA,SFD,2 to 4,5+,ADU,MH)']])

These look reasonable.

Now let's test out the functions that load the ABAG permits:

In [None]:
sj_permits_df = utils.load_all_new_building_permits('San Jose')

In [None]:
sj_permits_df.groupby('permyear')['totalunit'].sum()

Looks good!

In [None]:
sj_permits_df.columns

In [None]:
sj_sites = utils.load_site_inventory('San Jose')

In [None]:
sj_permits_df.columns

In [None]:
utils.calculate_inventory_housing_over_all_housing(sj_sites, sj_permits_df)

In [None]:
sj_sites[sj_sites.apn == '25417084']

In [None]:
# TODO: In the course of testing with the san jose dataset, i saw one APN listed three times.
# apn == 25417084. we should find some fix during cleaning

In [None]:
sj_sites.apn.isin(sj_permits_df.apn).sum()

In [None]:
sj_permits_df.apn.isin(sj_sites.apn).sum()

In [None]:
# TODO: why are there multiple permits per site? are some of these not used? are there duplicates?

In [None]:
utils.calculate_mean_overproduction_on_sites(sj_sites, sj_permits_df)

In [None]:
utils.calculate_inventory_housing_over_all_housing(sj_sites, sj_permits_df)

In [None]:
utils.calculate_total_units_permitted_over_he_capacity(sj_sites, sj_permits_df)

In [None]:
sj_sites.sitetype.value_counts()

In [None]:
utils.calculate_pdev_for_nonvacant_sites(sj_sites, sj_permits_df)

In [None]:
utils.calculate_pdev_for_vacant_sites(sj_sites, sj_permits_df)

In [None]:
utils.calculate_pdev_for_inventory(sj_sites, sj_permits_df)

In [None]:
sj_sites

In [56]:
import numpy as np
df = gpd.read_file(
        "./data/raw_data/housing_sites/xn--Bay_Area_Housing_Opportunity_Sites_Inventory__20072023_-it38a.shp"
    )


In [5]:
cities = df.jurisdict.unique()

<module 'housing_elements.utils' from '/Users/Salim/Desktop/housing-elements/housing_elements/utils.py'>

In [63]:
import os, sys

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

Unfortunate cases of casting values to nan:
 - San Ramon has "up to 1.35 FAR" as allowden for one parcel. I will cast to nan.
 - Newark has "2500 sf/ac" as allowden for one parcel. It will be turned to 'ac' and cast to nan.
 - Danville uses '1 du/20 ac' as allowden. I cannot do the math bc it's a waste of time. These are cast to nan.
 - El Cerrito uses nonstandardized plain english in allowden.
 - Walnut Creek mostly uses FAR instead of du/ac
 - Pittsburg has one allowden value of 'Max 96 units'. Im not going to support this one parcel.
 - Sausality does everything in terms of p units per k square feet, where p and k are variable. Im not bothering with this.
 - Fairfax has a lot of values of "project specific - no maximum"
 - Novato has some FAR values.
 - Portola valley has a few values of 'PD' for allowden that I cannot understand the meaning of.
 
Other data info:
- Orinda has no sites in the cycle in ABAG dataset.
- Piedmont's allowden is just nonsensical
- Woodside's allowden is also nonsensical

For any quantity of interest with over 50% of a input variable as nan, we should just mark it as ignored in the results table

In [None]:
la_sites.apn = la_sites.apn.str.replace('-','').astype('float')

In [None]:
permits.sort_values('totalunit', ascending=False).drop_duplicates('apn').shape

permits

In [None]:
permits.shape

In [45]:
cities

array(['Berkeley', 'Albany', 'Alameda', 'Livermore', 'Fremont',
       'San Ramon', 'Newark', 'Brentwood', 'Hayward',
       'Contra Costa County', 'Emeryville', 'Alameda County',
       'Pleasanton', 'San Leandro', 'Concord', 'Richmond', 'Martinez',
       'Clayton', 'Pinole', 'Oakland', 'San Francisco', 'Dublin',
       'Antioch', 'Lafayette', 'Danville', 'San Pablo', 'Napa',
       'El Cerrito', 'Union City', 'Walnut Creek', 'Corte Madera',
       'Moraga', 'Hercules', 'Oakley', 'Orinda', 'Marin County',
       'Pittsburg', 'Pleasant Hill', 'American Canyon', 'Larkspur',
       'Piedmont', 'San Rafael', 'Calistoga', 'Tiburon', 'Sausalito',
       'Saint Helena', 'Yountville', 'Napa County', 'San Anselmo',
       'Belvedere', 'Fairfax', 'Ross', 'Novato', 'Half Moon Bay',
       'Millbrae', 'San Bruno', 'Mill Valley', 'Brisbane', 'Atherton',
       'Menlo Park', 'Pacifica', 'Redwood City', 'Belmont', 'San Mateo',
       'Colma', 'Daly City', 'San Carlos', 'Hillsborough', 'Woodside',
 

In [55]:
for city in cities:
    print(city.upper())
    try:
        with HiddenPrints():
            utils.load_all_new_building_permits(city)
    except Exception as exc:
        print('ERROR FOUND')
        print(city)
        print(str(exc))
        print('-------')

BERKELEY
ALBANY
ALAMEDA
LIVERMORE
FREMONT
SAN RAMON
ERROR FOUND
San Ramon
[Errno 2] No such file or directory: 'data/raw_data/APRs/SanRamon2018.xlsm'
-------
NEWARK
BRENTWOOD
HAYWARD
CONTRA COSTA COUNTY
ERROR FOUND
Contra Costa County

-------
EMERYVILLE
ALAMEDA COUNTY
ERROR FOUND
Alameda County

-------
PLEASANTON
SAN LEANDRO
CONCORD
RICHMOND
MARTINEZ
ERROR FOUND
Martinez
[Errno 2] No such file or directory: 'data/raw_data/APRs/Martinez2018.xlsm'
-------
CLAYTON
ERROR FOUND
Clayton
[Errno 2] No such file or directory: 'data/raw_data/APRs/Clayton2018.xlsm'
-------
PINOLE
OAKLAND
SAN FRANCISCO
DUBLIN
ANTIOCH
LAFAYETTE
DANVILLE
SAN PABLO
ERROR FOUND
San Pablo
[Errno 2] No such file or directory: 'data/raw_data/APRs/SanPablo2018.xlsm'
-------
NAPA
EL CERRITO
UNION CITY
WALNUT CREEK
CORTE MADERA
ERROR FOUND
Corte Madera
[Errno 2] No such file or directory: 'data/raw_data/APRs/CorteMadera2018.xlsm'
-------
MORAGA
ERROR FOUND
Moraga
[Errno 2] No such file or directory: 'data/raw_data/APRs/Mo