# NIMBY vs. YIMBY: geospatial analysis of new construction projects in USA

TODO something like:
* Establish connection between pandas and dolt running in server mode.
* For each state/county/ZIP code (across some timeframe):
    * Compute a number of first property sales.
        * Absolute number
        * Per unit of area
        * Per unit of population
    * Compute percentage deviation from the mean.
        * Per unit of area
        * Per unit of population
* Do some dataviz to generate a nice, big chart and write it up.

During the second iteration of DoltHub's USA housing price data bounty a large amount of public real estate data was scraped, wrangled and imported into a version-controlled database. This enables us to do some exploration and analysis for the purpose of gaining insight into the dynamics of real estate market. Some parts of United States are said to suffer from NIMBYism - a resistance to new property developments in area. One famous example is Marc Andreesen, a prominent Silicon Valley venture capitalist, going out of his way to prevent new housing to be built in his town - Atherton, CA. But perhaps there's also areas that welcome and support new real estate projects? By wielding the power of programming and open data, we are able to leverage the `us-housing-prices-v2` database and find which are which. 

Our approach to data analysis approach is going to be as follows. We are going to limit data being analysed to timeframe from 2009 June 30 to 2020 January 1. This will provide us 10.5 years worth of data from the times between the official end of Great Recession of 2008 to the very beginning of the current quite complicated decade. Thus we will be looking into steady-state trends from relatively recent past period that had no major shocks to the entire real estate market. Furthermore, we are going to narrow our view into the initial records of property being sold, which implies that a property is newly built and just entered the market. We are going to count such initial sales for each county represented in the database. Some counties are geographically large, some are populous, some are small in area and/or population. To accomplish apples to apples comparison we are going to normalise number of initial sales by population and by land area. Lastly, we are going to compute standard deviation values for each county from per capita and per area values to appreciate how much they stand out.

In [1]:
from io import StringIO

import pandas as pd
import requests

resp = requests.get("https://www.openintro.org/data/csv/county_complete.csv", 
                    headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"})
resp

<Response [200]>

In [2]:
def cleanup_county(c):
    return c.upper().replace("COUNTY", "").strip()

buf = StringIO(resp.text)

df_counties = pd.read_csv(buf)
df_counties = df_counties[["fips", "state", "name", "pop2010", "area_2010"]]
df_counties = df_counties.rename(columns={'name': 'county'})
df_counties['county'] = df_counties['county'].apply(cleanup_county)
df_counties

Unnamed: 0,fips,state,county,pop2010,area_2010
0,1001,Alabama,AUTAUGA,54571,594.44
1,1003,Alabama,BALDWIN,182265,1589.78
2,1005,Alabama,BARBOUR,27457,884.88
3,1007,Alabama,BIBB,22915,622.58
4,1009,Alabama,BLOUNT,57322,644.78
...,...,...,...,...,...
3137,56037,Wyoming,SWEETWATER,43806,10426.65
3138,56039,Wyoming,TETON,21294,3995.38
3139,56041,Wyoming,UINTA,21118,2081.26
3140,56043,Wyoming,WASHAKIE,8533,2238.55


In [3]:
import mysql.connector as connection
import pandas as pd
from sqlalchemy import create_engine

db_connection_str = 'mysql+mysqlconnector://rl:trustno1@localhost/us_housing_prices_v2'
db_connection = create_engine(db_connection_str)

query = "SELECT * FROM `states`;"
states_df = pd.read_sql(query, db_connection)
states_df = states_df.rename(columns={'name': 'state'})
states_df

Unnamed: 0,code,state
0,AK,Alaska
1,AL,Alabama
2,AR,Arkansas
3,AS,American Samoa
4,AZ,Arizona
5,CA,California
6,CO,Colorado
7,CT,Connecticut
8,DC,District of Columbia
9,DE,Delaware


In [4]:
df_counties = pd.merge(df_counties, states_df, on='state')
df_counties

Unnamed: 0,fips,state,county,pop2010,area_2010,code
0,1001,Alabama,AUTAUGA,54571,594.44,AL
1,1003,Alabama,BALDWIN,182265,1589.78,AL
2,1005,Alabama,BARBOUR,27457,884.88,AL
3,1007,Alabama,BIBB,22915,622.58,AL
4,1009,Alabama,BLOUNT,57322,644.78,AL
...,...,...,...,...,...,...
3137,56037,Wyoming,SWEETWATER,43806,10426.65,WY
3138,56039,Wyoming,TETON,21294,3995.38,WY
3139,56041,Wyoming,UINTA,21118,2081.26,WY
3140,56043,Wyoming,WASHAKIE,8533,2238.55,WY


In [5]:
import mysql.connector as connection
import pandas as pd
from sqlalchemy import create_engine

db_connection_str = 'mysql+mysqlconnector://rl:trustno1@localhost/us_housing_prices_v2'
db_connection = create_engine(db_connection_str)

query = """
SELECT a.*
FROM `sales` a
INNER JOIN
(
    SELECT   `property_id`, `state`, `property_zip5`, `property_county`, MIN(`sale_datetime`) AS first_sale_datetime
    FROM     `sales`
    WHERE    `sale_datetime` > \"2009-06-30\" AND `sale_datetime` < \"2020-01-01\"
    GROUP BY `property_id`
) b ON a.property_id = b.property_id AND a.sale_datetime = b.first_sale_datetime;
"""

result_df = pd.read_sql(query, db_connection)
result_df = result_df[['state', 'property_zip5', 'property_county', 'sale_datetime', 'property_id']]
result_df = result_df.rename(columns={'property_county': 'county'})
result_df = result_df['county'].apply(cleanup_county)
result_df

OperationalError: (mysql.connector.errors.OperationalError) 2013 (HY000): Lost connection to MySQL server during query
[SQL: 
SELECT a.*
FROM `sales` a
INNER JOIN
(
    SELECT   `property_id`, `state`, `property_zip5`, `property_county`, MIN(`sale_datetime`) AS first_sale_datetime
    FROM     `sales`
    WHERE    `sale_datetime` > "2009-06-30" AND `sale_datetime` < "2020-01-01"
    GROUP BY `property_id`
) b ON a.property_id = b.property_id AND a.sale_datetime = b.first_sale_datetime;
]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

https://stackoverflow.com/questions/11683712/sql-group-by-and-min-mysql 

In [None]:
# TODO: rework this code to compute stats at county level.
counts_by_state = result_df['state'].value_counts()
df_counts_by_state = pd.DataFrame.from_records([counts_by_state.to_dict()]).transpose()
df_counts_by_state.reset_index(inplace=True)
df_counts_by_state = df_counts_by_state.rename(columns={'index': 'code', 0: 'n'})

query = "SELECT * FROM `states`;"
states_df = pd.read_sql(query, db_connection)

df_counts_by_state = pd.merge(df_counts_by_state, states_df, on='code')

df_state_area = pd.read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area")[0]
df_state_area.columns = df_state_area.columns.to_flat_index()
df_state_area = df_state_area.rename(columns={('State', 'State'): 'name', ('Land area[2]', 'km2'): 'land_area_km2'})
df_state_area = df_state_area[['name', 'land_area_km2']]

df_counts_by_state = pd.merge(df_counts_by_state, df_state_area, on='name')

df_state_pop = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population')[0]
df_state_pop.columns = df_state_pop.columns.to_flat_index()
df_state_pop = df_state_pop.rename(columns={('State or territory', 'State or territory'): 'name', 
                                            ('Census population[8][a]', 'July 1, 2021 (est.)'): 'population'})
df_state_pop = df_state_pop[['name', 'population']]

df_counts_by_state = pd.merge(df_counts_by_state, df_state_pop, on='name')
df_counts_by_state['per_capita'] = df_counts_by_state['n'] / df_counts_by_state['population']
df_counts_by_state['per_land_km2'] = df_counts_by_state['n'] / df_counts_by_state['land_area_km2']

per_capita_mean = float(df_counts_by_state[['per_capita']].mean()[0])
per_capita_stdev = float(df_counts_by_state[['per_capita']].std()[0])

df_counts_by_state['per_capita_stdevs_from_mean'] = (df_counts_by_state['per_capita'] - per_capita_mean) / per_capita_stdev

per_land_km2_mean = float(df_counts_by_state['per_land_km2'].mean())
per_land_km2_stdev = float(df_counts_by_state['per_land_km2'].std())

df_counts_by_state['per_land_km2_stdevs_from_mean'] = (df_counts_by_state['per_land_km2'] - per_land_km2_mean) / per_land_km2_stdev

df_counts_by_state['stdev_diff'] = abs(df_counts_by_state['per_capita_stdevs_from_mean'] - df_counts_by_state['per_land_km2_stdevs_from_mean'])

df_counts_by_state

In [None]:
# Based on: https://plotly.com/python/mapbox-county-choropleth/

import requests
import json

token = "pk.eyJ1IjoicmwxOTg3IiwiYSI6ImNqa3k5MTBjczBneHYza3F0c3Vub3pjY2sifQ.yYKuXGFsuX6qbEUF0EJn1A"

resp = requests.get("https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json")
counties = json.loads(resp.text)

df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv",
                   dtype={"fips": str})

import plotly.graph_objects as go

fig = go.Figure(go.Choroplethmapbox(geojson=counties, locations=df.fips, z=df.unemp,
                                    colorscale="Viridis", zmin=0, zmax=12, marker_line_width=0))
fig.update_layout(mapbox_style="light", mapbox_accesstoken=token,
                  mapbox_zoom=3, mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
df