# Intro
Our task was to find the best neighborhood in Pittsburgh and to do this, we created a data-driven metric designed to evaluate the “best neighborhood for a family.” We decided to calculate this metric based on a given neighborhood’s tree cover, the crime rate, and number of accessible parks. The component data that we used for this project came from the Western Pennsylvania Regional Data Center.

In [1]:
# IMPORTS
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point

In [2]:
neighborhoods = gpd.read_file("https://data.wprdc.org/dataset/e672f13d-71c4-4a66-8f38-710e75ed80a4/resource/4af8e160-57e9-4ebf-a501-76ca1b42fc99/download/neighborhoods.geojson")
trees = gpd.read_file("https://data.wprdc.org/dataset/9ce31f01-1dfa-4a14-9969-a5c5507a4b40/resource/d876927a-d3da-44d1-82e1-24310cdb7baf/download/trees_img.geojson")

trees = trees.to_crs(neighborhoods.crs)

trees_with_neighborhoods = gpd.sjoin(trees, neighborhoods, how="inner", predicate="within")
tree_counts = trees_with_neighborhoods.groupby("hood").size().reset_index(name='tree_count')

neighborhoods_with_counts = neighborhoods.merge(tree_counts, on="hood", how="left")
neighborhoods_with_counts["tree_count"] = neighborhoods_with_counts["tree_count"].fillna(0).astype(int)

neighborhoods_with_counts['trees_per_area'] = neighborhoods_with_counts['tree_count'] / neighborhoods_with_counts['Shape__Area']
max_tree_count = neighborhoods_with_counts['trees_per_area'].max()
neighborhoods_with_counts['proportion'] = neighborhoods_with_counts['trees_per_area'] / max_tree_count

neighborhoods_with_counts = neighborhoods_with_counts.sort_values(by='proportion', ascending=False)
neighborhoods_with_counts.reset_index(drop=True, inplace=True)

trees_final = neighborhoods_with_counts[['hood', 'tree_count', 'proportion']]

# Trees
This data was included in our metric with the idea that the more trees are in a neighborhood, the more inviting it would be especially for children. When we first found this dataset, we believed that it was a complete tree cover dataset for the city of Pittsburgh. After working with the data for a while, it became clear that it must only include entries for trees that are planted or maintained by the city or municipal government. This is because each tree entry was far too detailed to account for every tree in Frick Park for example. In addition, some neighborhoods appeared to have far more trees than were reported here. We realized, however, that this data was still valuable because if it’s comprised of trees maintained by the government, a higher number of trees indicates strong civic initiatives. In addition, most trees planted and maintained by local governments are located in highly urbanized areas, so this tree score can also help to bolster a region that may not have many parks, but still has a lot of maintained green space.

In [3]:
crime_df = pd.read_excel("https://data.wprdc.org/dataset/65e69ee3-93b2-4f7a-b9cb-8ce977f15d9a/resource/bd41992a-987a-4cca-8798-fbe1cd946b07/download/allmergedtables.xlsx")

crime_geo = [Point(xy) for xy in zip(crime_df['XCOORD'], crime_df['YCOORD'])]

crime = gpd.GeoDataFrame(crime_df, geometry=crime_geo)
crime.set_crs('EPSG:4326', allow_override=True, inplace=True);

crime = crime.to_crs(neighborhoods.crs)

neighborhood_crime = gpd.sjoin(crime, neighborhoods, how="inner", predicate="within")
crime_count = neighborhood_crime.groupby("hood").size().reset_index(name='crime_count')

neighborhood_crime = neighborhoods.merge(crime_count, on="hood", how="left")
neighborhood_crime["crime_count"] = neighborhood_crime["crime_count"].fillna(0).astype(int)

max_crime = neighborhood_crime['crime_count'].max()
neighborhood_crime['proportion'] = 1 - neighborhood_crime['crime_count'] / max_crime

neighborhood_crime = neighborhood_crime.sort_values(by='proportion', ascending=False)
neighborhood_crime.reset_index(drop=True, inplace=True)

crime_final = neighborhood_crime[['hood', 'crime_count', 'proportion']]

# Crime
For our crime data, we utilized the Pittsburgh crime dashboard. We included crime in our metric because a family would likely want their neighborhood to be safe. We intentionally did __not__ calculate the crime rate relative to the population living in an area. This is because if you live in a neighborhood, but most of the crime is committed by non-residents, you are still affected. The best example of this is the Southside which has a relatively high crime rate due to the location of many prominent bars on East Carson Street, yet when you account for population in these areas, the true effect is masked. 