---
format: 
  html:
    toc: false
    page-layout: full
execute:
    echo: false
---

## 3. Regression Analysis

In [14]:
import geopandas as gpd
import numpy as np
import pandas as pd
import cenpy
import pygris
import pandana as pnda
import osmnx as ox
import altair as alt
import pandana as pnda
import geoviews as gv
import geoviews.tile_sources as gvts
import warnings

from pandana.loaders import osm
from shapely.geometry import Point
from pandana.loaders import osm
from matplotlib import pyplot as plt

In [15]:
warnings.filterwarnings("ignore")

In [16]:
np.random.seed(42)
pd.options.display.max_columns = 999

Physical amenities can be related to sentiments about community health condition. Therefore, I ran a OLS regression to explore the correlation between the number of health center and number of parks in one neighborhood and percentage of adults who rate their health as “fair” or “poor” (sentiments of how people think of their health conditions).
Regression analysis helps in understanding the relationships between different variables. In this case, it helps determine whether there is a statistically significant correlation between the number of health centers and parks and the perceived health conditions of adults in a neighborhood. Understanding the factors that influence perceived health conditions can guide public health interventions. For example, if there is a positive correlation, it might indicate that improving access to health centers and recreational spaces positively affects residents' health perceptions. This information can be valuable for policymakers, urban planners, or health professionals to identify areas for improvement.

In [17]:
planning_districts = gpd.read_file(
    "https://opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson"
).to_crs(epsg=2272)
community_score = pd.read_csv("./data/CLEANED_community_health_score.csv")
community_geo = planning_districts.merge(community_score, how='left', left_on='DIST_NAME', right_on='DIST_NAME').to_crs(epsg=2272)

In [18]:
Health_Centers_gdf = gpd.read_file("./data/Health_Centers.geojson").to_crs("EPSG:2272")
Health_Centers_gdf['lon'] = Health_Centers_gdf.geometry.x
Health_Centers_gdf['lat'] = Health_Centers_gdf.geometry.y

In [19]:
joined_data = gpd.sjoin(Health_Centers_gdf, planning_districts, how="left", op="within")
health_centers_count = joined_data.groupby('OBJECTID_1').size().reset_index(name='health_centers_count')
planning_districts = planning_districts.merge(health_centers_count, left_on='OBJECTID_1', right_on='OBJECTID_1', how='left', suffixes=('_planning', '_health_centers'))

In [20]:
planning_districts = planning_districts.loc[:, ~planning_districts.columns.duplicated()]
area = planning_districts.to_crs(epsg=3857).geometry.area
planning_districts["num_health_center_per_area"] = planning_districts["health_centers_count"] / area * 1e4

In [21]:
url = "https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson"
parks = gpd.read_file(url).to_crs("EPSG:2272")

In [22]:
joined_data = gpd.sjoin(parks, planning_districts, how="left", op="within")
parks_count = joined_data.groupby('OBJECTID_1').size().reset_index(name='parks_count')
planning_districts = planning_districts.merge(parks_count, left_on='OBJECTID_1', right_on='OBJECTID_1', how='left', suffixes=('_planning', '_parks'))

In [23]:
planning_districts["num_parks_per_area"] = planning_districts["parks_count"] / area * 1e4

In [24]:
planning_districts = planning_districts.merge(community_geo[['DIST_NAME', 'Percentage']], on='DIST_NAME', how='left')

In [25]:
planning_districts.head()

Unnamed: 0,OBJECTID_1,OBJECTID,DIST_NAME,ABBREV,Shape__Area,Shape__Length,PlanningDist,DaytimePop,geometry,health_centers_count,num_health_center_per_area,parks_count,num_parks_per_area,Percentage
0,1,14,River Wards,RW,210727000.0,66931.59502,,,"POLYGON ((2711323.754 255818.110, 2711628.628 ...",,,21,0.006289,0.282
1,2,3,North Delaware,NDEL,270091500.0,89213.074378,,,"POLYGON ((2743358.021 274541.170, 2743413.946 ...",,,28,0.006534,0.221
2,3,0,Lower Far Northeast,LFNE,306852900.0,92703.285159,,,"POLYGON ((2747427.678 297865.068, 2747454.031 ...",,,17,0.003487,0.162
3,4,9,Central,CTR,178288000.0,71405.14345,,,"POLYGON ((2697746.272 241701.844, 2697962.079 ...",8.0,0.002835,77,0.027284,0.136
4,5,10,University Southwest,USW,129646800.0,65267.676141,,,"POLYGON ((2686719.537 239936.817, 2686992.274 ...",6.0,0.002924,16,0.007798,


In [26]:
import statsmodels.api as sm

df = planning_districts[['num_health_center_per_area', 'num_parks_per_area', 'Percentage']].fillna(0)

# Define independent variables (X) and dependent variable (y)
X = df[['num_health_center_per_area', 'num_parks_per_area']]
y = df['Percentage']

# Add a constant term to the independent variables matrix
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Print the regression results
print(model.summary())

# Check for correlation
correlation_matrix = df.corr()
correlation_percentage = correlation_matrix.loc['Percentage', ['num_health_center_per_area', 'num_parks_per_area']]

# Print correlation results
print(f"\nCorrelation between Percentage and num_health_center_per_area: {correlation_percentage['num_health_center_per_area']:.4f}")
print(f"Correlation between Percentage and num_parks_per_area: {correlation_percentage['num_parks_per_area']:.4f}")


                            OLS Regression Results                            
Dep. Variable:             Percentage   R-squared:                       0.025
Model:                            OLS   Adj. R-squared:                 -0.106
Method:                 Least Squares   F-statistic:                    0.1884
Date:                Fri, 22 Dec 2023   Prob (F-statistic):              0.830
Time:                        02:29:39   Log-Likelihood:                 17.006
No. Observations:                  18   AIC:                            -28.01
Df Residuals:                      15   BIC:                            -25.34
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const               

R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. In this case, the R-squared is 0.025, suggesting that only a small percentage (2.5%) of the variability in Percentage is explained by the independent variables.
The p-values associated with each coefficient test the null hypothesis that the corresponding coefficient is zero. High p-values (> 0.05) indicate that the corresponding independent variable is not statistically significant.
In summary, based on this analysis, there is little evidence to suggest a significant linear relationship between the independent variables num_health_center_per_area and num_parks_per_area and the dependent variable Percentage. Therefore, the number of health center and number of parks in one neighborhood don't have much correlation with how people think of their health conditions.

Although there is not much correlations among the three indicators, it could be meaningful to create a Community Health Index by aggregating the number of health centers, number of parks, and the percentage of adults rating their health as "fair" or "poor".