---
format: 
  html:
    toc: false
    page-layout: full
execute:
    echo: false
---

<div class="text-box">
    
# 2.3 Statistical Analysis and Heatmap 
    
</div>

In [8]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr





import altair as alt
import geopandas as gpd
import hvplot.pandas
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import requests
import folium
import panel as pn
import xyzservices



<div class="text-box">
    
## 2.3.1 Load Data and Counting
    
Here I'll
    
1) Load the buildings per tract data frame from part **2.1**
    
2) Group buildings by census tracts
    
3) Create a count column counting the college buildings per census tract. 

In [2]:
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from scipy.stats import spearmanr
import numpy as np

In [4]:
#| echo: true
#| code-fold: true

college_buildings = gpd.read_file("Universities_Colleges.geojson")
demographics_with_tracts = gpd.read_file("identity_with_tracts.geojson")


if college_buildings.crs != demographics_with_tracts.crs:
    demographics_with_tracts = demographics_with_tracts.to_crs(college_buildings.crs)


college_buildings = college_buildings[college_buildings.is_valid]
demographics_with_tracts = demographics_with_tracts[demographics_with_tracts.is_valid]





buildings_per_tract = gpd.sjoin(
    college_buildings,
    demographics_with_tracts,
    how="left",         
    predicate="within"  
)




if 'NAME_left' in buildings_per_tract.columns:
    count_col = 'NAME_left'
elif 'BUILDING_ID' in buildings_per_tract.columns:
    count_col = 'BUILDING_ID'  # Replace with actual building identifier
else:
    # Choose the first non-NaN column as a fallback
    count_col = buildings_per_tract.columns[0]

counts = (
    buildings_per_tract
    .groupby("tract")[count_col]
    .count()
    .reset_index()
)

counts.rename(columns={count_col: "building_count"}, inplace=True)


all_tracts_df = demographics_with_tracts[[
    'Total Population', 'Median Household Income',
    'Black and Latino/Hispanic', 'Bachelors Degree or Higher',
    'Associates Degree', 'Masters Degree', 'White and Latino/Hispanic',
    'state', 'county', 'tract','geometry', 'White Alone (Total)', 'Black or African American Alone',
    'Hispanic or Latino Total', 'Puerto Rican','Dominican',
]].copy()


all_tracts_df['tract'] = all_tracts_df['tract'].astype(str).str.strip()
counts['tract'] = counts['tract'].astype(str).str.strip()


result = all_tracts_df.merge(counts, on="tract", how="left")


result["building_count"] = result["building_count"].fillna(0).astype(int)




building_freq = result['building_count'].value_counts().sort_index().reset_index()


building_freq.columns = ['Building Count', 'Frequency']




result.head()







Unnamed: 0,Total Population,Median Household Income,Black and Latino/Hispanic,Bachelors Degree or Higher,Associates Degree,Masters Degree,White and Latino/Hispanic,state,county,tract,geometry,White Alone (Total),Black or African American Alone,Hispanic or Latino Total,Puerto Rican,Dominican,building_count
0,4098,80470,60,1083,177,518,475,42,101,2701,"POLYGON ((-75.15600 39.92553, -75.15591 39.925...",2864,428,785,141,0,0
1,4300,76060,0,991,338,427,117,42,101,2702,"POLYGON ((-75.15284 39.92511, -75.15277 39.925...",3680,124,325,168,0,0
2,4452,65847,18,555,404,328,499,42,101,2801,"POLYGON ((-75.15910 39.92593, -75.15902 39.926...",2119,312,1203,57,0,0
3,5772,67585,289,1566,91,698,289,42,101,2802,"POLYGON ((-75.16707 39.92680, -75.16693 39.926...",3718,510,685,107,0,0
4,3762,66932,35,865,80,600,290,42,101,2900,"POLYGON ((-75.16949 39.92560, -75.16923 39.926...",3018,85,544,159,35,0


</div>

<div class="text-box">

## 2.3.2
    
There are a lot of census tracts across Philadelphia with 0 college buildings. 
    
To address this, I will transform the college count variable into a categorical variable, distinguishing between census tracts that have one or more college buildings and those that have none. This approach will facilitate a clearer analysis of the relationship between the presence of college buildings and other variables of interest within each census tract. 

In [11]:
#| echo: true
#| code-fold: true



numeric_cols = [
    "Median Household Income",
    "Bachelors Degree or Higher",
    "Associates Degree",
    "Masters Degree",
    "White and Latino/Hispanic",
    "Black and Latino/Hispanic",
    'White Alone (Total)', 'Black or African American Alone',
    'Hispanic or Latino Total', 'Puerto Rican','Dominican'
]
for col in numeric_cols:
    result[col] = pd.to_numeric(result[col], errors='coerce')


tracts_all_df = result.copy()



def categorize_building_count(count):
    if count == 0:
        return 0
    else:
        return 1
 

tracts_all_df["buildings_cat"] = tracts_all_df["building_count"].apply(categorize_building_count)




</div>

<div class="text-box">
    
## 2.3.3 Statistical Analysis and Seaborn Heatmap
    
Finally, I'll 
    
    1) Compute the correlation, p-value, and standard errors of all of the  variables in my dataframe with each other 
    2) Plot these correlations on a Seaborn heat map with an interactive tool tip that shows the  correlation, p-value, and
    standard errors of every variable as it correlates with one another. 
    
    

In [15]:
#| echo: true
#| code-fold: true


corr_numeric_cols = [
    "Median Household Income",
    "Bachelors Degree or Higher",
    "Associates Degree",
    "buildings_cat", "Black and Latino/Hispanic","White and Latino/Hispanic",
    'Hispanic or Latino Total',
    
]


n = len(corr_numeric_cols)
corr_matrix = np.zeros((n, n))
pval_matrix = np.zeros((n, n))
stderr_matrix = np.zeros((n, n)) 


for i in range(n):
    for j in range(n):
        if i == j:
            corr_matrix[i, j] = 1.0
            pval_matrix[i, j] = 0.0
            stderr_matrix[i, j] = 0.0
        elif i < j:
            pair_df = tracts_all_df[[corr_numeric_cols[i], corr_numeric_cols[j]]].dropna()
            if len(pair_df) < 3:
                corr_matrix[i, j] = np.nan
                corr_matrix[j, i] = np.nan
                pval_matrix[i, j] = np.nan
                pval_matrix[j, i] = np.nan
                stderr_matrix[i, j] = np.nan
                stderr_matrix[j, i] = np.nan
            else:
                x = pair_df[corr_numeric_cols[i]]
                y = pair_df[corr_numeric_cols[j]]
                r, p = pearsonr(x, y)
                
                corr_matrix[i, j] = r
                corr_matrix[j, i] = r
                pval_matrix[i, j] = p
                pval_matrix[j, i] = p
                
                n_pairs = len(pair_df)
                
                stderr = np.sqrt((1 - r**2) / (n_pairs - 2))
                stderr_matrix[i, j] = stderr
                stderr_matrix[j, i] = stderr


corr_df = pd.DataFrame(corr_matrix, columns=corr_numeric_cols, index=corr_numeric_cols)
pval_df = pd.DataFrame(pval_matrix, columns=corr_numeric_cols, index=corr_numeric_cols)
stderr_df = pd.DataFrame(stderr_matrix, columns=corr_numeric_cols, index=corr_numeric_cols)

corr_df

Unnamed: 0,Median Household Income,Bachelors Degree or Higher,Associates Degree,buildings_cat,Black and Latino/Hispanic,White and Latino/Hispanic,Hispanic or Latino Total
Median Household Income,1.0,0.607788,0.008413,0.079966,-0.18144,-0.093035,-0.177877
Bachelors Degree or Higher,0.607788,1.0,0.196838,0.000716,-0.100238,-0.007511,-0.131714
Associates Degree,0.008413,0.196838,1.0,-0.191563,0.17122,0.255804,0.238996
buildings_cat,0.079966,0.000716,-0.191563,1.0,-0.159477,-0.12722,-0.150066
Black and Latino/Hispanic,-0.18144,-0.100238,0.17122,-0.159477,1.0,0.405201,0.559506
White and Latino/Hispanic,-0.093035,-0.007511,0.255804,-0.12722,0.405201,1.0,0.887142
Hispanic or Latino Total,-0.177877,-0.131714,0.238996,-0.150066,0.559506,0.887142,1.0


In [16]:
#| echo: true
#| code-fold: true

corr_long = corr_df.stack().reset_index()
corr_long.columns = ["Variable1", "Variable2", "Correlation"]

pval_long = pval_df.stack().reset_index()
pval_long.columns = ["Variable1", "Variable2", "p_value"]

stderr_long = stderr_df.stack().reset_index()
stderr_long.columns = ["Variable1", "Variable2", "std_err"]


merged_long = (
    corr_long
    .merge(pval_long, on=["Variable1", "Variable2"], how="left")
    .merge(stderr_long, on=["Variable1", "Variable2"], how="left")
)


heatmap = alt.Chart(merged_long).mark_rect().encode(
    x=alt.X('Variable1:O', sort=sorted(merged_long["Variable1"].unique())),
    y=alt.Y('Variable2:O', sort=sorted(merged_long["Variable2"].unique()), scale=alt.Scale(reverse=True)),
    color=alt.Color('Correlation:Q',
                    scale=alt.Scale(scheme='redblue', domain=(-1,1))),
    tooltip=[
        alt.Tooltip('Variable1:N'),
        alt.Tooltip('Variable2:N'),
        alt.Tooltip('Correlation:Q', format=".3f"),
        alt.Tooltip('p_value:Q', format=".3g"),
        alt.Tooltip('std_err:Q', format=".3g")
    ]
).properties(
    width=450,
    height=450,
    title="Correlation Heatmap (Altair) with p-values & Std. Error"
)


heatmap.display()

</div>
<div class="text-box">
    
## Analysis 

    
Based on the provided correlation heatmap, 
    
The relationship between post-secondary achievement rates and the presence of post-secondary school buildings within Philadelphia's census tracts appears to be minimal. 

Specifically, the number of college buildings (building_count) shows a correlation of almost 0 with both bachelor's degrees or higher (r = 0.001) and a weak negative correlation with associates degrees (r = -0.192), indicating that an increase in college buildings is not strongly associated with higher post-secondary attainment. 
    
When distinguishing between Black-Latines and White-Latines, Black and Latino/Hispanic populations show a modest positive correlation with associate degree attainment (r = 0.171), whereas White and Latino/Hispanic populations display a stronger positive correlation (r = 0.256) with associate degrees. 
    
However, neither group demonstrates a significant positive relationship with bachelor's degrees, and the presence of college buildings does not notably mediate these educational outcomes. These findings suggest that factors other than proximity to post-secondary institutions may play a more crucial role in influencing educational attainment within Latino communities, with White-Latines exhibiting somewhat higher associate degree attainment compared to Black-Latines
    
Moreover, building off of our findings in 1.3, Black-Latino hispanics have a modest negative correlation with Median Household Income, indicating that the higher the Median Household Income, the lower the Black-Latin population, in comparison with White-Latinos who have no correlation with Median Household Income. 

</div>