---
format: 
  html:
    toc: false
    page-layout: full
execute:
    echo: false
---

<div class="text-box">
    
# 2.3 Bar Graphs of Tracts with College Buildings vs Tracts Without College Buildings
    
</div>

In [10]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr





import altair as alt
import geopandas as gpd
import hvplot.pandas
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import requests
import folium
import panel as pn
import xyzservices



<div class="text-box">
    
## 2.3.1 Load Data and Counting
    
Here I'll
    
1) Load the buildings per tract data frame from part **2.1**
    
2) Group buildings by census tracts
    
3) Create a count column counting the college buildings per census tract. 

In [11]:
#| echo: true
#| code-fold: true

college_buildings = gpd.read_file("Universities_Colleges.geojson")
demographics_with_tracts = gpd.read_file("identity_with_tracts.geojson")


if college_buildings.crs != demographics_with_tracts.crs:
    demographics_with_tracts = demographics_with_tracts.to_crs(college_buildings.crs)


college_buildings = college_buildings[college_buildings.is_valid]
demographics_with_tracts = demographics_with_tracts[demographics_with_tracts.is_valid]





buildings_per_tract = gpd.sjoin(
    college_buildings,
    demographics_with_tracts,
    how="left",         
    predicate="within"  
)




if 'NAME_left' in buildings_per_tract.columns:
    count_col = 'NAME_left'
elif 'BUILDING_ID' in buildings_per_tract.columns:
    count_col = 'BUILDING_ID'  # Replace with actual building identifier
else:
    # Choose the first non-NaN column as a fallback
    count_col = buildings_per_tract.columns[0]

counts = (
    buildings_per_tract
    .groupby("tract")[count_col]
    .count()
    .reset_index()
)

counts.rename(columns={count_col: "building_count"}, inplace=True)


all_tracts_df = demographics_with_tracts[[
    'Total Population', 'Median Household Income',
    'Black and Latino/Hispanic', 'Bachelors Degree or Higher',
    'Associates Degree', 'Masters Degree', 'White and Latino/Hispanic',
    'state', 'county', 'tract','geometry', 'White Alone (Total)', 'Black or African American Alone',
    'Hispanic or Latino Total', 'Puerto Rican','Dominican',
]].copy()


all_tracts_df['tract'] = all_tracts_df['tract'].astype(str).str.strip()
counts['tract'] = counts['tract'].astype(str).str.strip()


result = all_tracts_df.merge(counts, on="tract", how="left")


result["building_count"] = result["building_count"].fillna(0).astype(int)




building_freq = result['building_count'].value_counts().sort_index().reset_index()


building_freq.columns = ['Building Count', 'Frequency']




result.head()







Unnamed: 0,Total Population,Median Household Income,Black and Latino/Hispanic,Bachelors Degree or Higher,Associates Degree,Masters Degree,White and Latino/Hispanic,state,county,tract,geometry,White Alone (Total),Black or African American Alone,Hispanic or Latino Total,Puerto Rican,Dominican,building_count
0,4098,80470,60,1083,177,518,475,42,101,2701,"POLYGON ((-75.15600 39.92553, -75.15591 39.925...",2864,428,785,141,0,0
1,4300,76060,0,991,338,427,117,42,101,2702,"POLYGON ((-75.15284 39.92511, -75.15277 39.925...",3680,124,325,168,0,0
2,4452,65847,18,555,404,328,499,42,101,2801,"POLYGON ((-75.15910 39.92593, -75.15902 39.926...",2119,312,1203,57,0,0
3,5772,67585,289,1566,91,698,289,42,101,2802,"POLYGON ((-75.16707 39.92680, -75.16693 39.926...",3718,510,685,107,0,0
4,3762,66932,35,865,80,600,290,42,101,2900,"POLYGON ((-75.16949 39.92560, -75.16923 39.926...",3018,85,544,159,35,0


</div>

<div class="text-box">

## 2.3.2
    
There are a lot of census tracts across Philadelphia with 0 college buildings. 
    
To address this, I will transform the college count variable into a categorical variable, distinguishing between census tracts that have one or more college buildings and those that have none. This approach will facilitate a clearer analysis of the relationship between the presence of college buildings and other variables of interest within each census tract. 

In [12]:
#| echo: true
#| code-fold: true



numeric_cols = [
    "Median Household Income",
    "Bachelors Degree or Higher",
    "Associates Degree",
    "White and Latino/Hispanic",
    "Black and Latino/Hispanic",
]
for col in numeric_cols:
    result[col] = pd.to_numeric(result[col], errors='coerce')


tracts_all_df = result.copy()



def categorize_building_count(count):
    if count == 0:
        return 0
    else:
        return 1
 




tracts_all_df["buildings_cat"] = tracts_all_df["building_count"].apply(categorize_building_count)



white_greater_df = tracts_all_df[tracts_all_df["White and Latino/Hispanic"] > tracts_all_df["Black and Latino/Hispanic"]]
black_greater_df = tracts_all_df[tracts_all_df["Black and Latino/Hispanic"] > tracts_all_df["White and Latino/Hispanic"]]




</div>

<div class="text-box">
    
    
## 2.3.3 Bar Graphs Of Means

Now I'll develop three bar graphs with altair displaying the relationship between socioeconomic status (Median Household Income) and post-secondary achievement (Associates & Bachelors) with census tracts that have a post-secondary building, and those that do not. Each bar chart has two sets of bars (Building vs No Building) and the bar graphs are differentiated based on which overarching tracts are included in analysis. The seperation of tracts are as follows; 

a) Bar chart contaning all census tracts in Philadelphia
    
b) Bar chart of tracts with more Black Latines than White Latines
    
c) Bar chart of tracts with more White Latines than Black Latines
    


In [21]:
#| echo: true
#| code-fold: true


def create_mean_plot(df, title):
   
    selected_cols = [
        "Median Household Income",
        "Bachelors Degree or Higher",
        "Associates Degree",
    ]
    
    for col in numeric_cols:
        result[col] = pd.to_numeric(result[col], errors='coerce')
    
    
    df_clean = df.dropna(subset=selected_cols + ["buildings_cat"])
    
   
    mean_df = df_clean.groupby("buildings_cat").agg({col: "mean" for col in selected_cols}).reset_index()
    
    
    mean_df["Building Category"] = mean_df["buildings_cat"].map({0: "No Buildings", 1: "Has Buildings"})
    
   
    mean_long = mean_df.melt(
        id_vars=["Building Category"],
        value_vars=selected_cols,
        var_name="Variable",
        value_name="Mean"
    )
    
    
    color_scale = alt.Scale(
        domain=selected_cols,
        range=["#1f77b4", "#ff7f0e", "#2ca02c"] 
    )
    
    
    chart = alt.Chart(mean_long).mark_bar().encode(
        x=alt.X("Building Category:N", title="Building Category"),
        y=alt.Y("Mean:Q", title="Mean Value"),
        color=alt.Color("Variable:N", scale=color_scale, title="Variable"),
        xOffset=alt.X("Variable:N"), 
        tooltip=[
            alt.Tooltip("Building Category:N", title="Building Category"),
            alt.Tooltip("Variable:N"),
            alt.Tooltip("Mean:Q", format=".2f")
        ]
    ).properties(
        width=300,
        height=400,
        title=title
    ).interactive() 
    
    
    text = chart.mark_text(
        align='center',
        baseline='bottom',
        dy=-5  
    ).encode(
        text=alt.Text('Mean:Q', format=".0f")
    )
    
    
    final_chart = chart + text
    
    return final_chart


In [22]:


Philadelphia_plot= create_mean_plot(
    tracts_all_df,
    "Mean Socioeconomic Variables by Building Category Entire Philadelphia"
)



white_plot = create_mean_plot(
    white_greater_df,
    "Mean Socioeconomic Variables by Building Category (White Latino > Black Latino)"
)


black_plot = create_mean_plot(
    black_greater_df,
    "Mean Socioeconomic Variables by Building Category (Black Latino > White Latino)"
)



 


final_chart = alt.hconcat(Philadelphia_plot,black_plot, white_plot ).resolve_scale(
y='shared'
)

final_chart



</div>



<div class="text-box">

## Analysis 
    
In contrast to my initial insights from part 2.2, the Median Household Income is not lower in tracts that contain a college building. In fact, the Median Household Income appears to be a bit larger when compared to census tracts without a college building. 
    
Moreover, there is also little variation between the mean values  of Associates and Bahcelors Degree within census tracts that have a buildings vs those that do not, indicating that the presence of a college building in the nearby area does little to mediate  post-secondary achievement rates  and Median Household Income. 
    
    

To answer the second question 
    "How is post-secondary achievement rate (Associate/Bachelors Degree) influenced by living in the same tract as a post-secondary school building(s). 
       
a) How does this trend differ between tracts that have more Black/Latines, than White/Latines and vice versa."
    
**Overall, the  presence of college buildings does not significantly impact post-secondary achievement rates, and this does not differ between Census tracts have more Black/Latines, than White/Latines and vice versa** 
    
    


Aditionally, Black and Latino/Hispanic predominant tracts display lower socioeconomic indicators compared to the city average and tracts with a higher White Latino population. Specifically, there is a lower mean number of individuals with a Bachelor’s degree but a higher mean number with an Associate’s degree. While White and Latino/Hispanic Predominant Tracts tracts demonstrate higher median household incomes and a greater number of individuals holding a Bachelor’s degree, indicating better economic opportunities and higher educational attainment.

    

    



Demographic compositions may actually play a  more crucial role in shaping educational and economic landscapes within census tracts, and future studies should continue to analyze other attributtes that influence post-secondary achievement within different Latine racial group. 
    
Moreover, future studies should also investigate more intricate spatial statistical analysis to analyze..... 
    
    
    
However, the little significance that living in a census tract with a post-secondary building can also be used to inform post-secondary instiutions to begin advancing their outreach efforts into their local communities. 




</div>



<div class="text-box">
    
## 2.3.4 Statistical Analysis and Seaborn Heatmap
    
Finally, to further investigate the relationships between educational attainment, household income, and the presence of college buildings, I'll: 
   
    1) Compute the correlation, p-value, and standard errors of Associates Degree, Bachelors Degree, Presence of Building in Tract, and Median Household Income for the entireity of Philadelphia 
    2) Plot these correlations on a Seaborn heat map with an interactive tool tip that shows the  correlation, p-value, and standard errors of each variable as it correlates with one another. 
    3) Repeat the previous two steps for tracts that have larger White-Latine populations and tracks that have larger Black Latine populations
    
    

In [6]:


corr_numeric_cols = [
    "Median Household Income",
    "Bachelors Degree or Higher",
    "Associates Degree",
    "buildings_cat",
    "White and Latino/Hispanic",
    "Black and Latino/Hispanic"
    
]


n = len(corr_numeric_cols)
corr_matrix = np.zeros((n, n))
pval_matrix = np.zeros((n, n))
stderr_matrix = np.zeros((n, n)) 


for i in range(n):
    for j in range(n):
        if i == j:
            corr_matrix[i, j] = 1.0
            pval_matrix[i, j] = 0.0
            stderr_matrix[i, j] = 0.0
        elif i < j:
            pair_df = tracts_all_df[[corr_numeric_cols[i], corr_numeric_cols[j]]].dropna()
            if len(pair_df) < 3:
                corr_matrix[i, j] = np.nan
                corr_matrix[j, i] = np.nan
                pval_matrix[i, j] = np.nan
                pval_matrix[j, i] = np.nan
                stderr_matrix[i, j] = np.nan
                stderr_matrix[j, i] = np.nan
            else:
                x = pair_df[corr_numeric_cols[i]]
                y = pair_df[corr_numeric_cols[j]]
                r, p = pearsonr(x, y)
                
                corr_matrix[i, j] = r
                corr_matrix[j, i] = r
                pval_matrix[i, j] = p
                pval_matrix[j, i] = p
                
                n_pairs = len(pair_df)
                
                stderr = np.sqrt((1 - r**2) / (n_pairs - 2))
                stderr_matrix[i, j] = stderr
                stderr_matrix[j, i] = stderr


corr_df = pd.DataFrame(corr_matrix, columns=corr_numeric_cols, index=corr_numeric_cols)
pval_df = pd.DataFrame(pval_matrix, columns=corr_numeric_cols, index=corr_numeric_cols)
stderr_df = pd.DataFrame(stderr_matrix, columns=corr_numeric_cols, index=corr_numeric_cols)

corr_df

Unnamed: 0,Median Household Income,Bachelors Degree or Higher,Associates Degree,buildings_cat,White and Latino/Hispanic,Black and Latino/Hispanic
Median Household Income,1.0,0.607788,0.008413,0.079966,-0.093035,-0.18144
Bachelors Degree or Higher,0.607788,1.0,0.196838,0.000716,-0.007511,-0.100238
Associates Degree,0.008413,0.196838,1.0,-0.191563,0.255804,0.17122
buildings_cat,0.079966,0.000716,-0.191563,1.0,-0.12722,-0.159477
White and Latino/Hispanic,-0.093035,-0.007511,0.255804,-0.12722,1.0,0.405201
Black and Latino/Hispanic,-0.18144,-0.100238,0.17122,-0.159477,0.405201,1.0


In [7]:


merged_long = pd.concat([
    corr_df.stack().rename('Correlation'),
    pval_df.stack().rename('p_value'),
    stderr_df.stack().rename('std_err')
], axis=1).reset_index().rename(columns={'level_0': 'Variable1', 'level_1': 'Variable2'})


heatmap = alt.Chart(merged_long).mark_rect().encode(
    x=alt.X('Variable1:O', sort=sorted(merged_long["Variable1"].unique())),
    y=alt.Y('Variable2:O', sort=sorted(merged_long["Variable2"].unique()), scale=alt.Scale(reverse=True)),
    color=alt.Color('Correlation:Q',
                    scale=alt.Scale(scheme='redblue', domain=(-1, 1))),
    tooltip=[
        alt.Tooltip('Variable1:N'),
        alt.Tooltip('Variable2:N'),
        alt.Tooltip('Correlation:Q', format=".3f"),
        alt.Tooltip('p_value:Q', format=".3g"),
        alt.Tooltip('std_err:Q', format=".3g")
    ]
).properties(
    width=450,
    height=450,
    title="Correlation Heatmap (Altair) with p-values & Std. Error"
)

# Display the heatmap
heatmap.display()


</div>
<div class="text-box">
    
## Analysis 2

    
Based on the provided correlation heatmap, 
    
The relationship between post-secondary achievement rates and the presence of post-secondary school buildings within Philadelphia's census tracts appears to be minimal. 

Specifically, the number of college buildings (building_count) shows a correlation of almost 0 with both bachelor's degrees or higher (r = 0.001) and a weak negative correlation with associates degrees (r = -0.192), indicating that an increase in college buildings is not strongly associated with higher post-secondary attainment. 
    
When distinguishing between Black-Latines and White-Latines, Black and Latino/Hispanic populations show a modest positive correlation with associate degree attainment (r = 0.171), whereas White and Latino/Hispanic populations display a stronger positive correlation (r = 0.256) with associate degrees. 

Moreover, building off of our findings in 1.3, Black-Latino hispanics have a modest negative correlation with Median Household Income, indicating that the higher the Median Household Income, the lower the Black-Latin population, in comparison with White-Latinos who have no correlation with Median Household Income. 
    
    
To conclude, I want to provide a bar chart breakdown of socioeconomic status and post-secondary achievement rates for the entireity of Philadelphia, Tracts that have more with Latines than Black Latines, and tracts that have more Black Latines than White Latines. 

</div>

<div class="text-box">
    
## Conclusion

To synthesize the findings from the correlation analysis and bar graphs, the heatmap reveals that:

Educational Attainment and Buildings: There is no significant positive correlation between the presence of college buildings and higher educational attainment levels (Bachelor’s degrees). The weak negative correlation with associate degrees suggests that other factors may influence educational outcomes more strongly.

Income Disparities: The negative correlation between Median Household Income and Black Latino populations indicates socioeconomic disparities that are not directly mitigated by the presence of college buildings.

Demographic Influences: White Latino populations show stronger positive correlations with associate degrees and have no significant correlation with household income, highlighting differing socioeconomic dynamics compared to Black Latino populations.

Overall Conclusion: The statistical analyses indicate that while educational infrastructure is a vital component of community development, its direct impact on socioeconomic indicators is limited. Instead, demographic factors play a more substantial role in shaping educational and economic outcomes within Philadelphia's census tracts. Future studies should explore additional variables and contextual factors to comprehensively understand the dynamics at play.

</div>