Exploring Segregation in NYC SChools
====================================

References
----------
- Allen, R., & Vignoles, A. (2007). What should an index of school segregation measure? _Oxford Review of Education_, _33_(5), 643–668. https://doi.org/10.1080/03054980701366306
- Cohen, D. (2021). NYC School Segregation Report Card: Still Last, Action Needed Now! https://escholarship.org/uc/item/5fx616qn
- Frankel, D. M., & Volij, O. (2011). Measuring school segregation. _Journal of Economic Theory_, _146_(1), 1–38. https://doi.org/10.1016/j.jet.2010.10.008
- Zhang, C. H., & Ruther, M. (2021). Contemporary patterns and issues of school segregation and white flight in U.S. metropolitan areas: Towards spatial inquiries. _GeoJournal_, _86_(3), 1511–1526. https://doi.org/10.1007/s10708-019-10122-1


In [30]:
# load the demographic data
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import scale 
from sklearn.cross_decomposition import PLSRegression
from sklearn.feature_selection import chi2

import numpy as np
from functools import partial

import scipy
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg

from IPython.display import Markdown as md, HTML
from nycschools import schools, geo, ui, class_size
shsat_schools = ['10X445', 'K543', '13K430', 'M435', 'X495', '28Q687', '31R605', '02M475']


In [35]:
# load the demographic data and get just the most recent year and columns of interest
df = schools.load_school_demographics()
cols = ['dbn', 'beds', 'district',  'school_name', 'total_enrollment',
        'asian_n', 'asian_pct', 'black_n', 'black_pct', 'hispanic_n',
        'hispanic_pct', 'white_n', 'white_pct', 'swd_n', 'swd_pct', 'ell_n',
        'ell_pct', 'poverty_n', 'poverty_pct', 'eni_pct']
data = df[df.ay == df.ay.max()].copy()
data = data[cols]

Intense Segregation and Apartheid Schools
=========================================
In 2014, the [_Civil Rights Project/Proyecto Derechos Civiles_](https://civilrightsproject.ucla.edu/) at UCLA released a report indicating that NYC had the most segregated schools in the nation. They followed up the 2014 report in 2021, finding that NYC was still last in terms of racial and ethnic integration of major school districts.

One simple measure used in the report is to consider schools with >90% non-white students as **intensely segregated** and schools with 99% non-white students as **apartheid schools**. We can quickly look at the most recent demographic data in our set to see what these two measures look like.

In [61]:
# add some columns related to our analysis
data["nonwhite_n"] = data.total_enrollment - data.white_n
data["nonwhite_pct"] = 1 - data.white_pct
data["black_hispanic_n"] = data.black_n + data.hispanic_n
data["black_hispanic_pct"] = data.black_pct + data.hispanic_pct

data["intense_seg"] = (data.black_hispanic_pct >= .90)
data["apartheid"] = data.black_hispanic_pct >= .99


seg_schools = len(data[data.intense_seg == True])
apart_schools = len(data[data.apartheid == True])
num_schools = data.dbn.count()
rep_schools = num_schools - seg_schools
total_students = data.total_enrollment.sum()
rep_students = data[data.intense_seg == False].total_enrollment.sum()
seg_students = data[data.intense_seg == True].total_enrollment.sum()
apart_students = data[data.apartheid == True].total_enrollment.sum()


seg = pd.DataFrame({
    "School Group": ["NYC Total", "More Representative", "Intense Segregated", "Apartheid"],
    "Total Students": [f"{total_students:,}", f"{rep_students:,}", f"{seg_students:,}", f"{apart_students:,}"],
    "% of City Students": ["100%", f"{rep_students/total_students:.1%}", f"{seg_students/total_students:.1%}", f"{apart_students/total_students:.1%}"]
})
display(seg)
md(f"""
**Overview of school segregation in NYC**

In 2020-21 there were {num_schools:,} schools in NYC with {total_students:,} students.

Of these schools {seg_schools:,} were considered intense segregated, enrolling
{seg_students:,} students.

In addition, {apart_schools:,} of these schools were "apartheid" schools
 with student populations with more than 99% non-White students, enrolling
{apart_students:,} students.
""")


Unnamed: 0,School Group,Total Students,% of City Students
0,NYC Total,1148504,100%
1,More Representative,757098,65.9%
2,Intense Segregated,391406,34.1%
3,Apartheid,6765,0.6%



**Overview of school segregation in NYC**

In 2020-21 there were 2,048 schools in NYC with 1,148,504 students.

Of these schools 887 were considered intense segregated, enrolling
391,406 students.

In addition, 22 of these schools were "apartheid" schools
 with student populations with more than 99% non-White students, enrolling
6,765 students.


10% Representative
==================
One measure of segregation used by the DeBlasio administration's "School Diversity Advisory Group (SDAG)" was to look to see if a school's ethnic/racial percentages are within 10% of the mean for the district's averages.




Index of dissimilarity
----------------------

`D` is the _index of dissimilarity_ which is an "index of unevenness segregation." For our data D measures the unevenness of ethnic/racial distribution across schools. Higher values of D indicate that groups are not spread evenly across schools (more segregation), lower values indicate an even distribution of the population (less segregation). 1 would be perfectly segregated while 0 would be perfectly proportionate distribution. Below we calculate D for each district and then for the entire school system. We find that some districts have a low D index, where it is higher than the city index in other districts. This measure is based on Allen, R., and Vignoles, A. (2007) and Frankel, D. M., and Volij, O. (2011).

We create a function to calculate the D index for a school based on a set of data. We can use this to find unenveness with a geographic school district and/or within the whole city.


In [51]:
# calculate unevenness

def calculate_dissimilarity(data):
    total_black = data['black_n'].sum()
    total_white = data['white_n'].sum()
    total_asian = data['asian_n'].sum()
    total_hispanic = data['hispanic_n'].sum()

    total = data.total_enrollment.sum()

    non_black = total - total_black
    non_white = total - total_white
    non_asian = total - total_asian
    non_hispanic = total - total_hispanic

    def diss(row, eth, eth_total, total):
        cols = list(row.index)
        # the total students in the school outside of the target ethnic group `eth`
        non_eth = sum([row[col] for col in cols if col != eth and col.endswith('_n')])
        D = (row[eth] / eth_total) - (non_eth / total)
        return abs(D)

    black_D = data.apply(partial(diss, eth="black_n", eth_total=total_black, total=non_black), axis=1)
    black_D = black_D.sum() / 2

    white_D = data.apply(partial(diss, eth="white_n", eth_total=total_white, total=non_white), axis=1)
    white_D = white_D.sum() / 2

    asian_D = data.apply(partial(diss, eth="asian_n", eth_total=total_asian, total=non_asian), axis=1)
    asian_D = asian_D.sum() / 2

    hispanic_D = data.apply(partial(diss, eth="hispanic_n", eth_total=total_hispanic, total=non_hispanic), axis=1)
    hispanic_D = hispanic_D.sum() / 2

    # calculated a weighted average of the D indices
    weights = [data.asian_pct.mean(), data.black_pct.mean(), data.hispanic_pct.mean(), data.white_pct.mean()]
    D = np.average([asian_D, black_D, hispanic_D, white_D], weights=weights)
    
    return D

cols = ['dbn', 'district', 'boro', 'total_enrollment', 'black_n', 'white_n', 'asian_n',
        'hispanic_n', 'black_pct', 'white_pct', 'asian_pct', 'hispanic_pct']
data = df[cols].copy()
data.set_index('dbn', inplace=True)
seg_D = pd.DataFrame()
seg_D['district'] = data.district.unique()
seg_D['D'] = seg_D.district.apply(lambda x: calculate_dissimilarity(data[data.district == x]))
nyc_D = calculate_dissimilarity(data)
seg_D = seg_D.sort_values('D', ascending=False)
print("City D", nyc_D)

City D 0.4951746899047429


In [67]:
agg = {
    'boro': 'first',
    'total_enrollment': 'sum',
    'black_n': 'sum',
    'white_n': 'sum',
    'asian_n': 'sum',
    'hispanic_n': 'sum',
    'black_pct': 'mean',
    'white_pct': 'mean',
    'asian_pct': 'mean',
    'hispanic_pct': 'mean'
}
t = data.groupby('district').agg(agg).reset_index()
t = t.merge(seg_D, on='district', how='inner')

del agg['boro']
nyc = data.aggregate(agg)

nyc = pd.DataFrame(nyc).T
nyc = nyc.reset_index()
nyc['district'] = 0
nyc['boro'] = 'NYC Schools'
nyc['D'] = nyc_D
nyc[t.columns]
table = pd.concat([t, nyc[t.columns]])
m = {75:"SWD", 84:"Charter Schools", 79:"Alternative District"}
table.boro = table.apply(lambda x: x.boro if x.district < 33 else m[x.district], axis=1)
table.set_index("district")

Unnamed: 0_level_0,boro,total_enrollment,black_n,white_n,asian_n,hispanic_n,black_pct,white_pct,asian_pct,hispanic_pct,D
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Manhattan,54408.0,8276.0,10017.0,11658.0,22388.0,0.184378,0.143704,0.151052,0.490504,0.384274
2,Manhattan,383052.0,57632.0,99881.0,77241.0,129514.0,0.172985,0.251782,0.168195,0.358181,0.465494
3,Manhattan,129405.0,25139.0,45179.0,11568.0,39089.0,0.259396,0.261026,0.064434,0.366362,0.367152
4,Manhattan,61348.0,14361.0,3054.0,5478.0,36747.0,0.251611,0.050785,0.067264,0.601111,0.233072
5,Manhattan,66446.0,28299.0,6408.0,3234.0,26299.0,0.453837,0.079608,0.041331,0.394464,0.287054
6,Manhattan,103532.0,7395.0,5474.0,1535.0,87888.0,0.063104,0.055578,0.012857,0.856391,0.394353
7,Bronx,89637.0,23391.0,1389.0,1021.0,62704.0,0.265293,0.014668,0.010678,0.697054,0.12358
8,Bronx,141479.0,30435.0,7966.0,8976.0,91944.0,0.232398,0.049901,0.05796,0.644321,0.192761
9,Bronx,172829.0,47202.0,2312.0,2855.0,118797.0,0.276125,0.013524,0.017098,0.683455,0.178404
10,Bronx,292542.0,43332.0,17394.0,23179.0,203895.0,0.162149,0.048305,0.049886,0.723032,0.309917


In [70]:
# charter schools are a segregating factor, but not in the way that you might think

community = df[df.school_type == "community"]
charter = df[df.school_type == "charter"]
# drop null rows
community = community[community.black_hispanic_pct.notnull()]
charter = charter[charter.black_hispanic_pct.notnull()]

# t-test between community and charter schools
t = scipy.stats.ttest_ind(charter.black_hispanic_pct, community.black_hispanic_pct)
# population size
n_charter = len(charter)
n_community = len(community)

# mean average
M_charter = charter.black_hispanic_pct.mean()
M_community = community.black_hispanic_pct.mean()

# standard deviation 
sd_charter = charter.black_hispanic_pct.std()
sd_community = community.black_hispanic_pct.std()


display(md(f"""
**T-Test results** comparing school averages of 
Charter School % Black/Hispanic (`n={n_charter}`) 
and Community School % Black/Hispanic (`n={n_community}`).

- Charter % Black/Hispanic: M={M_charter:.02%}, SD={sd_charter:.02%}
- Community % Black/Hispanic: M={M_community:.02%}, SD={sd_community:.02%}
- T-score: {t.statistic:.04f}, p-val: {t.pvalue:.04f}

`n` values report the number of schools observed, not the number of students. 
"""))




**T-Test results** comparing school averages of 
Charter School % Black/Hispanic (`n=1256`) 
and Community School % Black/Hispanic (`n=8360`).

- Charter % Black/Hispanic: M=89.62%, SD=13.54%
- Community % Black/Hispanic: M=71.01%, SD=28.50%
- T-score: 22.7578, p-val: 0.0000

`n` values report the number of schools observed, not the number of students. 


In [26]:


# calc entropy index
def entropy_index(row):
    p = row.values + 0.00001
    return -np.sum(p * np.log2(p))

seg = df[["dbn", "asian_pct", "white_pct", 
          "black_pct", "hispanic_pct","multi_racial_pct", 
          "native_american_pct","missing_race_ethnicity_data_pct",
          "ay", "total_enrollment"]].copy()

seg = df[["dbn", "asian_pct", "white_pct", 
          "black_pct", "hispanic_pct",
          "ay", "total_enrollment"]].copy()

pct_cols = [c for c in seg.columns if c.endswith("_pct")]

seg = seg.set_index("dbn")
seg = seg[seg.ay == seg.ay.max()]
seg = seg.drop("ay", axis=1)

total_enrollment = seg.total_enrollment
seg = seg.drop("total_enrollment", axis=1)
# calculate entropy
seg["entropy"] = seg.apply(entropy_index, axis=1)

district_entropy = seg.entropy.mean()
seg["entropy_distance"] = abs(district_entropy - seg["entropy"])
seg["pct_enrollment"] = total_enrollment / total_enrollment.sum()

seg["m"] = seg.entropy_distance * seg.pct_enrollment
M = seg.m.sum()
print(M)
# [c for c in df.columns if c.endswith("_n")]
seg["district_entropy"] = district_entropy
seg["total_enrollment"] = total_enrollment
seg[["entropy","entropy_distance","pct_enrollment", "district_entropy", "total_enrollment"]].sort_values("entropy_distance").head(10)




0.3232528141746691


Unnamed: 0_level_0,entropy,entropy_distance,pct_enrollment,district_entropy,total_enrollment
dbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10X094,1.24682,2e-05,0.000922,1.246839,1059
84K752,1.246119,0.00072,0.000195,1.246839,224
08X537,1.247712,0.000873,0.000118,1.246839,136
05M129,1.247835,0.000996,0.000266,1.246839,306
84K746,1.248149,0.00131,0.000524,1.246839,602
84Q170,1.248152,0.001313,0.000288,1.246839,331
19K938,1.245183,0.001656,5.4e-05,1.246839,62
02M529,1.248577,0.001738,0.000171,1.246839,196
04M146,1.244635,0.002204,0.0003,1.246839,344
12X217,1.244627,0.002212,0.000265,1.246839,304


In [69]:

df["black_hispanic_pct"] = df["black_pct"] + df["hispanic_pct"]

df.school_type = df.school_type.astype("category")
df.head()



Unnamed: 0,dbn,beds,district,geo_district,boro,school_name,short_name,ay,year,school_type,...,swd_n,swd_pct,ell_n,ell_pct,poverty_n,poverty_pct,eni_pct,clean_name,zip,black_hispanic_pct
0,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2016,2016-17,community,...,51,0.287,12,0.067,152,0.854,0.882,roberto clemente,10009,0.877
1,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2017,2017-18,community,...,49,0.258,8,0.042,161,0.847,0.89,roberto clemente,10009,0.853
2,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2018,2018-19,community,...,39,0.224,8,0.046,147,0.845,0.888,roberto clemente,10009,0.822
3,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2019,2019-20,community,...,46,0.242,17,0.089,155,0.816,0.867,roberto clemente,10009,0.8
4,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2020,2020-21,community,...,43,0.223,21,0.109,158,0.819,0.856,roberto clemente,10009,0.803
