School Demographics
==================

The New York City Department of Education collects and reports
school demographic data for all of the public schools in its
system. This notebook describes the data released under the 
[2020-2021 School Demographics Snapshot](https://data.cityofnewyork.us/Education/2020-2021-Demographic-Snapshot-School/vmmu-wj3w).

A "clean" version of this data can be loaded from the `schools` module with the
`load_school_demographics()` function.

This notebook describes that data and shows some ways to use it.

Importing and loading the data
--------------------------------------------

In [1]:
# Jupyter commands to reload libraries without a restart
# this lets changes to schools.py reflect immediately
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
from IPython.display import Markdown as md
# core functions for importing and manipulating school data
import schools
# some helper functions for displaying pandas data
import ui 

In [3]:
# uncomment to load the raw, underlying data for comparison
# raw_df = pd.read_csv("https://data.cityofnewyork.us/resource/vmmu-wj3w.csv?$limit=1000000")

# cleaned version of the data
df = schools.load_school_demographics()

In [4]:
# schools.demo has some lists of strings for accessing sub-sets of columns
df[schools.demo.core_cols]

Unnamed: 0,dbn,district,boro,school_name,year,total_enrollment,asian_n,asian_pct,black_n,black_pct,...,hispanic_pct,white_n,white_pct,swd_n,swd_pct,ell_n,ell_pct,poverty_n,poverty_pct,eni_pct
0,01M015,1,Manhattan,P.S. 015 Roberto Clemente,2016,178,14,0.079000,51,0.287000,...,0.590000,4,0.022000,51,0.287000,12,0.067,152,0.854,0.882
1,01M015,1,Manhattan,P.S. 015 Roberto Clemente,2017,190,20,0.105000,52,0.274000,...,0.579000,6,0.032000,49,0.258000,8,0.042,161,0.847,0.890
2,01M015,1,Manhattan,P.S. 015 Roberto Clemente,2018,174,24,0.138000,48,0.276000,...,0.546000,6,0.034000,39,0.224000,8,0.046,147,0.845,0.888
3,01M015,1,Manhattan,P.S. 015 Roberto Clemente,2019,190,27,0.142000,56,0.295000,...,0.505000,9,0.047000,46,0.242000,17,0.089,155,0.816,0.867
4,01M015,1,Manhattan,P.S. 015 Roberto Clemente,2020,193,26,0.135000,53,0.275000,...,0.528000,11,0.057000,43,0.223000,21,0.109,158,0.819,0.856
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9164,84X730,84,Bronx,Bronx Charter School for the Arts,2016,320,2,0.006250,76,0.237500,...,0.737500,3,0.009375,67,0.209375,51,0.159,235,0.734,0.840
9165,84X730,84,Bronx,Bronx Charter School for the Arts,2017,314,2,0.006369,65,0.207006,...,0.773885,1,0.003185,68,0.216561,57,0.182,258,0.822,0.891
9166,84X730,84,Bronx,Bronx Charter School for the Arts,2018,430,2,0.004651,98,0.227907,...,0.746512,3,0.006977,103,0.239535,71,0.165,363,0.844,0.888
9167,84X730,84,Bronx,Bronx Charter School for the Arts,2019,523,1,0.001912,131,0.250478,...,0.722753,5,0.009560,117,0.223709,69,0.132,453,0.866,0.892


In [6]:
# keep track of the number of tables we display
# create a simple counter function from our ui helper
table = ui.counter()

# find the total unique schools in the data set
num_schools = len(df["dbn"].unique())
# pull these from the data rather than hardcode
# will make it easier to update when we get a new data set release
min_year = df["year"].min()
max_year = df["year"].max()

# map of column names and the aggregate function to perform
agg_fun = {
    "total_enrollment":"sum",
    "asian_pct":"mean",
    "black_pct":"mean",
    "white_pct":"mean",
    "hispanic_pct":"mean",
    "swd_pct":"mean",
    "ell_pct":"mean",
    "poverty_pct":"mean"
}

# calculate aggregates, grouping by boro, for the most recent year
df_boro = df[df["year"] == max_year].groupby("boro").agg(agg_fun)

# # calculate aggregates, grouping by year
df_years = df.groupby("year").agg(agg_fun)


# find the total unique schools in the data set
num_schools = len(df["dbn"].unique())
# pull these from the data rather than hardcode
# will make it easier to update when we get a new data set release
min_year = df["year"].min()
max_year = df["year"].max()

# format and rename aggregate columns
def format_totals(df):
    
    # use the ui package to format the numbers in our tables to make them easier to read
    pct_cols = ["asian_pct", "black_pct", "white_pct", "hispanic_pct", "swd_pct", "ell_pct", "poverty_pct"]   
    totals = ui.fmt_table(df, pct_cols=pct_cols, num_cols=["total_enrollment"])
    
    # rename columns with descriptive headers
    totals.columns = ['Total Students', '% Asian', '% Black', '% White', '% Hispanic', '% SWD', '% ENL',
           '% Poverty']
    # flatten the aggregate column headers
    totals.reindex(axis=1)
    return totals


# calculate aggregates, grouping by boro, for the most recent year
df_boro = df[df["year"] == max_year].groupby("boro").agg(agg_fun)

# add totals for the city by running aggregate on df_boro
# and making it the last row
df_boro.loc["NYC (totals)"] = df_boro.agg(agg_fun)


# calculate aggregates, grouping by year
df_years = df.groupby("year").agg(agg_fun)

# use `display` and `Mardown (md)` to mix formatted output and python variables
display(md(f"""
Calculating aggregates
----------------------
This data set contains school data from **academic years {min_year}-{max_year}**.
It includes demographic **data from {num_schools:,} different schools** in the 32 zoned
school districts as well as in the "special" districts:

- [District 75](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75) 
  for students with highly specialized needs that cannot be met in the regular school 
  special education program
- [District 79](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75)
  representing schools in the alternative school district for older students, 
  students with interrupted education, court-involved youth, etc. 
- [District 84]() designating public charter schools operating within the DOE

"""))

display(format_totals(df_years))
display(md(f"**Table {table()}: Summary of school demographics by year.**"))

display(format_totals(df_boro))
display(md(f"**Table {table()}: Summary of school demographics by borough.**"))


Calculating aggregates
----------------------
This data set contains school data from **academic years 2016-2020**.
It includes demographic **data from 1,879 different schools** in the 32 zoned
school districts as well as in the "special" districts:

- [District 75](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75) 
  for students with highly specialized needs that cannot be met in the regular school 
  special education program
- [District 79](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75)
  representing schools in the alternative school district for older students, 
  students with interrupted education, court-involved youth, etc. 
- [District 84]() designating public charter schools operating within the DOE



Unnamed: 0_level_0,Total Students,% Asian,% Black,% White,% Hispanic,% SWD,% ENL,% Poverty
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016,1080892,11.3%,32.0%,12.2%,42.1%,22.6%,13.7%,73.3%
2017,1081307,11.4%,31.5%,12.2%,42.4%,23.2%,13.9%,78.1%
2018,1079862,11.5%,31.0%,12.0%,42.8%,23.7%,13.8%,77.1%
2019,1080549,11.5%,30.6%,11.9%,43.3%,23.7%,13.4%,77.3%
2020,1050017,11.7%,30.2%,11.7%,43.5%,23.6%,14.1%,76.7%


**Table 1: Summary of school demographics by year.**

Unnamed: 0_level_0,Total Students,% Asian,% Black,% White,% Hispanic,% SWD,% ENL,% Poverty
boro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bronx,218081.0,3.7%,28.2%,3.3%,63.3%,24.4%,17.7%,87.7%
Brooklyn,315797.0,10.4%,43.1%,13.0%,31.0%,23.6%,12.7%,77.6%
Manhattan,169026.0,9.4%,24.3%,14.6%,47.9%,25.2%,11.8%,70.6%
Queens,283289.0,25.3%,22.2%,11.4%,37.1%,19.8%,15.3%,71.2%
Staten Island,63824.0,11.0%,15.1%,38.1%,32.8%,29.7%,8.4%,62.5%
NYC (totals),1050017.0,11.9%,26.6%,16.1%,42.4%,24.6%,13.2%,73.9%


**Table 2: Summary of school demographics by borough.**

In [7]:
# show the built-in descriptive statistic from pandas for this data set
desc = df.copy()
desc["year"] = pd.Categorical(df.year)
desc["district"] = pd.Categorical(df.district)

desc = desc[schools.demo.core_cols].describe(include="all")
display(desc)
display(md(f"**Table {table()}: Descriptive stats of key columns**"))

Unnamed: 0,dbn,district,boro,school_name,year,total_enrollment,asian_n,asian_pct,black_n,black_pct,...,hispanic_pct,white_n,white_pct,swd_n,swd_pct,ell_n,ell_pct,poverty_n,poverty_pct,eni_pct
count,9169,9169.0,9169,9169,9169.0,9169.0,9169.0,9169.0,9169.0,9169.0,...,9169.0,9169.0,9169.0,9169.0,9169.0,9169.0,9169.0,9169.0,9169.0,9169.0
unique,1879,35.0,5,1870,5.0,,,,,,...,,,,,,,,,,
top,01M015,84.0,Brooklyn,New Visions Charter High School for Advanced Math,2020.0,,,,,,...,,,,,,,,,,
freq,5,1187.0,2832,20,1878.0,,,,,,...,,,,,,,,,,
mean,,,,,,585.955611,95.272113,0.114829,149.827898,0.310663,...,0.428142,86.237321,0.120075,122.522303,0.23365,80.49362,0.137677,433.621878,0.765193,0.731424
std,,,,,,477.226538,217.404728,0.165497,166.121062,0.26357,...,0.249595,180.892501,0.176769,97.894632,0.158448,106.569319,0.14017,347.605026,0.194591,0.202723
min,,,,,,7.0,0.0,0.0,0.0,0.0,...,0.015,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.04,0.04
25%,,,,,,320.0,5.0,0.014,41.0,0.08,...,0.202,6.0,0.015,67.0,0.161,18.0,0.045,235.0,0.706,0.633
50%,,,,,,471.0,17.0,0.041,102.0,0.247,...,0.403,15.0,0.033,99.0,0.206,44.0,0.098,353.0,0.824,0.792
75%,,,,,,687.0,81.0,0.143,192.0,0.491,...,0.634,78.0,0.144,146.0,0.257329,100.0,0.181,517.0,0.904,0.886


**Table 3: Descriptive stats of key columns**

In [8]:
# read the data dictionary from a .csv file
# and format the results
dd = pd.read_csv("demo-data-dict.csv")

display(md("""
Data Dictionary
---------------
We renamed some of the columns from the raw dataset for brevity and clarity. For each
demographic group where there is a count and a percentage, we suffix the "count"
column with `_n` and the percentage column with `_pct`. The DOE data uses the
convention of column name with no suffix for counts and `_1` for percents.
Note also that `year` has change to an integer representation of the acadmic
year. The raw data has a string, but other DOE data releases use integers
for the year. The int value makes the data easier to merge and sort.
"""))

for i, row in dd.iterrows():
    col, dt, desc = row
    display(md(f"- **{col}:** {desc}"))



Data Dictionary
---------------
We renamed some of the columns from the raw dataset for brevity and clarity. For each
demographic group where there is a count and a percentage, we suffix the "count"
column with `_n` and the percentage column with `_pct`. The DOE data uses the
convention of column name with no suffix for counts and `_1` for percents.
Note also that `year` has change to an integer representation of the acadmic
year. The raw data has a string, but other DOE data releases use integers
for the year. The int value makes the data easier to merge and sort.


- **dbn:** the unique school idea in format District-Borough-Number

- **district:** Districts 1-32 represent geographic school districts in the city. [District 75](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75) supports students with highly specialized needs that cannot be met in the regular school special education program. [District 79](https://www.schools.nyc.gov/learning/special-education/school-settings/district-75) is the alternative school district for older students, students with interrupted education, court-involved yout, etc. [District 84]() designates public charter schools operating within the DOE.

- **boro:** the full borough name (e.g., Manhattan, Staten Island)

- **school_name:** the full name of the school. Elementary schools are usually called PS 15, PS 143, etc. They often have a descriptive name, too, like Roberto Clemente. Middle schools are _usually_ MS 915, but sometimes they are called IS 915. Example: P.S. 015 Roberto Clemente

- **year:** the academic year as an integer.  2020-21 is represented as `2020`

- **total_enrollment:** the total number of students in the school

- **grade_3k_pk_half_day_full:** the total number of students in the school in early childhood '3K' or 'pre-k'

- **[grade_k..grade_12]:** each of these columns is the number of students in each grade at the school

- **female_n:** total female students at the school

- **female_pct:** the percent of female students as a real number between 0 and 1

- **male_n:** the total male students at the school

- **male_pct:** the percent of male students as a real number between 0 and 1

- **asian_n:** total Asian students at the school

- **asian_pct:** the percent of Asian students as a real number between 0 and 1

- **black_n:** total Black students at the school

- **black_pct:** the percent of Black students as a real number between 0 and 1

- **hispanic_n:** total Latinx students at the school

- **hispanic_pct:** the percent of Latinx students as a real number between 0 and 1

- **multi_racial_n:** total multi-racial students at the school

- **multi_racial_pct:** the percent of multi-racial students as a real number between 0 and 1

- **native_american_n:** total Native American students at the school

- **native_american_pct:** the percent of Native American students as a real number between 0 and 1

- **white_n:** total White students at the school

- **white_pct:** the percent of White students as a real number between 0 and 1

- **missing_race_ethnicity_data_n:** total number of students with missing race/ethnic data

- **missing_race_ethnicity_data_pct:** the percent of students with missing race/ethnic data as a real number between 0 and 1

- **swd_n:** total number of students with disabilities (sometimes written SWD) in the schools. This counts the number of students with an IEP (individualized education plan) in special education at the school. For more info about special ed and IEPs:
https://www.schools.nyc.gov/learning/special-education/preschool-to-age-21/special-education-in-nyc

- **swd_pct:** the percent of SWDs as a real number between 0 and 1

- **ell_n:** total number of students with who are characterized as English Language Learners (sometimes written ELL but also ENL or ESL students) in the schools. This counts the number of students who receive modified instruction either through English as a New Language instruction and/or bilingual education. For more info on how NYC identifies ELLs:
https://www.schools.nyc.gov/learning/multilingual-learners/english-language-learners))

- **ell_pct:** the percent of ENL students as a real number between 0 and 1

- **poverty_n:** the number of students who qualify for free or reduced  lunch or HRA benefits. If the underlying data shows ‘Less than 5%’ than the field represents 4% of the school enrollment. If the underlying data has ‘Greater than 95%’ than the value will be 96% of the school enrollment.

- **poverty_pct:** the percent of students in poverty. The poverty as a percentage represented as a real number between 0 and 1

- **eni_pct:** Economic Need Index (ENI) estimates the percentage of students in the school living in poverty. The ENI is a percentage represented as a real number between 0 and 1

Change over time
-------------------------
The school demographics data set tells us about the student population a school serves.
We can investigate questions around school size, racial and ethnic makeup, students with disabilities,
ENL students, and poverty levels. Looking at the number of students at each grade level we can see school type: P-5, K-5, 6-8, P-12, etc. We can aggregate this data at district level, grade level, school type (communit/charter). Also, since our data spans several years, we can calculate changes over time.

This data is particularly intersting as a baseline for merging with other data sets (test scores, high school acceptance, etc.), but there may be interesting questions without a merge.

This example calculates change in % of poverty and and student demographics over time to get a sense of which schools and districts are "gentrifying" and what that means for the student body.

The change in `poverty_pct` measures the change in the percentage of students in the school
who are considered impoverished in City data. Negative numbers for `poverty_change` indicate
that the school, as a whole, is _wealthier_ than in the previous year. Accordingly, schools
with the _smallest change_ (greatest negative change) are the ones that are gentrifying
most rapidly by attracting wealthier familes.


In [9]:
gen = df.copy()

# look at just one district to run more quickly
# comment out for whole city (slow!!)
# gen = gen[gen["district"]==13]


pct_cols = ['asian', 'black', 'hispanic', 'white', 'swd', 'ell',  'poverty', 'eni']


def calc_year_change(row, col):
    y = row["year"]
    dbn = row["dbn"]
    if y > min_year:
        last_year = gen.query(f"year == {y-1} and dbn=='{dbn}'")
        
        # if no previous year for this school, return 0 - no change
        if len(last_year) != 1:
            return 0
           
        change = float(row[col]) - float(last_year[col])
        return change

    return 0

print("Calculating yearly changes...")
# run apply() for each column we want to track changes in
for col in pct_cols:
    gen[f"{col}_change"] = gen.apply(calc_year_change, axis=1, args=(f"{col}_pct",))
    
print("calculating changes completed.")

In [40]:
top_20 = gen.sort_values(by="poverty_change")[0:20]
# show these columns in our table
pct_cols = ['poverty_change','white_change',  'black_change', 'hispanic_change','asian_change',  'swd_change', 'ell_change',  ]
cols = [ "district", "boro", "school_name", "year"] + pct_cols

# display the columns we want without the index counter
top_20 = ui.fmt_table(top_20[cols], pct_cols=pct_cols)


display(top_20.style.hide(axis='index'))

report = f"""
**Table {table()}: 20 schools that had the largest 1-year change in wealth.**

Scanning this table, we can see that the most frequent schools are charter schools from
district 84 (n={len(top_20[top_20["district"]==84])}) and that Brooklyn was
the most frequent borough (n={len(top_20[top_20["boro"]=="Brooklyn"])}).
"""

md(report)

district,boro,school_name,year,poverty_change,white_change,black_change,hispanic_change,asian_change,swd_change,ell_change
84,Brooklyn,Brooklyn Prospect Charter School Downtown,2017,-30.3%,27.8%,-38.1%,5.1%,0.4%,-1.5%,-0.2%
75,Manhattan,Hospital Schools,2018,-26.4%,7.7%,-10.1%,-6.0%,7.5%,-8.4%,-1.2%
84,Brooklyn,Brooklyn LAB Charter School,2019,-25.5%,0%,-2.0%,2.1%,-0.5%,-1.3%,0.7%
84,Brooklyn,Canarsie Ascend Charter School,2017,-23.2%,0.1%,1.0%,-0.9%,-0.2%,-1.0%,0.1%
84,Brooklyn,Edmund W. Gordon Brooklyn Laboratory Charter Schoo,2019,-22.5%,1.6%,-3.7%,4.9%,-2.9%,7.3%,0.4%
30,Queens,P.S. 384,2019,-21.9%,13.5%,4.5%,-12.6%,1.4%,-3.2%,-1.9%
29,Queens,P.S. 360,2017,-21.8%,0%,1.6%,-1.4%,-0.2%,1.9%,0%
84,Queens,Success Academy Charter School - Springfield Garde,2019,-21.3%,-0.1%,0%,0.1%,0.2%,-1.4%,0.2%
84,Brooklyn,Brooklyn Prospect Charter School 15.2,2020,-20.3%,7.0%,-0.3%,-11.1%,1.3%,-3.6%,-2.7%
84,Brooklyn,Lefferts Gardens Ascend Charter School,2020,-19.6%,0%,3.7%,-1.2%,-1.2%,-4.6%,-1.2%



**Table 29: 20 schools that had the largest 1-year change in wealth.**

Scanning this table, we can see that the most frequent schools are charter schools from
district 84 (n=12) and that Brooklyn was
the most frequent borough (n=13).


In [45]:
# calculate some aggregates 
change_by_year = gen[gen["year"]>min_year].groupby(["year", "boro"]).agg({"poverty_pct":"mean", "poverty_change":"mean"})
change_by_dist = gen[gen["year"]==max_year].groupby(["district"]).agg({
    "district":"max", # keep the district for each group
    "boro":"max", # by using max on a repeated string it just keeps this col in results
    "poverty_pct":"mean", 
    "poverty_change":"mean",
    "white_pct": "mean",
    "black_pct": "mean",
    "hispanic_pct": "mean",
    "asian_pct": "mean",
  })

change_by_dist = change_by_dist.sort_values(by="poverty_change")



display(ui.fmt_table(change_by_year, pct_cols=["poverty_pct", "poverty_change"]))
display(md(f"**Table {table()}: Poverty Change and Percent Poverty by Borough, {min_year+1}-{max_year}**"))


pct_cols=["poverty_pct", "poverty_change", "white_pct","black_pct","hispanic_pct","asian_pct"]
change_by_dist = ui.fmt_table(change_by_dist, pct_cols=pct_cols)

display(change_by_dist.style.hide(axis="index"))
display(md(f"**Table {table()}: Poverty Change and Percetnage by Borough, {min_year+1}-{max_year}**"))


Unnamed: 0_level_0,Unnamed: 1_level_0,poverty_pct,poverty_change
year,boro,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,Bronx,88.1%,4.5%
2017,Brooklyn,79.7%,4.8%
2017,Manhattan,71.5%,4.7%
2017,Queens,73.9%,5.2%
2017,Staten Island,62.1%,3.4%
2018,Bronx,87.5%,-0.7%
2018,Brooklyn,78.7%,-1.1%
2018,Manhattan,70.7%,-0.8%
2018,Queens,71.8%,-1.9%
2018,Staten Island,62.5%,0.3%


**Table 35: Poverty Change and Percent Poverty by Borough, 2017-2020**

district,boro,poverty_pct,poverty_change,white_pct,black_pct,hispanic_pct,asian_pct
29,Queens,73.0%,-2.1%,1.8%,64.7%,15.9%,12.6%
28,Queens,68.3%,-2.1%,12.0%,24.5%,27.0%,29.5%
21,Brooklyn,75.6%,-1.9%,30.4%,14.5%,29.0%,23.7%
26,Queens,50.7%,-1.8%,15.5%,9.7%,16.5%,54.7%
32,Brooklyn,87.7%,-1.7%,2.6%,15.9%,78.7%,2.0%
13,Brooklyn,70.7%,-1.6%,12.5%,54.7%,22.3%,6.9%
22,Brooklyn,69.8%,-1.5%,27.8%,34.8%,16.6%,17.8%
6,Manhattan,84.6%,-0.9%,6.1%,6.1%,84.9%,1.3%
1,Manhattan,72.2%,-0.8%,14.3%,18.4%,49.8%,13.9%
10,Bronx,86.3%,-0.7%,4.1%,16.1%,73.6%,4.5%


**Table 36: Poverty Change and Percetnage by Borough, 2017-2020**