# Exploring math and ELA proficiencies

We will explore whether trends emerge in math and ELA proficiencies, based on the clustering of schools. We are looking at `% Level 3+4`, which would show the percentage of students who meet learning standards (with/without distinction).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("output/clustered-2024.csv")

In [3]:
cols_to_use = [ "DBN", "School Name", "Grade", "Year", "Number Tested", "% Level 3+4" ]
math_df = pd.read_excel("data/school-math-results-2018-2025-public.xlsx", sheet_name="Math - All", usecols=cols_to_use)
ela_df = pd.read_excel("data/school-ela-results-2018-2025-public.xlsx", sheet_name="ELA - All", usecols=cols_to_use)

In [4]:
merged_df = df.copy()

In [5]:
test_data = {
    "math": (math_df, "math_pct"),
    "ela": (ela_df, "ela_pct")
}

for test_name, (test_df, col_name) in test_data.items():
    # filter all grades and latest year
    filtered = test_df[(test_df["Grade"] == "All Grades") & (test_df["Year"] == 2025)]

    # merge and rename cols
    merged_df = merged_df.merge(
        filtered[["DBN", "% Level 3+4"]].rename(columns={"% Level 3+4": col_name}),
        on="DBN", how="left"
    )

In [6]:
merged_df.sample(3)

Unnamed: 0,DBN,School Name,Year,Total Enrollment,% Asian and Pacific Islander,% Black,% Hispanic,% Multi-Racial,% Native American,% White,% Missing Race/Ethnicity Data,Borough,kmeans_label,math_pct,ela_pct
1458,31R013,P.S. 013 M. L. Lindemeyer,2024-25,773,0.288486,0.10608,0.416559,0.019405,0.003881,0.165589,0.0,R,0,78.716217,74.662163
1822,84X395,"NYC Charter High School for Architecture, Engi...",2024-25,454,0.013216,0.34141,0.618943,0.0,0.013216,0.013216,0.0,X,1,,
1219,26Q213,P.S. 213 The Carl Ullman School,2024-25,373,0.546917,0.075067,0.252011,0.018767,0.008043,0.099196,0.0,Q,0,70.303032,64.968155


In [7]:
merged_df.to_csv("output/test-scores.csv", encoding="UTF-8", index=False)

In [8]:
merged_df[["math_pct", "ela_pct"]] = merged_df[["math_pct", "ela_pct"]].apply(pd.to_numeric, errors="coerce")

*Note*: Data caveat - `math_pct` and `ela_pct` now includes suppressed or null data.

In [9]:
math_trends = merged_df.groupby("kmeans_label")["math_pct"].agg(["mean", "min", "max", "std"])
math_trends.T

kmeans_label,0,1,2,3,4
mean,72.514923,42.917546,43.888176,49.679284,72.555059
min,37.7551,7.228916,5.932203,12.820513,15.384615
max,99.404762,95.973152,97.368423,98.529411,100.0
std,13.801632,18.235831,15.791776,19.153974,15.35655


In [10]:
ela_trends = merged_df.groupby("kmeans_label")["ela_pct"].agg(["mean", "min", "max", "std"])
ela_trends.T

kmeans_label,0,1,2,3,4
mean,68.082826,45.292373,43.048934,51.409843,72.749463
min,39.917694,8.77193,16.410257,22.602739,29.166666
max,98.802399,98.924728,94.174759,96.480942,100.0
std,13.082734,15.703252,13.275849,15.930233,13.593726


## Observations

Generally: 
Cluster 4 is top performer in both math and ELA (math: 72.6%, ela: 72.7%), with Cluster 0 as second best (math: 72.5%, ela: 68.1%). Clusters 1-3 are showing swings in math proficiency rates.

* Less variability in ELA than math proficiencies 
* For clusters serving similar populations, Clusters 1-3 are getting a wide range of proficiency rates: likely, some schools are underperforming while others are doing exceptionally well. ALSO: the number of students could skew aggregates. There could be a large school with a low rate that is dragging aggregates, or smaller schools pulling those up. 
