<h1>MCAS results: Why are some schools considered better than others?</h1>
<p>Data are from:</p>
    <p>
    MCAS achievement, MCAS growth percentiles<br>
<a href=https://profiles.doe.mass.edu/statereport/nextgenmcas.aspx>
    https://profiles.doe.mass.edu/statereport/nextgenmcas.aspx</a>

<p>Race and gender of students<br><a href=https://profiles.doe.mass.edu/statereport/enrollmentbyracegender.aspx>
    https://profiles.doe.mass.edu/statereport/enrollmentbyracegender.aspx</a>

<p>High needs/English learners/low income etc.<br><a href=https://profiles.doe.mass.edu/statereport/selectedpopulations.aspx>
    https://profiles.doe.mass.edu/statereport/selectedpopulations.aspx</a>

<p>Total per pupil expenditure<br><a href=https://profiles.doe.mass.edu/statereport/ppx.aspx>
    https://profiles.doe.mass.edu/statereport/ppx.aspx</a>

<p>Town/city income per capita<br><a href=https://dlsgateway.dor.state.ma.us/reports/rdPage.aspx?rdReport=DOR_Income_EQV_Per_Capita>
    https://dlsgateway.dor.state.ma.us/reports/rdPage.aspx?rdReport=DOR_Income_EQV_Per_Capita</a>

<p>School data are taken from the 2018-2019 school year, the last full school year before pandemic interruptions. MCAS tests were taken in the spring of 2019, and fiscal data are from fiscal year 2019.

In [61]:
#Load the data into pandas dataframes
import pandas
import pandasql
mcas=pandas.read_csv("NextGenMCAS2019.csv",encoding="utf-8",delimiter="\t")
demographics=pandas.read_csv("enrollmentbydemographic2018-2019.csv",encoding="utf-8",delimiter="\t")
student_needs=pandas.read_csv("selectedpopulations2018-2019.csv",encoding="utf-8",delimiter="\t")
ppexp=pandas.read_csv("PerPupilExpenditures_2019.csv",encoding="utf-8",delimiter="\t")
ppexp["In-District Expenditures per Pupil"]= \
ppexp["In-District Expenditures per Pupil"].apply(lambda x: float(x.replace("$","").replace(",","")))
town_income=pandas.read_csv("DOR_Income_Per_Capita_2019.csv",encoding="utf-8",delimiter="\t",thousands=",")

#Drop unused columns
ppexp=ppexp[["District Code","In-District Expenditures per Pupil"]]
town_income=town_income[["Municipality","DOR Income Per Capita","EQV Per Capita"]]
#Combine the science, math, and English Language Arts data into a combined score for each school district.
mcas_combined_subjects=pandasql.sqldf("SELECT `District Code`,MAX(`District Name`) AS district_name," +
                                      "AVG(`M+E %`) AS 'Passing %',AVG(SGP) AS 'Student Growth Percentile' " +
                                      "FROM mcas GROUP BY `District Code` ORDER BY `District Code`",globals())
#Add code for local, regional, and charter schools
def school_sorter(code):
    if code==0:
        return "state average"
    if code<4000000:
        return "local"
    if code<6000000:
        return "charter"
    if code<35000000:
        return "regional"
    return "charter"
mcas_combined_subjects["district_type"]=mcas_combined_subjects["District Code"].map(school_sorter)

#Merge the MCAS results data with data on school district demographics and spending
schools_combined=pandas.merge(mcas_combined_subjects,demographics,how="inner",on="District Code")
schools_combined=pandas.merge(schools_combined,student_needs,how="inner",on="District Code")
schools_combined=pandas.merge(schools_combined,ppexp,how="inner",on="District Code")
all_districts_df=schools_combined.copy()
for column in all_districts_df.columns:
    if "#" in column:
        all_districts_df.pop(column)
locals_df=pandas.merge(all_districts_df,town_income,how="inner",left_on="district_name",right_on="Municipality")
df=locals_df.copy()
df=df.drop(["District Name_x","District Name_y"],axis=1)
df=df.dropna()

In [62]:
#Analyze the data on town income and school spending
import plotly
import plotly.express as px
p1=px.scatter(df,x="DOR Income Per Capita",y="Passing %",log_x=True,hover_data=
              {"district_name","Passing %"},trendline="ols",trendline_options=dict(log_x=True))
p1.update_xaxes(range=[4,5.60206],dtick="D1",title="city/town income per capita (dollars/year)")
p1.show()
p2=px.scatter(df,x="In-District Expenditures per Pupil",y="Passing %",hover_data=
              {"district_name","Passing %"},trendline="ols")
p2.show()

There is a modest correlation (R<sup>2</sup>=0.62) between per capita income and MCAS scores, but no correlation between district educational spending and MCAS performance. This suggests something other than school spending causes the higher score districts to outperform others.

In [63]:
#Finding the strongest correlations
import numpy
import sklearn
from sklearn.linear_model import LinearRegression as LinReg
dfreg=df.copy()
X=dfreg[["dem1","dem2","dem3","dem4","dem5","dem6","dem7","dem8","dem9","First Language Not English %",
       "English Language Learner %","Students With Disabilities %","Economically Disadvantaged %",
       "In-District Expenditures per Pupil","DOR Income Per Capita","EQV Per Capita"]].copy()
Y=dfreg[["Passing %"]].copy()

xnames=[]
ynames=[]
Rsq=[]
for x in X.columns:
    for y in Y.columns:
        xnames.append(x)
        xdata=numpy.array(X[x]).reshape((-1,1))
        ynames.append(y)
        ydata=Y[y]
        Lregression=LinReg().fit(xdata,ydata)
        Rsq.append(Lregression.score(xdata,ydata))
correlation_df=pandas.DataFrame(dict([("x",xnames),("y",ynames),("correlation",Rsq)])).sort_values(by="correlation",ascending=False)
print(correlation_df.head(8))

                               x          y  correlation
12  Economically Disadvantaged %  Passing %     0.736020
14         DOR Income Per Capita  Passing %     0.415541
1                           dem2  Passing %     0.326032
10    English Language Learner %  Passing %     0.207176
11  Students With Disabilities %  Passing %     0.206712
7                           dem8  Passing %     0.168617
5                           dem6  Passing %     0.138964
9   First Language Not English %  Passing %     0.124293


In [64]:
#Strongest correlation graph
p3=px.scatter(df,x="Economically Disadvantaged %",y="Passing %",hover_data=
              {"district_name"},trendline="ols")
p3.show()

In [65]:
FullLinReg=LinReg().fit(X,Y["Passing %"])
print("16-factor linear regression: R sq = ",FullLinReg.score(X,Y["Passing %"]))

16-factor linear regression: R sq =  0.8075823722468721


The overfit 16-factor linear regression offers little improvement over the correlation with just 1 factor: economic disadvantage of the students.

To look at the effect of local per capita income and wealth, I looked only at school district. But what if I look at all the school districts in the MCAS data set?

In [66]:
#Analyze the same data for all districts, including charters and regionals.
dfreg=all_districts_df.copy()
X=dfreg[["dem1","dem2","dem3","dem4","dem5","dem6","dem7","dem8","dem9","First Language Not English %",
       "English Language Learner %","Students With Disabilities %","Economically Disadvantaged %",
       "In-District Expenditures per Pupil"]].copy()
Y=dfreg[["Passing %"]].copy()

xnames=[]
ynames=[]
Rsq=[]
for x in X.columns:
    for y in Y.columns:
        xnames.append(x)
        xdata=numpy.array(X[x]).reshape((-1,1))
        ynames.append(y)
        ydata=Y[y]
        Lregression=LinReg().fit(xdata,ydata)
        Rsq.append(Lregression.score(xdata,ydata))
correlation_df=pandas.DataFrame(dict([("x",xnames),("y",ynames),("correlation",Rsq)])).sort_values(by="correlation",ascending=False)
print(correlation_df.head(8))
p4=px.scatter(all_districts_df,x="Economically Disadvantaged %",y="Passing %",color="district_type",
              hover_data={"district_name"},trendline="ols",trendline_scope="overall")
p4.show()

                               x          y  correlation
12  Economically Disadvantaged %  Passing %     0.612423
1                           dem2  Passing %     0.167192
7                           dem8  Passing %     0.154452
10    English Language Learner %  Passing %     0.153489
11  Students With Disabilities %  Passing %     0.147537
2                           dem3  Passing %     0.137764
5                           dem6  Passing %     0.136346
9   First Language Not English %  Passing %     0.070975


The same independent variables top the list, but the correlations are much weaker. The cause seems to be charter schools<sup>*</sup> (green dots) -- compared to local and regional public school districts, they have a much weaker correlation between economic disadvantage and test scores.

<i><font size="2">*Technically charter school districts, but in Massachusetts a charter school is usually its own school district.</i></font>