ANOVA and OLS Regression
========================
This notebook combines school demographic data
and New York State ELA test scores to examine
the factors that predict test scores. In it we
run a t-test to test for statistic difference
in the scores for White students and Black students.

We then run analysis of variance between the 4
main ethnic/racial groups to see if there is statistical
significance in the outcomes. We run an OLS regression
on these groups with the "All Students" as a baseline
reference category and display the ANOVA table and regression summary.

Finally, one run a different OLS regression to examine
the school demographic factors and their effect on test scores.

In [1]:
# automatically reload changes we make to schools.py and ui.py
%load_ext autoreload
%autoreload 2

In [3]:
# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
import networkx as nx
import math

from IPython.display import Markdown as md

from school_data import schools, exams, ui


In [4]:
# load the demographic data and merge it with the ELA data
df = schools.load_school_demographics()

# load the data from the csv file
ela = exams.load_ela()


#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "ay"])

# show the columns in our merged data set
df.columns

Index(['dbn', 'district', 'boro', 'school_name', 'short_name', 'ay', 'year_x',
       'total_enrollment', 'grade_3k_pk_half_day_full', 'grade_k', 'grade_1',
       'grade_2', 'grade_3', 'grade_4', 'grade_5', 'grade_6', 'grade_7',
       'grade_8', 'grade_9', 'grade_10', 'grade_11', 'grade_12', 'female_n',
       'female_pct', 'male_n', 'male_pct', 'asian_n', 'asian_pct', 'black_n',
       'black_pct', 'hispanic_n', 'hispanic_pct', 'multi_racial_n',
       'multi_racial_pct', 'native_american_n', 'native_american_pct',
       'white_n', 'white_pct', 'missing_race_ethnicity_data_n',
       'missing_race_ethnicity_data_pct', 'swd_n', 'swd_pct', 'ell_n',
       'ell_pct', 'poverty_n', 'poverty_pct', 'eni_pct', 'clean_name', 'grade',
       'year_y', 'category', 'number_tested', 'mean_scale_score', 'level_1_n',
       'level_1_pct', 'level_2_n', 'level_2_pct', 'level_3_n', 'level_3_pct',
       'level_4_n', 'level_4_pct', 'level_3_4_n', 'level_3_4_pct', 'test_year',
       'charter'],
     

In [5]:
# for this analysis we will only look at grade 8 scores for the 2018-19 school year
# the last pre-covid year
df = df[df["grade"] =='8']
df = df[df["ay"] == 2018]

# create 5 groups as independent data frames

all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]


In [6]:
# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)


# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])

display(md(f"""
**T-Test results** comparing school averages of 
White (`n={white["dbn"].count()}`) and Black (`n={black["dbn"].count()}`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M={white["mean_scale_score"].mean()}, SD={white["mean_scale_score"].std()}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std()}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.
"""))

**Mean average and standard deviation of test scores for each group.**

Unnamed: 0_level_0,Mean,STD
category,Unnamed: 1_level_1,Unnamed: 2_level_1
All Students,601.650957,9.014575
Asian,610.558379,9.357716
Black,598.167014,7.726598
Current ELL,577.973401,6.288727
Econ Disadv,601.111595,8.710385
Ever ELL,604.205822,6.341045
Female,603.969318,9.035509
Hispanic,598.65078,8.18175
Male,597.865757,9.571399
Never ELL,602.116687,8.431567



**T-Test results** comparing school averages of 
White (`n=176`) and Black (`n=318`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M=608.2399111659091, SD=10.524117180693764
- Black students: M=598.1670142591196, SD=7.7265981679485085
- T-score: 12.1507, p-val: 0.0

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.


In [7]:
# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups

fvalue, pvalue = scipy.stats.f_oneway(
    asian["mean_scale_score"], 
    black["mean_scale_score"],
    hispanic["mean_scale_score"],
    white["mean_scale_score"])


results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions., [p={pvalue}, F={round(fvalue,3)}]
"""
md(results)


A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
Asian (n=169), Black (n=318), Latinx (n=374),
and White (n=176) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions., [p=1.2920393648289998e-68, F=124.039]


In [8]:
# run OLS regression on the different category of test scores
model = ols('mean_scale_score ~ C(category, Treatment(reference="All Students"))', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
display(md("### OLS Model Summary: Demographic Group Test Scores Regression"))
display(model.summary())
display(md("**ANOVA Table**"))
anova_table

### OLS Model Summary: Demographic Group Test Scores Regression

0,1,2,3
Dep. Variable:,mean_scale_score,R-squared:,0.395
Model:,OLS,Adj. R-squared:,0.393
Method:,Least Squares,F-statistic:,252.1
Date:,"Tue, 12 Apr 2022",Prob (F-statistic):,0.0
Time:,14:56:55,Log-Likelihood:,-17966.0
No. Observations:,5034,AIC:,35960.0
Df Residuals:,5020,BIC:,36050.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,601.6510,0.356,1689.961,0.000,600.953,602.349
"C(category, Treatment(reference=""All Students""))[T.Asian]",8.9074,0.751,11.861,0.000,7.435,10.380
"C(category, Treatment(reference=""All Students""))[T.Black]",-3.4839,0.599,-5.814,0.000,-4.659,-2.309
"C(category, Treatment(reference=""All Students""))[T.Current ELL]",-23.6776,0.667,-35.484,0.000,-24.986,-22.369
"C(category, Treatment(reference=""All Students""))[T.Econ Disadv]",-0.5394,0.566,-0.953,0.341,-1.649,0.570
"C(category, Treatment(reference=""All Students""))[T.Ever ELL]",2.5549,0.645,3.959,0.000,1.290,3.820
"C(category, Treatment(reference=""All Students""))[T.Female]",2.3184,0.539,4.303,0.000,1.262,3.375
"C(category, Treatment(reference=""All Students""))[T.Hispanic]",-3.0002,0.569,-5.268,0.000,-4.117,-1.884
"C(category, Treatment(reference=""All Students""))[T.Male]",-3.7852,0.539,-7.017,0.000,-4.843,-2.728

0,1,2,3
Omnibus:,220.028,Durbin-Watson:,0.561
Prob(Omnibus):,0.0,Jarque-Bera (JB):,326.061
Skew:,0.406,Prob(JB):,1.57e-71
Kurtosis:,3.947,Cond. No.,12.1


**ANOVA Table**

Unnamed: 0,sum_sq,df,F,PR(>F)
"C(category, Treatment(reference=""All Students""))",242199.057751,13.0,252.130032,0.0
Residual,370943.901871,5020.0,,


In [9]:
# run different OLS regression using the school demographics to predict mean_scale_score for all students
# the above example uses the R-like formual method for OLS
# this uses the python/programmatic approach

# calculate coefficients for these factors
factors = ['total_enrollment', 'asian_pct',  'black_pct', 
       'hispanic_pct',  'white_pct','swd_pct',  'ell_pct',  'poverty_pct']

# only use the All Students data
data = df[df["category"]=="All Students"].copy()

# add boro as a factor, but convert to category first
# "dummies" basically converts the categorical boro data to "wide" Bolean data with one col for reach boro
dummies = pd.get_dummies(data[['boro']])

display(md("### OLS Model Summary: School Demographics Regression"))

y = data['mean_scale_score']
# boros are not adding information, leave them out
# X = pd.concat([data[factors], dummies], axis=1)
X = data[factors]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

display(model.summary())
display(model.summary2())

### OLS Model Summary: School Demographics Regression

0,1,2,3
Dep. Variable:,mean_scale_score,R-squared:,0.663
Model:,OLS,Adj. R-squared:,0.659
Method:,Least Squares,F-statistic:,141.4
Date:,"Tue, 12 Apr 2022",Prob (F-statistic):,2.3400000000000002e-130
Time:,14:56:59,Log-Likelihood:,-1791.3
No. Observations:,583,AIC:,3601.0
Df Residuals:,574,BIC:,3640.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,622.0261,10.951,56.801,0.000,600.517,643.535
total_enrollment,-0.0004,0.001,-0.606,0.544,-0.002,0.001
asian_pct,8.6873,12.431,0.699,0.485,-15.729,33.103
black_pct,-3.4150,12.178,-0.280,0.779,-27.334,20.504
hispanic_pct,2.4193,12.020,0.201,0.841,-21.189,26.028
white_pct,5.7811,11.786,0.491,0.624,-17.368,28.930
swd_pct,-50.6672,3.546,-14.290,0.000,-57.631,-43.703
ell_pct,-34.9550,2.747,-12.726,0.000,-40.350,-29.560
poverty_pct,-8.4041,2.703,-3.109,0.002,-13.713,-3.095

0,1,2,3
Omnibus:,35.989,Durbin-Watson:,1.778
Prob(Omnibus):,0.0,Jarque-Bera (JB):,70.017
Skew:,0.387,Prob(JB):,6.25e-16
Kurtosis:,4.511,Cond. No.,89600.0


0,1,2,3
Model:,OLS,Adj. R-squared:,0.659
Dependent Variable:,mean_scale_score,AIC:,3600.6017
Date:,2022-04-12 14:56,BIC:,3639.9154
No. Observations:,583,Log-Likelihood:,-1791.3
Df Model:,8,F-statistic:,141.4
Df Residuals:,574,Prob (F-statistic):,2.3400000000000002e-130
R-squared:,0.663,Scale:,27.738

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,622.0261,10.9509,56.8013,0.0000,600.5173,643.5349
total_enrollment,-0.0004,0.0007,-0.6065,0.5444,-0.0018,0.0010
asian_pct,8.6873,12.4310,0.6988,0.4849,-15.7286,33.1032
black_pct,-3.4150,12.1782,-0.2804,0.7793,-27.3344,20.5043
hispanic_pct,2.4193,12.0199,0.2013,0.8406,-21.1891,26.0277
white_pct,5.7811,11.7859,0.4905,0.6240,-17.3676,28.9297
swd_pct,-50.6672,3.5457,-14.2896,0.0000,-57.6315,-43.7030
ell_pct,-34.9550,2.7468,-12.7257,0.0000,-40.3500,-29.5600
poverty_pct,-8.4041,2.7029,-3.1093,0.0020,-13.7129,-3.0953

0,1,2,3
Omnibus:,35.989,Durbin-Watson:,1.778
Prob(Omnibus):,0.0,Jarque-Bera (JB):,70.017
Skew:,0.387,Prob(JB):,0.0
Kurtosis:,4.511,Condition No.:,89579.0
