ANOVA and OLS Regression
========================
This notebook combines school demographic data
and New York State ELA test scores to examine
the factors that predict test scores. In it we
run a t-test to test for statistic difference
in the scores for White students and Black students.

We then run analysis of variance between the 4
main ethnic/racial groups to see if there is statistical
significance in the outcomes. We run an OLS regression
on these groups with the "All Students" as a baseline
reference category and display the ANOVA table and regression summary.

Finally, one run a different OLS regression to examine
the school demographic factors and their effect on test scores.

In [1]:
# automatically reload changes we make to schools.py and ui.py
%load_ext autoreload
%autoreload 2

In [2]:
# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

from IPython.display import Markdown as md

import schools
import ui

In [3]:
# load the demographic data and merge it with the ELA data
df = schools.load_school_demographics()

# load the data from the csv file
ela = pd.read_csv("ela-combined.csv")


#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "year"])

# show the columns in our merged data set
df.columns

Index(['dbn', 'district', 'boro', 'school_name', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n',
       'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct',
       'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
       'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
       'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
       'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
       'eni_pct', 'grade', 'category', 'number_tested', 'mean_scale_score',
       'level_1', 'level_1_pct', 'level_2', 'level_2_pct', 'level_3',
       'level_3_pct', 'level_4', 'level_4_pct', 'level_3_4', 'level_3_4_pct'],
      dtype='object')

In [5]:
# for this analysis we will only look at grade 8 scores for the 2018-19 school year
# the last pre-covid year
df = df[df["grade"] =='8']
df = df[df["year"] == 2019]

# create 5 groups as independent data frames

all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]


In [9]:
# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)


# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])

display(md(f"""
**T-Test results** comparing school averages of 
White (`n={white["dbn"].count()}`) and Black (`n={black["dbn"].count()}`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M={white["mean_scale_score"].mean()}, SD={white["mean_scale_score"].std()}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std()}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.
"""))

**Mean average and standard deviation of test scores for each group.**

Unnamed: 0_level_0,Mean,STD
category,Unnamed: 1_level_1,Unnamed: 2_level_1
All Students,600.919888,9.352696
Asian,610.558379,9.357716
Black,598.167014,7.726598
Current ELL,577.973401,6.288727
Econ Disadv,601.111595,8.710385
Ever ELL,604.205822,6.341045
Female,603.969318,9.035509
Hispanic,598.65078,8.18175
Male,597.865757,9.571399
Never ELL,602.116687,8.431567



**T-Test results** comparing school averages of 
White (`n=176`) and Black (`n=318`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M=608.2399111659091, SD=10.524117180693764
- Black students: M=598.1670142591196, SD=7.7265981679485085
- T-score: 12.1507, p-val: 0.0

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.


In [10]:
# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups

fvalue, pvalue = scipy.stats.f_oneway(
    asian["mean_scale_score"], 
    black["mean_scale_score"],
    hispanic["mean_scale_score"],
    white["mean_scale_score"])


results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions., [p={pvalue}, F={round(fvalue,3)}]
"""
md(results)


A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
Asian (n=169), Black (n=318), Latinx (n=374),
and White (n=176) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions., [p=1.2920393648289998e-68, F=124.039]


In [21]:
# run OLS regression on the different category of test scores
model = ols('mean_scale_score ~ C(category, Treatment(reference="All Students"))', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
display(md("### OLS Model Summary: Demographic Group Test Scores Regression"))
display(model.summary())
display(md("**ANOVA Table**"))
anova_table

### OLS Model Summary: Demographic Group Test Scores Regression

0,1,2,3
Dep. Variable:,mean_scale_score,R-squared:,0.398
Model:,OLS,Adj. R-squared:,0.396
Method:,Least Squares,F-statistic:,249.3
Date:,"Thu, 24 Mar 2022",Prob (F-statistic):,0.0
Time:,17:18:39,Log-Likelihood:,-17569.0
No. Observations:,4919,AIC:,35170.0
Df Residuals:,4905,BIC:,35260.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,600.9199,0.398,1508.033,0.000,600.139,601.701
"C(category, Treatment(reference=""All Students""))[T.Asian]",9.6385,0.774,12.459,0.000,8.122,11.155
"C(category, Treatment(reference=""All Students""))[T.Black]",-2.7529,0.626,-4.394,0.000,-3.981,-1.525
"C(category, Treatment(reference=""All Students""))[T.Current ELL]",-22.9465,0.692,-33.152,0.000,-24.303,-21.590
"C(category, Treatment(reference=""All Students""))[T.Econ Disadv]",0.1917,0.594,0.323,0.747,-0.974,1.357
"C(category, Treatment(reference=""All Students""))[T.Ever ELL]",3.2859,0.671,4.897,0.000,1.971,4.601
"C(category, Treatment(reference=""All Students""))[T.Female]",3.0494,0.568,5.364,0.000,1.935,4.164
"C(category, Treatment(reference=""All Students""))[T.Hispanic]",-2.2691,0.598,-3.795,0.000,-3.441,-1.097
"C(category, Treatment(reference=""All Students""))[T.Male]",-3.0541,0.569,-5.366,0.000,-4.170,-1.938

0,1,2,3
Omnibus:,223.023,Durbin-Watson:,0.537
Prob(Omnibus):,0.0,Jarque-Bera (JB):,330.244
Skew:,0.417,Prob(JB):,1.94e-72
Kurtosis:,3.957,Cond. No.,13.2


**ANOVA Table**

Unnamed: 0,sum_sq,df,F,PR(>F)
"C(category, Treatment(reference=""All Students""))",240840.943761,13.0,249.304265,0.0
Residual,364498.941506,4905.0,,


In [19]:
# run different OLS regression using the school demographics to predict mean_scale_score for all students
# the above example uses the R-like formual method for OLS
# this uses the python/programmatic approach

# calculate coefficients for these factors
factors = ['total_enrollment', 'asian_pct',  'black_pct', 
       'hispanic_pct',  'white_pct','swd_pct',  'ell_pct',  'poverty_pct', 'number_tested']

# only use the All Students data
data = df[df["category"]=="All Students"].copy()

# add boro as a factor, but convert to category first
# "dummies" basically converts the categorical boro data to "wide" Bolean data with one col for reach boro
dummies = pd.get_dummies(data[['boro']])

display(md("### OLS Model Summary: School Demographics Regression"))

y = data['mean_scale_score']
X = pd.concat([data[factors], dummies], axis=1)
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

display(model.summary())



### OLS Model Summary: School Demographics Regression

0,1,2,3
Dep. Variable:,mean_scale_score,R-squared:,0.728
Model:,OLS,Adj. R-squared:,0.72
Method:,Least Squares,F-statistic:,93.41
Date:,"Thu, 24 Mar 2022",Prob (F-statistic):,2.95e-119
Time:,17:17:25,Log-Likelihood:,-1405.3
No. Observations:,468,AIC:,2839.0
Df Residuals:,454,BIC:,2897.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,515.4631,9.411,54.774,0.000,496.969,533.957
total_enrollment,-0.0004,0.001,-0.420,0.675,-0.003,0.002
asian_pct,15.8875,12.735,1.248,0.213,-9.139,40.914
black_pct,-1.1354,12.589,-0.090,0.928,-25.875,23.604
hispanic_pct,6.1447,12.319,0.499,0.618,-18.064,30.353
white_pct,8.7286,12.240,0.713,0.476,-15.326,32.784
swd_pct,-45.9423,4.452,-10.319,0.000,-54.692,-37.193
ell_pct,-36.1449,2.778,-13.009,0.000,-41.605,-30.685
poverty_pct,-9.4903,2.893,-3.280,0.001,-15.176,-3.805

0,1,2,3
Omnibus:,21.447,Durbin-Watson:,1.823
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.654
Skew:,0.027,Prob(JB):,1.83e-13
Kurtosis:,4.734,Cond. No.,2.02e+18
