ANOVA and OLS Regression
========================
This notebook combines school demographic data
and New York State ELA test scores to examine
the factors that predict test scores. In it we
run a t-test to test for statistic difference
in the scores for White students and Black students.

We then run analysis of variance between the 4
main ethnic/racial groups to see if there is statistical
significance in the outcomes. We run an OLS regression
on these groups with the "All Students" as a baseline
reference category and display the ANOVA table and regression summary.

Finally, one run a different OLS regression to examine
the school demographic factors and their effect on test scores.

In this notebook:

- t-test (review)
- ANOVA (analysis of variance)
- OLS (ordinary least squares) regression

In [1]:
# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy

import pingouin as pg

import statsmodels.api as sm

from IPython.display import Markdown as md

from nycschools import schools, exams


In [2]:
# load the demographic data and merge it with the ELA data
df = schools.load_school_demographics()

# load the data from the csv file
ela = exams.load_ela()


#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "ay"])

# for this analysis we will only look at grade 8 scores for the 2018-19 (pre-covid) school year
# the last pre-covid year
df = df[df.grade =='8']
df = df[df.ay == 2018]
df = df[df.mean_scale_score.notnull()]

# create 5 groups as independent data frames

all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]



t-test: white and black students
--------------------------------
Before running the ANOVA and further analysis, we will run a t-test between Black and White students to determine if there is a significant difference between their average test score.

In [3]:

# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)


# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])

display(md(f"""
**T-Test results** comparing school averages of 
White (`n={white["dbn"].count()}`) and Black (`n={black["dbn"].count()}`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M={white["mean_scale_score"].mean():.04f}, SD={white["mean_scale_score"].std():.04f}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std():.04f}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.
"""))

**Mean average and standard deviation of test scores for each group.**

Unnamed: 0_level_0,Mean,STD
category,Unnamed: 1_level_1,Unnamed: 2_level_1
All Students,601.673565,9.005773
Asian,610.558379,9.357716
Black,598.167014,7.726598
Current ELL,577.987138,6.298893
Econ Disadv,601.111595,8.710385
Ever ELL,604.239202,6.331076
Female,603.995597,9.028235
Hispanic,598.683482,8.168227
Male,597.893669,9.563723
Never ELL,602.137611,8.430177



**T-Test results** comparing school averages of 
White (`n=176`) and Black (`n=318`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M=608.2399, SD=10.5241
- Black students: M=598.1670142591196, SD=7.7266
- T-score: 12.1507, p-val: 0.0

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.


ANOVA & f-values
----------------
In the example below we calculate the f-statistic to see if there are significant differences in test scores based on racial/ethnic group of the test takers. We compare the four main groups in NYC Schools: Asian, Black, Hispanic, and White.

In [4]:
# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups

fvalue, pvalue = scipy.stats.f_oneway(
    asian["mean_scale_score"], 
    black["mean_scale_score"],
    hispanic["mean_scale_score"],
    white["mean_scale_score"])


results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
during the 2018-2019 academic year.

The four groups in the test are: Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions, [p={pvalue:.04f}, F={fvalue:.04f}].

_`n` values are the number of schools reported, not the number of test takers in each group_
"""
md(results)


A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
during the 2018-2019 academic year.

The four groups in the test are: Asian (n=169), Black (n=318), Latinx (n=373),
and White (n=176) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions, [p=0.0000, F=123.7368].

_`n` values are the number of schools reported, not the number of test takers in each group_


### Pingouin results
When we ran the t-test we saw that `pingouin` offers some useful additional features.
The API (syntax for using) `pingouin` differes from `scipy`. Before we run the test, we
create a new single dataframe with just the columns we care about. We tell the function which column
is the dependent variable and which specifies the groups.

Below we show the [pinguion ANOVA](https://pingouin-stats.org/generated/pingouin.anova.html#pingouin.anova).

In [5]:
# pg.anova()
data = df.copy()
data = data[data.category.isin(["Asian", "Black", "Hispanic", "White"])]
data[["category", "mean_scale_score"]]

pg.anova(dv='mean_scale_score', between='category', data=data, detailed=True)

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,category,27998.508706,3,9332.836235,123.736753,1.84031e-68,0.264544
1,Within,77838.530186,1032,75.424932,,,


OLS Linear Regression
---------------------
The ANOVA tells us that there is a significant difference in test result based on racial/ethnic group. We can run a regression analysis to help us isolate the impact of different factors on our `mean_scale_score` -- our dependent variable.

For this analysis we will look at the school demographics to analyze the mean ELA test score for All Students at the school.

In [7]:
# first choose the "factors" from our data fields that we believe impact mean_scale_score
data = df.copy()
data = data[data.category == "All Students"]
factors = [
       'total_enrollment',
       'female_pct', 
       'asian_pct',  
       'black_pct',
       'hispanic_pct',  
       'white_pct',
       'swd_pct',  
       'ell_pct',  
       'poverty_pct',
       'eni',
       'charter']

y = data['mean_scale_score']
X = data[factors]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

model.summary()


0,1,2,3
Dep. Variable:,mean_scale_score,R-squared:,0.677
Model:,OLS,Adj. R-squared:,0.67
Method:,Least Squares,F-statistic:,108.4
Date:,"Fri, 06 Oct 2023",Prob (F-statistic):,7.91e-132
Time:,10:49:16,Log-Likelihood:,-1776.1
No. Observations:,582,AIC:,3576.0
Df Residuals:,570,BIC:,3629.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,617.3600,10.868,56.808,0.000,596.015,638.705
total_enrollment,-0.0003,0.001,-0.486,0.627,-0.002,0.001
female_pct,7.6363,2.421,3.154,0.002,2.882,12.391
asian_pct,9.6632,12.217,0.791,0.429,-14.332,33.659
black_pct,-4.2340,11.986,-0.353,0.724,-27.776,19.308
hispanic_pct,1.5070,11.830,0.127,0.899,-21.728,24.742
white_pct,4.8929,11.592,0.422,0.673,-17.875,27.661
swd_pct,-41.4359,4.043,-10.249,0.000,-49.377,-33.495
ell_pct,-30.6338,2.863,-10.700,0.000,-36.257,-25.010

0,1,2,3
Omnibus:,34.343,Durbin-Watson:,1.801
Prob(Omnibus):,0.0,Jarque-Bera (JB):,67.33
Skew:,0.367,Prob(JB):,2.4e-15
Kurtosis:,4.495,Cond. No.,89700.0


In [8]:
# we can also pull specific data from the model
# here we create our own table with the factors, coefficients and p-values
params = list(model.params.index.values[1:])
coefs = list(model.params.values[1:],)
pvalues = list(model.pvalues[1:])

table = pd.DataFrame({"factor":params,"coef":coefs,"p-values":pvalues})
table.sort_values(by="coef")

Unnamed: 0,factor,coef,p-values
6,swd_pct,-41.435949,9.754318e-23
7,ell_pct,-30.6338,1.797988e-24
8,poverty_pct,-6.748134,0.08523855
3,black_pct,-4.234043,0.7240341
9,eni,-3.759965,0.3277058
0,total_enrollment,-0.00034,0.627444
4,hispanic_pct,1.506996,0.8986746
10,charter,2.460203,0.0001154229
5,white_pct,4.892862,0.6731113
1,female_pct,7.636345,0.001692509
